One-D-Piece: Image Tokenizer Meets Quality-Controllable Compression

Keita Miwa1, 2, Kento Sasaki1, 3, Hidehisa Arai1, Tsubasa Takahashi1, Yu Yamaguchi1, 4
1Turing Inc. 2University of Tokyo 3University of Tsukuba 4Keio Research Institute at SFC
CoVLA framework overview

We propose One-D-Piece, discrete image tokenizer that enables variable-length tokenization adjustable from 1 to 256 tokens. Even with a very small number of tokens (e.g., \( n_{\text{tokens}} = 8 \)), it achieves recognizable image reconstructions. As the token count increases, the image quality progressively improves, reaching near-original fidelity at \( n_{\text{tokens}} = 256 \).

Abstract

Current image tokenization methods require a large number of tokens to capture the information contained within images. Although the amount of information varies across images, most image tokenizers only support fixed-length tokenization, leading to inefficiency in token allocation. In this study, we introduce One-D-Piece, a discrete image tokenizer designed for variable-length tokenization, achieving quality-controllable mechanism. To enable variable compression rate, we introduce a simple but effective regularization mechanism named Tail Token Drop into discrete one-dimensional image tokenizers. This method encourages critical information to concentrate at the head of the token sequence, enabling support of variadic tokenization, while preserving state-of-the-art reconstruction quality. We evaluate our tokenizer across multiple reconstruction quality metrics and found that it delivers significantly better perceptual quality than existing quality-controllable compression methods, including JPEG and WebP, at smaller byte sizes. Furthermore, we assess our tokenizer on various downstream computer vision tasks, including image classification, object detection, semantic segmentation, and depth estimation, confirming its adaptability to numerous applications compared to other variable-rate methods. Our approach demonstrates the versatility of variable-length discrete image tokenization, establishing a new paradigm in both compression efficiency and reconstruction performance. Finally, we validate the effectiveness of tail token drop via detailed analysis of tokenizers.

Overview

The architecture of One-D-Piece is designed to achieve variable-length tokenization for images while maintaining high reconstruction quality. The framework consists of three main components:

  • Encoder: Converts an input image into a sequence of discrete tokens, leveraging a vision transformer (ViT) backbone for efficient feature extraction.
  • Quantizer: Applies vector quantization to discretize the encoded features into a finite set of token embeddings.
  • Decoder: Reconstructs the image from the token embeddings, enabling progressive refinement as the number of tokens increases.

A key contribution is the Tail Token Drop mechanism, which prioritizes important information at the start of the token sequence. This allows flexible truncation of tokens while preserving essential image details, resulting in quality-controllable compression.

One-D-Piece Architecture

Sample Reconstructions

Below are sample reconstructions demonstrating the capabilities of One-D-Piece-L-256. The images show how the quality of reconstructions progressively improves as the number of tokens increases.

Sample Reconstructions

BibTeX

@misc{onedpiece,
      title        = {One-D-Piece: Image Tokenizer Meets Quality-Controllable Compression},
      author       = {Keita Miwa and Kento Sasaki and Hidehisa Arai and Tsubasa Takahashi and Yu Yamaguchi},
      year         = {2025},
      eprint       = {2501.10064},
      archivePrefix= {arXiv},
      primaryClass = {cs.CV},
      url          = {https://arxiv.org/abs/2501.10064},
 }