The architecture of One-D-Piece is designed to achieve variable-length tokenization for images while maintaining high reconstruction quality. The framework consists of three main components:
- Encoder: Converts an input image into a sequence of discrete tokens, leveraging a vision transformer (ViT) backbone for efficient feature extraction.
- Quantizer: Applies vector quantization to discretize the encoded features into a finite set of token embeddings.
- Decoder: Reconstructs the image from the token embeddings, enabling progressive refinement as the number of tokens increases.
A key contribution is the Tail Token Drop mechanism, which prioritizes important information at the start of the token sequence. This allows flexible truncation of tokens while preserving essential image details, resulting in quality-controllable compression.