xAR is a novel autoregressive visual generation framework that enhances image generation quality and speed using Next-X Prediction and Noisy Context Learning techniques.
What is xAR?
xAR is a novel autoregressive visual generation framework developed by ByteDance and Johns Hopkins University. It enhances image generation quality and speed using innovative techniques like Next-X Prediction and Noisy Context Learning.
Main Features of xAR
Next-X Prediction: Extends traditional "next token prediction" to support the prediction of more complex entities like image patches, cells, subsamples, and entire images, capturing richer semantic information.
Noisy Context Learning: Introduces noise during training to improve the model's robustness to errors and mitigate cumulative errors.
High-Performance Generation: Outperforms existing technologies like DiT and other diffusion models in both inference speed and generation quality on the ImageNet dataset.
Flexible Prediction Units: Supports various prediction unit designs, making it suitable for different visual generation tasks.
Technical Principles of xAR
Flow Matching: xAR transforms the discrete token classification problem into a continuous entity regression problem. It generates noisy inputs through interpolation and noise injection, predicting the direction flow (Velocity) from the noise distribution to the target distribution in each autoregressive step.
Inference Strategy: xAR generates images step-by-step in an autoregressive manner, starting from Gaussian noise and gradually generating the next unit until the entire image is completed.
Experimental Results: xAR has achieved significant performance improvements on the ImageNet-256 and ImageNet-512 benchmarks, with the xAR-B model being 20 times faster in inference speed than DiT-XL and achieving an FID of 1.72.
Application Scenarios of xAR
Art Creation: Generate creative images for inspiration or direct use in artworks.
Virtual Scene Generation: Quickly generate realistic virtual scenes for game development and virtual reality.
Old Photo Restoration: Restore damaged parts of old photos, recovering original details and colors.
Video Content Generation: Generate specific scenes or objects in videos for video effects production and editing.
Data Augmentation: Expand training datasets by generating diverse images, improving model generalization and robustness.