The Dynamic-Length Float (DFloat11) framework is a groundbreaking approach to compressing Large Language Models (LLMs) without any loss of accuracy. This method reduces the model size by 30% while maintaining bit-for-bit identical outputs to the original model. The key motivation behind DFloat11 is the inefficiency observed in the BFloat16 weight representation of LLMs, which often contains low entropy.
Experiments on models like Llama-3.1, Qwen-2.5, and Gemma-3 demonstrate that DFloat11 achieves a 30% reduction in model size while preserving exact outputs. Compared to offloading parts of an uncompressed model to the CPU, DFloat11 provides 1.9-38.8x higher throughput in token generation. Additionally, with a fixed GPU memory budget, DFloat11 enables 5.3-13.17x longer context lengths than uncompressed models.
DFloat11 makes it possible to perform lossless inference on extremely large models like Llama-3.1-405B (810GB) on a single node equipped with 8x80GB GPUs. This significantly lowers the hardware requirements for deploying state-of-the-art LLMs.
For more technical details, you can access the full paper here.