News

Breakthrough in LLM Compression: Dynamic-Length Float (DFloat11) Reduces Model Size by 30% Without Accuracy Loss

16h ago
LLM Compression Dynamic-Length Float GPU Inference Entropy Coding Custom GPU Kernel Llama-3.1 Qwen-2.5 Gemma-3
The Dynamic-Length Float (DFloat11) framework enables lossless compression of Large Language Models (LLMs), reducing model size by 30% while maintaining bit-for-bit accuracy and significantly improving GPU inference efficiency.

Lossless LLM Compression with Dynamic-Length Float for Efficient GPU Inference

The Dynamic-Length Float (DFloat11) framework is a groundbreaking approach to compressing Large Language Models (LLMs) without any loss of accuracy. This method reduces the model size by 30% while maintaining bit-for-bit identical outputs to the original model. The key motivation behind DFloat11 is the inefficiency observed in the BFloat16 weight representation of LLMs, which often contains low entropy.

How DFloat11 Works

  • Entropy Coding: DFloat11 applies entropy coding to assign dynamic-length encodings to weights based on their frequency. This achieves near information-optimal compression without sacrificing precision.
  • Custom GPU Kernel: To support efficient inference with dynamic-length encodings, a custom GPU kernel is developed. This kernel includes:
    • Decomposition of memory-intensive lookup tables (LUTs) into compact LUTs that fit in GPU SRAM.
    • A two-phase kernel for coordinating thread read/write positions using lightweight auxiliary variables.
    • Transformer-block-level decompression to minimize latency.

Performance Benefits

Experiments on models like Llama-3.1, Qwen-2.5, and Gemma-3 demonstrate that DFloat11 achieves a 30% reduction in model size while preserving exact outputs. Compared to offloading parts of an uncompressed model to the CPU, DFloat11 provides 1.9-38.8x higher throughput in token generation. Additionally, with a fixed GPU memory budget, DFloat11 enables 5.3-13.17x longer context lengths than uncompressed models.

Real-World Applications

DFloat11 makes it possible to perform lossless inference on extremely large models like Llama-3.1-405B (810GB) on a single node equipped with 8x80GB GPUs. This significantly lowers the hardware requirements for deploying state-of-the-art LLMs.

For more technical details, you can access the full paper here.

Sources

Lossless Compression for LLMs with Dynamic-Length Float This week, we read: ⭐70% Size, 100% Accuracy: Lossless LLM Compression for Efficient GPU Inference via Dynamic-Length Float.
Lossless LLM Compression for Efficient GPU Inference via Dynamic ... In this paper, we introduce Dynamic-Length Float (DFloat11), a lossless compression framework that reduces LLM size by 30% while preserving ...
Lossless LLM Compression for Efficient GPU Inference via Dynamic ... 70% Size, 100% Accuracy: Lossless LLM Compression for Efficient GPU Inference via Dynamic-Length Float. There was an issue loading the PDF.