DeepGEMM is an open-source library designed for efficient and concise FP8 matrix multiplication, optimized for NVIDIA Hopper Tensor Cores.
What is DeepGEMM?
DeepGEMM is an open-source library by DeepSeek designed for efficient FP8 (8-bit floating point) matrix multiplication (GEMM). Optimized for NVIDIA Hopper Tensor Cores, it supports both regular and Mixture of Experts (MoE) grouped GEMM operations. With Just-In-Time (JIT) compilation, DeepGEMM dynamically optimizes performance at runtime, eliminating the need for installation-time compilation.
Key Features
- Efficient FP8 Matrix Multiplication: Designed for FP8 GEMM, it supports fine-grained scaling to improve precision and performance.
- Regular and Grouped GEMM: Handles standard matrix multiplication and grouped GEMM for MoE models.
- JIT Compilation: Kernels are dynamically compiled at runtime based on matrix shape and parameters.
- Hopper Architecture Optimization: Leverages Tensor Memory Accelerator (TMA) for enhanced data transfer efficiency.
- Fine-Grained Scaling and Dual-Level Accumulation: Addresses FP8 precision issues by converting results to higher precision formats.
- Lightweight Design: Core code is concise (300 lines), making it easy to learn and optimize.
Performance
- Regular GEMM: Up to 2.7x speedup for certain matrix shapes, achieving over 1000 TFLOPS in large-scale operations.
- Grouped GEMM: 1.1-1.2x speedup for MoE models, optimizing memory bandwidth utilization.
System Requirements
- Hardware: NVIDIA Hopper architecture GPUs (e.g., H800, H100).
- Software: CUDA 12.3+, Python 3.8+, PyTorch 2.1+, CUTLASS 3.6+, Linux OS.
Use Cases
- Large-Scale AI Model Inference: Accelerates high-dimensional matrix multiplication.
- MoE Models: Optimizes grouped matrix multiplication for efficient training and inference.
- Low-Precision Computation: Solves FP8 precision issues while maintaining high-precision output.
- High-Performance Computing: Enhances matrix operation efficiency on Hopper architecture.
Getting Started
Visit the GitHub repository for installation instructions and documentation.