DeepGEMM

DeepGEMM

by DeepSeek
DeepGEMM is an open-source library designed for efficient and concise FP8 matrix multiplication, optimized for NVIDIA Hopper Tensor Cores.

What is DeepGEMM?

DeepGEMM is an open-source library by DeepSeek designed for efficient FP8 (8-bit floating point) matrix multiplication (GEMM). Optimized for NVIDIA Hopper Tensor Cores, it supports both regular and Mixture of Experts (MoE) grouped GEMM operations. With Just-In-Time (JIT) compilation, DeepGEMM dynamically optimizes performance at runtime, eliminating the need for installation-time compilation.

Key Features

  • Efficient FP8 Matrix Multiplication: Designed for FP8 GEMM, it supports fine-grained scaling to improve precision and performance.
  • Regular and Grouped GEMM: Handles standard matrix multiplication and grouped GEMM for MoE models.
  • JIT Compilation: Kernels are dynamically compiled at runtime based on matrix shape and parameters.
  • Hopper Architecture Optimization: Leverages Tensor Memory Accelerator (TMA) for enhanced data transfer efficiency.
  • Fine-Grained Scaling and Dual-Level Accumulation: Addresses FP8 precision issues by converting results to higher precision formats.
  • Lightweight Design: Core code is concise (300 lines), making it easy to learn and optimize.

Performance

  • Regular GEMM: Up to 2.7x speedup for certain matrix shapes, achieving over 1000 TFLOPS in large-scale operations.
  • Grouped GEMM: 1.1-1.2x speedup for MoE models, optimizing memory bandwidth utilization.

System Requirements

  • Hardware: NVIDIA Hopper architecture GPUs (e.g., H800, H100).
  • Software: CUDA 12.3+, Python 3.8+, PyTorch 2.1+, CUTLASS 3.6+, Linux OS.

Use Cases

  • Large-Scale AI Model Inference: Accelerates high-dimensional matrix multiplication.
  • MoE Models: Optimizes grouped matrix multiplication for efficient training and inference.
  • Low-Precision Computation: Solves FP8 precision issues while maintaining high-precision output.
  • High-Performance Computing: Enhances matrix operation efficiency on Hopper architecture.

Getting Started

Visit the GitHub repository for installation instructions and documentation.

Framework Features

Supported Tasks
Matrix Multiplication Mixture Of Experts (Moe) Operations Low-Precision Computation
Tags
FP8 Matrix Multiplication NVIDIA Hopper JIT Compilation High-Performance Computing CUDA Deep Learning Open-Source AI Libraries Mixture of Experts

Getting Started

Pricing
free
Requirements
  • NVIDIA Hopper Architecture GPU (e.g., H800, H100)
  • CUDA 12.3+
  • Python 3.8+
  • PyTorch 2.1+
  • CUTLASS 3.6+
  • Linux OS (e.g., Ubuntu, CentOS)

Screenshots & Images

Primary Screenshot
Additional Images

Stats

0 Views
0 Favorites
5112 GitHub Stars

Community & Support

Similar Frameworks

TPO
0
Phantom by ByteDance
0
AgentSociety by Tsinghua University
0

Recently Viewed

3FS Framework