DeepEP is an open-source Expert Parallel (EP) communication library designed for training and inference of Mixture of Experts (MoE) models, offering high throughput and low latency GPU kernels.
What is DeepEP?
DeepEP is an open-source Expert Parallel (EP) communication library developed by DeepSeek, specifically designed for training and inference of Mixture of Experts (MoE) models. It provides high-throughput and low-latency all-to-all GPU kernels, supporting both intra-node and inter-node NVLink and RDMA communications.
Key Features of DeepEP
- Efficient Communication Kernels: High-throughput and low-latency all-to-all GPU kernels for MoE's dispatch and combine operations.
- Low-Precision Computing Support: Supports FP8 and BF16 data formats, improving computational efficiency and reducing memory requirements.
- Optimized Communication Mechanism: Optimized kernels for the group-restricted gating algorithm, supporting asymmetric bandwidth forwarding from NVLink to RDMA.
- Low-Latency Inference Decoding: Pure RDMA low-latency kernel, with latency as low as 163 microseconds.
- Communication-Computation Overlap: Hook-based method that does not occupy GPU's stream multiprocessor (SM) resources.
- Flexible Resource Management: Supports flexible GPU resource management, adapting to different workloads.
- Network Configuration Optimization: Tested on InfiniBand networks, supporting traffic isolation through virtual lanes (VLs).
Performance of DeepEP
- High-Throughput Kernels: Tested on H800 GPUs and CX7 InfiniBand 400 Gb/s RDMA network cards, demonstrating excellent throughput performance.
- Low-Latency Kernels: Designed for inference decoding, using pure RDMA technology, significantly reducing latency.
- System Compatibility: Compatible with InfiniBand networks and RDMA over Converged Ethernet (RoCE).
System Requirements for DeepEP
- Hardware Requirements: Hopper architecture GPUs (e.g., H100, H800), GPUDirect RDMA-capable devices, NVLink for intra-node communication, and RDMA networks for inter-node communication.
- Software Requirements: Python 3.8+, CUDA 12.3+, PyTorch 2.1+, and a modified version of NVSHMEM.
- Network Requirements: InfiniBand networks, compatible with RDMA over Converged Ethernet (RoCE).
Application Scenarios of DeepEP
- Large-Scale Model Training: Efficient parallel communication support for training MoE models.
- Inference Tasks: Suitable for latency-sensitive inference decoding scenarios.
- High-Performance Computing: Optimizes communication performance on NVLink and RDMA networks.
- Intelligent Customer Service: Optimizes the inference process for quick response to user inquiries.
- Financial Sector: Used for risk assessment and automated report generation.