KTransformers

KTransformers

by Tsinghua University
KTransformers is an open-source project by Tsinghua University's KVCache.AI team, designed to optimize the inference performance of large language models and reduce hardware requirements.

What is KTransformers?

KTransformers is an open-source framework developed by Tsinghua University's KVCache.AI team in collaboration with Qujing Technology. It is designed to optimize the inference performance of large language models (LLMs) and reduce hardware requirements, making it possible to run models with hundreds of billions of parameters on consumer-grade hardware.

Key Features of KTransformers

  • Supports Local Inference of Large Models: Run full versions of large models like DeepSeek-R1 with 671B parameters on a single GPU with only 24GB of VRAM.
  • Improves Inference Speed: Achieve preprocessing speeds of up to 286 tokens/s and inference generation speeds of up to 14 tokens/s.
  • Compatible with Various Models and Operators: Supports DeepSeek series and other MoE architecture models, with a flexible template injection framework for customization.
  • Reduces Hardware Requirements: Enables "home-based" deployment of large models, making it accessible to individual users and small teams.
  • Supports Long Sequence Tasks: Integrates Intel AMX instruction set for CPU pre-fill speeds up to 286 tokens/s, reducing processing time from minutes to seconds.

Technical Principles

  • MoE Architecture: Offloads sparse MoE matrices to CPU/DRAM while keeping dense parts on the GPU, reducing VRAM requirements.
  • Offload Strategy: Distributes tasks based on computational intensity, prioritizing high-intensity tasks on the GPU.
  • High-Performance Operator Optimization: Uses llamafile for CPU optimization and Marlin operators for GPU, achieving significant speedups.
  • CUDA Graph Optimization: Minimizes CPU/GPU communication breaks, improving inference performance.
  • Quantization and Storage Optimization: Uses 4bit quantization to compress model storage and optimize KV cache size.

Application Scenarios

  • Individual Developers and Small Teams: Run large models on consumer-grade hardware for text generation, Q&A systems, and more.
  • Long Sequence Tasks: Efficiently process long texts, code analysis, and other tasks.
  • Enterprise Applications: Deploy large models locally for intelligent customer service and content recommendation.
  • Academic Research: Explore and optimize MoE architecture models on ordinary hardware.
  • Education and Training: Use as a teaching tool for large model applications and optimization techniques.

Getting Started

Visit the GitHub repository to download and start using KTransformers.

Framework Features

Supported Tasks
Text Generation Q&a Systems Long Sequence Processing Code Analysis Intelligent Customer Service Content Recommendation
Tags
AI Optimization Large Language Models GPU/CPU Heterogeneous Computing Inference Speed MoE Architecture Quantization Open Source Deep Learning Model Deployment Hardware Efficiency

Getting Started

Pricing
free
Requirements
  • GPU with 24GB VRAM
  • Python 3.8+
  • CUDA Toolkit

Screenshots & Images

Primary Screenshot
Additional Images

Stats

0 Views
0 Favorites
13211 GitHub Stars

Community & Support

Similar Frameworks

TPO
0
Phantom by ByteDance
0
AgentSociety by Tsinghua University
0

Recently Viewed

MindSearch Framework