Running Qwen 2.5-Omni-3B on Consumer PCs: Key Requirements and Installation Guide
April 30, 2025
Qwen 2.5-Omni-3B
multimodal model
GPU memory
PyTorch
FlashAttention 2
Hugging Face
BF16 precision
Qwen 2.5-Omni-3B, a powerful multimodal model capable of processing text, images, audio, and video, can be run on consumer PCs with specific hardware and software requirements, including a modern GPU with sufficient VRAM and optimized precision settings.
Running Qwen 2.5-Omni-3B on Consumer PCs: Key Requirements and Installation Guide
Qwen 2.5-Omni-3B is a powerful multimodal model capable of processing text, images, audio, and video, and generating text and natural speech responses. While it is designed for high-performance environments, it is possible to run it on a consumer PC or laptop with certain considerations.
Key Requirements:
Installation Steps:
- Install the necessary libraries:
pip uninstall transformers
pip install git+https://github.com/huggingface/[email protected]
pip install accelerate
- Install the Qwen Omni utilities for handling multimodal inputs:
pip install qwen-omni-utils[decord] -U
- Load the model using the following code snippet:
from transformers import Qwen2_5OmniForConditionalGeneration, Qwen2_5OmniProcessor
from qwen_omni_utils import process_mm_info
model = Qwen2_5OmniForConditionalGeneration.from_pretrained("Qwen/Qwen2.5-Omni-3B", torch_dtype="auto", device_map="auto")
processor = Qwen2_5OmniProcessor.from_pretrained("Qwen/Qwen2.5-Omni-3B")
Usage Tips:
- Batch Inference: The model supports batch processing of mixed media inputs (text, images, audio, and video).
- Audio Output: To enable audio output, ensure the system prompt is set correctly. You can also change the voice type of the output audio using the
speaker
parameter.
- Memory Optimization: Use BF16 precision and FlashAttention 2 to reduce memory usage and speed up generation.
For more detailed instructions and examples, refer to the Hugging Face model page.
Sources
Qwen/Qwen2.5-Omni-3B - Hugging Face
Key Features · Omni and Novel Architecture · Real-Time Voice and Video Chat · Natural and Robust Speech Generation · Strong Performance Across ...