Pixtral 12B is Mistral AI's first multimodal AI model, capable of processing both images and text, with 12 billion parameters.
Pixtral 12B - Mistral AI's First Multimodal AI Model
Introduction
Pixtral 12B is a groundbreaking multimodal AI model developed by Mistral AI, capable of processing both text and image data. With 12 billion parameters, it is designed to handle complex tasks such as image captioning, object counting, and visual question answering.
Key Features
- Multimodal Capabilities: Processes both text and image data seamlessly.
- High Parameter Count: 12 billion parameters for enhanced performance.
- Vision Encoder: Supports high-resolution images up to 1024x1024 pixels.
- Open Source: Available under the Apache 2.0 license for customization and deployment.
- Optimized Inference: Utilizes TensorRT-LLM for efficient performance on NVIDIA GPUs.
Technical Details
- Architecture: 40 layers, 14,336 hidden dimensions, 32 attention heads.
- Vision Adapter: 400 million parameters with GeLU activation.
- Inference Optimization: Supports dynamic batching, KV caching, and quantization.
Use Cases
- Image and Text Understanding: Ideal for tasks requiring simultaneous parsing of visual and language information.
- Content Creation: Assists in generating descriptive text for images and creating article illustrations.
- Customer Support: Helps in understanding and responding to image-related queries.
- Medical Image Analysis: Provides diagnostic support by analyzing medical images.
Getting Started
Pixtral 12B is available for download and fine-tuning on HuggingFace. For more details, visit the project website.