DeepSeek-VL2

DeepSeek-VL2

by DeepSeek
DeepSeek-VL2 is an open-source series of large-scale Mixture-of-Experts (MoE) vision-language models developed by DeepSeek. It significantly improves upon its predecessor, DeepSeek-VL, and excels in tasks such as visual question answering, optical character recognition, document/table/chart understanding, and visual grounding. The model series includes three versions: DeepSeek-VL2-Tiny, DeepSeek-VL2-Small, and DeepSeek-VL2, with 1.0B, 2.8B, and 4.5B activated parameters, respectively. DeepSeek-VL2 supports resolutions up to 1152x1152 and extreme aspect ratios of 1:9 or 9:1, making it versatile for various applications. It also features advanced capabilities like understanding scientific charts and generating Python code from images using the Plot2Code feature.

What is DeepSeek-VL2?

DeepSeek-VL2 is an open-source series of large-scale Mixture-of-Experts (MoE) vision-language models developed by DeepSeek. It significantly improves upon its predecessor, DeepSeek-VL, and excels in tasks such as visual question answering, optical character recognition, document/table/chart understanding, and visual grounding. The model series includes three versions: DeepSeek-VL2-Tiny, DeepSeek-VL2-Small, and DeepSeek-VL2, with 1.0B, 2.8B, and 4.5B activated parameters, respectively.

Main Features of DeepSeek-VL2

  • Dynamic Resolution Support: Handles images with resolutions up to 1152x1152, supporting extreme aspect ratios of 1:9 or 9:1.
  • Chart Understanding: Understands various scientific charts by learning from research document data.
  • Plot2Code: Generates Python code from images.
  • Meme Recognition: Parses and understands various memes.
  • Visual Grounding: Performs zero-shot visual grounding, finding objects in images based on natural language descriptions.
  • Visual Storytelling: Connects multiple images to form a visual story.

Technical Principles of DeepSeek-VL2

  • Multi-Head Latent Attention (MLA): Uses low-rank key-value joint compression to eliminate bottlenecks in key-value caching during inference.
  • DeepSeekMoE Architecture: Adopts a high-performance MoE architecture in feed-forward networks, reducing training costs.
  • Cost-Effective Training and Inference: Leverages a diverse corpus of 8.1 trillion tokens, saving 42.5% in training costs compared to DeepSeek 67B.
  • Long Context Windows: Supports context windows up to 128K in length.

Application Scenarios of DeepSeek-VL2

  • Chatbots: Enables natural language interaction with users.
  • Image Captioning: Generates descriptive text based on image content.
  • Code Generation: Generates code based on user requirements, applicable in programming and software development fields.

Project Address of DeepSeek-VL2

Model Capabilities

Model Type
multimodal
Supported Tasks
Visual Question Answering Optical Character Recognition Document Understanding Table Understanding Chart Understanding Visual Grounding Code Generation Meme Recognition Visual Storytelling
Tags
Vision-Language Model Mixture-of-Experts AI Open Source Computer Vision Natural Language Processing Multimodal AI Document Understanding Code Generation Visual Grounding

Usage & Integration

Pricing
free
License
Open Source

Screenshots & Images

Additional Images

Stats

0 Views
0 Likes
4658 GitHub Stars

Community & Support

Similar Models

LongWriter by Tsinghua University and Zhipu AI
0
Pixtral12B by Mistral AI
0
LongCite by Tsinghua University
0