BGE-VL

by Beijing Academy of Artificial Intelligence

BGE-VL is a multimodal vector model focusing on tasks like image-text retrieval and composite image retrieval.

What is BGE-VL?

BGE-VL is a multimodal vector model developed by the Beijing Academy of Artificial Intelligence in collaboration with several universities. It is trained on large-scale synthetic data called MegaPairs and focuses on multimodal retrieval tasks such as image-text retrieval and composite image retrieval. The model enhances generalization and retrieval performance through efficient multimodal data synthesis methods. The BGE-VL series includes BGE-VL-Base and BGE-VL-Large based on the CLIP architecture, and BGE-VL-MLLM based on a multimodal large model architecture. The model has demonstrated excellent performance in various benchmarks, particularly in composite image retrieval tasks, significantly improving retrieval accuracy. The core strengths of BGE-VL lie in the scalability and high quality of its data synthesis methods, as well as its exceptional generalization capabilities in multimodal tasks.

Main Features of BGE-VL

Image-Text Retrieval: Retrieve the most relevant images based on input text descriptions or find related text information based on input images.
Composite Image Retrieval: Supports simultaneous input of images and text instructions to comprehensively understand both and retrieve more accurate target images.
Multimodal Embedding: Maps images and text into a unified vector space, allowing data from different modalities to be compared and retrieved using vector similarity.
Instruction Fine-Tuning: Fine-tunes the model based on synthesized multimodal instruction data to better understand and execute complex multimodal tasks, improving the model's generalization and task adaptability.

Technical Principles of BGE-VL

Data Synthesis Method (MegaPairs):
Data Mining: Mines diverse image pairs from a massive image-text corpus, using various similarity models (e.g., CLIP) to find candidate images related to the query image.
Instruction Generation: Generates open-domain retrieval instructions based on multimodal large language models (MLLM) and large language models (LLM), summarizing the relationships between image pairs and writing high-quality retrieval instructions.
Triplet Construction: Constructs multimodal triplet data containing "query image, query statement, target image" for model training. The data does not require manual annotation, ensuring efficiency and scalability.
Multimodal Model Architecture:
CLIP-Based Architecture: BGE-VL-Base and BGE-VL-Large use a CLIP-like architecture, mapping images and text into the same vector space through image and text encoders, and optimizing model performance through contrastive learning.
Multimodal Large Model Architecture: BGE-VL-MLLM is based on a more complex multimodal large model architecture, handling complex multimodal interactions and instruction understanding tasks.
Instruction Fine-Tuning: Fine-tunes the model based on synthesized multimodal instruction data to enhance its understanding and execution of multimodal tasks.
Contrastive Learning and Optimization: During training, the model optimizes the vector representation of multimodal embeddings through contrastive learning, bringing related images and text closer in the vector space while distancing unrelated data. Training on large-scale synthetic data allows the model to learn more generalized multimodal feature representations, excelling in various multimodal tasks.

Project Address of BGE-VL

HuggingFace Model Library: https://huggingface.co/collections/BAAI/megapairs

Application Scenarios of BGE-VL

Intelligent Search: Users can upload images or input text to quickly find related content, improving search accuracy.
Content Recommendation: Recommends similar image-text materials based on user-uploaded content or interests, enhancing personalized experiences.
Image Editing Assistance: Helps designers quickly find reference images with similar styles, improving creative efficiency.
Intelligent Customer Service: Combines image and text understanding to provide more intuitive solutions, enhancing service efficiency.
Cultural Heritage Research: Retrieves related artifacts or research materials based on images and text, aiding archaeological and conservation efforts.

Model Capabilities

Model Type

multimodal

Supported Tasks

Image-Text Retrieval Composite Image Retrieval Multimodal Embedding Instruction Fine-Tuning

Usage & Integration

Screenshots & Images

Additional Images

Try Now

Stats

96 Views

0 Favorites

Similar Models

Ola by Tsinghua University, Tencent Hunyuan Research Team, NUS S-Lab

453

Zonos by Zyphra

389

Step-Video-T2V by Leapfrogging Star

460

BGE-VL

What is BGE-VL?

Main Features of BGE-VL

Technical Principles of BGE-VL

Project Address of BGE-VL