AI Models

AI Models Page 2 of 7

All Models Complete list of AI models and foundation models, sorted by newest first

Light-R1
Light-R1 by 360 Smart Brain
0

Light-R1 is an open-source AI model developed by 360 Smart Brain, focusing on long chain reasoning in mathematics. It is based on Qwen2.5-32B-Instruct and trained with 70,000 mathematical data points using a two-stage curriculum learning approach (SFT+DPO). Light-R1 outperforms DeepSeek-R1-Distill-Qwen-32B, scoring 76.6 in the AIME24 test compared to DeepSeek's 72.6. The model is cost-efficient, requiring only 12 H800 machines running for 6 hours, costing approximately $1000. Light-R1 is fully open-source, including the model, dataset, training framework, and evaluation code, making it a valuable resource for the open-source community and a reference for low-cost training of specialized models.

AI Model Mathematics Open Source Long Chain Reasoning Machine Learning Education Research Enterprise Applications Software Integration Low-Cost Training
language production Open Source
44 views
QwQ-32B
QwQ-32B by Alibaba
0

QwQ-32B is Alibaba's open-source reasoning model with 32 billion parameters, trained using large-scale reinforcement learning (RL). It excels in tasks such as mathematical reasoning and programming, matching the performance of larger models like DeepSeek-R1. The model integrates agent capabilities, adjusting its reasoning process based on environmental feedback, demonstrating strong adaptability and reasoning power. Available on Hugging Face under the Apache 2.0 license, QwQ-32B highlights the potential of reinforcement learning in enhancing model performance and provides new directions for the development of Artificial General Intelligence (AGI).

Reinforcement Learning Mathematical Reasoning Programming Open Source AI Models Machine Learning Artificial General Intelligence Hugging Face Qwen Chat Agent Capabilities
language production Open Source
28 views
BGE-VL
BGE-VL by Beijing Academy of Artificial Intelligence
0

BGE-VL is a multimodal vector model developed by the Beijing Academy of Artificial Intelligence in collaboration with several universities. It is trained on large-scale synthetic data called MegaPairs and specializes in multimodal retrieval tasks such as image-text retrieval and composite image retrieval. The model enhances generalization and retrieval performance through efficient multimodal data synthesis methods. BGE-VL includes variants like BGE-VL-Base and BGE-VL-Large based on the CLIP architecture, and BGE-VL-MLLM based on a multimodal large model architecture. It has demonstrated excellent performance in various benchmarks, particularly in composite image retrieval tasks, significantly improving retrieval accuracy. The core strengths of BGE-VL lie in the scalability and high quality of its data synthesis methods, as well as its exceptional generalization capabilities in multimodal tasks.

Multimodal Vector Model Image Retrieval Text Retrieval Artificial Intelligence Machine Learning CLIP Architecture Data Synthesis Contrastive Learning Generalization
multimodal production
30 views
SpatialVLA
SpatialVLA by Shanghai AI Lab, China Telecom AI Research Institute, ShanghaiTech
0

SpatialVLA is a spatial-enhanced vision-language-action model developed by Shanghai AI Lab, China Telecom AI Research Institute, and ShanghaiTech. It is pre-trained on 1.1 million real robot episodes and equipped with 3D Egocentric Position Encoding and Adaptive Spatial Grids. The model excels in 3D scene spatial understanding, zero-shot in-distribution generalization, and efficient adaptation to new robot setups. It achieves state-of-the-art performance across diverse evaluations and offers faster inference speed with fewer tokens per action. The open-source code and flexible fine-tuning mechanisms provide new technical pathways for robotics research and applications.

Robotics 3D Spatial Understanding AI Models Zero-shot Generalization Cross-platform Universality Efficient Adaptation Real-time Robot Control Open-source Fine-tuning Mechanisms Industrial Applications
multimodal production Open Source
34 views
GEN3C
GEN3C by NVIDIA, University of Toronto, Vector Institute
0

GEN3C is a generative video model developed by NVIDIA, the University of Toronto, and the Vector Institute. It generates high-quality 3D video content by leveraging precise camera control and spatiotemporal consistency. The model constructs a 3D cache based on point clouds, using depth estimation from input images or video frames to back-project and generate 3D scenes. It then renders 2D videos based on user-provided camera trajectories, ensuring precise control over camera motion and avoiding inconsistencies found in traditional methods. GEN3C supports video generation from single to multiple viewpoints, making it suitable for both static and dynamic scenes. It also supports 3D editing and complex camera movements, providing powerful tools for video creation and simulation.

Video Generation 3D Modeling Camera Control AI Research Novel View Synthesis 3D Rendering Dynamic Scenes Spatiotemporal Consistency Depth Estimation Point Clouds
vision experimental Open Source
37 views
GeminiEmbedding
GeminiEmbedding by Google
0

Gemini Embedding is a state-of-the-art text embedding model developed by Google, designed to transform text into high-dimensional numerical vectors that capture semantic and contextual information. Trained on the Gemini model, it excels in language understanding, supports over 100 languages, and ranks first in the Multilingual Text Embedding Benchmark (MTEB). The model is versatile, supporting tasks like efficient retrieval, text classification, similarity detection, and more. It supports input tokens up to 8K in length and outputs 3K-dimensional vectors, with flexible dimension adjustment using Matryoshka Representation Learning (MRL) technology. Integrated into the Gemini API, it is ideal for developers, data scientists, and enterprise teams.

Text Embedding Natural Language Processing Multilingual Support Semantic Analysis Contextual Understanding Efficient Retrieval Text Classification Similarity Detection API Integration High-Dimensional Vectors
language production
22 views
START
START by Alibaba Group, University of Science and Technology of China
0

START (Self-Taught Reasoner with Tools) is a novel reasoning model developed by Alibaba Group and the University of Science and Technology of China. It enhances the reasoning capabilities of large language models (LLMs) by integrating external tools such as Python code executors. START employs the "Hint-infer" technique to insert prompts during the reasoning process, encouraging the model to use external tools. It also utilizes the "Hint-RFT" framework for self-learning and fine-tuning. START introduces tool invocation on top of long-chain reasoning (Long CoT), significantly improving accuracy and efficiency in complex mathematical problems, scientific questions, and programming challenges. It has outperformed existing models in multiple benchmarks and is the first open-source model to combine long-chain reasoning with tool integration.

Reasoning Model AI Tools Python Integration Long-Chain Reasoning Self-Learning Mathematical Problem Solving Scientific Research Programming Education Open Source
language production Open Source
46 views
MistralSmall3.1
MistralSmall3.1 by Mistral AI
0

Mistral Small 3.1 is an open-source multimodal AI model developed by Mistral AI, featuring 24 billion parameters and released under the Apache 2.0 license. It excels in both text and image processing tasks, supporting a context window of up to 128k tokens and achieving inference speeds of up to 150 tokens per second. The model is optimized for efficiency, capable of running on a single RTX 4090 or a 32GB RAM Mac, making it suitable for local deployment. It supports up to 25 languages and performs well in benchmarks like MMLU and MMLU Pro, offering strong multimodal understanding capabilities.

Multimodal AI Open Source Text Processing Image Processing Natural Language Processing AI Models Fast Inference Long Context Window Multilingual Support Lightweight Design
multimodal production Open Source
35 views
SmolDocling
SmolDocling by ds4sd
0

SmolDocling is a compact, multimodal document processing model designed for efficient conversion of document images into structured text. It supports a variety of elements including text, formulas, charts, and tables, making it suitable for academic papers, technical reports, and other document types. With only 256M parameters, it ensures fast inference speeds, processing each page in just 0.35 seconds on an A100 GPU. The model is fully compatible with Docling and supports multiple export formats.

Document Processing OCR Multimodal Lightweight Text Extraction Formula Recognition Chart Recognition Table Recognition Academic Papers Technical Reports
multimodal production Open Source
59 views
Moshi
Moshi by Kyutai
0

Moshi is an end-to-end real-time audio multimodal AI model developed by the French AI research lab Kyutai. It can listen, speak, and simulate 70 different emotions and styles for communication. As an open-source model comparable to GPT-4o, Moshi can run on regular laptops, features low latency, supports local device usage, and protects user privacy. The development and training process of Moshi is simple and efficient, completed by an 8-person team in 6 months. The code, weights, and technical papers of Moshi will soon be open-sourced for free use and further research by global users.

AI Multimodal Real-time Open-source Voice Assistant Speech Recognition Natural Language Processing Emotion Simulation Low Latency Local Device Usage
Multimodal experimental Open Source
33 views