AI Models

AI Models Page 5 of 7

All Models Complete list of AI models and foundation models, sorted by newest first

HumanOmni by HumanMLLM

HumanOmni is a multimodal large model designed for human-centric scenarios, integrating visual and auditory modalities. It processes video, audio, or a combination of both to understand human behavior, emotions, and interactions. Pre-trained on over 2.4 million video clips and 14 million instructions, HumanOmni employs a dynamic weight adjustment mechanism to flexibly fuse visual and auditory information. It excels in tasks like emotion recognition, facial description, and speech recognition, making it suitable for applications such as movie analysis, close-up video interpretation, and real-time video understanding.

Multimodal Human-Centric Emotion Recognition Speech Recognition Video Analysis Audio Processing AI Models Behavior Understanding Interaction Analysis Real-Time Processing

multimodal production Open Source

87 views

Learn More Try Now

Soundwave by The Chinese University of Hong Kong (Shenzhen)

Soundwave is an open-source speech understanding model developed by The Chinese University of Hong Kong (Shenzhen). It specializes in the intelligent alignment and comprehension of speech and text, leveraging innovative alignment and compression adapter technologies to bridge the representation gap between speech and text. This enables efficient speech feature compression and enhanced performance in various speech-related tasks.

Speech Understanding Text Alignment Speech Translation Speech Q&A Emotion Recognition Multimodal Interaction Open Source AI Model Speech Processing Language Learning

multimodal production Open Source

71 views

Learn More Try Now

Parler-TTS by Hugging Face

Parler-TTS is an open-source text-to-speech (TTS) model developed by Hugging Face. It generates high-quality, natural-sounding speech by mimicking specific speaker styles (gender, pitch, speaking style, etc.) based on input prompts. The model is fully open-source, including datasets, preprocessing, training code, and weights, promoting innovation in high-quality, controllable TTS models. Its architecture is based on MusicGen, integrating text encoders, decoders, and audio codecs to optimize voice generation through text descriptions and embedding layers.

Text-to-Speech Open Source AI Model Natural Language Processing Voice Generation Customizable High-Quality Audio Developer Tools Ethical AI Speech Synthesis

Text-to-Speech production Open Source

112 views

Learn More Try Now

Qwen2-Audio by Alibaba's Tongyi Qianwen Team

Qwen2-Audio is an open-source AI speech model developed by Alibaba's Tongyi Qianwen team. It supports direct voice input and multilingual text output, enabling features like voice chat, audio analysis, and support for over 8 languages. The model excels in performance on benchmark datasets and is integrated into Hugging Face's transformers library, making it accessible for developers. It also supports fine-tuning for specific applications.

AI Speech Model Multilingual Open-Source Voice Chat Audio Analysis Text Generation Hugging Face Fine-Tuning Developer Tools Natural Language Processing

multimodal production Open Source

96 views

Learn More Try Now

MiniCPM3.0 by FaceWall Intelligence

MiniCPM 3.0 is a high-performance edge AI model developed by FaceWall Intelligence, featuring 4B parameters. Despite its smaller size, it surpasses GPT-3.5 in performance. The model utilizes LLMxMapReduce technology to support infinite-length text processing, effectively expanding its contextual understanding capabilities. In Function Calling, MiniCPM 3.0 performs close to GPT-4o, demonstrating excellent edge-side execution capabilities. The model also includes the RAG trio (retrieval, re-ranking, and generation models), significantly improving Chinese retrieval and content generation quality. MiniCPM 3.0 is fully open-source, with the quantized model occupying only 2GB of memory, making it ideal for edge-side deployment while ensuring data security and privacy.

AI Model Edge Computing Open Source Natural Language Processing Function Calling Text Processing Chinese Retrieval Content Generation Data Security Privacy Protection

language production Open Source

80 views

Learn More Try Now

Marco by Alibaba International

Marco is a large-scale commercial translation model developed by Alibaba International, supporting 15 global languages including Chinese, English, Japanese, Korean, Spanish, and French. It excels in context-based translation, outperforming competitors like Google Translate, DeepL, and GPT-4 in BLEU evaluation metrics. Marco uses advanced multilingual data filtering and parameter expansion techniques to ensure high-quality translations and reduce service costs. It is optimized for cross-border e-commerce, offering precise translations for product titles, descriptions, and customer interactions. Marco is available on Alibaba's AI platform, Aidge, and is designed for large-scale commercial use.

Translation AI Multilingual E-commerce Language Processing Context-Based Translation Cross-Border Commerce Large Language Model BLEU Evaluation Alibaba

language production

110 views

Learn More Try Now

SeedVR by Nanyang Technological University, ByteDance

SeedVR is a diffusion transformer model developed by Nanyang Technological University and ByteDance, capable of high-quality universal video restoration. It introduces a shifted window attention mechanism, using large (64x64) windows and variable-sized windows at boundaries, effectively processing videos of any length and resolution. SeedVR combines a causal video variational autoencoder (CVVAE) to reduce computational costs while maintaining high reconstruction quality. Through large-scale joint training of images and videos and a multi-stage progressive training strategy, SeedVR excels in various video restoration benchmarks, particularly in perceptual quality and speed.

Video Restoration Diffusion Transformer AI Model High-Quality Video Universal Restoration Swin-MMDiT CVVAE Perceptual Quality Efficient Processing Realistic Details

vision production

67 views

Learn More Try Now

Mini-Omni by gpt-omni

Mini-Omni is an open-source, end-to-end voice dialogue model that supports real-time voice input and output, allowing for seamless voice-to-voice dialogue without the need for additional Automatic Speech Recognition (ASR) or Text-to-Speech (TTS) systems. It employs a text-guided voice generation method, enhancing performance through batch parallel strategies while maintaining the original model's language capabilities. Mini-Omni is designed for applications requiring real-time, natural voice interactions, such as smart assistants, customer service, and smart home control.

Voice Dialogue Real-Time Interaction Open-Source Multimodal AI Voice Technology AI Models Conversational AI Text-Guided Generation Speech Synthesis Cross-Modal Understanding

multimodal experimental Open Source

61 views

Learn More Try Now

FunAudioLLM by Alibaba Tongyi Lab

FunAudioLLM is an open-source speech large model project developed by Alibaba Tongyi Lab, consisting of two models: SenseVoice and CosyVoice. SenseVoice excels in multilingual speech recognition and emotion detection, supporting over 50 languages, with particularly strong performance in Chinese and Cantonese. CosyVoice focuses on natural speech generation, capable of controlling tone and emotion, and supports Chinese, English, Japanese, Cantonese, and Korean. FunAudioLLM is suitable for scenarios such as multilingual translation and emotional voice dialogue. The related models and code have been open-sourced on the Modelscope and Huggingface platforms.

Speech Recognition Speech Synthesis Multilingual Emotion Detection Open Source AI Models Natural Language Processing Voice Interaction Machine Learning Developer Tools

speech production Open Source

69 views

Learn More Try Now

Phi-3.5 by Microsoft

Phi-3.5 is a cutting-edge AI model series developed by Microsoft, comprising three specialized models: Phi-3.5-mini-instruct, Phi-3.5-MoE-instruct, and Phi-3.5-vision-instruct. These models are optimized for lightweight inference, expert mixture systems, and multimodal tasks, respectively. The series supports a 128k context length, excels in multilingual processing, and enhances multi-turn dialogue capabilities. It is licensed under the MIT open-source license and has demonstrated superior performance in benchmark tests against models like GPT4o, Llama 3.1, and Gemini Flash.

AI Models Machine Learning Natural Language Processing Multimodal AI Lightweight Inference Expert Mixture Systems Multilingual Processing Multi-turn Dialogue Open Source Benchmark Performance

multimodal production Open Source

95 views

Learn More Try Now

AI Models Page 5 of 7

All Models Complete list of AI models and foundation models, sorted by newest first

What’s in Startup Plan?

What’s in Startup Plan?

What’s in Startup Plan?

What’s in Startup Plan?

Details

Frameworks

Database

Billing

Completed

Project Type

Project Settings

Drop files here or click to upload.

Budget

Build a Team

Set First Target

Upload Files

Drop files here or click to upload.

Project Created!

No result found

Advanced Search

Search Preferences

AI Models Page 5 of 7

All Models Complete list of AI models and foundation models, sorted by newest first

Drop files here or click to upload.

Drop files here or click to upload.