AI Models

AI Models Page 3 of 5

All Models Complete list of AI models and foundation models, sorted by newest first

Parler-TTS
Parler-TTS by Hugging Face
0

Parler-TTS is an open-source text-to-speech (TTS) model developed by Hugging Face. It generates high-quality, natural-sounding speech by mimicking specific speaker styles (gender, pitch, speaking style, etc.) based on input prompts. The model is fully open-source, including datasets, preprocessing, training code, and weights, promoting innovation in high-quality, controllable TTS models. Its architecture is based on MusicGen, integrating text encoders, decoders, and audio codecs to optimize voice generation through text descriptions and embedding layers.

Text-to-Speech Open Source AI Model Natural Language Processing Voice Generation Customizable High-Quality Audio Developer Tools Ethical AI Speech Synthesis
Text-to-Speech production Open Source
Qwen2-Audio
Qwen2-Audio by Alibaba's Tongyi Qianwen Team
0

Qwen2-Audio is an open-source AI speech model developed by Alibaba's Tongyi Qianwen team. It supports direct voice input and multilingual text output, enabling features like voice chat, audio analysis, and support for over 8 languages. The model excels in performance on benchmark datasets and is integrated into Hugging Face's transformers library, making it accessible for developers. It also supports fine-tuning for specific applications.

AI Speech Model Multilingual Open-Source Voice Chat Audio Analysis Text Generation Hugging Face Fine-Tuning Developer Tools Natural Language Processing
multimodal production Open Source
MiniCPM3.0
MiniCPM3.0 by FaceWall Intelligence
0

MiniCPM 3.0 is a high-performance edge AI model developed by FaceWall Intelligence, featuring 4B parameters. Despite its smaller size, it surpasses GPT-3.5 in performance. The model utilizes LLMxMapReduce technology to support infinite-length text processing, effectively expanding its contextual understanding capabilities. In Function Calling, MiniCPM 3.0 performs close to GPT-4o, demonstrating excellent edge-side execution capabilities. The model also includes the RAG trio (retrieval, re-ranking, and generation models), significantly improving Chinese retrieval and content generation quality. MiniCPM 3.0 is fully open-source, with the quantized model occupying only 2GB of memory, making it ideal for edge-side deployment while ensuring data security and privacy.

AI Model Edge Computing Open Source Natural Language Processing Function Calling Text Processing Chinese Retrieval Content Generation Data Security Privacy Protection
language production Open Source
Marco
Marco by Alibaba International
0

Marco is a large-scale commercial translation model developed by Alibaba International, supporting 15 global languages including Chinese, English, Japanese, Korean, Spanish, and French. It excels in context-based translation, outperforming competitors like Google Translate, DeepL, and GPT-4 in BLEU evaluation metrics. Marco uses advanced multilingual data filtering and parameter expansion techniques to ensure high-quality translations and reduce service costs. It is optimized for cross-border e-commerce, offering precise translations for product titles, descriptions, and customer interactions. Marco is available on Alibaba's AI platform, Aidge, and is designed for large-scale commercial use.

Translation AI Multilingual E-commerce Language Processing Context-Based Translation Cross-Border Commerce Large Language Model BLEU Evaluation Alibaba
language production
SeedVR
SeedVR by Nanyang Technological University, ByteDance
0

SeedVR is a diffusion transformer model developed by Nanyang Technological University and ByteDance, capable of high-quality universal video restoration. It introduces a shifted window attention mechanism, using large (64x64) windows and variable-sized windows at boundaries, effectively processing videos of any length and resolution. SeedVR combines a causal video variational autoencoder (CVVAE) to reduce computational costs while maintaining high reconstruction quality. Through large-scale joint training of images and videos and a multi-stage progressive training strategy, SeedVR excels in various video restoration benchmarks, particularly in perceptual quality and speed.

Video Restoration Diffusion Transformer AI Model High-Quality Video Universal Restoration Swin-MMDiT CVVAE Perceptual Quality Efficient Processing Realistic Details
vision production
Mini-Omni
Mini-Omni by gpt-omni
0

Mini-Omni is an open-source, end-to-end voice dialogue model that supports real-time voice input and output, allowing for seamless voice-to-voice dialogue without the need for additional Automatic Speech Recognition (ASR) or Text-to-Speech (TTS) systems. It employs a text-guided voice generation method, enhancing performance through batch parallel strategies while maintaining the original model's language capabilities. Mini-Omni is designed for applications requiring real-time, natural voice interactions, such as smart assistants, customer service, and smart home control.

Voice Dialogue Real-Time Interaction Open-Source Multimodal AI Voice Technology AI Models Conversational AI Text-Guided Generation Speech Synthesis Cross-Modal Understanding
multimodal experimental Open Source
FunAudioLLM
FunAudioLLM by Alibaba Tongyi Lab
0

FunAudioLLM is an open-source speech large model project developed by Alibaba Tongyi Lab, consisting of two models: SenseVoice and CosyVoice. SenseVoice excels in multilingual speech recognition and emotion detection, supporting over 50 languages, with particularly strong performance in Chinese and Cantonese. CosyVoice focuses on natural speech generation, capable of controlling tone and emotion, and supports Chinese, English, Japanese, Cantonese, and Korean. FunAudioLLM is suitable for scenarios such as multilingual translation and emotional voice dialogue. The related models and code have been open-sourced on the Modelscope and Huggingface platforms.

Speech Recognition Speech Synthesis Multilingual Emotion Detection Open Source AI Models Natural Language Processing Voice Interaction Machine Learning Developer Tools
speech production Open Source
Phi-3.5
Phi-3.5 by Microsoft
0

Phi-3.5 is a cutting-edge AI model series developed by Microsoft, comprising three specialized models: Phi-3.5-mini-instruct, Phi-3.5-MoE-instruct, and Phi-3.5-vision-instruct. These models are optimized for lightweight inference, expert mixture systems, and multimodal tasks, respectively. The series supports a 128k context length, excels in multilingual processing, and enhances multi-turn dialogue capabilities. It is licensed under the MIT open-source license and has demonstrated superior performance in benchmark tests against models like GPT4o, Llama 3.1, and Gemini Flash.

AI Models Machine Learning Natural Language Processing Multimodal AI Lightweight Inference Expert Mixture Systems Multilingual Processing Multi-turn Dialogue Open Source Benchmark Performance
multimodal production Open Source
Gen3Alpha
Gen3Alpha by Runway
0

Gen-3 Alpha is the latest AI video generation model developed by Runway, an AI video startup. It significantly improves video fidelity, consistency, and dynamic performance through large-scale multimodal training infrastructure. The model can generate 10-second long, detailed, and smooth high-fidelity video clips, supporting text-to-video and image-to-video transformations, and offering precise temporal control and various advanced control modes, providing a powerful tool for artists and creative professionals.

AI Video Generation Multimodal Training Creative Tools Text-to-Video Image-to-Video High-Fidelity Video Temporal Control Character Generation Stylization Creative Professionals
multimodal production
F5-TTS
F5-TTS by Shanghai Jiao Tong University
0

F5-TTS is an open-source, high-performance text-to-speech (TTS) system developed by Shanghai Jiao Tong University. It utilizes flow matching and diffusion transformer (DiT) technology to generate natural, fluent, and accurate speech without additional supervision. The system supports zero-shot learning, multi-language synthesis (including Chinese and English), and effective long-text synthesis. F5-TTS also features emotion control and speed control, allowing users to adjust the emotional expression and playback speed of the synthesized speech. Trained on a large-scale dataset of 100,000 hours, F5-TTS demonstrates excellent performance and generalization capabilities. It is widely applicable in scenarios such as audiobooks, voice assistants, language learning, news broadcasting, and game dubbing, providing robust speech synthesis for both commercial and non-commercial purposes.

Text-to-Speech AI Open Source Voice Synthesis Natural Language Processing Speech Synthesis Zero-shot Learning Multi-language Support Emotion Control Speed Control
language production Open Source