Voxtral: an AI model developed by Mistral AI for speech transcription and understanding
Voxtral is its first open audio model released on July 15, 2025 by French AI startup Mistral AI. Voxtral aims to provide commercial applications with speech understanding capabilities out-of-the-box for production environments, at a price that is highly competitive in the market. The Voxtral model is available in two versions for ....
CosyVoice: Ali open source multilingual cloning and generation tools
CosyVoice is an open source multilingual speech generation model that focuses on high-quality text-to-speech (TTS) technology. It supports speech synthesis in multiple languages, providing features such as zero-sample speech generation, cross-language speech cloning, and fine-grained sentiment control.Cos- yVoice 2.0 compares to the previous version, significantly...
Qwen-TTS: Speech Synthesis Tool with Chinese Dialect and Bilingual Support
Qwen-TTS is a text-to-speech (TTS) tool developed by the Alibaba Cloud Qwen team and provided through the Qwen API. It is trained on a large-scale speech dataset, with a natural and expressive voice output that automatically adjusts intonation, speech rate, and emotion.Qwen-TTS supports Mandarin, English...
Kyutai: Speech to text real-time conversion tool
Kyutai Labs' delayed-streams-modeling project is an open source speech-to-text conversion framework based on Delayed Stream Modeling (DSM) technology at its core. It supports real-time speech-to-text (STT) and text-to-speech (TTS) functions , suitable for building efficient voice interaction applications . The project provides p...
MiniMax Speech 02
With the continuous evolution of AI technologies, personalized and highly natural voice interaction has become a key requirement for many intelligent applications. However, existing text-to-speech (TTS) technologies still face challenges in meeting large-scale personalized tones, multilingual coverage, and highly realistic emotion expression. To address these line...
AssemblyAI: High-precision Speech-to-Text and Audio Intelligence Analysis Platform
AssemblyAI is a platform focused on speech AI technology, providing developers and enterprises with efficient speech-to-text and audio analysis tools. Its core highlight is the Universal family of models, especially the newly released Universal-2, which is AssemblyAI's most advanced speech-to-text...
Baichuan-Audio
Baichuan-Audio is an open source project developed by Baichuan Intelligence (baichuan-inc), hosted on GitHub, focusing on end-to-end voice interaction technology. The project provides a complete audio processing framework that can transform speech input into discrete audio tokens , and then through a large model to generate a pair of ...
Step-Audio
Step-Audio is an open source intelligent speech interaction framework designed to provide out-of-the-box speech understanding and generation capabilities for production environments. The framework supports multi-language dialog (e.g., Chinese, English, Japanese), emotional speech (e.g., happy, sad), regional dialects (e.g., Cantonese, Szechuan), adjustable speech rate...
Parler-TTS: Generating speaker-specific text-to-speech models from input text
Parler-TTS is an open-source text-to-speech (TTS) modeling library developed by Hugging Face, designed to generate high-quality, natural-sounding speech. The model is capable of generating speech with a specific speaker style (e.g. gender, pitch, speaking style, etc.) based on the input text.Parler-TTS is based on the paper .....