VibeVoice-1.5B: A Speech Generation Model Supporting Long Audio Multi-Role Conversations from Microsoft
VibeVoice-1.5B is a cutting-edge open-source Text-to-Speech (TTS) model released by Microsoft Research. It is specifically designed for generating expressive, long-form, multi-character dialog audio, such as podcasts or audiobooks. The core innovation of VibeVoice is its use of a 7...
MiniMax Releases Speech 2.5: Speech Synthesis Technology Breaks Through in Multilingualism and Tone Reproduction
On August 7, MiniMax announced Speech 2.5, a next-generation speech generation model that, according to official data, improves on its predecessor Speech 02 in terms of multilingual expressiveness, tone reproduction accuracy, and the number of supported languages. In the field of Artificial Intelligence Generated Content (AIGC)...
KittenTTS: Lightweight Text-to-Speech Modeling
KittenTTS is an open source text-to-speech (TTS) model focused on lightweight and efficiency. It takes up less than 25MB of storage, has about 15 million parameters, and runs on low-end devices without GPU support.Developed by the KittenML team, KittenTTS offers multiple...
SongGeneration: open-source AI model for generating high-quality music and lyrics
SongGeneration is a music generation model developed and open-sourced by Tencent AI Lab, focusing on generating high-quality songs, including lyrics, accompaniment and vocals. It is based on the LeVo framework, combining the language model LeLM and music codecs to support song generation in English and Chinese. The model is on a dataset of millions of songs...
OpusLM_7B_Anneal: an efficient unified model for speech recognition and synthesis
OpusLM_7B_Anneal is an open source speech processing model developed by the ESPnet team and hosted on the Hugging Face platform. It focuses on a variety of tasks such as speech recognition, text-to-speech, speech translation and speech enhancement, and is suitable for researchers and developers to experiment and apply in the field of speech processing. The model .....
Magenta RealTime: an open source model for generating music in real time
Magenta RealTime (Magenta RT for short) is an open source music generation model developed by Google DeepMind that focuses on real-time music creation. It is an open source version of Lyria RealTime that supports the generation of high-quality music clips via text or audio cues. The model is based on 80...
MOSS-TTSD: An Open Source Bilingual Dialog Speech Generation Tool
MOSS-TTSD is an open source dialog speech generation model that supports bilingual Chinese and English. It can convert two-person dialog text into natural and expressive speech, suitable for AI podcast production, language research and other scenarios. The model is based on low bit rate coding technology and supports zero-sample two-person speech cloning and...
Higgs Audio: an open source tool for generating high-quality speech and multi-character conversations
Higgs Audio is an open source text-to-speech (TTS) project developed by Boson AI focused on generating high-quality, emotionally rich speech and multi-character dialog. The project is based on over 10 million hours of audio data training and supports zero-sample speech cloning, natural dialog generation and multilingual speech output....
Voxtral: an AI model developed by Mistral AI for speech transcription and understanding
Voxtral is its first open audio model released on July 15, 2025 by French AI startup Mistral AI. Voxtral aims to provide commercial applications with speech understanding capabilities out-of-the-box for production environments, at a price that is highly competitive in the market. The Voxtral model is available in two versions for ....
CosyVoice: Ali open source multilingual cloning and generation tools
CosyVoice is an open source multilingual speech generation model that focuses on high-quality text-to-speech (TTS) technology. It supports speech synthesis in multiple languages, providing features such as zero-sample speech generation, cross-language speech cloning, and fine-grained sentiment control.Cos- yVoice 2.0 compares to the previous version, significantly...
Qwen-TTS: Speech Synthesis Tool with Chinese Dialect and Bilingual Support
Qwen-TTS is a text-to-speech (TTS) tool developed by the Alibaba Cloud Qwen team and provided through the Qwen API. It is trained on a large-scale speech dataset, with a natural and expressive voice output that automatically adjusts intonation, speech rate, and emotion.Qwen-TTS supports Mandarin, English...
Kyutai: Speech to text real-time conversion tool
Kyutai Labs' delayed-streams-modeling project is an open source speech-to-text conversion framework based on Delayed Stream Modeling (DSM) technology at its core. It supports real-time speech-to-text (STT) and text-to-speech (TTS) functions , suitable for building efficient voice interaction applications . The project provides p...
MiniMax Speech 02
With the continuous evolution of AI technologies, personalized and highly natural voice interaction has become a key requirement for many intelligent applications. However, existing text-to-speech (TTS) technologies still face challenges in meeting large-scale personalized tones, multilingual coverage, and highly realistic emotion expression. To address these line...
Muyan-TTS: Personalized Podcast Speech Training and Synthesis
Muyan-TTS is an open source text-to-speech (TTS) model designed for podcasting scenarios. It is pre-trained with over 100,000 hours of podcast audio data and supports zero-sample speech synthesis to generate high-quality natural speech. The model is built on Llama-3.2-3B, and combined with the SoVITS decoder, it provides high...
Kimi-Audio: Open Source Audio Processing and Dialogue Base Modeling
Kimi-Audio is an open source audio base model developed by Moonshot AI that focuses on audio understanding, generation and dialog. It supports a variety of audio processing tasks such as speech recognition, audio Q&A, and speech emotion recognition. The model has been pre-trained with over 13 million hours of audio data, combined with innovative...
Orpheus-TTS: Text-to-Speech Tool for Generating Natural Chinese Speech
Orpheus-TTS is an open source text-to-speech (TTS) system developed on the Llama-3b architecture with the goal of generating audio close to natural human speech. It is launched by the Canopy AI team and supports multiple languages such as English, Spanish, French, German, Italian, Portuguese and Chinese...
MegaTTS3: A Lightweight Model for Synthesizing Chinese and English Speech
MegaTTS3 is an open source speech synthesis tool developed by ByteDance in cooperation with Zhejiang University, focusing on generating high-quality Chinese and English speech. Its core model is only 0.45B parameters , lightweight and efficient , support for mixed Chinese and English speech generation and speech cloning . The project is hosted on GitHub, providing code and...
IndexTTS: Text-to-Speech Tool with Chinese-English Mixing Support
IndexTTS is an open source text-to-speech (TTS) tool hosted on GitHub and developed by the index-tts team. It is based on XTTS and Tortoise technologies, and provides efficient and high-quality speech synthesis through improved module design.IndexTTS uses tens of thousands of hours...
AssemblyAI: High-precision Speech-to-Text and Audio Intelligence Analysis Platform
AssemblyAI is a platform focused on speech AI technology, providing developers and enterprises with efficient speech-to-text and audio analysis tools. Its core highlight is the Universal family of models, especially the newly released Universal-2, which is AssemblyAI's most advanced speech-to-text...
Top