CosyVoice: Ali open source multilingual cloning and generation tools
CosyVoice is an open source multilingual speech generation model that focuses on high-quality text-to-speech (TTS) technology. It supports speech synthesis in multiple languages, providing features such as zero-sample speech generation, cross-language speech cloning, and fine-grained sentiment control.Cos- yVoice 2.0 compares to the previous version, significantly...
Qwen-TTS: Speech Synthesis Tool with Chinese Dialect and Bilingual Support
Qwen-TTS is a text-to-speech (TTS) tool developed by the Alibaba Cloud Qwen team and provided through the Qwen API. It is trained on a large-scale speech dataset, with a natural and expressive voice output that automatically adjusts intonation, speech rate, and emotion.Qwen-TTS supports Mandarin, English...
Kyutai: Speech to text real-time conversion tool
Kyutai Labs' delayed-streams-modeling project is an open source speech-to-text conversion framework based on Delayed Stream Modeling (DSM) technology at its core. It supports real-time speech-to-text (STT) and text-to-speech (TTS) functions , suitable for building efficient voice interaction applications . The project provides p...
MiniMax Speech 02
With the continuous evolution of AI technologies, personalized and highly natural voice interaction has become a key requirement for many intelligent applications. However, existing text-to-speech (TTS) technologies still face challenges in meeting large-scale personalized tones, multilingual coverage, and highly realistic emotion expression. To address these line...
Baichuan-Audio
Baichuan-Audio is an open source project developed by Baichuan Intelligence (baichuan-inc), hosted on GitHub, focusing on end-to-end voice interaction technology. The project provides a complete audio processing framework that can transform speech input into discrete audio tokens , and then through a large model to generate a pair of ...
Step-Audio
Step-Audio is an open source intelligent speech interaction framework designed to provide out-of-the-box speech understanding and generation capabilities for production environments. The framework supports multi-language dialog (e.g., Chinese, English, Japanese), emotional speech (e.g., happy, sad), regional dialects (e.g., Cantonese, Szechuan), adjustable speech rate...