CosyVoice is a multi-language speech generation model of Ali open source , focusing on high-quality text-to-speech (TTS) technology . Its core features include:
- Zero-sample speech generation: Generate speech similar to the target voice based on short audio samples without additional training.
- cross-language speech synthesis: Supports multilingual speech generation while maintaining tonal consistency.
- Fine-grained emotional control: Emotional expression tags such as laughter and pauses can be added to generate more natural speech.
- Dialect and accent adjustment: Support for generating speech in specific dialects or accents such as Sichuanese.
- Streaming Speech Synthesis: Low-latency feature with first-packet latency as low as 150ms.
The main advantage of this tool is its high sound quality output, with a MOS score of 5.53 close to the commercial level, as well as a significant reduction in the articulation error of the 30%-50% compared to the previous version.
This answer comes from the articleCosyVoice: Ali open source multilingual cloning and generation toolsThe