Engineering Innovations in Emotional Speech Synthesis
CosyVoice realizes real-time emotion control based on symbolic tags for the first time in the field of speech synthesis, and its Tokenizer module presets 8 types of paralinguistic tags, such as [laughter][cry][pause=200ms], and supports rhyme adjustment with 50ms-level accuracy. Multi-level conditional adversarial training is used in the technical scheme:
- Underlying characteristics: Modeling Emotional Rhymes Using the Pitch-Contour Prediction Network
- Medium level control: Cross-Language Emotion Migration via Prosody-Tokens
- upper layer application: Open interfaces for semantic-level control such as [style=happy]
The empirical data shows that adding [laughter] tag can improve the pleasantness score of synthesized speech by 42%, and the pause marking error is less than ±10ms. this feature has been applied to game NPC dialogue system, which reduces the annotation cost by 90% compared with the traditional affective speech synthesis scheme.
This answer comes from the articleCosyVoice: Ali open source multilingual cloning and generation toolsThe