Background to the issue
Speech generated by traditional TTS systems often lacks emotional fluctuations, which CosyVoice solves with a fine-grained emotion-controlled labeling system.
Specific implementation programs
- Insert standard sentiment labels: Insert directly in the text
[laughter],[pause]Etc. labels:'他突然[laughter]停下来,因为被逗笑了[laughter]'
- Using Command Control: By
inference_instruct2The method specifies the overall emotional style:'用欢快的语气说这段话'
- Rhythmic Enhancement Technique: Enabled during training
--use_prosodyparameter, which enhances the naturalness of accent and intonation
Advanced Techniques
1. Combining tags and commands for richer performance
2. Referencetokenizer.pyRow 248View full list of tags
3. For movie and TV dubbing scenes, it is recommended that the emotion labels be aligned with the sound timeline.
This answer comes from the articleCosyVoice: Ali open source multilingual cloning and generation toolsThe































