Background to the issue
Speech generated by traditional TTS systems often lacks emotional fluctuations, which CosyVoice solves with a fine-grained emotion-controlled labeling system.
Specific implementation programs
- Insert standard sentiment labels: Insert directly in the text
[laughter]
,[pause]
Etc. labels:'他突然[laughter]停下来,因为被逗笑了[laughter]'
- Using Command Control: By
inference_instruct2
The method specifies the overall emotional style:'用欢快的语气说这段话'
- Rhythmic Enhancement Technique: Enabled during training
--use_prosody
parameter, which enhances the naturalness of accent and intonation
Advanced Techniques
1. Combining tags and commands for richer performance
2. Referencetokenizer.py
Row 248View full list of tags
3. For movie and TV dubbing scenes, it is recommended that the emotion labels be aligned with the sound timeline.
This answer comes from the articleCosyVoice: Ali open source multilingual cloning and generation toolsThe