Problem analysis
Dialect synthesis suffers from two core problems: missing phonemes and dysrhythmia. CosyVoice 2.0 reduces the pronunciation error rate by 30-50% with the following scheme.
prescription
- Using the Dialect Command Mode: Specify the dialect type explicitly:
'用四川话说这句话'
- Customized phoneme sets: in
config.yaml
Central Extended Dialect-specific phonemes, such as the alveo-palatal nasal of Sichuanese ȵ - data enhancement: Mix of standardized and vernacular corpus for training, ratio of 4:1 recommended
Implementation steps
1. PrioritizationCosyVoice2-0.5B
basic model
2. Collection of at least 2 hours of clean corpus in the target dialects
3. Fine-tuning time settings--dialect_weight=0.3
parameters
Effectiveness Verification
Using the MUSHRA test method, the naturalness MOS score of Sichuanese synthesis was improved from 4.2 to 5.1, reaching the commercial standard.
This answer comes from the articleCosyVoice: Ali open source multilingual cloning and generation toolsThe