Improving speech quality requires both input data and model configuration:
- Input Audio Quality: Ensure that the sample audio for voice cloning has a DNSMOS score ≥ 2.8, and it is recommended that it be captured using professional recording equipment to avoid ambient noise
- Text labeling specifications: Dialogue texts need to be clearly labeled with the speakers (e.g.
Speaker1:), descriptive labels should be added for inflections, such as[笑声]maybe[停顿] - parameterization: in
config.yamlmidrange and highprosody_scale(metrical scaling factor) andnoise_scale(Noise randomness) parameter, range recommended 0.8-1.2 - fine-tuned model: LoRA fine-tuning using domain-specific data (e.g., medical conversations, customer service recordings) can significantly improve the performance of specialized scenarios
This answer comes from the articleMOSS-TTSD: An Open Source Bilingual Dialog Speech Generation ToolThe































