How to optimize the naturalness and expressiveness of speech generated by MOSS-TTSD?

2025-08-19

459

Improving speech quality requires both input data and model configuration:

Input Audio Quality: Ensure that the sample audio for voice cloning has a DNSMOS score ≥ 2.8, and it is recommended that it be captured using professional recording equipment to avoid ambient noise
Text labeling specifications: Dialogue texts need to be clearly labeled with the speakers (e.g.Speaker1:), descriptive labels should be added for inflections, such as[笑声]maybe[停顿]
parameterization: inconfig.yamlmidrange and highprosody_scale(metrical scaling factor) andnoise_scale(Noise randomness) parameter, range recommended 0.8-1.2
fine-tuned model: LoRA fine-tuning using domain-specific data (e.g., medical conversations, customer service recordings) can significantly improve the performance of specialized scenarios

Quick query station AI tool