A full range of solutions to optimize the naturalness of digital human speech
Linly-Talker offers a variety of technical solutions to the problem of unnatural speech:
- Basic program: selecting a quality TTS::
- Prioritize the voice provided by Microsoft Speech Services in the WebUI voice settings
- Recommended voice types for Chinese are "Xiaoxiao" or "Yunxi".
- Suggested choices for English are "Jenny" or "Guy".
- Advanced program: voice cloning::
- Prepare a 1-minute or more sample of the target speech (clear and noiseless is recommended)
- Speech cloning using the GPT-SoVITS model
- Adjust the speaker similarity parameter (recommended 0.7-0.9)
- Technology Optimization::
- Decrease the Speech Rate parameter appropriately to enhance clarity.
- Enabling Voice Enhancement for FunASR
- Sound recording in a quiet environment
- Subsequent optimization::
- Synchronization of voice and mouthing through MuseTalk
- Adjusting pitch curves using audio editing software
- Adding the right amount of background sound to enhance the ambience
It is worth noting that the system supports real-time adjustment of speech parameters, so that users can continuously optimize during the conversation until the desired effect is achieved. For professional scene use, it is recommended to record 3-5 higher quality speech samples for model fine-tuning.
This answer comes from the articleLinly-Talker: An Intelligent Dialogue System for Digital People, Combining Big Language Modeling and Visual Modeling for a New Interactive ExperienceThe































