Speech Cloning Optimization Solution
To achieve sound similarity above 95%, three dimensions need to be optimized:
- sample qualityChoose 5-10 seconds of WeChat voice without background noise, and we recommend using the system's own recording function to dump it. Avoid including: 1) background music 2) multi-person conversations 3) current noises
- parameterization: Higher in xcodec_config.json
hop_length
to 256 while setting theremove_silence=True
Enhanced Feature Extraction - data enhancement: Variable speed non-modulated processing using the sox audio tool (command:
sox input.wav output.wav tempo 0.9
), generating multiple versions of training samples
Advanced tips include 1) Labeling text with rhyming symbols 2) Adding 10ms leading mute 3) Using NSF-HiFiGAN as a back-end vocoder. Tests can be compared to the mel spectral similarity (mel-CDTW) metrics
This answer comes from the articleWeClone: training digital doppelgangers with WeChat chats and voicesThe