Personalized voice customization process
Muyan-TTS achieves personalized speech generation through the SFT (Supervised Fine-Tuning) model, which mainly consists of the following steps:
- Data preparation: Collect at least 30 minutes of clear voice data (in WAV format) from the target speaker, recommended sampling rate 16kHz, mono
- Data preprocessing: Speech transcription using integrated Whisper and FunASR tools to generate structured datasets
- Model Tuning: Modification
training/sft.yamlConfigure the file and runtrain.shpriming training - weights integration: the base model will be
sovits.pthCopy to new model catalog to maintain decoder consistency
Data quality requirements
- Avoid background noise and audio distortion
- Consistency in voice style (e.g., podcasting scenarios suggest a formal speaking style)
- Transcription text accuracy needs to be >95%
Typical training parameters
In the base configuration, a usable personalized model can be obtained by training with a single card A100 for 1 hour (~1000 steps). Recommended learning rate 3e-5, batch size 8.
This answer comes from the articleMuyan-TTS: Personalized Podcast Speech Training and SynthesisThe































