Overseas access: www.kdjingpai.com
Bookmark Us
Current Position:fig. beginning " AI Answers

How does Muyan-TTS's personalized voice customization feature work? What data do I need to prepare?

2025-08-23 1.7 K
Link directMobile View
qrcode

Personalized voice customization process

Muyan-TTS achieves personalized speech generation through the SFT (Supervised Fine-Tuning) model, which mainly consists of the following steps:

  1. Data preparation: Collect at least 30 minutes of clear voice data (in WAV format) from the target speaker, recommended sampling rate 16kHz, mono
  2. Data preprocessing: Speech transcription using integrated Whisper and FunASR tools to generate structured datasets
  3. Model Tuning: Modificationtraining/sft.yamlConfigure the file and runtrain.shpriming training
  4. weights integration: the base model will besovits.pthCopy to new model catalog to maintain decoder consistency

Data quality requirements

  • Avoid background noise and audio distortion
  • Consistency in voice style (e.g., podcasting scenarios suggest a formal speaking style)
  • Transcription text accuracy needs to be >95%

Typical training parameters

In the base configuration, a usable personalized model can be obtained by training with a single card A100 for 1 hour (~1000 steps). Recommended learning rate 3e-5, batch size 8.

Recommended

Can't find AI tools? Try here!

Just type in the keyword Accessibility Bing SearchYou can quickly find all the AI tools on this site.

Top