Customizing the voice style needs to be achieved through model fine-tuning, which is divided into five stages:
- Data preparation: Collect 300 or more speech samples of the target style (10-30 seconds per sample is recommended) to be included:
- WAV audio (24kHz sampling rate)
- Counterpart text transcription
- Optional labeling of emotions
- format conversion: Convert the data to Hugging Face dataset format using the official Colab notebook (ID provided in the documentation) for automatic processing:
- Text normalization (e.g., numeric to text)
- Speech feature extraction (F0, mel spectrum)
- Data set partitioning (80/10/10)
- Configuration file adjustment: Modify key parameters in finetune/config.yaml:
- learning_rate: Recommendation 3e-5
- batch_size: adjusted for video memory (4 recommended for 12GB cards)
- max_epochs: usually 10-15 rounds
- priming training: Use the accelerate distributed framework:
accelerate launch train.py
The training process automatically uploads metrics to the WandB panel - Effectiveness Verification: Effectiveness was assessed by speaker similarity score (Spearman's correlation coefficient ≥ 0.7 was considered satisfactory) and MOS naturalness score (≥ 4.0 was considered excellent)
Typically, 10 hours of training with the V100 GPU yields the desired results.
This answer comes from the articleOrpheus-TTS: Text-to-Speech Tool for Generating Natural Chinese SpeechThe
































