How can developers customize their proprietary voice styles based on Orpheus-TTS?

2025-08-25

1.5 K

Customizing the voice style needs to be achieved through model fine-tuning, which is divided into five stages:

Data preparation: Collect 300 or more speech samples of the target style (10-30 seconds per sample is recommended) to be included:
- WAV audio (24kHz sampling rate)
- Counterpart text transcription
- Optional labeling of emotions
format conversion: Convert the data to Hugging Face dataset format using the official Colab notebook (ID provided in the documentation) for automatic processing:
- Text normalization (e.g., numeric to text)
- Speech feature extraction (F0, mel spectrum)
- Data set partitioning (80/10/10)
Configuration file adjustment: Modify key parameters in finetune/config.yaml:
- learning_rate: Recommendation 3e-5
- batch_size: adjusted for video memory (4 recommended for 12GB cards)
- max_epochs: usually 10-15 rounds
priming training: Use the accelerate distributed framework:
accelerate launch train.py
The training process automatically uploads metrics to the WandB panel
Effectiveness Verification: Effectiveness was assessed by speaker similarity score (Spearman's correlation coefficient ≥ 0.7 was considered satisfactory) and MOS naturalness score (≥ 4.0 was considered excellent)

Typically, 10 hours of training with the V100 GPU yields the desired results.

Quick query station AI tool