A practical approach to building a multi-role voice system
For audiobook or multi host podcast scenarios, a stable multi-role voice library can be built by following the steps below:
- Infrastructure phase:
- Collect at least 20 minutes of pure voice samples for each target character
- Create a separate catalog structure for training datasets
- Create an exclusive
data/tts_sft_data_xx.jsonconfiguration file
- Model training program:
- Scenario A: Train SFT models individually for each character
- Option B: Train a single model using a mixture of multi-speaker data (requires modification of model architecture)
- Reasoning phase management:
- Creating Roles - Reference Audio Mapping Table
- Strict matching when calling the API
ref_wav_pathwith training data - available at
prompt_textAdding character identifiers to enhance features
For scenarios that require frequent character switching, it is recommended that each model be deployed as an independent API endpoint, with load balancing to achieve efficient invocation. This solution has been validated in audiobook production, which can maintain the stability of 10+ character tones at the same time.
This answer comes from the articleMuyan-TTS: Personalized Podcast Speech Training and SynthesisThe































