Scenario requirements
Modern voice assistants need to support multi-user tone memories and personalized responses, and traditional solutions require training separate models for each user.
technical realization
- Fast cloning of tones: Record a 3-second calibration voice to be called when the user first uses it:
cosyvoice.add_zero_shot_spk(user_id, prompt_audio)
- Multi-tone management: Use
spk_embeddings.npy
File Storage User Tone Characteristics - Dynamic emotional adjustment: automatic insertion based on the content of the dialog
[happy]
,[whisper]
etag (math.)
system integration
1. Adopt gRPC servitization deployment, support 100+ concurrent requests
2. Work with NLU engine to realize context-aware sentiment label generation
3. AdoptionCosyVoice-300M-SFT
Model Optimization for Short Speech Generation
business value
The solution increases user satisfaction of voice assistants by 401 TP3T and user retention by 251 TP3T.
This answer comes from the articleCosyVoice: Ali open source multilingual cloning and generation toolsThe