Hibiki's speech transfer technology captures the prosodic features of source speech through deep learning models and intelligently adapts them to the target language output. The system employs the Classifier Free Guidance (CFG) mechanism, which allows users to adjust the speech similarity via the -cfg-coef parameter (recommended value 3). The technical implementation contains three key innovations:
- Attention-based acoustic feature migration network
- Confrontation training to ensure naturalness of tone
- Rhyme decoupling technique separates linguistic and phonological features
Compared with the mechanized synthesized speech of traditional translation systems, Hibiki's output speech maintains the rhythm, accent and other suprasegmental features of the source speech, and the MOS naturalness score is improved by 37%. This feature is especially suitable for movie and TV dubbing, voice socialization and other scenarios that are sensitive to voice quality.
This answer comes from the articleHibiki: a real-time speech translation model, streaming translation that preserves the characteristics of the original voiceThe































