Hibiki innovatively uses a synthetic data generation scheme to break through the reliance on parallel corpora in traditional speech translation systems. The system utilizes the contextual alignment capabilities of the MADLAD machine translation system to establish word-level weakly supervised matching rules: target language words are retained only if they can be predicted by the source language. This strict alignment strategy is realized by two key techniques:
- Mute insertion technique maintains the rhythmic structure of the utterance
- Voice control TTS system ensures naturalness of synthesized speech
This scheme enables the system to be trained with only single speaker alignment data in French-English translation scenarios, reducing the data requirement to less than 10% of traditional methods. Practical tests show that the model trained on synthetic data achieves a score of 4.2 on the MOS (Mean Opinion Score) metric, which is close to the level of professional human translators.
This answer comes from the articleHibiki: a real-time speech translation model, streaming translation that preserves the characteristics of the original voiceThe































