Lightweight Deployment Program
For devices with limited computing resources, the following strategy can be used to deploy the Hibiki real-time translation feature:
- Select 1B lightweight version of the model: such as kyutai/hibiki-1b-mlx-bf16 designed for the device side, compared with the 2B version of the memory footprint reduced by 50%.
- Using the MLX Framework: The Metal version of the MLX implementation has an excellent energy efficiency ratio on Apple chips.
- Quantitative model weights: Converting BF16 to INT8 halves the model size while maintaining 90% accuracy.
- Enable Streaming Processing: Setting a smaller chunk_size (e.g. 1 second) reduces memory spikes.
- Cloud Collaboration Solutions: Retain only voice front-end processing locally, offloading core computation to edge servers.
Experimental data shows that end-to-end latency within 500ms can be achieved using the MLX-Swift implementation on the iPhone 16 Pro. For Android devices, consider repackaging the model using TensorFlow Lite.Kyutai Labs also provides a Rust version (hibiki-rs) that can be cross-compiled to support multiple embedded platforms.
This answer comes from the articleHibiki: a real-time speech translation model, streaming translation that preserves the characteristics of the original voiceThe































