A solution to the real-time speech-to-text latency problem
To achieve low-latency native speech-to-text effect, you can start from the following aspects:
- Hardware Optimization: Priority is given to GPU devices that support CUDA or MPS, with ≥ 8GB of video memory. if using NVIDIA graphics cards, make sure you have the latest CUDA toolkit installed. cpu users can try to quantize the model (e.g. whisper-small-int8) to lighten the load.
- Parameter Configuration: Modify the webRTC parameters in main.py:
- Set audio_chunk_duration=0.3 (reduce audio chunk duration)
- Adjust speech_pad_ms=200 (reduce mute fill time)
- Set batch_size=1 (disable batch processing)
- Model Selection: Selection of models based on equipment performance:
- High-performance devices: whisper-large-v3-turbo
- General equipment: whisper-base
- Low profile devices: whisper-tiny-int8
- Preprocessing Optimization: Adjust the audio sample rate (16000Hz recommended) and the number of channels (mono) via the ffmpeg parameter, for example:
ffmpeg -ar 16000 -ac 1
Finally, it is recommended to add to the project .env file theUSE_CACHE=false
Turning off intermediate result caching reduces latency by a further 0.2-0.3 seconds.
This answer comes from the articleOpen source tool for real-time speech to textThe