Overseas access: www.kdjingpai.com
Bookmark Us
Current Position:fig. beginning " AI Answers

Delayed Stream Modeling Technology Enables 0.5-Second Low Latency Voice Interaction

2025-08-23 1.1 K

Delay optimization principles for DSM techniques

Kyutai's Delayed Stream Modeling (DSM) technology achieves 500 millisecond latency through an innovative streaming architecture. Unlike traditional batch models, DSM employs time-aligned audio and text stream processing, where the model gradually generates partial text results as it receives the audio stream. This design avoids the problem of waiting for complete audio input before processing begins.

The technical implementation contains three key technologies: first, a dynamic chunking strategy that intelligently splits the audio stream based on semantic activity detection (VAD); second, an incremental decoding mechanism that initiates the decoding process as soon as sufficient speech features are obtained; and finally, a flush trick acceleration technique that completes the remaining processing as soon as the end of the speech is detected, compression of the latency from 500 milliseconds to 125 milliseconds.

The actual speech-to-text test data shows that when running the 1B parameter model on the L40S GPU, the real-time transcription latency for English is stable in the 0.45-0.55 second range, with French processing slightly higher at about 0.6 seconds. This performance can already meet the needs of most real-time dialog scenarios.

Recommended

Can't find AI tools? Try here!

Just type in the keyword Accessibility Bing SearchYou can quickly find all the AI tools on this site.

Top