Building a conversational partner using AI is easy in concept, but hard in execution. The primary barrier to natural conversation isn't the quality of responses—it's latency.
In human-to-human speech, the average turn-taking gap is roughly 200 milliseconds. When latency extends past 1.5 seconds, the conversation breaks. It feels like talking over a walkie-talkie rather than engaging in a natural, organic dialogue.
At Fluenex, we are building a real-time AI speaking coach. To make these sessions feel real, we set a target: audio-to-audio latency must be under 500 milliseconds. Here is how we engineered the solution.
The Latency Equation
In a standard AI speech pipeline, a user's spoken audio goes through four stages:
- VAD & STT: Voice Activity Detection determines when the user finishes speaking, and Speech-to-Text converts the audio to text.
- LLM Inference: The conversational AI processes the text and generates a response.
- TTS Generation: Text-to-Speech synthesizes the text response back into audio.
- Streaming & Playback: The audio buffer is streamed back to the user's browser.
If executed sequentially, the total lag is often 2 to 4 seconds. To beat this, we optimized every link in this chain.
1. WebRTC & Stream-Based Transcriptions
Instead of recording audio files locally and uploading them at the end of a sentence, we stream the user's audio continuously to our servers using WebRTC.
A lightweight Voice Activity Detection (VAD) model runs directly on the edge. The moment the user stops speaking, we trigger the final endpoint of the Whisper stream, giving us a transcription in less than 150ms.
"By streaming chunked audio packets during speech, we perform transcription concurrently, reducing post-sentence transcription lag to almost zero."
2. Stream-to-Stream Pipeline Routing
Instead of waiting for the Large Language Model (LLM) to generate a full paragraph before sending it to the Text-to-Speech (TTS) engine, we feed the LLM output directly into the TTS engine as a stream of tokens.
The moment the LLM outputs the first 3 to 5 words (enough to make a phonetic phrase), the TTS engine begins synthesizing. The client browser begins playing the first audio buffer while subsequent sentences are still being generated in the cloud.
3. Pronunciation & Grammar Scorecard Pipeline
To prevent grammar analytics and feedback parsing from slowing down the speech loop, we decoupled it entirely.
The main chat loop only handles direct speech output. In the background, a secondary thread processes the raw audio and transcription to generate grammar feedback and pronunciation scores, showing them on the user's dashboard only after the turn is complete.
The Result
By combining edge-based VAD, WebRTC streaming, and token-level streaming between LLM and TTS, we successfully reduced average conversational roundtrip latency to 420ms in testing.
This speed completely changes the experience. Conversations feel natural and fluid, training learners to listen, process, and respond in English at natural human pacing.