Real-time speech translation has become fast enough to feel nearly instantaneous.
Yet the way it achieves this speed introduces a different problem: the translated speech often arrives in fragments. A system speaks once it has enough source context to generate a partial translation, then waits until the next meaningful unit becomes available.
These wait-and-speak cycles introduce repeated silences between translated chunks, making the output sound less like continuous speech and more like a sequence of disconnected bursts — ultimately degrading listeners’ judgments of translation quality.
This is the question behind NaturalFlow. Instead of treating silence as unavoidable, we ask whether the model can use the diversity of expression to smooth the flow.
An LLM-based translator can express the same meaning in many faithful ways. Some are short and direct; others take slightly longer to say. NaturalFlow learns to choose translations that are not just accurate, but timed for speech — using natural paraphrases to keep talking while the source audio continues to arrive.
Hear the difference for yourself. 🎧