NaturalFlow: Optimizing Speech Flow in Simultaneous Speech-to-Speech Translation

Dongwook Lee^1 * Youngho Cho^1 * Sangkwon Park¹ Heeseung Kim^2 † Sungroh Yoon^1 †

Seoul National University¹ University of Seoul²

^*Equal contribution ^†Corresponding authors

NaturalFlow optimizes speech flow in simultaneous speech-to-speech translation by reducing pauses and silence ratios while preserving translation quality and latency.

Real-time speech translation has become fast enough to feel nearly instantaneous.

Yet the way it achieves this speed introduces a different problem: the translated speech often arrives in fragments. A system speaks once it has enough source context to generate a partial translation, then waits until the next meaningful unit becomes available.

These wait-and-speak cycles introduce repeated silences between translated chunks, making the output sound less like continuous speech and more like a sequence of disconnected bursts — ultimately degrading listeners’ judgments of translation quality.

But should real-time translation have to sound fragmented?

This is the question behind NaturalFlow. Instead of treating silence as unavoidable, we ask whether the model can use the diversity of expression to smooth the flow.

The idea

An LLM-based translator can express the same meaning in many faithful ways. Some are short and direct; others take slightly longer to say. NaturalFlow learns to choose translations that are not just accurate, but timed for speech — using natural paraphrases to keep talking while the source audio continues to arrive.

Comparison of a baseline simultaneous translation, broken up by pauses, against NaturalFlow's continuous output that paraphrases to keep speaking. — A baseline system stalls between chunks, leaving the listener waiting through repeated silences. NaturalFlow chooses a slightly longer paraphrase — *"awarded during the championship phase"* — that keeps the speech flowing continuously while the source audio is still arriving.

Hear the difference for yourself. 🎧

Demonstrating Enhanced Fluency: Minimized Pauses and Silence Ratios

These samples demonstrate how NaturalFlow significantly reduces the frequency of unnatural pauses and the overall silence ratio compared to baseline models. By optimizing for fluency-aligned preferences, our model delivers a more continuous and natural simultaneous speech-to-speech translation experience.

Short-form audio examples

Dataset: CVSS-C

Long-form audio examples

Dataset: mTEDx

Multistream Visualization

Each player is stereo: left channel is Source, and right channel is model output. Use headphones for clear channel separation.

Dataset: VoxPopuli