Sub-Second Transcription: How Steno Delivers Instant Voice to Text

All posts

When we say Steno delivers sub-second transcription, we mean something very specific: the time between you releasing the hotkey and seeing your transcribed text appear at the cursor is less than one second. Not "fast." Not "near-instant." Under one second, measured with a stopwatch. This article breaks down exactly how that works and why it matters more than you might think.

Defining Sub-Second Transcription

Most people have an intuitive sense of what "fast" means, but the technical definition matters here. Sub-second transcription refers to the total processing latency, which is the time from when audio recording stops to when the finished text is inserted into the active application. This excludes speaking time, since that is determined by you, not the software. It measures only the time the system needs to do its work.

To put this in perspective, the average human blink takes 150-400 milliseconds. A sub-second transcription means your text appears in the time it takes to blink two or three times. This is fast enough that the transcription feels simultaneous with the act of stopping speech.

The Latency Breakdown

Every millisecond in the transcription pipeline has a source. Here is where the time goes in Steno's architecture:

Audio Finalization: 10-30ms

When you release the hotkey, Steno needs to finalize the audio buffer, flush any remaining samples from the microphone, and prepare the audio data for transmission. This happens in native Swift code running on the main thread with direct access to Core Audio. There is no framework overhead and no garbage collector to cause pauses. This step is consistently under 30 milliseconds.

Audio Compression: 20-50ms

Raw audio from the microphone is too large to transmit efficiently. Steno compresses the audio before sending it to the API. The compression is optimized for speed over file size, producing a compact payload without spending excessive time on encoding. For a typical 5-10 second dictation, compression takes 20-50 milliseconds.

Network Transmission: 30-100ms

The compressed audio needs to travel from your Mac to Groq's API servers. For most users in North America, this round-trip network latency is 30-80 milliseconds. Users in Europe or Asia may see slightly higher latencies of 80-150ms depending on their distance from the nearest Groq endpoint. Steno uses HTTP/2 with connection pooling, so there is no connection setup overhead after the first request.

Groq Whisper Processing: 150-400ms

This is where the heavy lifting happens, and it is also where Steno's architecture provides the biggest advantage. Groq runs the Whisper large-v3 model on their custom Language Processing Units (LPUs), which are purpose-built for sequential inference workloads. Unlike GPUs, which are optimized for parallel computation, LPUs are designed to process the sequential token-by-token generation that language models require. The result is that Groq can transcribe a 10-second audio clip in 150-400 milliseconds, roughly 5-10x faster than GPU-based Whisper APIs.

Response Parsing and Text Insertion: 5-15ms

When the transcription response arrives, Steno parses the JSON, extracts the text, and inserts it at the current cursor position using macOS accessibility APIs. This final step is negligible in terms of latency.

Total: 215-595ms

Adding up all the stages, the typical total latency for Steno is 215-595 milliseconds, well under the one-second threshold. In practice, most dictations complete in 300-500ms under normal conditions.

Why One Second Is the Magic Number

The one-second threshold is not arbitrary. It comes from decades of human-computer interaction research. Jakob Nielsen's response time guidelines, first published in 1993 and validated repeatedly since, identify three critical thresholds:

100ms: The system feels instantaneous. The user perceives no delay.
1 second: The user notices a delay but maintains their train of thought. Flow state is preserved.
10 seconds: The user's attention wanders. They start thinking about something else.

For dictation, the one-second threshold is the boundary between "this feels like talking" and "this feels like waiting." When transcription completes in under a second, you finish speaking, glance at the screen, and the text is already there. Your brain barely registers the gap. When it takes two or three seconds, you find yourself watching the screen, waiting, and that waiting breaks the continuity between your thought and the written word.

The Flow State Connection

Flow state, the psychological condition of being fully immersed in a task, is fragile. Research by Mihaly Csikszentmihalyi and subsequent studies show that interruptions of just a few seconds can break flow, and re-entering flow can take 10-15 minutes. Every time your dictation app makes you wait, it risks breaking your flow.

Sub-second transcription keeps you in flow because the tool becomes invisible. You think, you speak, you see text. There is no gap for your attention to wander into. This is why Steno users consistently report that dictation "feels different" from other apps. It is not just faster in a measurable sense; it is faster in a way that changes the subjective experience of writing.

What Happens Above One Second

To appreciate what sub-second transcription gives you, consider what happens when latency is higher. At 2-3 seconds, you develop a habit of pausing after speaking to watch the screen. This pause breaks the think-speak-think rhythm and introduces dead time into your workflow. At 4-5 seconds, you start to wonder if the transcription failed. You might click the dictation button again, accidentally triggering a duplicate. At 10+ seconds, dictation stops feeling like a productivity tool and starts feeling like a chore.

These are not hypothetical scenarios. They are the daily reality for users of dictation apps that rely on standard GPU-hosted transcription APIs or local processing on consumer hardware.

Can Sub-Second Be Achieved Locally?

A common question is whether sub-second transcription is possible without a network connection, running Whisper entirely on the Mac's hardware. The answer is: not with the large model, and not consistently. Running Whisper large-v3 locally on an M3 Pro typically produces results in 3-8 seconds for a 10-second audio clip. The small model is faster at 1-3 seconds, but with noticeably lower accuracy. Only Groq's specialized hardware currently delivers large-model accuracy at sub-second speed.

That said, Steno does offer an offline mode using Apple's on-device speech recognition for situations where you have no internet connection. The accuracy is lower and the latency is higher, but it ensures you can always dictate.

The Pursuit of Speed

Sub-second transcription is not just a spec sheet number. It is the difference between a dictation tool that feels like a natural extension of your voice and one that feels like a slow, clunky intermediary. Steno achieves it through a combination of native Swift code, optimized audio handling, and Groq's purpose-built inference hardware. The result is a dictation experience where the technology disappears and only your words remain.

Try Steno free at stenofast.com, with Pro features available for $4.99/month.

Speed is not about convenience. It is about preserving the connection between your thoughts and your words. Sub-second transcription keeps that connection intact.