Live transcription — converting speech to text in real time as the words are spoken — has moved from a research project to a practical, daily-use technology in the span of just a few years. The gap between the moment you finish speaking and the moment your words appear as text has shrunk from several seconds to under a second in the best implementations. Understanding how live transcription works, and what makes some implementations dramatically faster than others, helps you choose the right tool and use it most effectively.
How Live Transcription Works
At its core, live transcription involves three steps that must happen in rapid succession: audio capture, speech processing, and text delivery. The speed of each step and the efficiency of the handoffs between them determine the total latency a user experiences.
Audio Capture
The microphone captures your speech as a continuous audio stream. The quality of this capture — sample rate, bit depth, noise floor — directly affects how much useful information the speech model receives. A microphone close to your mouth with low ambient noise provides the clearest signal. Most speech models are trained on 16kHz mono audio, which is sufficient for human speech frequencies and keeps file sizes and processing loads manageable.
Speech Processing
The captured audio is passed to a speech recognition model that converts acoustic patterns into probable word sequences. In cloud-based live transcription systems, this audio is streamed over the internet to a remote server where the model runs on specialized hardware. The model processes audio in small chunks — typically 200 to 500 milliseconds at a time — and returns text predictions that are assembled into the final transcript.
The two main approaches to live processing are streaming and segment-based. Streaming systems attempt to produce partial transcriptions while you are still speaking, updating the display in real time. Segment-based systems wait for you to finish a natural speech unit — a phrase or sentence — then process the whole segment at once. Streaming feels more immediate but has higher error rates on the partial results. Segment-based is more accurate but requires a brief pause before text appears.
Text Delivery
Once processed, the transcription must be delivered to wherever the text needs to appear. In a web-based transcription tool, this means updating the UI within the browser. In a system-level tool like Steno, this means injecting the text at the cursor position in whatever application is currently focused. System-level text injection is technically more complex but dramatically more useful because it works in any application without requiring you to copy and paste.
What Determines Live Transcription Latency
Users experience latency as the time between finishing a spoken phrase and seeing that phrase as text. Several factors control this delay.
Network Latency
Cloud-based transcription must send audio data to a remote server and receive transcribed text back. Round-trip network latency — the time for this data to travel both ways — contributes directly to the total delay. Servers geographically closer to the user reduce this component. Systems that pre-buffer audio locally and send it in a single burst after you stop speaking are less affected by network variability than streaming systems that send data continuously.
Inference Speed
The speech model itself takes time to process audio. Larger, more accurate models take longer to run. The best live transcription systems balance accuracy and speed by using highly optimized inference infrastructure — purpose-built hardware like GPUs or dedicated AI accelerators — to run models faster than standard compute would allow.
Segment Boundaries
The system must decide when you have finished speaking before it can process and return a complete transcription. Systems that detect the natural end of a phrase quickly can return text faster than systems with conservative silence detection that waits longer to be sure you have stopped.
Steno uses a hold-to-speak model that solves this problem elegantly: the release of the hotkey is the explicit signal that you have finished speaking. The system processes the complete recording immediately upon key release, producing text within 500 to 800 milliseconds in most cases. This is faster than toggle-based systems that must infer when you have stopped speaking from acoustic cues alone.
Live Transcription Use Cases
Dictation for Text Entry
The most common live transcription use case is personal dictation: converting your own spoken words into typed text in real time. This is Steno's primary use case — the app sits in your Mac menu bar, and any time you need to type something, you can speak it instead. The live transcription happens so quickly that it feels like instant conversion.
Meeting Transcription
Live transcription of meetings — where the audio from a group conversation is converted to text in real time — requires multi-speaker models that can distinguish and label different speakers. Dedicated meeting transcription tools like Otter.ai and Fireflies handle this specific scenario. Steno focuses on personal dictation rather than multi-speaker meeting transcription.
Accessibility
For users who cannot type due to mobility impairments, live transcription is not a productivity tool — it is a fundamental access tool. The accuracy and latency requirements for accessibility use are the most demanding: errors and delays are not minor inconveniences but real barriers. The improvements in live transcription technology over the past few years have made voice-based computing access dramatically more viable for this population.
Lecture and Presentation Captioning
Live transcription appears in real time as captions during lectures, presentations, and classroom settings. This benefits users who are deaf or hard of hearing, non-native language speakers, and anyone in an acoustically challenging environment. Dedicated captioning systems from Apple (Live Captions in macOS) and others handle this use case.
The Future of Live Transcription
Live transcription accuracy continues to improve, and latency continues to decrease. The most significant near-term developments are in noise robustness — maintaining accuracy in challenging acoustic environments — and in edge processing, where the transcription happens on-device rather than in the cloud, eliminating network latency entirely. Apple Silicon's Neural Engine is already enabling on-device transcription quality that rivals cloud-based solutions for many use cases.
For most users, the live transcription available today through tools like Steno is already fast enough and accurate enough to replace typing as the primary text input method for sustained writing tasks. Download Steno at stenofast.com to experience sub-second live transcription on your Mac or iPhone.
When live transcription reaches sub-second latency with high accuracy, it stops feeling like a technology and starts feeling like telepathy — thoughts becoming text with no perceptible delay.