From Voice to Text: The Complete Picture of How Speech Becomes Words on Screen

All posts

When you hold the Steno hotkey and speak a sentence, the words appear on screen within a fraction of a second. To anyone who last tried voice dictation ten years ago and found it a slow, error-prone frustration, this speed can feel almost magical. It is not magic — it is a well-engineered pipeline that has been optimized at every stage. Understanding what happens from voice to text helps you understand why some tools perform better than others, and how to get the best results from whichever tool you use.

Stage 1: Audio Capture

The journey from voice to text starts with your microphone. The microphone converts acoustic pressure waves — the physical compression of air produced by your vocal cords and mouth shaping — into an electrical signal. This signal is then digitized by your device's audio hardware at a sampling rate (usually 16,000 or 44,100 samples per second for voice applications).

Quality at this stage matters more than most people realize. A poor microphone or excessive background noise introduces acoustic information that is not part of your speech signal, and the recognition engine has to spend computational effort distinguishing your voice from ambient sound. This is one reason why using a dedicated headset microphone or earbuds with an inline mic produces better accuracy than relying on a built-in laptop microphone in a noisy environment.

Steno captures audio when you hold the hotkey and stops when you release it. This press-to-activate model means the system only captures intentional speech, which eliminates false activations from ambient conversation and reduces the noise in each recording by limiting it to intentional dictation sessions.

Stage 2: Audio Preprocessing

Before the audio reaches the recognition model, it typically undergoes preprocessing. This can include noise reduction (filtering out consistent background sounds like fan noise or air conditioning), normalization (adjusting the overall volume level so quiet and loud speakers are handled similarly), and feature extraction (converting the raw audio waveform into a compact representation — usually log-mel spectrograms — that is more efficient for neural networks to process).

The quality of preprocessing can meaningfully affect accuracy, especially in challenging audio conditions. Tools that do minimal preprocessing may perform well in ideal conditions but degrade noticeably with background noise. Robust preprocessing pipelines maintain more consistent performance across varying recording conditions.

Stage 3: Neural Network Transcription

The preprocessed audio representation is fed into a large neural network trained on speech recognition. These models — trained on hundreds of thousands of hours of diverse speech data — have learned to map acoustic patterns to text tokens, handling the enormous variation in pronunciation, accent, speaking speed, and voice characteristics across different speakers.

Modern speech recognition networks process not just individual sounds but sequences of sounds, using context to resolve ambiguity. The word "two" sounds identical to "to" and "too," but in the context of "I need two more minutes," the model's language understanding component recognizes that the numeral form is appropriate. This contextual disambiguation is one of the most significant advances in modern AI-powered speech recognition compared to older rule-based approaches.

The model produces a sequence of predicted tokens — words or subword units — with associated confidence scores. For most of the output, confidence is very high and the predicted text is unambiguous. For a small fraction of the output, particularly around proper nouns and technical terms, confidence is lower and errors are more likely. Custom vocabulary features address this by explicitly adding high-priority terms that the model should prefer when the audio is consistent with them.

Stage 4: Post-Processing and Formatting

The raw output from the transcription model is a sequence of text tokens that does not yet look like readable text. Post-processing applies several transformations:

Punctuation insertion: Silence patterns and acoustic cues are used to determine where periods, commas, and question marks should appear. This is one of the harder aspects of the pipeline because natural speech does not always pause at grammatical sentence boundaries.
Capitalization: The beginning of sentences, proper nouns, and acronyms are capitalized based on grammatical context and vocabulary knowledge.
Number formatting: Spoken numbers are converted to the appropriate written form — "four five six" in an address context becomes "456," while "forty-five" in a descriptive context remains "45" written out appropriately.
Smart formatting (optional): In tools like Steno, an optional Smart Rewrite pass can further clean up the output — removing filler words, restructuring run-on sentences, and adapting the formality to match the writing context.

Stage 5: Text Injection

The final step is delivering the text to wherever it needs to go. This is where many tools fall short. Web-based dictation tools often copy text to the clipboard and paste it, which can have side effects in some applications and adds a noticeable step. Browser extension-based tools can only inject text into web-based text fields.

Native macOS applications like Steno use the system accessibility APIs to inject text directly into whatever application and text field has focus. This method is universal — it works in any application that accepts text input, including native apps, Electron apps, Terminal windows, and web forms. The injection happens at the system level, which means it behaves exactly like keyboard input from the perspective of the receiving application.

The Total Latency Budget

For the entire pipeline from voice to text to feel instantaneous, the total round-trip time needs to be under about 500 milliseconds. Here is how that budget breaks down:

Audio capture and preprocessing: ~10-20ms
Network transmission (audio to server): ~50-150ms depending on connection
Neural network inference: ~100-300ms depending on model size and hardware
Network transmission (result back): ~50-150ms
Post-processing and injection: ~10-20ms

Steno is engineered to fit within this budget on good connections. The interaction design — hold to record, release to trigger transcription — means the audio has already been captured and is ready to transmit the moment you release the key, which eliminates any additional waiting time beyond the pipeline itself.

What This Means for Your Dictation Practice

Understanding the pipeline illuminates some practical points about getting better results:

Better microphone input improves accuracy more than any other user-controllable variable
Speaking in complete phrases rather than word-by-word gives the contextual model more information to work with
Custom vocabulary directly addresses the highest-error category in Stage 3
Stable network connections reduce variability in transcription latency

Steno handles the engineering of the pipeline. Your job is to speak clearly and naturally. When both sides work together, the journey from voice to text becomes smooth enough to be invisible — and that is exactly the goal.

Try it at stenofast.com.

A well-engineered voice-to-text pipeline should be invisible. You speak, the words appear, and the technology between those two things never enters your awareness.