Automatic Speech Recognition: How It Works and Why It's Finally Good

Automatic speech recognition (ASR) has been a research problem for over 60 years. Early systems in the 1950s could recognize spoken digits — nothing more. By the 1990s, commercial products like Dragon Dictate required you to speak in a stilted, word-by-word cadence. Today, ASR handles natural conversational speech with accuracy that often exceeds what a human transcriptionist achieves in the same amount of time.

What changed? The short answer is: neural networks, massive training datasets, and the compute to train them. But the longer answer reveals why this technology matters for everyday users and what its current limits actually are.

The Old Way: Phonemes and Hidden Markov Models

For most of ASR's history, the dominant approach was pipeline-based. You'd break speech into phonemes (the fundamental sound units of language), model the statistical transitions between them using Hidden Markov Models, and then apply a separate language model to choose the most probable word sequence.

This approach worked, but it was brittle. The acoustic model and language model were separate components that had to be tuned independently. Errors at one stage couldn't easily be corrected by information at another stage. The system had no real understanding of meaning — just statistical patterns in sound and text.

The tell-tale sign of these older systems was their sensitivity to speaking style. You had to train them on your specific voice, speak slowly and clearly, and avoid words outside a defined vocabulary. Deviation from these constraints caused quality to collapse rapidly.

The New Way: End-to-End Neural Networks

Modern ASR systems are built on deep neural networks that learn to map audio directly to text, often without an explicit phoneme representation at all. The key architectural shift was moving from separate acoustic and language models to unified models trained end-to-end on paired audio-text data.

The current generation of systems use transformer architectures — the same class of neural network that powers large language models. These architectures excel at modeling long-range dependencies: understanding that a word 20 words earlier in a sentence constrains what a word now probably is.

Training these models requires enormous datasets: hundreds of thousands of hours of labeled audio across diverse speakers, accents, recording conditions, and languages. The breadth of training data is one reason modern systems generalize well to accents and speaking styles they haven't explicitly seen.

How Your Voice Becomes Text: Step by Step

Here's what happens when you speak into a modern ASR system:

Audio capture: Your microphone converts air pressure variations into an electrical signal, which is digitized at a sample rate (typically 16 kHz for speech).
Feature extraction: The raw waveform is converted into a time-frequency representation — a mel spectrogram — that highlights the features of sound most relevant to speech perception.
Encoder processing: A deep neural network encoder processes the spectrogram, producing a rich internal representation that captures acoustic patterns across time.
Decoder generation: A decoder network attends to the encoder output and generates text tokens one at a time, each conditioned on the acoustic context and the text generated so far.
Post-processing: The raw text is cleaned up — punctuation is added, numbers are formatted, and domain-specific terms are normalized.

This entire process happens remarkably fast. In optimized cloud deployments, the round-trip from speaking a sentence to receiving transcribed text can be under 500 milliseconds. Apps like Steno are built on this kind of infrastructure — the result appears at your cursor almost as soon as you've finished speaking.

Why Accents and Background Noise Are Still Challenging

Despite dramatic improvements, two factors continue to degrade ASR accuracy: strong accents and background noise.

Accents shift the acoustic realization of phonemes in ways that a model trained on different accents may not have seen. A model trained mostly on American English news transcripts will have seen far fewer examples of Glaswegian Scottish English, Nigerian English, or Indian English — so it's statistically less well-calibrated for those patterns. The solution is training data diversity, and the best modern systems train on truly global multilingual corpora.

Background noise masks acoustic features and confuses the model at the input stage. No amount of model sophistication fully compensates for audio quality that starts degraded. Good noise-suppression preprocessing helps, but it's no substitute for capturing clean audio in the first place.

Language Models and Context: The Secret to Accuracy

One of the counterintuitive aspects of modern ASR is how much the language model component contributes to accuracy. Even when the acoustic signal is ambiguous — "their," "there," and "they're" sound identical — a strong language model can usually pick the right spelling from context.

This is why the same audio that would confuse an older phoneme-based system is handled correctly by modern systems: they're not just pattern-matching sounds to words, they're reasoning about what word makes sense given everything that came before it in the sentence.

This also explains why domain-specific vocabulary is harder: if you're dictating medical notes with terminology the language model hasn't seen often, it falls back to its best phonetic guess — which might be a common word that sounds similar. Custom vocabulary features address this by boosting the probability of specific terms.

Real-Time vs. Batch Processing: The Latency Tradeoff

ASR systems face a fundamental tradeoff between accuracy and latency. The most accurate predictions come from models that can see the full audio segment before committing to a transcription — but real-time use requires committing to text before the segment is complete.

Modern streaming ASR systems use clever techniques to approximate the accuracy of offline processing while still delivering low latency:

Lookahead buffers: The model waits a short time after you stop speaking before finalizing the last few words, allowing it to use trailing context.
Rolling window attention: Instead of attending to all previous audio at once, the model maintains a fixed-size context window, enabling faster computation.
Two-pass systems: A fast first pass produces a rough transcript immediately; a slower, more accurate second pass refines it within a few seconds.

Where ASR Is Heading

The current trajectory points toward a few near-term improvements:

Better accent coverage: As training datasets diversify, the accuracy gap between different accents and dialects is narrowing. Models trained on hundreds of languages and regional variants generalize better to all of them.

Improved multi-speaker handling: Separating and correctly attributing overlapping speakers remains the hardest problem. New architectures specifically designed for multi-speaker scenarios are closing this gap.

Smaller on-device models: The push to run capable ASR models locally — for privacy, latency, and offline use — has driven rapid innovation in model compression and efficiency. The gap between cloud and on-device accuracy is shrinking.

Multimodal context: Future systems may use visual cues (lip movement, facial expressions) alongside audio, which human ears implicitly use in noisy environments.

For users, the practical upshot is that 2026's tools are genuinely useful for professional work in ways that tools from even three years ago weren't. If you tried voice-to-text in the past and gave up because accuracy wasn't there, it's worth trying again.

To see how these advances translate to a real-world dictation app, read our breakdown of how Steno works under the hood, or compare how modern AI-powered ASR stacks up against Apple's built-in dictation.