All posts

Speech recognition has existed in some form for decades. Dragon NaturallySpeaking launched in 1997. Apple introduced Siri in 2011. Google's voice search became reliable around 2014. But if you used any of these tools and then tried a modern AI transcription system like Whisper, you would immediately notice something different: it just works. Not mostly works, not works-if-you-speak-slowly — it works the way you would expect a human transcriptionist to work.

This shift is not incremental improvement. It is a fundamentally different approach to turning speech into text. Understanding how this technology works helps explain why dictation tools that use AI transcription are dramatically more accurate and capable than their predecessors, and why Steno chose to build its entire product on this foundation.

The Old Way: Hidden Markov Models

Traditional speech recognition systems, including the ones that powered Siri's early versions and Dragon NaturallySpeaking, were built on Hidden Markov Models (HMMs) combined with statistical language models. The approach worked like this:

  1. Break audio into short frames (typically 25 milliseconds each)
  2. Extract acoustic features from each frame (mel-frequency cepstral coefficients, for the technically inclined)
  3. Use an HMM to map sequences of acoustic features to phonemes (individual speech sounds)
  4. Use a pronunciation dictionary to map phoneme sequences to words
  5. Use a statistical language model to choose the most likely word sequence

This pipeline had several fundamental limitations. Each component was trained separately, so errors in one stage compounded through the rest. The pronunciation dictionary was fixed — words not in the dictionary could not be recognized. And the language model, while helpful, had limited context — it could consider perhaps the previous two or three words when predicting the next one, but it could not understand the meaning of a sentence.

The result was speech recognition that worked well for clear, carefully spoken English by native speakers in quiet environments, but degraded rapidly with accents, background noise, conversational speech patterns, or specialized vocabulary.

The New Way: End-to-End Transformer Models

Whisper, released by OpenAI in September 2022, took a fundamentally different approach. Instead of a pipeline of separate components, Whisper is a single neural network that takes audio as input and produces text as output. The model learns the entire mapping from sound to text during training, without relying on hand-crafted features, pronunciation dictionaries, or separate language models.

The Architecture

Whisper uses a transformer architecture — the same type of neural network that powers GPT and other large language models. The model has two main components:

Encoder: The audio encoder converts raw audio into a rich representation that captures not just acoustic features, but contextual information about how sounds relate to each other. The encoder processes 30 seconds of audio at a time, using self-attention mechanisms that allow it to consider the relationship between any two points in the audio. This means a word spoken at second 5 can inform the interpretation of a word spoken at second 25.

Decoder: The text decoder generates text one token at a time, attending to both the encoder's audio representation and the previously generated tokens. This autoregressive process means each word is generated with full knowledge of both the audio context and all previously generated words. The decoder is, in essence, a language model that is conditioned on audio — it understands both what was said and what makes sense linguistically.

Training at Scale

What makes Whisper remarkable is not just its architecture (transformers for speech recognition existed before Whisper) but the scale and diversity of its training data. OpenAI trained Whisper on 680,000 hours of multilingual speech data collected from the internet. This training set includes:

This diverse training data is why Whisper handles real-world speech so well. It has seen (heard) enough variety that accented English, background cafe noise, or domain-specific jargon do not throw it off the way they would a traditional system trained primarily on clean, native-speaker read speech.

Why AI Transcription Is Better

The practical improvements over traditional speech recognition are substantial and measurable.

Robustness to Accents

Traditional systems were trained primarily on General American and Received Pronunciation English. Speakers with Indian, Nigerian, Scottish, or Southern US accents experienced significantly higher error rates. Whisper, trained on speech from across the English-speaking world, handles accent variation gracefully. The error rate difference between accents is typically less than 2 percentage points, compared to 10-15 percentage points for traditional systems.

Noise Handling

Traditional systems used separate noise reduction preprocessors that often stripped useful audio information along with the noise. Whisper's encoder learns to attend to speech and ignore noise as part of its training. It was trained on audio with real-world noise — cafe chatter, air conditioning hum, keyboard clicking — so it has learned to separate speech from noise without explicit noise reduction.

Automatic Punctuation and Formatting

Traditional systems produced raw word sequences with no punctuation. Users had to say "period," "comma," and "new paragraph" explicitly. Whisper's decoder, being a language model, naturally produces punctuated and formatted text. It infers punctuation from speech patterns — pauses, intonation changes, and sentence structure — the same way a human transcriptionist would.

Context-Aware Disambiguation

The sentence "I need to check the site" could be transcribed as "sight," "site," or "cite" by a traditional system, with the choice depending on the language model's statistics about which word is most common after "the." Whisper's decoder has a much richer understanding of context — it considers the entire preceding sentence, the domain of conversation, and linguistic patterns to make the right choice far more often.

Where Groq Fits In

Whisper defines what the model can do. Groq defines how fast it runs. And for a dictation app where you are waiting for text to appear after you stop speaking, speed is critical.

Groq has developed custom silicon — the Language Processing Unit (LPU) — designed specifically for running transformer models at high speed. While GPUs process neural network computations in parallel across thousands of small cores, Groq's LPU uses a deterministic, synchronous architecture that eliminates the overhead of memory management and scheduling that slows GPU inference.

The practical result is that Groq runs the Whisper large-v3 model faster than real-time. A 10-second audio clip is transcribed in under one second. For Steno users, this means the delay between releasing the hotkey and seeing text is dominated by network latency (the time for audio to travel to the server and text to come back), not by model inference time.

This speed is what makes cloud-based AI transcription viable for real-time dictation. If the model took 10 seconds to process 10 seconds of audio (1x real-time), the delay would be too long for interactive use. At faster-than-real-time inference, the delay is imperceptible — comparable to the latency of loading a web page.

What This Means for Dictation on Mac

The combination of Whisper's accuracy and Groq's speed creates a dictation experience that was not possible even three years ago. Here is what it means in practical terms:

You Do Not Need to Train the System

Traditional dictation software required you to spend 15 to 30 minutes reading training passages so the system could learn your voice. Whisper has already been trained on hundreds of thousands of hours of diverse speech. It works accurately the first time you use it, with no enrollment period.

You Do Not Need to Speak Differently

With traditional systems, you learned to speak "to the computer" — slowly, clearly, with explicit pauses. With Whisper-based transcription, you speak naturally. Conversational pace, natural cadence, normal pronunciation. The model was trained on natural speech, so natural speech is what it handles best.

It Gets Technical Terms Right

Say "Kubernetes" to a traditional dictation system and you might get "Cooper Netties." Say it to Whisper and you get "Kubernetes" — because the model has seen enough technical content in its training data to recognize specialized vocabulary. The same applies to medical terms, legal jargon, scientific nomenclature, and brand names.

Punctuation Is Automatic

You never need to say "period" or "comma." Whisper infers punctuation from your speech patterns with high accuracy. This alone saves significant time compared to traditional dictation, where managing punctuation was a constant cognitive burden.

The Future of AI Transcription

The current generation of AI transcription models is impressive, but the field is advancing rapidly. Several trends suggest that the next few years will bring further significant improvements.

Model distillation techniques are producing smaller models that retain most of the accuracy of larger ones, making high-quality local transcription increasingly feasible. Multimodal models that understand both audio and visual context could improve accuracy in specialized domains. And real-time streaming architectures are reducing latency from seconds to milliseconds.

For Steno users, these advances translate directly into better dictation experiences. Because Steno uses server-side transcription, model improvements are deployed instantly — you get better accuracy without downloading updates or changing anything on your end. The model that transcribes your speech today is better than the one that transcribed it last month, and next month's will be better still.

If you want to experience what AI-powered transcription actually feels like in daily use, download Steno and try it. The difference between traditional dictation and AI transcription is not subtle — it is the difference between a tool you fight with and a tool that simply works.