English Speech to Text: Accuracy, Accents, and Choosing the Right Tool

All posts

English speech to text is the most mature and best-supported use case in voice recognition technology. English has the most training data, the most varied dialect coverage, and the most active research investment of any language. Yet significant differences exist between tools in how well they handle the full breadth of English as it is actually spoken — across accents, speeds, and speaking styles.

Choosing the right English speech-to-text tool for your needs requires understanding these differences and testing tools with your own voice, not just benchmark results on standardized test sets.

The Accent Problem in English Speech to Text

English is spoken natively by hundreds of millions of people across the United States, United Kingdom, Australia, Canada, Ireland, India, Nigeria, South Africa, and dozens of other countries. Each of these regions has distinct phonological features — vowel sounds, consonant realizations, stress patterns, and prosodic rhythms — that differ significantly from the mid-Atlantic American English that historically dominated speech recognition training data.

Early speech recognition systems were trained almost exclusively on broadcast-quality American English, which meant they performed poorly on virtually every other English accent. This created a frustrating dynamic where speakers with regional American accents, British accents, Indian English, or any non-standard variety found that dictation tools were unreliable or unusable for them.

Modern large neural models trained on diverse multilingual and multi-accent data have dramatically reduced this gap. The best English speech-to-text tools now handle British, Australian, Indian, Nigerian, and Scottish English with accuracy that approaches what they achieve on standard American English. But not all tools have made this investment equally, and performance differences on accented speech remain significant across products.

Non-Native English Speakers

English speech to text for non-native speakers presents additional challenges. When someone speaks English with phonological features transferred from their native language — a particular challenge for consonants and vowel sounds that do not exist in English — older systems failed badly. Modern neural models handle non-native accents substantially better because they have been trained on varied speaker populations.

The practical recommendation for non-native English speakers is to test any tool with your actual speaking style before committing to it. Many tools that claim broad accent support still struggle with specific accent combinations. The only way to know how a tool handles your voice is to try it.

Steno's accuracy on non-native English accents is notably strong, making it a good choice for global users who communicate in English professionally.

Speaking Speed and Style

Accuracy also varies with speaking speed and style. Most tools are optimized for a moderate, deliberate speaking pace. Very fast speech — above 180 words per minute — or very slow, hesitant speech can reduce accuracy. Similarly, highly conversational speech with filled pauses ("um," "uh," restarts, and interruptions) is harder to process than fluent, composed speech.

For dictation purposes, the most effective approach is to speak at a deliberate, natural pace — somewhat slower than you might speak in casual conversation, but not so slow that you sound robotic. This is not about changing how you speak; it is simply about giving the transcription engine enough acoustic information per syllable to recognize words accurately.

Technical and Domain-Specific English

General-purpose English speech to text tools are trained primarily on everyday language. Technical vocabulary — medical terminology, legal Latin, software development jargon, financial terms — is less represented in training data and therefore more prone to errors.

The solutions are:

Custom vocabulary lists that tell the tool which specialized terms to expect. Steno includes a custom vocabulary feature that significantly improves accuracy on domain-specific terms.
Providing a context prompt that signals the domain. Telling the transcription engine "this is a medical note" or "this is software documentation" improves accuracy on domain terms even before the utterance begins.
Spelling out abbreviations and initialisms when they appear in technical content. "API" said as individual letters is easier for most engines than spoken as a word.

Comparing English Speech-to-Text Tools on Mac

Apple Dictation

Apple's built-in dictation handles American and British English well. It has improved significantly in recent versions and handles a range of accents better than it once did. The on-device model available in recent macOS versions works without internet connectivity, which is useful for privacy-sensitive content. Accuracy on technical vocabulary is its main weakness.

Google Voice Typing

Google's voice typing, available in Google Docs and through Chrome, is strong on general English and has excellent accent coverage given Google's broad training data. However, it is tied to specific applications and does not offer system-wide dictation across all Mac apps.

Steno

Steno uses a cloud-based large model that delivers consistently high accuracy across English accents and dialects. Its system-wide hold-to-speak interface works in any Mac application — Word, Gmail, Slack, VS Code, Notion, or any text field — making it the most practical daily driver for English speech to text on Mac. Download it at stenofast.com.

Tips for Better English Dictation Results

Speak in complete sentences. Partial phrases and sentence fragments are harder to transcribe accurately than complete grammatical utterances.
Minimize background noise. The built-in MacBook microphone works acceptably in quiet environments. In noisy environments, a headset microphone improves accuracy noticeably.
Speak clearly but naturally. Exaggerated, slow pronunciation does not help modern neural models — they are trained on natural speech, not artificially slow speech.
Add your common names and technical terms to the custom vocabulary. Proper nouns that are uncommon in general language are where accuracy gaps most often appear.
Review transcripts immediately. Catching and correcting errors right after they appear is faster than reviewing a long document later.

The best English speech-to-text tool is the one that understands how you specifically speak — not how the average person in a training dataset speaks.