Voice to Text Without Internet on Mac: Offline Dictation Options Explained

All posts

You are on an airplane. You are in a remote cabin. You are working from a hotel with terrible Wi-Fi. You need to get text out of your head and onto the screen, and you prefer speaking to typing. Can you use voice-to-text without an internet connection on your Mac?

The answer is yes, but with important tradeoffs. This article covers every approach to offline dictation on macOS, explains the accuracy and performance differences between local and cloud-based transcription, and helps you decide which approach fits your needs.

Why Most Dictation Apps Need the Internet

Modern voice-to-text accuracy comes from large neural network models — specifically, transformer-based models like OpenAI's Whisper. These models are trained on hundreds of thousands of hours of speech data and contain hundreds of millions (or billions) of parameters. Running the full-size model requires significant computational resources: powerful GPUs, substantial RAM, and optimized inference engines.

Cloud-based transcription services like Groq (which Steno uses) run these large models on dedicated hardware — custom silicon and GPU clusters designed specifically for fast inference. They can process a 10-second audio clip in under a second, returning transcription that is accurate even with accents, background noise, technical vocabulary, and complex sentence structures.

Sending audio to the cloud for transcription is not a cost-cutting shortcut — it is how you get the best possible accuracy. The computational resources available in a cloud data center dwarf what even the most powerful MacBook Pro can provide.

Offline Option 1: Apple's Built-in Dictation

Starting with macOS Ventura, Apple's built-in dictation can run entirely on-device. You can enable this in System Settings under Keyboard, then Dictation. When "On-Device Only" mode is selected, your audio is processed locally using Apple's speech recognition models.

How It Works

Apple ships compact speech recognition models as part of macOS. These models are optimized for Apple Silicon, using the Neural Engine for inference. When you activate dictation, the audio is processed through these local models without any network request.

Accuracy

On-device accuracy is noticeably lower than cloud-based transcription. Apple's local models are smaller and less capable than the cloud-based alternatives. Common issues include:

More frequent homophone errors ("their" vs. "there" vs. "they're")
Weaker handling of technical vocabulary and proper nouns
Less reliable automatic punctuation
Reduced accuracy with accented English or in noisy environments

For simple, clearly spoken English in a quiet environment, accuracy is reasonable — perhaps 90 to 93%. For anything more demanding, the error rate increases significantly compared to cloud-based options that achieve 97 to 99% accuracy.

Pros

Already installed on your Mac, no setup required
Completely free and unlimited
Maximum privacy — audio never leaves your device
Works in most native macOS applications

Cons

Lower accuracy than cloud-based alternatives
Toggle-based activation (not hold-to-speak)
Uses noticeable CPU resources during active transcription
Does not work in all applications (some non-standard text fields are not supported)

Offline Option 2: Whisper.cpp

Whisper.cpp is an open-source C++ port of OpenAI's Whisper model, optimized for local execution on Apple Silicon. It can run various sizes of the Whisper model — from the tiny model (75MB) to the large-v3 model (approximately 3GB) — entirely on your Mac.

How It Works

You download the model file of your choice and run Whisper.cpp either from the command line or through a graphical wrapper like MacWhisper. The model loads into RAM, and audio is processed through the neural network locally. On an M1 MacBook Pro, the base model processes audio at roughly 4x real-time speed (a 10-second clip takes about 2.5 seconds). The large model is slower, processing at roughly 1x to 2x real-time on M1.

Accuracy by Model Size

The accuracy of local Whisper transcription depends heavily on which model size you use:

Tiny (75MB) — Fast but significantly less accurate. Suitable for clear speech in quiet environments. Expect around 85 to 90% accuracy.
Base (150MB) — Good balance of speed and accuracy for casual use. Around 90 to 93% accuracy.
Small (500MB) — Noticeably better with accents and background noise. Around 93 to 95% accuracy.
Medium (1.5GB) — Near-professional accuracy for most speech. Around 95 to 97% accuracy.
Large-v3 (3GB) — Best offline accuracy available, approaching cloud quality. Around 96 to 98% accuracy. However, processing is slower — roughly 1x real-time on M1, meaning a 10-second clip takes about 10 seconds.

Pros

Completely offline and private
Free and open source
Best offline accuracy available when using larger models
Optimized for Apple Silicon

Cons

Not designed for inline dictation — better for transcribing recordings
Larger models require significant RAM (the large model needs 4GB+ free)
Processing time can be noticeable, especially with larger models
No system-wide text injection — you must copy results manually
Requires some technical comfort with installation and configuration

Offline Option 3: macOS Speech Framework

Developers can build applications using Apple's Speech framework (SFSpeechRecognizer), which provides on-device speech recognition capabilities. Several third-party apps use this framework to offer offline dictation.

How It Works

The Speech framework uses Apple's speech recognition models, similar to the built-in dictation feature. Applications built on this framework can provide custom interfaces and workflows while using the same underlying engine.

Accuracy

Accuracy is comparable to Apple's built-in dictation, since the same models are used. The advantage is that third-party apps can add pre-processing (noise reduction, voice isolation) and post-processing (text formatting, smart punctuation) that may improve the overall result.

When You Actually Need Offline Dictation

Before choosing an offline dictation solution, consider how often you genuinely need it. Most Mac users have internet access the vast majority of the time. The scenarios where you truly lack connectivity are relatively rare:

Airplanes without Wi-Fi (increasingly rare on longer flights)
Remote locations without cellular coverage
Environments where network access is restricted (secure facilities, some hospitals)
Locations with extremely slow or unreliable internet

For most users, these situations represent a small fraction of their total working time. The question becomes: is it worth accepting lower accuracy all the time (by using an offline-only tool) to avoid being without dictation in these rare situations? Or is it better to use a cloud-based tool with superior accuracy most of the time and fall back to Apple's built-in offline dictation when connectivity is unavailable?

The Hybrid Approach

The most practical approach for most users is to use a cloud-based dictation tool as your primary method and keep Apple's built-in offline dictation as a fallback. Here is how this works in practice:

Install Steno (or another cloud-based dictation tool) for daily use. Benefit from AI-powered accuracy for 99% of your dictation needs.
Enable Apple's built-in dictation with on-device processing as a secondary option.
When you are offline, use Apple dictation. The accuracy is lower, but it works without any internet connection.
When you are back online, switch back to your primary tool for the best accuracy.

This hybrid approach gives you the best accuracy when connectivity is available and a serviceable fallback when it is not, without requiring you to compromise on your primary dictation experience.

The Accuracy Gap: Numbers in Context

To put the accuracy difference in concrete terms: at 95% accuracy (a good offline result), a 100-word paragraph will contain approximately 5 errors. At 98% accuracy (typical for cloud-based AI transcription), the same paragraph will contain approximately 2 errors.

The difference between 2 and 5 errors might seem small, but it compounds with volume. If you dictate 2,000 words in a day, that is 100 errors to fix versus 40. At roughly 5 seconds per correction (find the error, position cursor, delete wrong text, type correct text), the less accurate option costs you an additional 5 minutes per day in corrections alone. Over a month, that is nearly two hours of additional editing.

More importantly, the nature of the errors differs. Cloud-based AI transcription errors tend to be subtle — a misplaced comma, a capitalization choice — and are quick to fix. Offline transcription errors are more likely to be word-level mistakes — wrong homophones, dropped words, garbled technical terms — that require more cognitive effort to identify and correct.

Looking Ahead: Local Models Are Improving

The accuracy gap between local and cloud-based transcription is closing. Each generation of Apple Silicon brings more Neural Engine capability, and model optimization techniques like quantization and distillation are making it possible to run more capable models in less memory.

Apple has been investing heavily in on-device AI, and future macOS releases will likely ship with significantly more capable local speech recognition. The M4 chip's 38 TOPS Neural Engine is already capable of running models that would have required cloud processing just two years ago.

Similarly, open-source efforts around Whisper and its successors continue to produce smaller, faster models that close the gap with cloud accuracy. The distilled Whisper models can run at near-cloud accuracy on modern Apple Silicon hardware, though with higher latency than a dedicated cloud inference engine.

For now, though, the practical advice is clear: use cloud-based transcription when you can for the best experience, and keep an offline option available for the times when you cannot. Steno handles the first part. Apple's built-in dictation handles the second. Together, they cover every scenario you are likely to encounter.