All posts

You are on an airplane. You are in a remote cabin. You are working from a hotel with terrible Wi-Fi. You need to get text out of your head and onto the screen, and you prefer speaking to typing. Can you use voice-to-text without an internet connection on your Mac?

The answer is yes, but with important tradeoffs. This article covers every approach to offline dictation on macOS, explains the accuracy and performance differences between local and cloud-based transcription, and helps you decide which approach fits your needs.

Why Most Dictation Apps Need the Internet

Modern voice-to-text accuracy comes from large neural network models — specifically, transformer-based models like OpenAI's Whisper. These models are trained on hundreds of thousands of hours of speech data and contain hundreds of millions (or billions) of parameters. Running the full-size model requires significant computational resources: powerful GPUs, substantial RAM, and optimized inference engines.

Cloud-based transcription services like Groq (which Steno uses) run these large models on dedicated hardware — custom silicon and GPU clusters designed specifically for fast inference. They can process a 10-second audio clip in under a second, returning transcription that is accurate even with accents, background noise, technical vocabulary, and complex sentence structures.

Sending audio to the cloud for transcription is not a cost-cutting shortcut — it is how you get the best possible accuracy. The computational resources available in a cloud data center dwarf what even the most powerful MacBook Pro can provide.

Offline Option 1: Apple's Built-in Dictation

Starting with macOS Ventura, Apple's built-in dictation can run entirely on-device. You can enable this in System Settings under Keyboard, then Dictation. When "On-Device Only" mode is selected, your audio is processed locally using Apple's speech recognition models.

How It Works

Apple ships compact speech recognition models as part of macOS. These models are optimized for Apple Silicon, using the Neural Engine for inference. When you activate dictation, the audio is processed through these local models without any network request.

Accuracy

On-device accuracy is noticeably lower than cloud-based transcription. Apple's local models are smaller and less capable than the cloud-based alternatives. Common issues include:

For simple, clearly spoken English in a quiet environment, accuracy is reasonable — perhaps 90 to 93%. For anything more demanding, the error rate increases significantly compared to cloud-based options that achieve 97 to 99% accuracy.

Pros

Cons

Offline Option 2: Whisper.cpp

Whisper.cpp is an open-source C++ port of OpenAI's Whisper model, optimized for local execution on Apple Silicon. It can run various sizes of the Whisper model — from the tiny model (75MB) to the large-v3 model (approximately 3GB) — entirely on your Mac.

How It Works

You download the model file of your choice and run Whisper.cpp either from the command line or through a graphical wrapper like MacWhisper. The model loads into RAM, and audio is processed through the neural network locally. On an M1 MacBook Pro, the base model processes audio at roughly 4x real-time speed (a 10-second clip takes about 2.5 seconds). The large model is slower, processing at roughly 1x to 2x real-time on M1.

Accuracy by Model Size

The accuracy of local Whisper transcription depends heavily on which model size you use:

Pros

Cons

Offline Option 3: macOS Speech Framework

Developers can build applications using Apple's Speech framework (SFSpeechRecognizer), which provides on-device speech recognition capabilities. Several third-party apps use this framework to offer offline dictation.

How It Works

The Speech framework uses Apple's speech recognition models, similar to the built-in dictation feature. Applications built on this framework can provide custom interfaces and workflows while using the same underlying engine.

Accuracy

Accuracy is comparable to Apple's built-in dictation, since the same models are used. The advantage is that third-party apps can add pre-processing (noise reduction, voice isolation) and post-processing (text formatting, smart punctuation) that may improve the overall result.

When You Actually Need Offline Dictation

Before choosing an offline dictation solution, consider how often you genuinely need it. Most Mac users have internet access the vast majority of the time. The scenarios where you truly lack connectivity are relatively rare:

For most users, these situations represent a small fraction of their total working time. The question becomes: is it worth accepting lower accuracy all the time (by using an offline-only tool) to avoid being without dictation in these rare situations? Or is it better to use a cloud-based tool with superior accuracy most of the time and fall back to Apple's built-in offline dictation when connectivity is unavailable?

The Hybrid Approach

The most practical approach for most users is to use a cloud-based dictation tool as your primary method and keep Apple's built-in offline dictation as a fallback. Here is how this works in practice:

  1. Install Steno (or another cloud-based dictation tool) for daily use. Benefit from AI-powered accuracy for 99% of your dictation needs.
  2. Enable Apple's built-in dictation with on-device processing as a secondary option.
  3. When you are offline, use Apple dictation. The accuracy is lower, but it works without any internet connection.
  4. When you are back online, switch back to your primary tool for the best accuracy.

This hybrid approach gives you the best accuracy when connectivity is available and a serviceable fallback when it is not, without requiring you to compromise on your primary dictation experience.

The Accuracy Gap: Numbers in Context

To put the accuracy difference in concrete terms: at 95% accuracy (a good offline result), a 100-word paragraph will contain approximately 5 errors. At 98% accuracy (typical for cloud-based AI transcription), the same paragraph will contain approximately 2 errors.

The difference between 2 and 5 errors might seem small, but it compounds with volume. If you dictate 2,000 words in a day, that is 100 errors to fix versus 40. At roughly 5 seconds per correction (find the error, position cursor, delete wrong text, type correct text), the less accurate option costs you an additional 5 minutes per day in corrections alone. Over a month, that is nearly two hours of additional editing.

More importantly, the nature of the errors differs. Cloud-based AI transcription errors tend to be subtle — a misplaced comma, a capitalization choice — and are quick to fix. Offline transcription errors are more likely to be word-level mistakes — wrong homophones, dropped words, garbled technical terms — that require more cognitive effort to identify and correct.

Looking Ahead: Local Models Are Improving

The accuracy gap between local and cloud-based transcription is closing. Each generation of Apple Silicon brings more Neural Engine capability, and model optimization techniques like quantization and distillation are making it possible to run more capable models in less memory.

Apple has been investing heavily in on-device AI, and future macOS releases will likely ship with significantly more capable local speech recognition. The M4 chip's 38 TOPS Neural Engine is already capable of running models that would have required cloud processing just two years ago.

Similarly, open-source efforts around Whisper and its successors continue to produce smaller, faster models that close the gap with cloud accuracy. The distilled Whisper models can run at near-cloud accuracy on modern Apple Silicon hardware, though with higher latency than a dedicated cloud inference engine.

For now, though, the practical advice is clear: use cloud-based transcription when you can for the best experience, and keep an offline option available for the times when you cannot. Steno handles the first part. Apple's built-in dictation handles the second. Together, they cover every scenario you are likely to encounter.