Hold to Speak Dictation: Why Push-to-Talk Beats Toggle Dictation

All posts

Every dictation app on the market asks the same question: how does the user tell the app when to listen? The answer to this seemingly simple question has enormous consequences for accuracy, speed, and the overall dictation experience. Most apps use a toggle approach: click a button to start, click again to stop. Steno takes a different path with hold to speak dictation, and the difference is transformative.

What Is Hold to Speak Dictation?

Hold to speak dictation works exactly like a walkie-talkie. You press and hold a hotkey, speak your words, and release the key when you are done. The moment you release, your speech is transcribed and the resulting text appears at your cursor position. No buttons to click, no modes to manage, no wondering whether the microphone is still listening.

This interaction model is sometimes called push-to-talk, and it has been the gold standard in voice communication for decades. Air traffic controllers, military radio operators, and gamers all use push-to-talk because it gives the speaker absolute control over when their voice is transmitted. Steno applies this same principle to dictation.

The Problem with Toggle Dictation

Toggle dictation is what most people are familiar with. Apple's built-in dictation, Dragon NaturallySpeaking, and most third-party dictation apps use some version of this: activate dictation, speak until you are done, then deactivate it. This creates several problems that compound over time.

Accidental Captures

The most common complaint about toggle dictation is that it picks up things you did not intend to dictate. You activate dictation, speak a sentence, then turn to answer a colleague's question. The dictation engine faithfully transcribes your side conversation, your "um" as you think about what to say next, and the notification sound from your phone. With hold to speak, none of this happens. The microphone only listens while your finger is on the key.

The "Is It Still Listening?" Problem

Toggle dictation forces you to maintain awareness of a hidden state. Is dictation currently active or inactive? After a pause of several seconds, did the system auto-stop? Did I accidentally double-tap and toggle it off? This cognitive overhead is small but constant, and it pulls your attention away from the actual task of composing your thoughts. Hold to speak eliminates this entirely. The state is physically embodied in your finger: key down means listening, key up means not listening.

Latency Uncertainty

With toggle dictation, the system has to decide when you are "done" speaking. Some apps wait for a long pause. Others require you to click stop. Either way, there is a delay between finishing your thought and seeing the text appear. With hold to speak, the endpoint is crisp: the moment you release the key, the audio is sent for transcription. There is no ambiguity and no waiting for the system to decide you have finished.

How Steno Implements Hold to Speak

Steno is built as a native macOS app in Swift, and the hold-to-speak mechanism is deeply integrated into the system. Here is how it works under the hood.

Global Hotkey Listener

Steno registers a global hotkey that works in any application. By default, this is the right Option key, but you can customize it. The key press event is captured at the system level, so it works whether you are in a text editor, a browser, a terminal, or any other app. The hotkey listener runs with minimal overhead because it is implemented using native macOS event taps rather than polling.

Instant Recording Start

When the hotkey is pressed down, Steno immediately begins capturing audio from your microphone. There is no warmup delay and no "preparing to listen" state. The audio buffer starts filling from the very first millisecond. This is possible because Steno keeps the audio subsystem in a ready state, so microphone activation is near-instantaneous.

Release and Transcribe

When you release the hotkey, Steno stops recording, packages the audio, and sends it to the Groq Whisper API for transcription. The response typically comes back in under a second, and the transcribed text is inserted at your current cursor position using macOS accessibility APIs. The entire cycle, from key release to text appearing on screen, takes less than one second in most cases.

Visual Feedback

While you hold the key, Steno shows a subtle overlay indicator so you know recording is active. When you release and transcription is in progress, the indicator changes to reflect the processing state. This gives you confidence in what the system is doing without demanding your attention.

Why Hold to Speak Produces Better Transcriptions

Beyond the user experience advantages, hold to speak actually produces more accurate transcriptions. There are several reasons for this.

First, the audio is cleaner. Because you are only recording when you intend to speak, there is no background noise, no half-sentences, and no cross-talk from other people. The transcription engine receives exactly the speech you want transcribed, nothing more.

Second, the audio segments are shorter and more focused. Instead of sending a five-minute stream of audio with pauses, false starts, and digressions, you send focused bursts of speech, each one a complete thought. Shorter, cleaner audio segments are easier for any speech recognition system to handle accurately.

Third, there is no need for endpoint detection algorithms. Many transcription errors in toggle-mode dictation come from the system incorrectly deciding where one utterance ends and another begins. With hold to speak, the boundaries are explicit.

When to Use Hold to Speak vs. Continuous Dictation

Hold to speak is ideal for the way most people actually work at a computer. You think of a sentence, dictate it, review it, then think of the next one. This maps perfectly to the hold-release-review cycle. It works brilliantly for email composition, filling in forms, writing code comments, taking notes during meetings, and any other task where you are producing text in bursts rather than continuous streams.

Continuous dictation still has its place for long-form monologues where you want to speak for several minutes without interruption. But for the vast majority of daily computer use, the burst-and-review pattern of hold to speak is both faster and more accurate.

Making the Switch

If you have been using toggle dictation and want to try hold to speak, the adjustment period is surprisingly short. Most Steno users report that the interaction feels natural within the first few minutes. The physical metaphor of holding a key to talk is intuitive in a way that toggling a mode is not.

Steno is available as a free download for macOS, with a Pro tier at $4.99 per month that unlocks unlimited dictation and advanced features. The hold-to-speak interaction is available in both tiers. You can download it at stenofast.com and be dictating within 30 seconds of installation.

The best interface is one you do not have to think about. Hold to speak removes the cognitive overhead of managing dictation state, letting you focus entirely on what you want to say.