Every voice-to-text app needs to know when you are speaking. Without a reliable voice detector, the system either misses the beginning of your words, captures silence and background noise, or never knows when to stop recording. Voice activity detection — VAD — is one of the least glamorous but most important components of any speech recognition pipeline.
Understanding how voice detection works helps you get better results from dictation apps, troubleshoot problems when the microphone seems unresponsive, and appreciate why some apps feel instantly reactive while others lag or miss words.
What Voice Activity Detection Does
Voice activity detection is the process of determining, in real time, whether an incoming audio stream contains human speech or just noise. A microphone captures everything — your voice, the fan in your laptop, keyboard clicks, air conditioning hum, traffic outside, music from a nearby room. The voice detector's job is to flag which portions of that audio contain speech and which do not.
This sounds simple but is genuinely difficult. The volume of ambient noise can be comparable to soft speech. A sudden sound — a door closing, a cough — can look like speech to a naive detector. Conversely, a speaker who pauses briefly mid-sentence should not have their speech split into multiple segments by an overly aggressive detector.
Two Approaches to Voice Detection
Energy-Based Detection
The simplest voice detectors use energy thresholding: if the audio signal exceeds a certain volume level, it is classified as speech. If it falls below the threshold, it is classified as silence. This approach is fast and computationally cheap, but it fails in noisy environments where background sound exceeds the speech threshold, and it misses quiet or whispered speech.
Energy-based detection is still used in some systems because of its low computational cost, but it is rarely sufficient on its own for high-quality dictation applications.
Neural Voice Activity Detection
Modern voice detectors use small neural network models trained specifically to distinguish speech from non-speech audio. These models analyze acoustic features — frequency content, temporal patterns, formant structure — rather than simple volume levels. They can reliably identify speech in noisy environments that would completely fool an energy-based detector.
Neural VAD models are typically small enough to run continuously in the background on a Mac or iPhone without measurable performance impact. Apple Silicon's Neural Engine can run these models with negligible power consumption, which is why modern Mac dictation apps can keep the microphone active continuously without draining the battery.
Hotkey vs. Continuous Listening Approaches
Voice detection operates differently depending on how an app is designed to receive input:
Hotkey-Activated Dictation
Apps like Steno use a hold-to-speak model: you hold a hotkey, speak, and release. In this model, voice activity detection still matters — the app needs to know when your speech ends to finalize the transcript — but the user's explicit action (pressing and holding the key) provides the primary signal that recording should begin.
This approach has a significant advantage: it eliminates false positives entirely. The microphone only opens when you deliberately activate it. You never have to worry about ambient conversation being transcribed, or words spoken to a colleague being accidentally inserted into a document.
Always-On Listening
Some dictation systems attempt to run continuously and transcribe automatically whenever speech is detected. This requires highly accurate VAD because the system must distinguish between speech directed at the app and speech directed at other people or content. The challenge is that most VAD systems have no way to make this distinction — they detect all speech, not just speech intended for the app.
Always-on systems work best in controlled environments where the speaker is the only person talking. They are less suitable for shared offices or any environment with ambient human speech.
Voice Detection on Mac: Technical Details
macOS provides several audio APIs that dictation apps use to access microphone input:
- AVAudioEngine: Apple's high-level audio processing framework, suitable for real-time audio capture with built-in support for audio tap installation on any node in the processing graph.
- Core Audio: Lower-level framework offering more control over audio device access, buffer management, and format conversion.
- SFSpeechRecognizer: Apple's built-in speech recognition API, which handles its own VAD internally. Apps that use this API do not need to implement their own voice detection.
Apps that use their own speech recognition models (rather than Apple's built-in recognizer) must implement VAD independently, either using their own neural model or Apple's open-source VAD utilities.
Why VAD Quality Affects Transcription Quality
Poor voice detection creates problems that cascade into poor transcription. Specifically:
- Clipped beginnings: If the VAD detects speech too slowly, the first syllable or word of an utterance is cut off before recording begins. This produces transcripts that drop the first word regularly.
- Noise contamination: If the VAD does not suppress non-speech audio before sending it to the transcription model, the model receives noisy input and produces worse output.
- Premature cutoffs: An aggressive VAD that ends recording during brief pauses mid-sentence causes the transcript to miss the second half of long sentences.
- Latency: VAD that requires processing multiple frames before making a decision introduces latency. The best systems make decisions within one or two audio frames (20–40 milliseconds).
Practical Implications for Dictation Users
Understanding voice detection helps you troubleshoot common dictation problems:
If your app misses the first word regularly, the VAD latency is too high. Try speaking a syllable before your intended content — a soft "um" gives the detector time to activate. Alternatively, look for an app that pre-buffers audio so the first word is never lost.
If your app keeps recording after you stop, the VAD silence detection threshold is too long. Look for settings that let you adjust the trailing silence duration before the session ends.
If your app transcribes background noise, the VAD energy threshold is too low for your environment. Moving to a quieter location, using a directional microphone, or switching to a hotkey-activated model eliminates this problem.
iPhone Voice Detection
On iPhone, voice detection follows similar principles. The hardware microphone array in modern iPhones includes signal processing that isolates the speaker's voice, reducing ambient noise before the audio even reaches software. This hardware-level voice isolation means iPhone dictation often works better in noisy environments than dictation on a Mac with a standard built-in microphone.
A voice detector that works flawlessly is invisible. You notice it only when it fails — which is why the best apps invest heavily in getting this seemingly minor component exactly right.
For more on the technical architecture behind modern voice-to-text apps, see our deep dive on how Steno works under the hood.