How Steno Works Under the Hood

All posts

Steno looks simple on the surface: hold a key, speak, release, and text appears. But behind that simplicity is a carefully orchestrated pipeline that touches audio hardware, signal processing, speech recognition, text filtering, and deep macOS system integration. This post walks through the entire architecture for anyone curious about what happens between your voice and your cursor.

The Foundation: Swift, SwiftUI, and AppKit

Steno is written entirely in Swift. The UI layer uses a combination of SwiftUI and AppKit. SwiftUI handles the settings panels, history views, and popover interfaces where its declarative approach keeps the code lean. AppKit is used for the parts that need deeper system access: the menu bar item itself, global event monitoring, and window management.

Being a menu bar app means Steno has no dock icon and no main window. It lives in the top-right corner of your screen as a small icon. When you click it, a popover drops down with your recent dictations, statistics, and settings. This architecture keeps the app out of your way while remaining instantly accessible.

The Hotkey Listener

Everything starts with a keypress. Steno registers a global hotkey listener using CGEvent taps. This is a low-level macOS API that lets the app intercept keyboard events system-wide, regardless of which application is in the foreground. When you press and hold the configured hotkey (Left Control by default), Steno begins recording. When you release, recording stops and transcription begins.

The global event tap requires Accessibility permission, which is why Steno asks for it during first launch. Without it, the app cannot detect keypresses outside its own process. This is the same permission mechanism that other keyboard-driven utilities like Alfred, Raycast, and Karabiner use.

The Audio Recording Pipeline

Audio capture runs on AVAudioEngine, Apple's real-time audio processing framework. When the hotkey is pressed, Steno starts the audio engine and attaches a tap to the input node. Audio flows in as a stream of PCM buffers at the sample rate the speech recognition engine expects.

The pipeline is deliberately minimal. There is no pre-processing, noise reduction, or audio manipulation applied to the raw stream. Modern speech recognition handles noisy audio far better than hand-rolled filters, so we send the cleanest possible signal and let the recognition engine do its job. This also keeps CPU usage low during recording, which matters for a background app that should never interfere with whatever you are actually working on.

Voice Activity Detection

Not every hotkey press results in speech. You might accidentally hold the key, or press it and then decide not to say anything. Steno handles this with voice activity detection (VAD). While audio is being recorded, Steno calculates the speech ratio: the proportion of audio frames that contain actual speech versus silence.

If you release the hotkey and the speech ratio is below a certain threshold, Steno discards the recording entirely and does not send it for transcription. This saves unnecessary processing and avoids the situation where silence or background noise gets transcribed into phantom text. The threshold is tuned to be generous enough that even a quiet or short utterance is captured, but strict enough that holding the key while not speaking produces no output.

Transcription

Once recording finishes and passes the voice activity check, the audio buffer is sent to a state-of-the-art speech recognition service. Steno uses advanced AI transcription to convert audio to text. The transcription typically completes in 200 to 500 milliseconds, depending on the length of the audio and network conditions.

The response is a plain text string. Steno does not rely on word-level timestamps or confidence scores. The recognition accuracy is high enough that post-processing can focus on formatting rather than error correction.

Hallucination Filtering

One challenge with any speech recognition system is hallucinations: output text that does not correspond to anything actually said. This can happen with background noise, breathing sounds, or very short audio clips. Common hallucinations include phrases like "Thank you for watching," "Subscribe to my channel," or random words repeated in a loop.

Steno runs every transcription result through a hallucination filter before inserting it. The filter checks against a curated list of known hallucination patterns and also looks for suspicious characteristics like very short audio producing unusually long text. If a result is flagged as a likely hallucination, it is silently discarded. The user sees nothing, which is exactly the right behavior. An absent result is always better than a wrong one appearing in the middle of your email.

Text Insertion via the Accessibility API

This is where Steno diverges most from how you might expect a dictation app to work. Many dictation tools use the clipboard: they copy text to the pasteboard and then simulate Cmd+V. This approach is fast, but it has a critical flaw. It overwrites whatever the user had on their clipboard. You lose whatever you last copied, and if you were in the middle of a copy-paste workflow, your clipboard is now corrupted.

Steno avoids this entirely. It uses the macOS Accessibility API to insert text character by character into the focused text field. The app identifies the currently focused UI element, confirms it accepts text input, and then programmatically types each character. This approach preserves the clipboard, works in virtually every text field on the system, and produces text insertion that looks identical to keyboard typing from the target application's perspective.

Character-by-character insertion is slightly slower than clipboard paste, but the difference is imperceptible for typical dictation lengths. A 30-word sentence takes only a few extra milliseconds. The tradeoff is overwhelmingly worth it: your clipboard stays untouched, and there are no side effects from the insertion process.

Why Native Matters for Latency

Every millisecond in the pipeline counts. When you release the hotkey, you expect text to appear almost immediately. If there is a noticeable delay, the experience breaks down. This is why Steno is a native Swift app rather than an Electron wrapper or a web-based tool.

A native app starts the audio engine instantly, there is no JavaScript event loop adding latency to the hotkey detection, and the Accessibility API calls go directly through system frameworks without any bridging layer. The total overhead added by the app itself, from keypress detection to audio start to post-transcription text insertion, is under 50 milliseconds. The dominant latency is the transcription itself, and we cannot control that. But we can make sure everything around it is as fast as the hardware allows.

The result is a tool that feels instantaneous. You speak, you release, and the text is there. That responsiveness is not accidental. It is the direct consequence of building on native frameworks and obsessing over every step in the pipeline.