Why a Native Mac Dictation App Built in Swift Outperforms Everything Else

All posts

When you speak a sentence and expect it to appear at your cursor, every millisecond matters. The gap between pressing a hotkey and seeing your words materialize is the difference between a tool that feels like magic and one that feels like a chore. This is why we built Steno as a native Mac dictation app in Swift — not in Electron, not as a web wrapper, and not in a cross-platform framework.

The decision to go native was not about ideology. It was about physics. Audio capture, system-level text injection, and menu bar integration all demand tight coupling with macOS internals. A native Swift app can access these capabilities directly, without the overhead layers that make cross-platform tools sluggish.

The Latency Problem with Non-Native Apps

Most voice-to-text tools on the Mac are built with Electron or similar web-based frameworks. Electron bundles an entire Chromium browser engine into every application. For a text editor or a project management tool, that overhead is tolerable. For a dictation app that needs to capture audio in real time, process it, and inject text at the cursor position, it is not.

Consider what happens when you hold a hotkey and start speaking in a non-native dictation app. The keypress event travels through the JavaScript event loop. Audio capture goes through a web API abstraction layer. The audio buffer is marshalled between the native layer and the JavaScript runtime. Each of these steps adds latency — often 50 to 200 milliseconds of unnecessary delay before recording even begins.

In Steno, the hotkey listener is registered directly with the macOS Carbon Events API. Audio capture starts through AVFoundation with direct hardware access. There is no intermediate runtime, no garbage collector pausing at inconvenient moments, no event loop contention. The result is that recording begins within single-digit milliseconds of pressing the hotkey.

Memory and System Resources

Steno's entire application package is 1.7 megabytes. A typical Electron-based dictation app weighs in at 150 to 300 megabytes. This is not just a disk space consideration — it reflects a fundamental difference in runtime resource consumption.

An Electron app running idle in your menu bar consumes 80 to 150 megabytes of RAM for the Chromium renderer process alone. Steno, sitting in your menu bar waiting for the hotkey, uses roughly 12 megabytes. On a MacBook Air with 8GB of RAM, this difference is significant. Every megabyte consumed by a background utility is a megabyte unavailable for the application you are actually trying to use.

Swift's deterministic memory management through Automatic Reference Counting means Steno releases memory the instant it is no longer needed. There are no garbage collection pauses, no memory spikes during collection cycles. When you finish dictating, the audio buffers are freed immediately. The memory footprint drops back to its idle baseline within milliseconds.

System Integration That Cross-Platform Cannot Match

A dictation app is fundamentally a system utility. It needs to work everywhere — in your email client, your code editor, your browser, your terminal. This requires deep integration with macOS accessibility APIs, the pasteboard system, and input method services.

Text Injection

When Steno finishes transcribing your speech, it needs to place text at the cursor position in whatever application is currently focused. This is accomplished through the macOS Accessibility API (AX API), which allows programmatic interaction with text fields across all applications. In Swift, this is a direct call to AXUIElementSetAttributeValue. In a cross-platform framework, this requires bridging through native modules, adding complexity and potential failure points.

Menu Bar Presence

Steno lives in your menu bar as an LSUIElement app — a special macOS application type that has no Dock icon and no main window. It appears only as a small icon in your menu bar, ready when you need it, invisible when you do not. This application model is unique to macOS and requires native AppKit integration that cross-platform frameworks simulate imperfectly at best.

Audio Pipeline

Capturing audio from the system microphone on macOS involves requesting permission through the AVCaptureDevice authorization API, configuring an AVAudioEngine with the appropriate sample rate and format, and managing the audio session to coexist with other applications that might be using the microphone. Each of these steps has macOS-specific nuances that a native Swift app handles correctly by default.

The Security Advantage

A native Swift app runs in a single process with direct system calls. There is no embedded web server, no inter-process communication between a main process and a renderer process, and no JavaScript execution context that could be targeted by supply chain attacks.

Steno's codebase is pure Swift with no npm dependencies. The supply chain consists of Apple's Swift standard library and the macOS SDK. Compare this to an Electron app, which typically pulls in hundreds of npm packages, each of which is a potential vector for malicious code. For an application that captures audio from your microphone, this attack surface reduction is not academic — it is essential.

Performance Characteristics of Swift for Audio

Swift compiles to native machine code through LLVM. There is no interpreter, no just-in-time compiler, and no virtual machine. When Steno processes audio — applying voice activity detection, segmenting chunks for transmission, encoding to the required format — these operations run at full native speed.

Swift also provides value types (structs and enums) that are allocated on the stack rather than the heap. For audio processing, where you are manipulating buffers of floating-point samples at high frequency, stack allocation eliminates the overhead of heap allocation and deallocation. The practical result is consistent, predictable performance without the jitter that characterizes garbage-collected runtimes.

Why Not a Cross-Platform Framework?

Frameworks like Qt, Flutter, and React Native offer a compelling promise: write once, run everywhere. For a dictation app, this promise breaks down in practice. Voice-to-text is not a visual application — it is a system utility that needs to interact with platform-specific APIs for audio capture, text injection, hotkey registration, and notification delivery.

Building Steno with a cross-platform framework would mean writing the core functionality in platform-specific code anyway, wrapped in a framework that adds overhead without providing benefit. The user interface — a menu bar icon and an occasional overlay — is trivial to implement in native AppKit and does not justify the weight of a cross-platform UI framework.

What This Means for You

As a user, the technical decisions behind Steno translate into concrete benefits. The app launches instantly when your Mac boots. It sits in the menu bar consuming negligible resources. When you hold the hotkey and speak, recording starts immediately. When you release, transcription is fast because the audio was captured cleanly with no framework overhead.

The 1.7MB download means you can install Steno in seconds, even on a slow connection. Updates through Sparkle are similarly small, downloading and applying in the background without interrupting your work.

These are not features that appear on a comparison chart. You will never see "built in Swift" listed as a bullet point alongside "supports multiple languages" or "works in any text field." But they are the foundation that makes everything else possible. A dictation app that does not feel instant is a dictation app you stop using. Native Swift development is how we ensure Steno always feels instant.

If you want to experience the difference a native Mac dictation app makes, download Steno and try it yourself. The free tier gives you enough usage to feel the difference in your first session.