Voice to Text on Apple Silicon: How Steno Is Optimized for M-Series Macs

All posts

When Apple introduced the M1 chip in November 2020, it represented the most significant architectural change in Mac history. Every application that had not been recompiled for ARM would run through Rosetta 2, a translation layer that added overhead to every instruction. For most productivity apps, this overhead was acceptable. For a real-time audio tool like a dictation app, it was not.

Steno has been a native Apple Silicon application from its first Swift release. It runs directly on the ARM architecture of M1, M2, M3, and M4 chips without any translation layer. This is not just a marketing checkbox — it has measurable consequences for latency, power consumption, and the overall dictation experience.

What Native Apple Silicon Means in Practice

When we say Steno is "native on Apple Silicon," we mean the compiled binary contains ARM64 machine code that executes directly on the M-series processor. There is no Rosetta translation, no x86 emulation, no JIT compilation step. The Swift compiler generates ARM instructions that the CPU fetches, decodes, and executes in a single pipeline.

For a dictation app, this matters at several critical points in the workflow.

Audio Capture

When you hold the hotkey, Steno begins capturing audio through AVFoundation. The audio data flows from the microphone hardware through the audio driver, into the app's audio buffer. On native Apple Silicon, this path is fully optimized — the audio callback runs on a real-time thread that the M-series scheduler handles with nanosecond precision.

Under Rosetta, the audio callback code would first need to be translated from x86 to ARM instructions. While Rosetta caches these translations, the initial callback setup incurs a translation delay, and the translated code runs approximately 20% slower than native. For audio processing that operates on tight deadlines — each buffer must be processed before the next arrives — this overhead can cause buffer underruns and audio glitches.

Audio Processing

Before sending audio for transcription, Steno performs several processing steps: voice activity detection, noise analysis, audio encoding, and chunk segmentation. These operations use Apple's Accelerate framework, which provides highly optimized vector math routines.

On Apple Silicon, the Accelerate framework maps directly to the chip's NEON SIMD units — specialized hardware for parallel numeric computation. Operations like FFT (Fast Fourier Transform) for spectral analysis and vector dot products for voice activity detection run on dedicated silicon designed specifically for these workloads.

The M-series chips also include the AMX (Apple Matrix) coprocessor, which handles matrix operations at extraordinary speeds. While Steno's current audio processing does not require matrix math, the Accelerate framework automatically dispatches compatible operations to the most efficient execution unit available.

Power Efficiency and Battery Life

Apple Silicon's architecture uses a heterogeneous design with performance cores (P-cores) and efficiency cores (E-cores). The operating system scheduler assigns work to the appropriate cores based on the task's computational demands.

Steno's workload profile is ideal for this architecture. During idle periods — which represent the vast majority of the app's runtime — Steno's event loop runs on efficiency cores, consuming minimal power. When you press the hotkey and start dictating, the audio capture and processing work is handled by efficiency cores as well, because the computational load is modest. Only during peak processing moments might the scheduler briefly engage a performance core.

This means Steno has near-zero impact on your battery life. On a MacBook Air with an M2 chip, adding Steno to your login items does not measurably reduce battery runtime. The efficiency cores handle Steno's background workload within their normal power envelope — the incremental energy consumption is measured in microwatts.

Contrast this with non-native applications running under Rosetta. The translation layer itself consumes additional CPU cycles, and because the code is not optimized for the E-core/P-core scheduling model, Rosetta-translated apps tend to run on performance cores more often than necessary, drawing more power.

Memory Architecture Benefits

Apple Silicon uses unified memory architecture (UMA), where the CPU, GPU, and Neural Engine all share the same pool of memory. For Steno, this means audio buffers that are created during capture can be accessed by processing routines without memory copies between separate CPU and GPU address spaces.

More practically, the memory bandwidth of Apple Silicon — 100 GB/s on M1, 200 GB/s on M3 Pro/Max — means that Steno's audio buffers are read and written at speeds that make memory access essentially instantaneous relative to the audio sample rate. A 16kHz audio stream produces about 32 KB of data per second. Even the most basic M1 chip can move that amount of data in a fraction of a microsecond.

This surplus of memory bandwidth means Steno never becomes a memory bottleneck, even when your system is under heavy load from other applications. The audio pipeline gets the data it needs without contention.

Specific Optimizations Across M-Series Generations

M1 and M2

These chips established the ARM transition baseline. Steno runs with excellent performance on both, with audio capture latency under 5 milliseconds and total recording-to-text time dominated by network round-trip rather than local processing. The 8 or 10 GPU cores are not used by Steno directly, leaving them fully available for your other applications.

M3

The M3 generation introduced dynamic caching for the GPU and improved branch prediction for the CPU. For Steno, the CPU improvements mean the audio processing pipeline runs with fewer branch mispredictions, resulting in slightly more consistent latency. The improvement is measurable in benchmarks but not perceptible to users — M1 was already fast enough that the dictation experience was seamless.

M4

The M4 chip brought a significantly enhanced Neural Engine with 38 TOPS (trillion operations per second). While Steno currently performs transcription server-side, this Neural Engine capability opens the door for future local processing features like real-time voice activity detection using on-device ML models, speaker identification, and noise classification — all without touching the CPU.

The Rosetta Overhead for Competing Apps

Several popular dictation tools on macOS are still distributed as Intel-only binaries, relying on Rosetta 2 for Apple Silicon compatibility. While Rosetta is remarkably good at translation — most users cannot tell the difference for general applications — dictation apps are not general applications.

A dictation app has real-time constraints. Audio must be captured without gaps. Processing must complete within buffer deadlines. Hotkey events must be handled with minimal latency. Rosetta adds a small but consistent overhead to each of these operations. Individually, each overhead is barely measurable. Collectively, they can add 30 to 50 milliseconds to the total pipeline latency — the difference between a tool that feels instant and one that feels slightly sluggish.

Additionally, Rosetta-translated apps typically consume 30 to 50% more memory than their native equivalents, because the translation layer maintains both the original x86 code and the translated ARM code in memory, along with the translation cache.

Future-Proofing Your Dictation Workflow

Apple has made clear that the future of the Mac is Apple Silicon. Rosetta 2, while still supported, is a transitional technology. Applications that have not been ported to native ARM will eventually face compatibility issues as Apple removes Rosetta support from future macOS releases.

By choosing a native Apple Silicon dictation app today, you ensure that your voice-to-text workflow will continue to work seamlessly through future macOS updates without depending on a translation layer that has an expiration date.

Steno is compiled as a universal binary, meaning the same download works on both Intel and Apple Silicon Macs. On Intel, it runs native x86. On Apple Silicon, it runs native ARM. There is no compromise, no "compatible but not optimized" asterisk.

If you are using an M-series Mac and want a dictation tool that takes full advantage of your hardware, download Steno and experience native Apple Silicon performance for voice to text.