All posts

You have a recording sitting on your Mac — a voice memo, a meeting capture, an interview, a lecture — and you need the text version. The traditional solution was to put on headphones, play the recording, pause every few seconds, and type what you heard. That approach has always been slow and tedious. A one-hour recording could take three to five hours to transcribe manually. Today there are much better ways to convert a record to text on Mac.

The right approach depends on your specific situation: the length of the recording, the audio quality, how accurate the result needs to be, and whether you need the transcript immediately or can wait for processing. This guide covers the main methods with their real-world trade-offs.

Method 1: Live Dictation as You Play

The simplest method that does not require uploading anything: play your recording through your speakers or headphones while using a voice-to-text tool to capture the audio. This works best with a headset microphone placed near a speaker, or by using the Mac's built-in microphone to pick up audio played through external speakers.

Tools like Steno, which operate as system-level dictation apps, can transcribe whatever your microphone picks up — including audio being played from another application. Hold the hotkey, play a segment of your recording, release when you want to capture that segment. The transcribed text appears instantly in your notes app or document.

This approach requires some manual coordination, but it gives you control over which parts of the recording you capture. For recordings where only some sections are relevant, this selective approach can be more efficient than transcribing the entire file.

Method 2: File-Based Audio Transcription

For longer recordings where you need a complete transcript, file-based transcription services process your audio file and return text. You upload the file, processing happens in the cloud, and you receive a transcript — often within minutes for recordings under an hour.

What to Look for in a Transcription Service

Most transcription services charge by the minute of audio. A one-hour recording typically costs between one and three dollars for automated transcription, or ten to fifteen dollars per hour for human-reviewed transcription.

Method 3: iPhone Voice Memo Transcription

If the recording was made on your iPhone using the Voice Memos app, iOS 17 and later can automatically transcribe voice memos on-device. Open the Voice Memos app, select a memo, and tap the transcript icon. For short memos in clear audio conditions, this works well and requires no uploading.

For longer or more complex recordings, the iOS on-device transcription accuracy drops. In those cases, AirDropping the audio file to your Mac and using a dedicated transcription tool produces better results.

Method 4: Real-Time Capture Before You Need Conversion

The cleanest solution to recording-to-text conversion is to avoid creating the conversion problem in the first place. If you know you need the content as text, dictate directly instead of recording for later transcription.

Steno makes this practical in any situation where you are speaking into a phone call, interview, or meeting. Rather than recording everything and transcribing later, use Steno during the event to capture key points as dictated text. Hold the hotkey, speak a summary of what was just said, release. Your notes are ready immediately, in any application, without any post-processing step.

This approach works especially well for meetings, client calls, and interviews where you already know which parts matter most. Instead of capturing everything and then filtering, you filter in real time by dictating only what is worth keeping.

Improving Accuracy When Converting Recordings

Audio Quality Is Everything

The biggest factor in transcription accuracy is the quality of the original audio. Recordings made with a phone held at arm's length in a noisy room will produce poor transcription results regardless of which tool you use. Recordings made with a headset microphone in a quiet room will produce excellent results with most modern transcription tools.

If you have control over the recording setup, invest in audio quality. A $30 lavalier microphone clipped to a shirt collar produces dramatically better audio than a built-in phone microphone. That better audio directly translates to better transcription accuracy with less cleanup required.

Single Speaker Is Easier Than Multiple Speakers

Multi-speaker recordings — interviews, panel discussions, meetings with several participants — are harder to transcribe accurately because overlapping speech, different volumes, and varying accents all reduce accuracy. Single-speaker recordings transcribe much better. If you are in control of the recording setup, capturing one speaker at a time rather than a room full of people will improve your results significantly.

Know When to Edit Rather Than Re-Transcribe

Automatic transcription rarely produces a perfect transcript. Common errors include misheard proper nouns, domain-specific terms, and speaker transitions. The efficient workflow is to run automatic transcription, then do a listen-and-correct pass where you play the audio and fix errors in the transcript. This is dramatically faster than transcribing from scratch. Steno can assist here — use hold-to-speak to dictate corrections as you review the transcript, rather than typing each fix.

When Speed Matters Most

Sometimes you need text from a recording immediately. A client call ended five minutes ago and you need to send a summary now. A lecture just finished and you need your notes organized before the next class. In these situations, the fastest path from record to text is live dictation with a tool like Steno — no uploading, no waiting, no post-processing. Speak your summary while the content is still fresh, and your text is ready instantly.

Download Steno at stenofast.com to start converting your voice to text immediately, on any Mac application, with a single hotkey.

The fastest transcription is the one that never needed to happen — because you captured the content as text while it was being spoken, not as a recording that needs to be processed afterward.