Documentary interview transcription: a paper edit workflow that survives 30 hours of footage
If you shot 30 hours of interviews for a feature documentary, you have roughly 200 hours of typing ahead of you at human transcription speed — or about an afternoon of wall-clock time if you batch the audio through AI and spot-check the result. The paper edit is where your film actually gets written, and AI transcription is the only practical way to get there without burning a month or $2,700 at $1.50/audio-minute on a service like Rev.
We run AssemblyAI Universal-3 in production and transcribe several thousand minutes of long-form interview audio per week. This piece is the workflow we'd hand to an indie director or assistant editor opening a new project tomorrow.
What a paper edit is, and why doc editors live in it
A paper edit is a marked-up transcript of every interview, with the keeper quotes highlighted, themes color-coded, and timecodes preserved so each selected line can be pulled back into Premiere, Resolve, or Avid as a clip. You edit the film on paper first — sequencing scenes, building act structure, finding the throughline — before you ever touch the NLE.
The reason is simple math: scrubbing a 90-minute interview to find one sentence takes 20–40 minutes; finding it in a transcript takes 10 seconds. With 30 hours of interviews, the scrub-based approach is structurally impossible.
A paper edit also solves a problem unique to documentary: subjects rarely speak in clean soundbites. They ramble, backtrack, and reach the point three minutes after they started. You frequently want the setup from minute 12 stitched to the payoff from minute 45 — a Frankenstein cut that's invisible to the viewer if you got the words right. You cannot find those stitches by scrubbing. You find them by reading.
Once the paper edit is locked, the assistant editor builds a radio edit in the timeline — an audio-only assembly matching the paper script. Only after the radio edit flows do you start worrying about b-roll, lower thirds, and grade. Without a transcript, this whole sequence collapses.
Time math: 30 hours of interviews, two ways
Human transcription at professional speed runs about 4× real-time for clean single-speaker audio and 6–8× for multi-speaker interview audio with overlap. A 30-hour corpus is therefore 120–240 hours of typing. At Rev's $1.50/min human tier, that's $2,700 in transcription costs alone; $5,400 if you need verbatim with timestamps.
AI transcription runs faster than real-time on a single file and effectively parallel across files. Thirty 60-minute interviews submitted in batch finish in roughly 30–90 minutes of wall-clock time depending on queue depth. We see ~92% accuracy (WER ~7.88%) on clean 48 kHz boom-mic interview audio — which means roughly one word in twelve needs an editor's eye, almost always a name, a place, or a technical term.
Your real cost is the cleanup pass. Plan 15–25 minutes of human review per hour of audio for a paper-edit-grade transcript — call it 8 hours of editor time for the full 30-hour corpus, versus 120+ hours of typing.
For typical indie filmmaker transcription budgets, the entire corpus lands under $40 in compute on a Pro plan at $19/month (as of May 2026): 600 audio-minutes/month covers 10 hours of interviews, then overage at $0.04/minute on Pro covers the remainder. Roughly $34 versus $2,700.
Timestamps are everything
A documentary transcript without timecode is a magazine article. The point of the paper edit is round-tripping selections back to the NLE, and that requires timestamps your editor can act on.
We output timestamps at three granularities:
- Per-word: millisecond-level, used for caption export and tight subclip generation
- Per-segment: usually 5–15 seconds, what you actually read in the paper edit
- Per-speaker turn: starts and ends of each diarized block
For documentary work, segment-level is the unit you mark up. Per-word matters when you export captions for a rough cut screening, but reading a transcript with a timecode on every word is unusable.
The SMPTE timecode gap — read this before you import
Here's a frank limitation: AI engines do not read embedded SMPTE timecode from your video files. If your camera rolled timecode-of-day starting at 14:32:10:05, the AI does not know that. Every transcript we generate starts at 00:00:00 relative to the file.
When you bring an SRT back into the NLE, you have to offset the subtitle track to match the source clip's start TC. Premiere and Resolve both let you nudge a subtitle/caption track by a fixed offset; do this once per interview clip and your text lines up with the timeline. Trint and Descript handle this offset inside their own apps, which is part of what you pay for there. Resolve Studio also ships its own built-in transcription — if your director and producers all work in Resolve, that's a legitimate path. Many indie teams still generate text externally so collaborators without an NLE can read along.
Premiere and Resolve round-trip
Neither Premiere nor Resolve imports a paper-edit transcript directly the way Avid Script Sync does. The pragmatic flow:
- Export the transcript as SRT or VTT with timecodes referenced to interview-clip-relative time.
- In Premiere, import the SRT as a caption track on the interview clip — it shows up as searchable text in the Text panel (Premiere 2024+).
- In Resolve 18.5+, the Subtitle track behaves similarly; you can also import SRT directly to the Edit page and use it to navigate.
- For markers, convert your highlighted paper-edit selections into a CSV of in/out timecodes and import via the Markers panel (Resolve) or a marker-import script (Premiere via ExtendScript).
A few teams skip the NLE markers entirely and use the paper edit as the assembly script: the assistant editor pulls each highlighted segment as a subclip, names it by transcript line, and drops them in selects bins by theme.
Multi-camera interview audio: which source to upload
This decides your accuracy more than anything else.
A typical sit-down doc interview has four possible audio sources recorded simultaneously:
- Boom mic into a field recorder (Sound Devices, Zoom F8) — 48 kHz, broadcast WAV
- Lavalier(s) into the same recorder or wireless to camera — 48 kHz
- Camera scratch audio — usually 48 kHz but compressed, picks up room
- Mixed mono on the camera from a field mixer — variable quality
Upload the boom or the lav, not the camera scratch. Camera-internal audio after AAC compression typically costs us 2–4 points of accuracy versus the original field recorder file. For a 90-minute interview that's 1,500–3,000 additional misrecognized words.
If you have both boom and lav on separate tracks, upload them as a stereo file with boom-left, lav-right. We channel-split for diarization on stereo input, which gives perfect speaker separation — interviewer on one channel, subject on the other, no model guessing required. This is the single highest-leverage move for clean documentary transcripts.
If everything is mono and mixed, we fall back to pyannote-3.1 for speaker diarization. It works well for 2–4 speakers — fine for most interviews — and degrades beyond 6, which matters if you're transcribing a roundtable.
Field recorder file specifics
- Sample rate: 48 kHz is ideal; we don't gain anything from 96 kHz
- Bit depth: 24-bit WAV preferred, 16-bit fine
- Format: WAV, FLAC, or broadcast WAV (.bwf) all work; AAC/MP3 acceptable but lossy
- Length: our 10-hour-per-file ceiling on Pro covers any single interview; the 2 GB file ceiling is the practical limit for uncompressed multi-track
Diarization at corpus scale, and the speaker-naming problem
A documentary often involves 15–40 interview subjects, plus the director's voice on every tape, plus crew interjections. Diarization at this scale isn't about one tape — it's about consistent speaker labels across the corpus.
We label speakers within a file (Speaker A, Speaker B). We don't currently do cross-file speaker identification — the director's voice in interview-03 isn't automatically tagged as the same "Director" across interview-17. That's manual rename work, and the most efficient pattern is to decide your convention before the first upload:
- Batch upload all interviews with consistent filenames (
subject-name_YYYY-MM-DD_boom.wav) - For each transcript, rename Speaker A → subject name, Speaker B →
INTERVIEWER - Store the renamed transcripts in a folder mirrored to your project structure
For corpora with crew chatter, slate calls, and bathroom-break audio bleeding in, plan a cleanup pass to strip non-content speakers. The transcript will faithfully include every "okay we're rolling" — a feature for sync, a nuisance for the paper edit.
Our interview transcription workflow goes deeper on the journalist sit-down case and applies cleanly to doc work.
Building the paper edit after transcription
Transcripts are raw material. The paper edit is what you do with them.
-
Make a file map before upload. One row per interview: subject, date, source file, audio source (boom/lav/scratch), frame rate, runtime, transcript status. This is the document the assistant editor lives in for the next six months.
-
Spot-check 5 minutes per hour of footage. One early, one middle, one late. Flag files with bad sync, wrong speaker labels, or heavy proper-noun errors before you commit time to marking them up.
-
Search for known story beats. Names, places, dates, repeated metaphors, contradictions, emotional phrases. AI transcripts make full-text search trivial even when the text is imperfect — if a subject said "spreadsheet" 17 times across four interviews, search surfaces every moment in seconds.
-
Pull selects with source references attached. Every select carries speaker, file, and in/out timecode. If it doesn't, it isn't ready for the timeline. A working format:
MAYA — civic_maya_2026-05-12_boom.wav — 00:42:18.6–00:42:41.9 "The first time I saw the spreadsheet, I thought it was a mistake. Then I saw the same number in three different folders." -
Group selects by story function, not by shoot day. Origin, stakes, conflict, evidence, reversal, cost, resolution. A chronological transcript is for finding material; a thematic paper edit is for building scenes.
-
Verify every quote that reaches the cut. Before the edit leaves the core team, listen against the transcript and correct the text. AI accuracy is high; it isn't 100%, and a misheard word in a release-quoted line is an expensive thing to fix in post.
Where the seams are
Honest limitations on documentary work specifically:
- Proper nouns: place names, foreign-language names, specialized vocabulary (a doc on cellular biology will mangle "anaphase" the first three times until context kicks in). Budget a find-and-replace pass.
- Heavy accents in non-English: 99 languages at one price, but accuracy on regional Tagalog or West African French varies. Run a 5-minute test before committing a corpus.
- Overlapping speech: two people talking over each other produces transcript soup. Diarization helps; nothing fixes it perfectly.
- Archival audio: 1970s broadcast tape at 22 kHz with hiss runs closer to 17.7% WER, similar to telephony. Useful for indexing, not for direct quotation without verification.
- No live realtime captions: we transcribe the recording after, not during the shoot. On-set accessibility captions are a different tool.
Sensitive subjects need a data plan
Documentary subjects often discuss medical history, immigration status, abuse, litigation, or political exposure. Treat transcripts as production-sensitive documents, not casual notes. We provide HIPAA-grade data handling at rest but are not a HIPAA BAA-covered product yet — if your project requires a signed BAA, we're not your vendor today. For ordinary doc production, decide upfront who can upload, who can access exports, and when transcripts should be removed from shared drives.
Where neighbors fit
Trint and Descript both built strong editor-facing UIs for paper-edit-style work — Descript especially if you want to edit the audio by editing the text. Otter.ai is excellent for live meeting capture and weaker on long-form pre-recorded interview audio. Rev still wins when you need a human-certified transcript for legal release or broadcast standards-and-practices. Resolve Studio ships its own built-in transcription if your whole team is in Resolve. We're optimizing for fast, accurate batch transcription with clean timestamps you can round-trip into any NLE — and being honest that the final markup happens in your tool of choice, not ours.
What next
- Upload one 60-minute interview from your current project on the Free plan (30 audio-min/month, as of May 2026) and check the SRT offset against your Premiere or Resolve clip's source TC.
- If you have boom + lav on separate tracks, test a stereo upload (boom-L, lav-R) against a mono mixdown of the same interview — compare diarization quality side by side.
- For a 30-hour corpus, build the file map first: subject, date, source file, audio source, frame rate, transcript status. Decide your speaker-naming convention before the first upload, not after the tenth.
- If you're working in a language other than English, run a 5-minute sample first — one price, every language, but accuracy varies by audio condition and dialect.