Start free

Блог · · 5 min read

Audiobook narration transcription for proofing and ACX corrections

Audiobook narrators get correction lists from publishers and ACX QA. How AI transcription speeds the proofing pass by surfacing mismatches between manuscript and recording.

the second pass, after the diff has cleared the textual errors.

ACX corrections format

ACX and publishers commonly return corrections as a timestamped list. A common format publishers and QA reviewers use:

00:12:34 — "she walked toward" should be "she walked towards"
01:42:08 — missing line: "He never answered."
02:18:55 — repeated word: "the the"

A diff-driven workflow produces this format natively. You already have the timestamp, the manuscript phrase, and the transcript phrase — the corrections list is a CSV export with rows formatted as a string.

Filter aggressively before you deliver. Do not send a publisher a 400-row diff when 260 rows are punctuation, casing, or proper-noun mishears. Drop everything marked "false positive" or "accept" and ship only the confirmed rows. A clean 40-row sheet earns trust; a 400-row raw dump earns a reputation for noise.

For punch-and-roll narrators, the corrections list doubles as the pickup script. Sort by timestamp, open the session, set the punch-in three words before each marker, record the fix. A 10-hour book with 40 real corrections can become a 90-minute pickup session instead of a re-listen marathon.

Punch-and-roll vs full re-record

The proofing question that decides cost: does this chapter need a re-record or just pickups?

ACX does not publish a pickup-count threshold — they care that the finished files meet RMS, peak, and noise-floor specs and sound consistent. A chapter with 30 punch-and-roll corrections can sound identical to one recorded clean if the booth conditions match. Re-records become necessary when:

  • Booth conditions drifted — new HVAC, different mic position, weight gain or loss changing resonance
  • Performance issues spread across a whole scene, not localized to specific lines
  • The narrator's voice changed (cold, fatigue) for a full session
  • A chapter file was swapped, a section is missing for pages, or ACX rejected on technical grounds (RMS, peak level, noise floor, format)

A transcript diff tells you which bucket you're in. Scattered single-word substitutions across a chapter — that's a punch-and-roll job, 30 minutes to two hours depending on count. Clusters of 5+ word divergences in a row usually mean the narrator paraphrased, which is harder to repair seamlessly and often justifies a scene re-record.

The diff doesn't make the decision for you. It makes the decision visible in a way that listening straight through doesn't.

Cost vs hiring a human proof-listener

Professional audiobook proof-listening commonly runs roughly $10–25 per finished hour, with budget freelancers sometimes below that and experienced audiobook QC/proofers or rush work higher. For a 10-hour book that's about $100–250+ before any editing or pickups. Turnaround is typically 3–7 days, depending on the listener and queue.

AI transcription cost on our plans can change, so check the pricing page before you quote a job. The stable planning number is audio minutes:

  • A one-chapter pilot usually fits inside a free trial or free monthly allowance, if your account includes one
  • A 10-hour audiobook is 600 audio-minutes, plus headroom for re-runs
  • Studios running multiple titles a week need higher monthly minute allowances and higher file-size limits

See our pricing page for current tiers.

The honest comparison isn't AI vs human — it's AI plus a shorter human pass vs human alone. A proof-listener fed a pre-flagged corrections list can work faster: confirming flagged moments instead of scanning continuously. Combined cost on a 10-hour book depends on the listener's rate and how much of the audio they still spot-check, but it is usually lower than a full blind proof-listen.

Try it on your audio

Start free →

30 minutes a month, no card.

Where the seams are

Two things will bite you if you don't plan for them.

First, proper nouns. Audiobooks live and die on character names, place names, invented terminology. AssemblyAI's model may transcribe "Daenerys" as "Denarius" or "Dan Aris" depending on context. Build a custom vocabulary list where your STT engine supports term hints or custom spelling — before you transcribe. Otherwise the diff can be polluted with hundreds of false positives on every name.

Second, performance verbatim. If your narration intentionally departs from the manuscript — a publisher-approved cut, an authorized paraphrase, a foreign-language passage with phonetic spelling — the diff flags those as errors. Maintain a known-divergences list and mark those rows "accept" automatically.

What we do not ship: a dedicated manuscript-alignment editor, an ACX pickup sheet generator, or a DAW plugin. We are not pretending those exist. We return word-level JSON with timestamps from our audio-to-text pipeline; you wire that into your diff tool of choice. Descript has text-based audio editing and transcript workflows that are worth knowing about for narrative audio. Otter.ai and Rev.com produce transcripts, but don't ship a dedicated manuscript-diff UI — you export and compare yourself.

Run one chapter before you commit the book

Pilot before you transcribe the whole title. Pick a chapter with dialogue, proper nouns, and at least 25 minutes of finished audio — a clean nonfiction intro with no names isn't a hard enough test.

  1. Choose one finished chapter file.
  2. Export the approved manuscript section as plain text.
  3. Transcribe the chapter audio.
  4. Normalize punctuation, casing, numbers, and contractions on both sides.
  5. Diff transcript against manuscript.
  6. Review every flag while listening at the timestamp.
  7. Tag each row: confirmed, false positive, accept.
  8. Compute one number: confirmed corrections per finished hour.

That ratio is the only thing that matters. 18 confirmed corrections in a 30-minute chapter means the book needs a serious proofing pass. 2 confirmed plus 35 false positives from fantasy names means you need a vocabulary list before you scale up. Don't hand-clean the transcript to make the pilot look good — measure the workflow you'll actually run.

What next

  • Run a one-chapter pilot before scaling up. Count confirmed corrections per finished hour. Decide if the signal-to-noise works for your booth and your reading style.
  • Build a custom vocabulary list — every proper noun, every invented term, every recurring foreign-language phrase — before you transcribe the whole book.
  • Ask your publisher or production contact, if you have one, which corrections schema they prefer. The diff output reshapes easily, but knowing the target format upfront saves a rewrite.
  • If you're a studio or a podcaster handling multiple narrators, the workflow is the same — higher-volume tiers exist for that volume.