Start free

Блог · · 7 min read

Qualitative research transcription for NVivo and Atlas.ti

How to get timestamped, speaker-labelled transcripts that import cleanly into NVivo or Atlas.ti — with the IRB angle on third-party processing.

Qualitative research transcripts that import cleanly into NVivo and Atlas.ti

TL;DR. Qualitative research transcription for thematic analysis needs three things: turn-level timestamps, stable speaker labels, and an export format your CAQDAS tool reads without reformatting. Transcription.Solutions returns JSON with start, end, speaker_0/1/2… and text per turn, plus a DOCX that preserves speaker tags on separate lines — the layout NVivo and Atlas.ti expect. Source audio is deleted within 24 hours, and we don't train on your data — relevant for your IRB protocol.

What qualitative researchers actually need from a transcript

Thematic analysis lives or dies on the unit of coding. If your transcript is one wall of text, every code is a paragraph; if it's turn-by-turn with timestamps, you can code at the utterance level and jump back to the audio for any quote.

The minimum useful shape is: speaker label on its own line, then the turn, with a timestamp at the start of each turn. That's the layout NVivo's auto-coding by speaker recognises, and the layout Atlas.ti's interview transcript import parses into speaker-coded segments without you marking them by hand.

A clean transcript saves the 30–60 minutes per interview you'd otherwise spend reformatting before import. Across a 25-participant study, that's a working week.

The JSON export structure

Every transcript is available as JSON via the dashboard or the REST API. The shape is turn-based, not word-based at the top level — though word-level timing is included inside each turn for tools that need it.

{
  "language": "en",
  "duration": 3284.7,
  "turns": [
    {
      "speaker": "speaker_0",
      "start": 0.48,
      "end": 14.22,
      "text": "So to start, can you tell me a bit about how you first got involved in the project?"
    },
    {
      "speaker": "speaker_1",
      "start": 14.9,
      "end": 47.31,
      "text": "Sure. I came in around the second year, after the pilot phase…"
    }
  ]
}

Speakers are anonymous by default — speaker_0, speaker_1, speaker_2 — because we don't know who's who in your interview. You rename them in the dashboard (click the chip → rename) before exporting, and the new label flows through every other export format.

Diarization (separating who-spoke-when) runs two ways depending on your recording. Stereo files with the interviewer on one channel and the participant on the other get a 100%-confident channel split — no model inference, no errors. Mono recordings go through pyannote/speaker-diarization-3.1, which handles 2–6 speakers reliably; focus groups with overlapping talk past 6 speakers are where you'll want to spot-check.

Converting to DOCX with speaker tags preserved

Once you've renamed speakers, export DOCX. The format looks like this:

[00:00:00] Interviewer:
So to start, can you tell me a bit about how you first got
involved in the project?

[00:00:14] P07:
Sure. I came in around the second year, after the pilot phase…

Speaker label on its own line, timestamp prefix, blank line between turns. This is what both tools want:

ToolImport behaviour with this layout
NVivo 14File → Import → Files → Apply paragraph styles → auto-codes by speaker name
Atlas.ti 24Add Documents → recognises speaker labels followed by : as section headers, splits into speaker-quotations

If you'd rather keep raw timestamps for every utterance (useful when you're coding interviews and want to retrieve audio at the second), there's a "timestamp every turn" toggle on export. For Atlas.ti's A-Docs (audio-linked documents), you can pair the DOCX with the original audio inside the project — timestamps in the transcript become clickable anchors.

We export DOCX, SRT, VTT, TXT, and JSON on every plan, free included. No per-export fee, no watermark.

Try it on your audio

Start free →

30 minutes a month, no card.

Accuracy on real interview audio

On a quiet one-to-one interview at 128 kbps or above — single lavalier or USB condenser per speaker — expect ~92% word accuracy. That tracks AssemblyAI Universal-3's published benchmark (Universal-3 is our primary model; Whisper Large-v3 is a fallback on transient errors).

The accuracy variables that matter most for qualitative work:

  • Bitrate: 128 kbps+ MP3 or any WAV → ~92%. 64–96 kbps → ~87%. 32–48 kbps phone-quality → ~80% and worth a review pass.
  • Speakers: 2 talkers on a stereo channel-split is the best case. 4+ talkers on a mono mic in a noisy café is the worst case.
  • Language: 99 languages auto-detected from the first 30 seconds. High-resource languages (English, Spanish, German, French, Mandarin) sit at 92%+; low-resource languages drop further and benefit from a manual review.

Either way, treat the auto-transcript as a draft, not a finished artefact. The time saving is in the typing — not in eliminating your read-through.

The IRB and consent angle

Most ethics boards want two things spelled out when you use a third-party transcription service: what happens to the audio, and whether the provider trains on your data. Both matter because interview audio is identifying biometric data under GDPR and most IRB frameworks.

What we do, specifically:

  • Source audio is permanently deleted from our infrastructure within 24 hours of job completion. Not "may be deleted" — deleted on a scheduled job. Transcripts remain in your account until you delete them.
  • We do not train models on user data. The ASR providers we route through (AssemblyAI as primary, OpenAI Whisper as fallback) are configured with training opt-out where the provider offers it.
  • Data location and encryption: TLS in transit, encryption at rest (AWS S3). The full sub-processor list, including the ASR providers and storage regions, is published at /privacy and updated as the stack changes. EU and US processing regions are available through Supabase (auth + database); object storage is on AWS S3.

For your IRB application, the practical wording is: "Audio recordings will be transcribed using Transcription.Solutions, a third-party service. Audio files are deleted from the provider's systems within 24 hours of transcription. The provider does not use uploaded content to train machine learning models. Transcripts will be stored on [your institution's storage]."

If your protocol forbids any cloud processing, local Whisper on your laptop is the alternative — slower, lower accuracy on multi-speaker audio, and you still bear the data-handling burden. For most research-ethics frameworks, the 24-hour deletion guarantee is what makes cloud transcription acceptable.

Choosing how to upload

Three workflows, depending on volume:

  1. One interview at a time: drag the audio into the dropzone. Best for a small study (10–25 interviews) where you transcribe as you go.
  2. A batch after fieldwork: queue 20 uploads — Pro processes 20 concurrent jobs, Business processes 50. A day of fieldwork transcribed overnight.
  3. Automated pipeline: if you're running a longitudinal study with weekly interviews, POST each file to the REST API and receive a webhook when the JSON is ready. Drop it straight into your NVivo project folder.

Pro at $19/month covers 600 minutes — about 10 hours of interviews, or ~12 one-hour participants. Above that, overage is $0.04/minute on Pro and $0.02/minute on Business, or you move to Business at $49/month for 2,500 minutes. The pricing page has the full breakdown.

FAQ

Can I import a Transcription.Solutions DOCX directly into NVivo without reformatting?

Yes, as long as you've renamed speakers from speaker_0 to meaningful labels (Interviewer, P01, etc.) before exporting. NVivo 14's auto-code by speaker reads the [timestamp] Speaker name: pattern and assigns each turn to that speaker as a node. You don't need to apply paragraph styles manually unless you want NVivo to treat the timestamp lines as headings.

Does Atlas.ti recognise the timestamps as audio anchors?

Atlas.ti 24 supports synchronised audio documents (A-Docs), where you import the audio and the transcript together and timestamps become click-to-play anchors. The DOCX export uses [HH:MM:SS] prefixes that Atlas.ti parses when you link the audio in the project. Note that A-Docs need the original audio file on hand — since we delete source audio after 24 hours, download your audio locally if you want to use it.

How do you handle overlapping speech in focus groups?

For mono recordings with 3+ speakers, pyannote/speaker-diarization-3.1 assigns turns to whoever it judges as dominant in a given window. Hard overlaps (two people talking at the same volume) are flagged as the louder speaker, with the other turn often merged or shortened. For focus groups of 4+ participants, plan a 10–15% manual review pass on diarization labels — checking against your field notes about who sat where.

What's the difference between the JSON and DOCX exports for qualitative coding?

JSON gives you turn objects with start, end, speaker, and text plus word-level timing inside each turn — useful if you're scripting analysis in Python or R, or building a custom pipeline before importing to CAQDAS. DOCX is the human-readable form that imports directly into NVivo or Atlas.ti without scripting. Most researchers use DOCX for coding and keep JSON for any later quantitative work (turn lengths, response times, talk ratios).

Do you support languages other than English for interview transcription?

99 languages auto-detected from the first 30 seconds of audio. High-resource languages (English, Spanish, German, French, Italian, Portuguese, Dutch, Mandarin, Japanese) get ~92% accuracy on clean interview audio. Lower-resource languages drop further and benefit from a manual review pass. You can override auto-detection if your file is bilingual or starts with a long English preamble before switching to the interview language.

Is the 24-hour audio deletion configurable or guaranteed?

Guaranteed and not configurable — audio files are deleted on a scheduled job within 24 hours of transcription completing, regardless of plan. Transcripts (the text output) stay in your account until you delete them or close the account. We don't offer "extended audio retention" as a feature because the cleanest IRB story is a hard deletion guarantee. If you need the audio long-term, download a copy at upload time and store it in your institution's repository.

Can I get a signed Data Processing Agreement for my IRB application?

Yes. The DPA isn't self-serve — email support@transcription.solutions with your institution and use case, and we'll send the signed PDF. Our DPA covers the EU Standard Contractual Clauses for cross-border transfers, the sub-processor list at /privacy, the 24-hour source-media deletion commitment, and the no-training clause. Available on any plan including Free — if you're vetting the service for a study, ask before you upload anything that needs the agreement in place.

Related reading