Factors That Affect Transcription Accuracy Rates

Transcripts are easily accessible, digitally searchable representations of recorded audio or video information. They make it easier for users to get what they need from the recordings. This is why transcription services are heavily utilized by businesses that produce and depend on recorded information, either audio or video captures. Such industries include law enforcement agencies, healthcare organizations, law firms, businesses, and academic institutions. 

Meanwhile, accuracy is crucial to transcription — so much so that clients use them as primary qualifications when seeking transcription services. However, accuracy is subject to several factors, which we will discuss below. 

Why is accuracy important in transcription

Human conversations are incredibly complex and nuanced events. Language, dialects, accents, cultural backgrounds, syntax, emotions, social status, cognitive limitations, and so many more things can affect the flow of any discussion. The only reason it’s easy to have a conversation is because we do it daily. But try listening to a conversation in a language you have limited experience with, and you’ll begin to grasp how difficult it is to comprehend.  

Converting conversations into a written format can be challenging, as much can be lost during transcription. This is why accuracy is essential in transcription. A single incorrect word can create a whole new meaning to a sentence. Errors can completely undermine what the speaker was trying to say. In certain low-risk cases, this won’t cause too many problems. But imagine having errors in transcripts used in a courtroom, a police interview, or a make-or-break business negotiation. The effects can be heavy and far-reaching.

Difference between clean and full verbatim transcription

Before discussing accuracy, we must highlight the difference between a clean and full verbatim transcription. This clarification is necessary because each transcription method requires different considerations for what errors are. 

Verbatim transcriptions include every word and utterance from the recording. This includes false starts, filler words, pauses, and physical actions like coughing and clearing of throats. Crucially, it also includes grammar, word choice, and syntax errors.

Here’s a quick example of a full verbatim transcript: 

Speaker 1: I guess, uhm, we should call it a night? The work’s done mostly, and there’s, I mean, it’s a bit late. We can [clears throat] We can maybe continue on Monday. And… uh, there’s a coffee shop a few blocks away. So… maybe you want to, uhm, get a drink?

Speaker 2: Uhhhh… I have a few more things to do here. Sorry. 

Speaker 1:  Ah. That’s… uh, that’s too bad. 

Speaker 2: No! Uhm, I meant, can you maybe wait for me to finish? I’ll be happy to get coffee with you. [laughs]

As you can see, full verbatim transcripts include everything said and heard in the recording. There are several reasons why clients would prefer this type of result. First, it creates a complete recording of the content. Second, it captures the full scope of the conversation, including nuances that might not be immediately apparent on paper. 

Law firms and other legal industry organizations usually require full verbatim transcripts, especially for recordings like depositions and court hearings. Similarly, law enforcement agencies benefit from full verbatim transcripts, as every contextual detail is essential in reviewing their different recordings from interviews, wiretaps, 911 calls, and undercover wires. 

Cleaned-up verbatim transcripts, meanwhile, offer a more sanitized result. Filler words are removed while grammar errors, false starts, and sentence structure problems are corrected. The essence of the recording is still there, but it is lightly edited to create a cleaner transcript, making it much easier to read or skim. 

Here’s an example of a cleaned-up verbatim transcript: 

Speaker 1: I guess we should call it a night? The work’s mostly done, and it’s a bit late. We can maybe continue on Monday. And there’s a coffee shop a few blocks away. So maybe you want to get a drink?

Speaker 2: I have a few more things to do here. Sorry. 

Speaker 1: That’s too bad. 

Speaker 2: No! I meant, can you maybe wait for me to finish? I’ll be happy to get coffee with you. 

The message is essentially the same, but this type of transcript takes out small details that may affect the context of the conversation, like Speaker 1’s hesitation. Some industries prefer cleaned-up verbatim transcripts for easy review of their recorded materials, like a call center that needs more general data from their recordings or a researcher with so much data they want less information to work with because their project doesn’t require the smallest details.

As you can see, the difference between the two transcription methods can affect how accuracy is rated for each. Verbatim transcripts require everything to be transcribed, so missing anything will be counted against the accuracy ratings. Meanwhile, cleaned-up verbatim transcripts can do away with fillers and unnecessary words and not affect their accuracy. 

How is accuracy measured in transcription?

The primary way to measure accuracy in transcription is manual evaluation. To get this, we need to divide the total incorrect words by the total number of words in the document. The resulting value is expressed as a percentage. For example, a 2,000-word transcript with 200 errors has a 90% accuracy rate. Most human transcription providers offer 85% to 99% accuracy rates. 

WER, or Word Error Rate, is a metric used to measure the accuracy of automatic speech recognition (ASR) systems, machine transcription, and other natural language processing software and programs. WER is the primary model used to evaluate the performance of automatic speech recognition systems like Google or Microsoft’s speech-to-text programs.

The formula for calculating WER is mainly similar to manual evaluation: (S + D + I) / N


  • S (Substitutions): The number of words in the reference transcription replaced with incorrect words in the ASR output.
  • D (Deletions): The number of words missing in the ASR output compared to the reference transcription.
  • I (Insertions): The number of extra words in the ASR output that were not in the reference transcription.
  • N (Total words): The number of words in the reference transcription.

The WER score is expressed as a percentage. For example, if an automated transcription system produced a transcript with 20 substitutions, 7 deletions, and 4 insertions in a reference text containing 100 words, the WER would be calculated as WER = (10 + 5 + 3) / 100 = 31% error rate

Common factors that may affect transcription accuracy

Aside from the transcriptionist’s skill and work quality, several other factors can affect the accuracy of transcription projects. Here they are: 

Video or audio file quality

The quality of the base file being transcribed significantly impacts the result of any transcription process. It is possible to work with occasionally grainy audio quality, as the transcriptionist can then utilize contextual understanding to fill in the gaps in the conversation. However, accuracy takes a hit the more garbled or disjointed the recording is. There are a few things that affect audio and video quality accessibility: 

Background noises

It is expected that recordings will have a certain degree of background noise every time. Law enforcement recordings taken from police officers using body and dash cams are likely to have traffic noises from other vehicles or ambient noises from natural occurrences like rain, snow, etc. Courtroom recordings may capture background conversations, crowd reactions, and others. These noises can sometimes affect a transcriber’s ability to write what is being said and recorded accurately. 

Audio artifacts

Audio or sonic artifacts are any unintended and undesired sounds captured by audio or video recording devices. Examples of audio artifacts are static noises, electric hums and hisses, buzzing, and distortions. These can be caused by recording device issues or storage problems. They might also result from routine recording procedures, like compression, file transfers, or file type changes for compatibility issues. Sometimes, they even come from a lack of proper equipment. For example, plosive noises (which are popping sounds produced when recording words that have P, T, K, D, or B sounds) can be minimized or outright eliminated by using pop filters during recording. 

Equipment limitations

One of the most common concerns affecting recording quality is the equipment used. Some recording gear is better suited for different instances than others. While not always a factor, cheap equipment may be worse at recording good-quality audio and video than expensive equipment. Older recording equipment may have problems with its normal operation if it has not been maintained or made cheaply. Deteriorating storage devices like problematic SD cards or hard drives can also contribute to poor audio quality. 

Multiple speakers affecting subtitles and captions

Real human conversations tend to overlap. We don’t usually notice such things when talking to one another in person. Our brains are used to such interactions, taking contextual clues and other physical cues from the other people we are talking with to grab the idea of the conversation and continue speaking ourselves. 

One of my favorite examples of pointing out how messy discussions can be is the 2013 film Coherence. The film’s first ten minutes depict several characters meeting and talking about several topics, and their words melt and cross and crash into each other. 

We’re used to clean movie dialogues, where each speaker is clearly identified, so it can be a bit grating to listen to at first. The subtitles and captions don’t help either, as they’re usually limited to two to three lines on the screen at a time. 

The funny thing is that most real-world conversations between multiple people go that way, we just don’t notice it, though, because we are so used to it. 

Creating transcripts from recordings with multiple overlapping speakers can present a challenge. Hearing one speaker’s words when multiple people speak simultaneously might be difficult, even if you increase the volume. Voices are usually distinct, but that’s not always the case. Audio quality may also affect this issue, making it hard for the transcriptionist to discern who is talking. 

Language complexity and use of jargon

Different industries and lines of business use different terminologies. Transcriptionists processing recordings from unfamiliar business industries or practices may have difficulty transcribing such words or acronyms. They can sometimes be misinterpreted, leading to an increase in error rates. 

The transcriber also has the option to omit a word or phrase in the finished transcript, marking it with a timestamp as either indecipherable or not understood. If accuracy is paramount, the transcriptionist will use Google or another online database to try and determine what was said. This can, unfortunately, increase the time spent doing the transcription.

Transcription service providers with industry-specific specialized transcriptionists have an advantage in this situation. Their extensive experience within the industry and terminology can ensure accuracy with the jargon and help transcribing go faster and more accurately. 

Editing and proofreading

Transcription projects can also go through proofreading to maintain accuracy. Also, how the proofreading and editing are conducted may affect the resulting accuracy rate. Transcriptionists use different proofreading and error correction methods that work best for them. 

Here are some example scenarios. 

Scenario 1: The client used free automated transcription or speech-to-text program

There are several free speech-to-text programs available online. Large companies like Google, Amazon, Microsoft, and Apple have AI transcription programs as part of their services or are widely available within their software. 

Using these AI programs may present upfront savings for clients, but because of their high error rates, the amount of time spent editing these transcripts is usually not worth it. They’re primarily appropriate for single-person recordings or simple projects because automated transcripts usually have a lot of errors and will require human editing. The client can proofread and edit it themselves or hire someone from a virtual assistant company to do the corrections for them. 

Scenario 2: The client hired a transcription service provider

In certain cases, getting a transcription service provider is the best option. Corporations, law firms, and law enforcement agencies generate a lot of recordings that require transcription. It’s not feasible for them to use purely automated services, as these require a lot of human intervention. The amount of time will likely double the effort and time spent on their transcription projects. 

Transcription service providers will cover the project from start to finish and provide rigorous proofreading and error-checking practices to ensure their accuracy guarantee. 

How can transcription service providers improve their accuracy ratings?

There is only so much a transcription service provider can do with poor audio quality. What they can do, however, is make sure that their transcription process is calibrated to generate the best and most accurate results for their clients. Ditto Transcripts employs these techniques to ensure our 99% accuracy guarantee:

  • 100% human transcriptionists handle all transcription jobs. We don’t use AI or speech-to-text software for any of the work we do.
  • All of our transcriptionists are trained and experienced in their specialized fields, making them experts in the industry terminology and jargon.
  • All individual projects go to one specific person (or a group, depending on scope). We do not cut up your recordings and give them to different people to transcribe. This way, work continuity and quality are ensured.
  • All transcriptionists working with us follow our quality standards and guidelines to the letter.
  • All transcription projects undergo multiple quality checks, edits, and proofreading before being returned to the client.
  • We will inform clients if their recordings are difficult to transcribe or if the video or audio length will affect the process.  We’ll then give the client the new turnaround times in such cases.

Contact us at (720) 287-3710 for fast, accurate transcriptions.

Looking For A Transcription Service?

Ditto Transcripts is a U.S.-based HIPAA and CJIS compliant company with experienced U.S. transcriptionists. Learn how we can help with your next project!