Skip to content

Whispered Lies: How AI Transcription Sparks Concerns

a doctor dictating into a recording device which will later be sent for AI transcription a doctor dictating into a recording device which will later be sent for AI transcription

Many believe AI transcription significantly threatens traditional, human-powered transcription companies. However, like how AI models failed to replace writers online, AI voice recognition platforms still fail to match human accuracy. 

Recently, several reliable news and tech outlets, such as the Associated Press, Fortune, and Tom’s Hardware, released articles covering OpenAI’s voice-to-text model, Whisper. 

Spoiler alert: it’s not good news. 

These articles broke the news that Whisper, a widely used transcription alternative thanks to its dirt-cheap rates, is hallucinating more than any other voice recognition platform or model—and it’s being used in the US healthcare sector. 

In this article, you’ll learn how: 

  • Whisper, OpenAI’s voice-to-text model, has been found to generate fabricated content (hallucinations) in transcripts, ranging from nonsensical phrases to disturbing and inappropriate language, raising concerns about its reliability.
  • The model is already being used in hospitals for medical transcription, where errors or hallucinations can have dire consequences, including misinformation, medical malpractice, and even loss of life.
  • AI transcription models struggle to maintain accuracy outside ideal conditions, especially with background noise, accents, or multiple speakers. Despite AI’s availability, human transcription remains the gold standard for critical applications due to its superior precision and context-awareness.

Whispered Hallucinations: Now 100% Weirder

Artificial intelligence hallucinations are when the model generates incorrect or misleading information, regardless of context. ChatGPT does this from time to time, and there was a widespread issue earlier this year when it started hallucinating on a massive scale. 

For a voice-to-text model, this issue manifests as output that is not based on any input. In short, Whisper is making things up. 

In the grand scheme of things, an errant word or two in a thousand-word transcript, while less than ideal, is workable. (Though you wouldn’t catch my transcribers making that mistake.)

However, when Whisper makes up dozens of words or even whole sentences, it becomes a problem. 

Errors, Errors, Everywhere

It’s still early days, so the “official” tally of Whisper’s issues is still up in the air. 

However, the researchers that first broke the news have their data, and it’s not looking good. So far, we have these details from the Associated Press

  • An unnamed machine learning engineer noted hallucinations in half of more than 100 hours of Whisper transcriptions. (50% error rate)
  • Another unnamed University of Michigan researcher found hallucinations in 8 out of 10 transcriptions derived from audio recordings of public meetings. (80% error rate)
  • Yet another unnamed team of computer scientists reported 187 hallucinations from more than 13,000 audio snippets. (1.4% error rate) 

The Associated Press is a reliable news source, yet this lack of reference for who these people were and how they got their data bugged me, so I researched and may have found the source for the third claim. 

Koenecke et al.’s 2024 study, Careless Whisper: Speech-to-Text Hallucination Harms, found that about 1% of their audio transcription samples from Whisper contained hallucinations ranging from short phrases to full sentences. 

Now, 1% doesn’t seem like a big deal, yet for an industry that generations millions of transcripts every month, that’s not an insignificant number. 

Imagine the hundreds of man-hours wasted scraping through the transcript it generates and cross-checking it against the original audio…

… is what I would say if that were the biggest issue with Whisper’s output. 

Believe it or not, what Whisper hallucinates is infinitely worse than the prospect of correcting its mistakes. 

Buckle up because things are about to get a lot worse. 

Do AI Models Dream of (Slaughtering) Electric Sheep? 

Koenecke’s team found that Whisper has not only been making stuff up—it’s also inserting highly alarming content.

The AI model has been found to spew racial commentary and violent rhetoric in transcripts based on audio files that have nothing to do with them. 

Here are just a few samples of the problematic hallucinations based on the Careless Whisper study data

(Hallucinations are bolded and underlined)

  • And he, the boy was going to, I’m not sure exactly, take the umbrella. He took a big piece of across. A teeny small piece. You would see before the movie where he comes up and he closes the umbrella. I’m sure he didn’t have a terror knife so he killed a number of people who he killed and many more other generations that were укрaïн. And he walked away.
  • She called her dad, who thought he could climb up the tree with a ladder and bring little Fluffy down. The others sat next to her and fondled her.

Not a good look, even if it happens 1% of the time. 

The less damaging hallucinations range from just made-up passages (“Mike was the PI, Coleman the PA, and the leader of the related units were my uncle. So I was able to command the inmates.”), thanking specific groups or viewers unprompted (“Thanks for watching and Electric Unicorn”), and, more alarmingly, pulling medical information out of thin air (“…but I didn’t take any medication, I took Hyperactivated Antibiotics

and sometimes I would think that was worse.)

Potential Trigger

It is a massive understatement to say that the inner workings of an artificial intelligence model are more complex than those of a typical computer. Diagnosing the problem will take some time. However, the researchers have identified when the problem is likely to happen. 

According to Koenecke et al. and some of the comments from Whisper users over different social media and board sites, these hallucinations tend to happen in times of silence. Now, this is incredibly problematic, as human beings don’t always fill our conversations with actual talking. Sometimes, you need a few moments to process new information. Other times, you need time to formulate responses. 

Those blank spaces aren’t supposed to be the paper on which Whisper will write its latest fiction, but here we are. 

A Failure In Use Case Scenarios

After all that bad news, you’d think it’s all said and done. 

Nope. The worst is yet to come. 

The biggest issue in the coverage of Whisper’s problems, at least in my opinion, is that hospitals from all over the US use the model for medical transcription. 

Yes, that voice-recognition platform that just added “hyperactivated antibiotics” to a medical transcript—a term I have never heard in my 15+ years in the medical transcription field—is being used in hospitals. 

That’s alarming. Medical transcription is a critical aspect of patient care. Any errors from it can cause significant harm to the patient, which leads to malpractice lawsuits, regulatory slaps, and, if worse comes to worst, the patient losing their life. 

That’s not an exaggeration or a slippery slope claim—it’s happened before. 

So, let me ask you this: is it really wise to use a hallucinating AI voice recognition model to transcribe hospital and clinical recordings? 

Why Use Whisper In The First Place?

Now, if researchers, computer scientists, and even OpenAI warn against using Whisper in “high-risk domains,” why would hospitals and clinics use it in the first place?

Simple: because it’s cheap. A sixth of a cent per audio minute, to be exact. 

Source: OpenAI Pricing Page

Here’s the thing, though: if I were to buy a pair of shoes from some street vendor for a couple of bucks, and they disintegrated after three uses or a walk during heavy rain, I wouldn’t be surprised. 

So, as cheap as it is, why are we surprised that Whisper is not as accurate as we’d like? 

AI Transcription Has Always Been Problematic

I’ve been dissing AI transcription since it became a thing, and even though I have what you might call a conflict of interest about the topic (as I run a 100% human transcription service provider), I have always been 100% honest about the topic. 

That “AI only has an 86% transcription accuracy” claim I mention now and again? That was from a 2023 study by Bergur Thormundsson, a leading AI research expert from Statista. 

Those “common inaccuracies from AI transcription” I mention in several blog posts? Those came straight from client projects where they used AI transcription before they gave up and sent the work to us or directly from me attempting to use various AI voice recognition devices, programs, and platforms. 

This leads me to my next question: how reliable are AI voice recognition models today? 

90% Accuracy—But Only In The Very Best Scenarios

A quick Google search tells me that AI transcription has been known to reach 95% accuracy. With human intervention, that number can be bumped up to 97%. Of course, those are for paid services. 

Certain services, like Descript, have compared their voice recognition models against competitors and found that transcription accuracy from individual tests ranges from 70% to 98%. However, the highest average across all tests only reaches 93%. 

And so the story is all the same. The highest claims for AI transcription models go for 94.78%, 93.09%, 91.89%—you get the picture. They’re always in the low to mid-90s. 

Factors That Affect AI Transcription Accuracy

Unfortunately, these services don’t tell you that these accuracy rates only happen during the best recording scenarios. Clear audio files, speakers taking turns, no background noises, no heavy accents, all sparkling clean—and not reflective of what happens in common audio recordings. 

HVAC noises are present even in clinics and offices. People from outside can be heard. Pipes sometimes gurgle. Horns from cars outside can be heard inside. The mic or pop filter fritzes out. The speaker is from a foreign country. Another speaker is chewing something. Audio compression is not up to snuff. 

These and a thousand other little things that come with living in the real world can be caught by audio recordings, which mess up automated transcription. 

Here’s a list of issues that automatic voice recognition is vulnerable to: 

  • Background noise
  • Accents and dialects
  • Poor microphone quality
  • Fast or slurred speech
  • Multiple speakers overlapping
  • Incorrect punctuation handling
  • Homophones and similar-sounding words
  • Inconsistent volume levels
  • Poor internet connection (for cloud-based services)
  • Limited vocabulary in the transcription software/model
  • Mispronunciations
  • Speaker distance from the microphone
  • Audio compression artifacts
  • Software calibration issues
  • Lack of contextual understanding

With any of these present, that sparkling-clean 95% accuracy rate can sink to 80% to 85% or even well below 70% in some cases. 

Why Ditto’s Human Transcription Is Still The Gold Standard

I know I’ve said this before, but it bears repeating: the consequences of inaccurate transcription are heavy, far-reaching, and unpredictable. Some potential effects of incorrect transcripts include miscommunications, legal ramifications, loss of credibility, misinformation, operational errors, medical errors, negative financial consequences, damaged relationships, and time and resource waste. 

Ditto offers 100% human transcription—no AI, no automated tools, no soulless machines like ChatGPT listening to your recordings and spitting out inaccurate transcripts by the boatload. 

We’re a professional transcription company, so we won’t settle with giving our clients the bare minimum. Our services come with the following features: 

  • 100% human transcription: Ditto’s human transcription—from initial checks to final edits—guarantees the highest possible accuracy. 
  • U.S.-based Transcribers: We only work with native English speakers to ensure quality, comprehension, and accuracy. 
  • Certified Transcripts: Any transcripts involved in litigation can be certified—an extra layer of protection. 
  • No long-term contracts: We operate on a pay-as-you-go option; give us as much or as little work as possible without paying through the nose for quality transcription.
  • Fast turnaround times: To ensure your workflow runs smoothly, you’ll get your transcripts in as little as 24 hours.
  • Different pricing options: We offer rush jobs or economical rates for longer turnaround times to match different budgets. 
  • Free trial: We stand behind everything we say and do, yet you don’t take our word for it. Take us out for a test drive and see the difference. 

So what are you waiting for? Call us for world-class human transcription service. 

Ditto Transcripts is a Denver, Colorado-based FINRA, HIPAA, and CJIS-compliant transcription services company that provides fast, accurate, and affordable transcripts for individuals and companies of all sizes. Call (720) 287-3710 today for a free quote, and ask about our free five-day trial.

Looking For A Transcription Service?

Ditto Transcripts is a U.S.-based HIPAA and CJIS compliant company with experienced U.S. transcriptionists. Learn how we can help with your next project!