A team from Cornell University has flagged a critical issue with OpenAI’s Whisper speech-to-text AI model. Their study found that the system occasionally produces violent and fabricated content during transcription tasks, raising concerns about its dependability.
In one example, Whisper correctly transcribed a single sentence but then added five extra sentences of violent content. Analysis showed that the hallucinated output often included harmful language and false information, pointing to potential biases within the AI. The study underscores the necessity for ongoing refinement.
Study Uncovers AI Hallucinations in Whisper

Assistant Professor Allison Koenecke led the analysis of 13,000 audio samples to evaluate the accuracy of Whisper’s transcriptions. The findings revealed that around 1% included artificially generated violent language and false details. These results were presented at the ACM Conference on Fairness, Accountability, and Transparency (FAccT).
The data for this study was obtained from AphasiaBank, a database of speech samples from individuals with aphasia, which is part of the TalkBank project managed by Carnegie Mellon University. AphasiaBank offers audio and video recordings of subjects with and without aphasia (the latter serving as a control group), complete with human-generated transcriptions and anonymized demographic details of the participants. The speech samples in AphasiaBank come from various sources, predominantly university hospitals, and cover 12 languages such as English, Mandarin, Spanish, and Greek.
The accuracy of transcription is vital for areas like legal, medical, and hiring processes. The AI hallucinations, particularly during pauses in speech, present significant risks. Individuals with speech disabilities, such as aphasia, are more prone to misinterpretation by the system.
The study calls on OpenAI to improve Whisper’s handling of various speaking patterns. It recommends more rigorous pre-deployment testing and continual improvements to mitigate hallucination risks in essential applications.
Broader Concerns and Impacts
The revelation of hallucinations in Whisper extends to the broader reliability of AI systems in high-stakes fields. Errors in transcription could negatively affect professionals in legal and medical sectors as well as the general populace. The research highlights the need for increased scrutiny and development to ensure AI systems like Whisper perform reliably across different user scenarios.
Background of Whisper
Released in 2022, Whisper was trained on 680,000 hours of audio data. Post-study, OpenAI upgraded the model to reduce hallucinations. The team observed that long silences or pauses were more likely to provoke random names, addresses, and irrelevant web snippets.
Last Updated on November 7, 2024 3:54 pm CET