Why Chatbots Still Hallucinate – and How OpenAI Wants to Fix It

Why do AI assistants confidently give wrong answers? The problem isn’t the code, it’s the tests that reward guessing over admitting uncertainty.

4
Chatbots Hallucinate OpenAI
AIUnified CommunicationsLatest News

Published: September 8, 2025

Christopher Carey

Like students guessing on a tough exam, AI chatbots often bluff when they don’t know the answer.

The result? Plausible sounding but completely false statements – what researchers call hallucinations – that can mislead users and undermine trust.

Despite steady progress in AI, these hallucinations remain stubbornly present in even the most advanced systems, including GPT-5.

Now, a new paper from OpenAI argues that hallucinations are not strange side effects of machine intelligence, but predictable statistical errors built into how large language models (LLMs) are trained – and, crucially, how they’re tested.

Fixing them, the researchers say, will require a rethink of the benchmarks that drive AI development.

Hallucinations Baked In

At their core, language models are probability machines. They don’t “know” truth from falsehood the way humans do.

Instead, they predict which words are most likely to follow others based on patterns in their training data.

For example, when asked about the title of paper co-author Adam Tauman Kalai’s Ph.D. dissertation, a “widely used chatbot” confidently gave three different answers. All wrong.

The researchers then asked about his birthday, and got three more answers. Also all wrong.

OpenAI formalises this problem with what it calls the Is-It-Valid (IIV) test. In essence, it reduces text generation to a binary classification problem: is a given string valid or invalid?

The maths shows that if a model struggles with this classification, it will necessarily produce hallucinations during generation.

“The model sees only positive examples of fluent language and must approximate the overall distribution,” the researchers explained.

“All base models will err on inherently unlearnable facts. For each person there are 364 times more incorrect birthday claims than correct ones.”

For “arbitrary facts” with no learnable patterns, the error rate bottoms out at a stubbornly high level.

In short: hallucinations aren’t bugs, they’re essentially baked into the statistical foundations of language modelling.

Why Hallucinations Persist After Training

The paper argues that post-training – the fine-tuning process where models are adjusted with human feedback – often makes hallucinations worse because of how success is measured.

Much like students on a test, leaving a question unanswered guarantees failure, whereas guessing opens a chance to earn points.

Language models face the same incentives when evaluated on accuracy-based benchmarks. Saying “I don’t know” is penalised as much as being wrong, while guessing might look correct.

OpenAI calls this an “epidemic of penalising uncertainty.” Imagine two models:

  • Model A only answers when it’s confident, abstaining when uncertain.
  • Model B always gives an answer, even when guessing.

On today’s benchmarks, Model B will outperform Model A – not because it’s more accurate overall, but because the scoring system rewards boldness over caution.

Over time, this encourages models to “learn” to hallucinate.

The Evaluation Problem

This evaluation misalignment isn’t a minor quirk, as benchmarks are the lifeblood of AI research.

Leaderboards showcasing model performance on accuracy-driven tests shape funding, competition, and deployment.

But under current norms, accuracy is a binary: right or wrong.

There’s no room for “uncertain,” or any credit for saying “I don’t know,” and no penalty for confidently spewing falsehoods.

That misalignment means even well-intentioned efforts to curb hallucinations are fighting against the grain.

The researchers warn that bolting on a few extra hallucination evaluations won’t be enough. The dominant benchmarks – the ones that define who’s “winning” in AI – need to change if we want systems that prioritize trustworthiness over test-taking.

A New Kind of Test

OpenAI suggests one solution is to overhaul the way models are graded. Just as some tests penalise wrong answers to discourage blind guessing, AI benchmarks should:

  • Penalise confident errors more heavily than uncertainty.
  • Give credit when a model admits it doesn’t know.
  • Reward abstention over fabrication.

This isn’t just about making chatbots less annoying.

In high-stakes domains like medicine, law, or education, confidently wrong answers can have serious consequences. A system that bluffs less and signals uncertainty more could be far safer – even if it looks less impressive on traditional benchmarks.

What This Means For UC

For leaders in UC, the research underscores a critical operational challenge: AI-powered chatbots and assistants may confidently provide wrong information, potentially affecting customer interactions, internal collaboration, or automated workflows.

Because current evaluation metrics reward guessing over caution, UC systems that integrate AI could inadvertently propagate errors, reduce user trust, or trigger compliance risks.

As the paper notes, “I Don’t Know (IDK)-type responses are maximally penalised while an overconfident ‘best guess’ is optimal.”

UC leaders should therefore prioritise AI deployments that signal uncertainty, provide partial responses when unsure, and incorporate human oversight to prevent overconfident hallucinations from impacting business decisions.

The Bottom Line

Hallucinations may never vanish – the maths guarantees that some level of error is inevitable when machines are trained to mimic the messy distribution of human knowledge.

But OpenAI’s researchers argue that we can make them less harmful by changing the incentives.

Right now, AI is like a student trained to maximize test scores by guessing whenever it’s unsure. If we want models that admit uncertainty – and in doing so, become more trustworthy partners – we need to rewrite the tests themselves.

Because as long as the leaderboards keep rewarding lucky guesses, chatbots will keep bluffing.

Agentic AIArtificial IntelligenceChatGPTUser Experience
Featured

Share This Post