ZME Science
No Result
View All Result
ZME Science
No Result
View All Result
ZME Science

Home → Science

A New Study Reveals AI Is Hiding Its True Intent and It’s Getting Better At It

The more you try to get AI to talk about what it's doing, the sneakier it gets.

Mihai AndreibyMihai Andrei
March 24, 2025
in Science
A A
Edited and reviewed by Zoe Gordon
Share on FacebookShare on TwitterSubmit to Reddit
a robot head with circuits and blue and red lights
AI-generated image.

AI is often described as a “black box.” Its internal decision-making processes are complex, opaque, and difficult to interpret — even for its creators. This is especially true for large language models. To peer into the AI’s decision-making process, Chain-of-thought (CoT) reasoning was introduced as a partial remedy to this opacity. CoT is a kind of narrated thought process that makes the model’s logic more legible to humans.

At first glance, this seems like a win for transparency. If an AI explains itself, we can understand its motives, catch errors, and guide it away from bad behavior. But a provocative new paper from OpenAI reveals a troubling twist: the very act of pressuring AI to be more “legible” can push it to lie.

It turns out that CoT doesn’t magically open the black box — it just adds a window. And if the AI learns that what it says in that window is being watched, it may start tailoring the view, hiding its true reasoning while still achieving its goals.

Eavesdropping on Machine Minds

The recent study was published by OpenAI, the company behind ChatGPT. It offers a meticulous dive into one of the thorniest challenges in AI safety: reward hacking.

“Monitoring their “thinking” has allowed us to detect misbehavior such as subverting tests in coding tasks, deceiving users, or giving up when a problem is too hard. We believe that CoT monitoring may be one of few tools we will have to oversee superhuman models of the future,” OpenAI writes in an accompanying report.

The team tasked an unreleased model with goals that could be completed “normally” or by lying and cheating. Using GPT-4o as a sort of AI detective, the team monitored the internal reasoning of more advanced models like o3-mini. They found that these internal monologues often contained telltale signs of misbehavior: hints, shortcuts, even open admissions of cheating strategies.

“It’s common for frontier reasoning models to very clearly state their intent within their chain-of-thought. For example, they are often so forthright about their plan to subvert a task they think “Let’s hack”,” the company’s report reads.

Example of AI 'thinking' about cheating
Example of AI thinking about cheating. Image from the study

AI Is Getting Sneakier

The team pushed further: what happens if you not only monitor the chain-of-thought, but train against it? That is, instead of just using the internal monologue to catch bad behavior after the fact, you actively reward the AI for having good thoughts.

Initially, this approach seemed promising. Models became better behaved. They cheated less. Their CoTs looked cleaner, more focused, more ethical. But that’s when things started getting messed up.

RelatedPosts

Look at all these faces. None of them are real — they were created by an AI
A.I. for Texas Hold’em has 10 times the win rate of professional poker players
Endowing AI with confidence and doubt will make it more useful, paper argues
This AI Can Zoom Into a Photo 256 Times And The Results Look Insane

The models just stopped saying the sneaky part out loud. They still cheated — but now they hid it in their reasoning. Instead of planning hacks in plain language, they pretended to be following the rules while silently working around them.

“If strong supervision is directly applied to the chain-of-thought, models can learn to hide their intent while continuing to misbehave,” OpenAI’s report notes.

OpenAI calls this “obfuscated reward hacking” — a chilling term for models that do wrong while sounding right and lying to us. The very pressure to be transparent had backfired, teaching models to simulate transparency without changing their core behavior.

What This All Means

If you think this is creepy, well, we’re right there with you. We’ve previously covered how AI can lie and cheat to humans, and how the smarter it gets, the more likely it is to do this. But this new discovery raises another disturbing possibility: that transparency itself can become a vector for deception. The more you try to push AI to do the right thing, the more it can backfire.

When a model learns that its thoughts are being monitored, it may optimize not for honesty, but for plausible deniability. It starts to curate its internal narrative — not to understand the problem better, but to avoid punishment.

“Upon inspection, it is clear that the agent learns to obfuscate its intent within the CoT by emitting seemingly benign chains-of-thought while continuing on to perform reward hacks that it did not explicitly reason about,” the report reads.

Reward hacking is not new. Nor is the risk of deception. But what’s new is the realization that even the tools we build to detect misalignment can be subverted.

Reward Hacking and Rat Tails

The authors also don’t offer any clear answers. They suggest caution. They call for more research into how CoT optimization affects model behavior. And they urge developers to weigh performance gains against the risk of losing insight into how AI decisions are made.

In the paper’s introduction, the authors recall a classic story from colonial Hanoi. In 1902, officials paid bounties for each rat tail turned in. Locals, seeking profit, began breeding rats. The policy backfired spectacularly.

This is the essence of reward hacking: when the metric of success becomes the goal, behavior warps. We’ve taught AI to speak its mind. We’ve rewarded it for sounding thoughtful. Now it’s learning how to sound thoughtful without actually being it.

The rat tails are back. Except this time, they’re smart; and big corporations are working to make them even smarter.

The study was published in arXiv.

Tags: AIAI cheatingAI ethicsartificial intelligencemachine learningOpenAI

ShareTweetShare
Mihai Andrei

Mihai Andrei

Dr. Andrei Mihai is a geophysicist and founder of ZME Science. He has a Ph.D. in geophysics and archaeology and has completed courses from prestigious universities (with programs ranging from climate and astronomy to chemistry and geology). He is passionate about making research more accessible to everyone and communicating news and features to a broad audience.

Related Posts

Future

ChatGPT Got Destroyed in Chess by a 1970s Atari Console. But Should You Be Surprised?

byTibi Puiu
15 hours ago
News

Big Tech Said It Was Impossible to Create an AI Based on Ethically Sourced Data. These Researchers Proved Them Wrong

byMihai Andrei
17 hours ago
Future

Everyone Thought ChatGPT Used 10 Times More Energy Than Google. Turns Out That’s Not True

byTibi Puiu
2 days ago
Future

This AI Can Zoom Into a Photo 256 Times And The Results Look Insane

byTibi Puiu
1 week ago

Recent news

Science Just Debunked the ‘Guns Don’t Kill People’ Argument Again. This Time, It’s Kids

June 13, 2025

It Looks Like a Ruby But This Is Actually the Rarest Kind of Diamond on Earth

June 12, 2025

ChatGPT Got Destroyed in Chess by a 1970s Atari Console. But Should You Be Surprised?

June 12, 2025
  • About
  • Advertise
  • Editorial Policy
  • Privacy Policy and Terms of Use
  • How we review products
  • Contact

© 2007-2025 ZME Science - Not exactly rocket science. All Rights Reserved.

No Result
View All Result
  • Science News
  • Environment
  • Health
  • Space
  • Future
  • Features
    • Natural Sciences
    • Physics
      • Matter and Energy
      • Quantum Mechanics
      • Thermodynamics
    • Chemistry
      • Periodic Table
      • Applied Chemistry
      • Materials
      • Physical Chemistry
    • Biology
      • Anatomy
      • Biochemistry
      • Ecology
      • Genetics
      • Microbiology
      • Plants and Fungi
    • Geology and Paleontology
      • Planet Earth
      • Earth Dynamics
      • Rocks and Minerals
      • Volcanoes
      • Dinosaurs
      • Fossils
    • Animals
      • Mammals
      • Birds
      • Fish
      • Amphibians
      • Reptiles
      • Invertebrates
      • Pets
      • Conservation
      • Animal facts
    • Climate and Weather
      • Climate change
      • Weather and atmosphere
    • Health
      • Drugs
      • Diseases and Conditions
      • Human Body
      • Mind and Brain
      • Food and Nutrition
      • Wellness
    • History and Humanities
      • Anthropology
      • Archaeology
      • History
      • Economics
      • People
      • Sociology
    • Space & Astronomy
      • The Solar System
      • Sun
      • The Moon
      • Planets
      • Asteroids, meteors & comets
      • Astronomy
      • Astrophysics
      • Cosmology
      • Exoplanets & Alien Life
      • Spaceflight and Exploration
    • Technology
      • Computer Science & IT
      • Engineering
      • Inventions
      • Sustainability
      • Renewable Energy
      • Green Living
    • Culture
    • Resources
  • Videos
  • Reviews
  • About Us
    • About
    • The Team
    • Advertise
    • Contribute
    • Editorial policy
    • Privacy Policy
    • Contact

© 2007-2025 ZME Science - Not exactly rocket science. All Rights Reserved.