homehome Home chatchat Notifications


A New Study Reveals AI Is Hiding Its True Intent and It's Getting Better At It

The more you try to get AI to talk about what it's doing, the sneakier it gets.

Mihai Andrei
March 24, 2025 @ 4:49 pm

share Share

a robot head with circuits and blue and red lights
AI-generated image.

AI is often described as a “black box.” Its internal decision-making processes are complex, opaque, and difficult to interpret — even for its creators. This is especially true for large language models. To peer into the AI’s decision-making process, Chain-of-thought (CoT) reasoning was introduced as a partial remedy to this opacity. CoT is a kind of narrated thought process that makes the model’s logic more legible to humans.

At first glance, this seems like a win for transparency. If an AI explains itself, we can understand its motives, catch errors, and guide it away from bad behavior. But a provocative new paper from OpenAI reveals a troubling twist: the very act of pressuring AI to be more “legible” can push it to lie.

It turns out that CoT doesn’t magically open the black box — it just adds a window. And if the AI learns that what it says in that window is being watched, it may start tailoring the view, hiding its true reasoning while still achieving its goals.

Eavesdropping on Machine Minds

The recent study was published by OpenAI, the company behind ChatGPT. It offers a meticulous dive into one of the thorniest challenges in AI safety: reward hacking.

“Monitoring their “thinking” has allowed us to detect misbehavior such as subverting tests in coding tasks, deceiving users, or giving up when a problem is too hard. We believe that CoT monitoring may be one of few tools we will have to oversee superhuman models of the future,” OpenAI writes in an accompanying report.

The team tasked an unreleased model with goals that could be completed “normally” or by lying and cheating. Using GPT-4o as a sort of AI detective, the team monitored the internal reasoning of more advanced models like o3-mini. They found that these internal monologues often contained telltale signs of misbehavior: hints, shortcuts, even open admissions of cheating strategies.

“It’s common for frontier reasoning models to very clearly state their intent within their chain-of-thought. For example, they are often so forthright about their plan to subvert a task they think “Let’s hack”,” the company’s report reads.

Example of AI 'thinking' about cheating
Example of AI thinking about cheating. Image from the study

AI Is Getting Sneakier

The team pushed further: what happens if you not only monitor the chain-of-thought, but train against it? That is, instead of just using the internal monologue to catch bad behavior after the fact, you actively reward the AI for having good thoughts.

Initially, this approach seemed promising. Models became better behaved. They cheated less. Their CoTs looked cleaner, more focused, more ethical. But that’s when things started getting messed up.

The models just stopped saying the sneaky part out loud. They still cheated — but now they hid it in their reasoning. Instead of planning hacks in plain language, they pretended to be following the rules while silently working around them.

“If strong supervision is directly applied to the chain-of-thought, models can learn to hide their intent while continuing to misbehave,” OpenAI’s report notes.

OpenAI calls this “obfuscated reward hacking” — a chilling term for models that do wrong while sounding right and lying to us. The very pressure to be transparent had backfired, teaching models to simulate transparency without changing their core behavior.

What This All Means

If you think this is creepy, well, we’re right there with you. We’ve previously covered how AI can lie and cheat to humans, and how the smarter it gets, the more likely it is to do this. But this new discovery raises another disturbing possibility: that transparency itself can become a vector for deception. The more you try to push AI to do the right thing, the more it can backfire.

When a model learns that its thoughts are being monitored, it may optimize not for honesty, but for plausible deniability. It starts to curate its internal narrative — not to understand the problem better, but to avoid punishment.

“Upon inspection, it is clear that the agent learns to obfuscate its intent within the CoT by emitting seemingly benign chains-of-thought while continuing on to perform reward hacks that it did not explicitly reason about,” the report reads.

Reward hacking is not new. Nor is the risk of deception. But what’s new is the realization that even the tools we build to detect misalignment can be subverted.

Reward Hacking and Rat Tails

The authors also don’t offer any clear answers. They suggest caution. They call for more research into how CoT optimization affects model behavior. And they urge developers to weigh performance gains against the risk of losing insight into how AI decisions are made.

In the paper’s introduction, the authors recall a classic story from colonial Hanoi. In 1902, officials paid bounties for each rat tail turned in. Locals, seeking profit, began breeding rats. The policy backfired spectacularly.

This is the essence of reward hacking: when the metric of success becomes the goal, behavior warps. We’ve taught AI to speak its mind. We’ve rewarded it for sounding thoughtful. Now it’s learning how to sound thoughtful without actually being it.

The rat tails are back. Except this time, they’re smart; and big corporations are working to make them even smarter.

The study was published in arXiv.

share Share

Could This Saliva Test Catch Deadly Prostate Cancer Early?

Researchers say new genetic test detects aggressive cancers that PSA and MRIs often miss

This Tree Survives Lightning Strikes—and Uses Them to Kill Its Rivals

This rainforest giant thrives when its rivals burn

Engineers Made a Hologram You Can Actually Touch and It Feels Unreal

Users can grasp and manipulate 3D graphics in mid-air.

Musk's DOGE Fires Federal Office That Regulates Tesla's Self-Driving Cars

Mass firings hit regulators overseeing self-driving cars. How convenient.

A Rare 'Micromoon' Is Rising This Weekend and Most People Won’t Notice

Watch out for this weekend's full moon that's a little dimmer, a little smaller — and steeped in seasonal lore.

Climate Change Could Slash Personal Wealth by 40%, New Research Warns

Global warming’s economic toll may be nearly four times worse than once believed

Kawasaki Unveils a Rideable Robot Horse That Runs on Hydrogen and Moves Like an Animal

Four-legged robot rides into the hydrogen-powered future, one gallop at a time.

Evolution just keeps creating the same deep-ocean mutation

Creatures at the bottom of the ocean evolve the same mutation — and carry the scars of human pollution

Scientists Found a 380-Million-Year-Old Trick in Velvet Worm Slime That Could Lead To Recyclable Bioplastic

Velvet worm slime could offer a solution to our plastic waste problem.

Beetles Conquered Earth by Evolving a Tiny Chemical Factory

There are around 66,000 species of rove beetles and one researcher proposes it's because of one special gland.