homehome Home chatchat Notifications


Anthropic says it's "vaccinating" its AI with evil data to make it less evil

The Black Mirror episodes are writing themselves now.

Mihai Andrei
August 4, 2025 @ 8:52 pm

share Share

This image was generated by a non-Evil AI.

Last week, Anthropic presented some research into how AI “personalities” work. That is, how their tone, responses, and overarching motivation change based on something we would humanly call personality. They also looked at what makes a model “evil.”

According to the company, the best way to prevent language models from developing harmful traits like “evilness,” sycophancy, or hallucination is a method they call preventative steering. In practice, this works a bit like vaccination.

“Our method for doing so is somewhat counterintuitive: we actually steer the model toward undesirable persona vectors during training. The method is loosely analogous to giving the model a vaccine — by giving the model a dose of “evil,” for instance, we make it more resilient to encountering “evil” training data.”

Unpacking AI Personalities

Language models are often weird and counterintuitive. Ask them to write a poem or a creative text and they’ll go off in the most verbose way. Ask them about politics, and they’ll play the diplomat. But sometimes, they can surprise you. If you nudge them in just the wrong way, they go off the rails.

We’ve seen this before. Remember Bing’s “Sydney” personality, or more concerningly, the time Elon Musk’s Grok started calling itself “MechaHitler”? Those weren’t random glitches. They were personality shifts, or to be more precise, systematic changes in how the model interacted with the world.

Using two open-source models (Qwen 2.5 and Meta’s Llama 3) Anthropic engineers went deep into the neural networks to find the bits that “light up” when the AI is acting evil, sycophantic, or just making stuff up. They call these neural signatures “persona vectors.”

Flow diagram showing how Persona vectors work with evil AI personality
Image credits: Anthropic.

Persona vectors are the AI equivalent of personality centers in the brain. When you ask a model to flatter you, a specific vector activates. When you ask it to promote white supremacy, another one lights up. These vectors are measurable. But what’s important is that these vectors are controllable; you can tweak them, directly and indirectly. But it gets even more interesting.

“Once we’ve extracted these vectors, they become powerful tools for both monitoring and control of models’ personality traits,” Anthropic’s engineers write.

Isolating Vectors

Once you isolate a persona vector — say, for “sycophancy” — you can actually inject it into the model during generation. Or you can subtract it and make the model blunt and cold. The team ran transcripts showing models switching personalities like light switches based on which vectors were active. The results were eerie. Ask a normal question, flip on the “evil” vector, and the chatbot goes dark, suggesting unethical acts, expressing contempt, even admiring dictators.

Flow diagram showing how Persona vectors work with evil AI personality in conversation

But if you try to tweak an AI after its training is completed, you also make it a bit dumber.

“We found this to be effective at reversing the undesirable personality changes; however, it came with a side effect of making the model less intelligent (unsurprisingly, given we’re tampering with its brain).”

So, instead, Anthropic suggests a different approach.

How to Vaccinate Your AI

Preventative steering involves intentionally injecting a model with a specific undesirable trait (like “evil”) during training, using a persona vector, before the model learns that trait on its own from flawed or harmful data. After training, the researchers then remove that trait vector during deployment.

“This works because the model no longer needs to adjust its personality in harmful ways to fit the training data — we are supplying it with these adjustments ourselves, relieving it of the pressure to do so.

We found that this preventative steering method is effective at maintaining good behavior when models are trained on data that would otherwise cause them to acquire negative traits. What’s more, in our experiments, preventative steering caused little-to-no degradation in model capabilities.”

Using these persona vectors, the Anthropic team realized they could look through a training dataset and predict what personality the AI will develop. Basically, an AI personality isn’t the black box we once through it was. It can be predicted, if you analyze the data and inputs closely.

This kind of interpretability is essential for upcoming AI regulations. Governments around the world are beginning to mandate “AI safety and alignment” for high-risk systems. Having tools like persona vectors in your toolkit lets you prove, quantitatively, that your model isn’t secretly cooking up world domination plans.

In addition to the practical implications, there’s something almost poetic about this. To build trustworthy AI, we might have to teach it what “untrustworthy” looks like. It’s almost like teaching a child by showing them negative examples. It’s weirdly elegant.

At the end of the day, we’re not building saints. We’re building tools. But tools can go rogue, especially when they have intelligence (or something resembling it). Anthropic’s work gives us a roadmap — not just to detect when AI gets a little too flirty or a little too fascist, but to stop it from getting there in the first place.

Read the full paper here.

share Share

Lab-Grown Beef Now Has Real Muscle Fibers and It’s One Step Closer to Burgers With No Slaughter

In lab dishes, beef now grows thicker, stronger—and much more like the real thing.

Solid-State Batteries Charge in 3 Minutes, Offer Nearly Double the Range, and Never Catch Fire. So Why Aren't They In Your Phones and Cars Yet?

Solid state are miles ahead lithium-ion, but several breakthroughs are still needed before mass adoption.

An AI Ran a Vending Machine. It Ended Just How You'd Think It Would, But Worse

For a few surreal weeks, the dystopian future ran inside a mini-fridge in San Francisco.

A New AI Can Spot You by How Your Body Bends a Wi-Fi Signal

You don’t need a phone or camera to be tracked anymore: just wi-fi.

Nearly 3,000 People Tried a Four-Day Workweek With No Pay Cut and the Results Were Great

Largest study of its kind finds fewer workdays make for healthier, happier, more productive employees.

This Disturbing Phone Case Gets Sunburned Like Real Skin to Teach You a Lesson

The creepiest phone case ever made could maybe one day save your life.

AI Is Now Funny Enough to Make You Laugh. But Can It Ever Be Truly Humorous?

As people turn to AI for therapy and companionship, some say the models still need to learn the nuances of human humor.

Meta's New Bracelet Lets You Control Computers Directly

It's a completely new way to interact with computers.

Scientists Create a ‘Smart Sponge’ That Knows When to Heal and When to Fight Inflammation

This hydrogel could help millions of people lead a better life.

AI-designed autonomous underwater glider looks like a paper airplane and swims like a seal

An MIT-designed system lets AI evolve new shapes for ocean-exploring robots.