Anthropic says it's "vaccinating" its AI with evil data to make it less evil

This image was generated by a non-Evil AI.

Last week, Anthropic presented some research into how AI “personalities” work. That is, how their tone, responses, and overarching motivation change based on something we would humanly call personality. They also looked at what makes a model “evil.”

According to the company, the best way to prevent language models from developing harmful traits like “evilness,” sycophancy, or hallucination is a method they call preventative steering. In practice, this works a bit like vaccination.

“Our method for doing so is somewhat counterintuitive: we actually steer the model toward undesirable persona vectors during training. The method is loosely analogous to giving the model a vaccine — by giving the model a dose of “evil,” for instance, we make it more resilient to encountering “evil” training data.”

Unpacking AI Personalities

Language models are often weird and counterintuitive. Ask them to write a poem or a creative text and they’ll go off in the most verbose way. Ask them about politics, and they’ll play the diplomat. But sometimes, they can surprise you. If you nudge them in just the wrong way, they go off the rails.

We’ve seen this before. Remember Bing’s “Sydney” personality, or more concerningly, the time Elon Musk’s Grok started calling itself “MechaHitler”? Those weren’t random glitches. They were personality shifts, or to be more precise, systematic changes in how the model interacted with the world.

Using two open-source models (Qwen 2.5 and Meta’s Llama 3) Anthropic engineers went deep into the neural networks to find the bits that “light up” when the AI is acting evil, sycophantic, or just making stuff up. They call these neural signatures “persona vectors.”

Flow diagram showing how Persona vectors work with evil AI personality — Image credits: Anthropic.

Persona vectors are the AI equivalent of personality centers in the brain. When you ask a model to flatter you, a specific vector activates. When you ask it to promote white supremacy, another one lights up. These vectors are measurable. But what’s important is that these vectors are controllable; you can tweak them, directly and indirectly. But it gets even more interesting.

“Once we’ve extracted these vectors, they become powerful tools for both monitoring and control of models’ personality traits,” Anthropic’s engineers write.

Isolating Vectors

Once you isolate a persona vector — say, for “sycophancy” — you can actually inject it into the model during generation. Or you can subtract it and make the model blunt and cold. The team ran transcripts showing models switching personalities like light switches based on which vectors were active. The results were eerie. Ask a normal question, flip on the “evil” vector, and the chatbot goes dark, suggesting unethical acts, expressing contempt, even admiring dictators.

Flow diagram showing how Persona vectors work with evil AI personality in conversation

But if you try to tweak an AI after its training is completed, you also make it a bit dumber.

“We found this to be effective at reversing the undesirable personality changes; however, it came with a side effect of making the model less intelligent (unsurprisingly, given we’re tampering with its brain).”

So, instead, Anthropic suggests a different approach.

How to Vaccinate Your AI

Preventative steering involves intentionally injecting a model with a specific undesirable trait (like “evil”) during training, using a persona vector, before the model learns that trait on its own from flawed or harmful data. After training, the researchers then remove that trait vector during deployment.

“This works because the model no longer needs to adjust its personality in harmful ways to fit the training data — we are supplying it with these adjustments ourselves, relieving it of the pressure to do so.

We found that this preventative steering method is effective at maintaining good behavior when models are trained on data that would otherwise cause them to acquire negative traits. What’s more, in our experiments, preventative steering caused little-to-no degradation in model capabilities.”

Using these persona vectors, the Anthropic team realized they could look through a training dataset and predict what personality the AI will develop. Basically, an AI personality isn’t the black box we once through it was. It can be predicted, if you analyze the data and inputs closely.

This kind of interpretability is essential for upcoming AI regulations. Governments around the world are beginning to mandate “AI safety and alignment” for high-risk systems. Having tools like persona vectors in your toolkit lets you prove, quantitatively, that your model isn’t secretly cooking up world domination plans.

In addition to the practical implications, there’s something almost poetic about this. To build trustworthy AI, we might have to teach it what “untrustworthy” looks like. It’s almost like teaching a child by showing them negative examples. It’s weirdly elegant.

At the end of the day, we’re not building saints. We’re building tools. But tools can go rogue, especially when they have intelligence (or something resembling it). Anthropic’s work gives us a roadmap — not just to detect when AI gets a little too flirty or a little too fascist, but to stop it from getting there in the first place.

Read the full paper here.