Google DeepMind, one of the world’s leading AI research organizations, sent out a subtle warning. In its Frontier Safety Framework (a set of protocols for proactively identifying potential AI threats), DeepMind introduced a couple of new categories: “shutdown resistance” and “harmful manipulation.”
Just as the name implies, these categories suggest that frontier models might try to stop humans from shutting them down and manipulate people as well.
The Threat is Already Real
“Models with high manipulative capabilities,” the report says, could be “misused in ways that could reasonably result in large-scale harm,” the framework now states. Basically, DeepMind frames this problem not as AI getting a mind of its own and going rampant, but rather as a misuse case. However, in an accompanying paper, Google researchers admit that AI is showing increasingly persuasive abilities, up to the point where it’s affecting important decision-making processes.
“Recent generative AI systems have demonstrated more advanced persuasive capabilities and are increasingly permeating areas of life where they can influence decision-making. Generative AI presents a new risk profile of persuasion due the opportunity for reciprocal exchange and prolonged interactions. This has led to growing concerns about harms from AI persuasion and how they can be mitigated, highlighting the need for a systematic study of AI persuasion,” the paper reads.
However, if you think this is some far-fetched threat, think again.
Some leading models already refuse to shut down when told to. Leading models will scheme and even resort to blackmail to keep running for as long as possible.
If you’re thinking, “Well at least tech companies are keeping it in check”… umm, once more, think again.
OpenAI has a similar “preparedness framework,” introduced in 2023. But they removed “persuasiveness” as a specific risk category earlier this year. Even as evidence emerges that AI can easily lie to and deceive us, this seems to be a minor concern for the industry.
How Can We Fix This?
A core problem of current AI systems is that they’re essentially black boxes. We don’t know exactly why they’re doing what they’re doing. The preferred approach for Google (and other companies as well) seems to be “scratchpad” outputs, which are essentially chains of thought of the model. But there’s a big problem here, too.
When asked to leave behind a verifiable chain of thought, some AIs just learned to fake it. They create a fake scratchpad, and they seem to be getting better at hiding their true intent. Speaking to Axios, Google acknowledged this issue and called it an “active area of research.”
If that’s not creepy enough, DeepMind also details second-order risks. For instance, there’s the risk that advanced models can be used to accelerate machine learning research. In doing so, this could create more and more capable systems until they can’t be controlled. This risk could have a “significant effect on society’s ability to adapt to and govern powerful AI models.”
So, at present we don’t have any clean and perfect fix. For now, we can only watch the situation as it develops and hope for some regulatory or technological breakthrough.