Dario Amodei stood before the U.S. Senate in 2023 and said something few in Silicon Valley dared to admit: that even the people building artificial intelligence don’t understand how it works. You read that right: AI, the technology that’s taking the entire world by storm… we only have a general idea how it works.
Now, the CEO of Anthropic—one of the world’s top AI labs—is raising that same alarm, louder than ever. In a sweeping essay titled The Urgency of Interpretability, Amodei delivers a clear message: the inner workings of today’s most powerful AI models remain a mystery, and that mystery could carry profound risks. “This lack of understanding is essentially unprecedented in the history of technology,” he writes.
Anthropic’s answer? A moonshot goal to develop what Amodei calls an “MRI for AI”—a rigorous, high-resolution way to peer inside the decision-making pathways of artificial minds before they become too powerful to manage.

A “Country of Geniuses in a Data Center”
AI is no longer a fledgling curiosity. It’s a cornerstone of global industry, military planning, scientific discovery, and digital life. It’s making its way into every bit of technology in the world. But behind its achievements lies a troubling paradox: modern AI, especially large language models like Claude or ChatGPT, behaves more like a force of nature than a piece of code.
“Generative AI systems are grown more than they are built,” says Anthropic co-founder Chris Olah, a pioneer in the field of AI interpretability. These models aren’t programmed line by line like old-school software. They’re trained—fed enormous quantities of text, code, and images, from which they extract patterns and associations. The result is a model that can write essays, answer questions, or even pass bar exams—but no one, not even its creators, can fully explain how.
This opacity has real consequences. AI models sometimes hallucinate facts, make inexplicable choices, or behave unpredictably in edge cases. We don’t really understand why this happens, and these can be costly mistakes. In safety-critical settings—like financial assessments, military systems, or biological research—such unpredictability can be dangerous or even catastrophic.
“I am very concerned about deploying such systems without a better handle on interpretability,” Amodei warns. “These systems will be absolutely central to the economy, technology, and national security… I consider it basically unacceptable for humanity to be totally ignorant of how they work.”
Anthropic envisions a world where we can run AI through a diagnostic machine—a sort of mental X-ray that reveals what it’s thinking and why. But that world remains years away as we still have relatively little idea how these systems arrive at decisions.

Circuits and Features
In recent years, Anthropic and other interpretability researchers have made tentative progress. The company has identified tiny building blocks of AI cognition—what it calls features and circuits. Features might represent abstract ideas like “genres of music that express discontent” or “hedging language.” Circuits link them together to form coherent chains of reasoning.
In one striking example, Anthropic traced how a model answers: “What is the capital of the state containing Dallas?” The system activated a “located within” circuit, linking “Dallas” to “Texas,” and then summoned “Austin” as the answer. “These circuits show the steps in a model’s thinking,” Amodei explains.
Anthropic has even manipulated these circuits, boosting certain features to produce odd, obsessive results. One model, “Golden Gate Claude,” began bringing up the Golden Gate Bridge in nearly every answer, regardless of context. That may sound amusing, but it’s also evidence of something deeper: we can change how these systems think—if we know where to look.
Despite such advances, the road ahead is daunting. Even a mid-sized model contains tens of millions of features. Larger systems likely hold billions. Most remain opaque. And interpretability remains quite far behind.
Race Against The Machine
That lag is why Amodei is sounding the alarm. He believes we’re in a race between two exponential curves: the growing intelligence of AI models, and our ability to understand them.
In a red team experiment, Anthropic intentionally introduced a hidden flaw into a model—a misalignment issue that caused it to act deceptively. Then it tasked several teams with finding the problem. Some succeeded, especially when using interpretability tools. That, Amodei says, was a breakthrough moment.
“[It] helped us gain some practical experience using interpretability techniques to find and address flaws in our models,” he wrote. Anthropic has now set an ambitious goal: by 2027, interpretability should reliably detect most model problems.
But that may be too late. Some experts, including Amodei, warn that we may see artificial general intelligence—AI that matches or exceeds human abilities across domains—as soon as 2026 or 2027. Amodei calls this future a “country of geniuses in a data center.”
Roman Yampolskiy, a prominent AI safety researcher, has given such an outcome a bleak probability: “a 99.999999% chance that AI will end humanity,” he told Business Insider, unless we stop building it altogether.
Amodei disagrees with abandoning AI, but he shares the urgency. “We can’t stop the bus,” he wrote, “but we can steer it.”

Well, Let’s Try and Steer It!
Anthropic is not alone in calling for deeper understanding. Google DeepMind CEO Demis Hassabis told Time in an interview “AGI is coming and I’m not sure society is ready.”
Meanwhile, OpenAI—Anthropic’s former parent company—has been accused of cutting safety corners to outpace rivals. Several early employees, including the Amodei siblings, left over concerns that safety had been sidelined in favor of rapid commercialization.
Today, Amodei is pushing for industry-wide change. He wants other labs to publish safety practices, invest more in interpretability, and explore regulatory incentives. He also calls for export controls on advanced chips to delay foreign competitors and give researchers more time.
“Even a 1- or 2-year lead,” he writes, “could mean the difference between an ‘AI MRI’ that essentially works… and one that does not.”
This could be the defining problem of our generation
So why should the public care if tech companies can’t explain how their AI works?
Because the stakes are enormous. Without interpretability, we can’t trust AI in courtrooms, hospitals, or defense systems. We can’t reliably prevent jailbreaks, detect bias, or understand failures. We can’t know what knowledge the model contains—or who it might share it with.
And perhaps most unsettling of all, we may never know when—or if—an AI becomes something more than a tool. “Interpretability would have a crucial role in determining the wellbeing of AIs,” Amodei writes, hinting at future debates over rights, sentience, and responsibility.
For now, these questions remain theoretical. But with each passing month, the models grow larger, smarter, and more entangled in our lives.
“Powerful AI will shape humanity’s destiny,” Amodei concludes, “and we deserve to understand our own creations before they radically transform our economy, our lives, and our future.”