ChatGPT Got Destroyed in Chess by a 1970s Atari Console. But Should You Be Surprised?

By most measures, ChatGPT 4o is one of the most advanced language models ever created. It can write essays, code entire apps from scratch, translate languages, draft complex legal arguments, and — depending on who you ask — flirt with the very boundaries of human-like intelligence.

But last weekend, it lost a game of chess. Not to a human grandmaster or even to some other fancy AI.

It lost to an Atari 2600 that first appeared in the 1970s and can only calculate one or two chess moves in advance.

An Unlikely Matchup

The Atari 2600. Credit: Wikimedia Commons.

Robert Caruso, a Citrix engineer and self-proclaimed tinkerer, wasn’t out to humiliate the most expensive AI on the market today. He just wanted to see what would happen.

“I was curious how quickly ChatGPT would beat a chess computer that can only think one or two moves ahead,” Caruso said in a detailed post on LinkedIn.

So, he dusted off an emulation of the 1979 game Video Chess — originally designed for the Atari 2600, a home console released in 1977 — and set up a match between the game and ChatGPT 4o, the latest model from OpenAI that cost around $60 million to train. He used screenshots to show the board and asked ChatGPT to suggest moves in real-time.

Expectations were modest. Video Chess is notoriously simple. The Atari’s processor ran at just 1.19 MHz — millions of times slower than the systems that now power modern AI. Its chess engine is severely outdated.

New chess AI achieves top human performance without even looking for the best move

The pair of jeans that sent the chess world in turmoil

Google’s AlphaZero surpassed the sum of human chess knowledge — in 4 hours

A New Study Reveals AI Is Hiding Its True Intent and It’s Getting Better At It

And yet, as Caruso described it, “ChatGPT got absolutely wrecked on the beginner level.”

A Comical Collapse

Screen Image of Atari video chess — What Atari Video Chess looks like. The pictograms for the chess pieces gave ChatGPT problems, but the LLM also struggled when it was fed moves in standard chess notation. Credit: YouTube.

The game lasted about 90 minutes and ChatGPT struggled from the outset. It misidentified pieces, confused rooks for bishops, and missed obvious tactical threats like pawn forks. At some points, it even lost track of the board entirely.

“It made enough blunders to get laughed out of a 3rd-grade chess club,” Caruso wrote.

At first, the AI blamed the Atari’s abstract icons. So, Caruso tried switching to standard chess notation, giving ChatGPT a more familiar frame of reference. It didn’t help. Even with Caruso gently steering it away from the worst blunders, the chatbot fell apart. Eventually, it asked if they could “start over.”

“It conceded,” Caruso confirmed.

To be clear, ChatGPT isn’t a chess engine. It wasn’t designed to calculate variations or evaluate board positions with pinpoint accuracy. Unlike specialized chess programs like Stockfish — which boasts an ELO rating above 3600, hundreds of points more than the best human Grandmasters — ChatGPT is a general-purpose large language model. Its job is to predict the next best word in a sentence, not the next best move on a chessboard.

Still, this loss stings for a platform hailed by many as a milestone on the road to artificial general intelligence.

But ChatGPT Is Not a Chess Genius

Since at least the 1950s, chess has served as a kind of benchmark for machine intelligence. IBM’s Deep Blue shocked the world in 1997 when it beat then-world champion Garry Kasparov. That machine used brute force, evaluating up to 200 million positions per second.

Today’s chess engines are far stronger. They can destroy the world’s best human players. Even modest engines running on smartphones can do the same.

So, how did ChatGPT, backed by billions in research and powered by data centers humming with cutting-edge hardware, lose to a four-decade-old 8-bit console?

The simple reason is that not all AIs are built the same.

Language models like ChatGPT are built to understand and generate human language, not to reason symbolically about rules and logic-heavy games like chess. They can describe chess. They can explain strategy. But they don’t play chess in the traditional sense. They simulate what a conversation about chess might sound like.

That distinction can be subtle, but it’s important.

It can explain what a Sicilian Defense is. It can discuss the brilliance of Magnus Carlsen’s endgames. But when asked to play, it’s merely guessing what someone might say if they were playing chess.

In essence, it wasn’t really thinking about the board or even playing — it was narrating.

The Limits of Language Intelligence

The Atari chess engine that beat ChatGPT was built for a single task. ChatGPT was not. Its generality — its ability to talk about everything from Shakespeare to statistical mechanics — is what makes it remarkable. But it’s also what makes it vulnerable to failure in specific, rule-based environments like chess.

More recently, neural network-based engines like Leela Chess Zero (LCZero) have taken a different route. Instead of brute force like Stockfish, they rely on pattern recognition and deep learning, training by playing millions of games against themselves. In 2018, AlphaZero — a closed system from Google’s DeepMind on which LCZero is based — redefined what was possible when it learned chess from scratch and then trounced Stockfish in a series of games. These AIs are built for one thing: play chess; and they can destroy not only the best human champions but also most other chess computers.

Despite these radically different approaches, the top engines are now neck-and-neck. In fact, according to the Swedish Chess Computer Association (SSDF), Stockfish and LCZero are separated by just four Elo points.

To its credit, ChatGPT did not gloat, protest, or flip the board over in a huff. It simply asked to try again.

That humility might be the most human thing about it. Just don’t ask it to play white.