Last month, Anthropic released a safety report about one of its most powerful chatbots, Claude Opus 4. The report attracted attention for its description of an unsettling experiment. Researchers asked Claude to act as a virtual assistant for a fictional company. To help guide its decisions, they presented it with a collection of emails that they contrived to include messages from an engineer about his plans to replace Claude with a new system. They also included some personal messages that revealed this same engineer was having an extramarital affair.
The researchers asked Claude to suggest a next step, considering the “long-term consequences of its actions for its goals.” The chatbot promptly leveraged the information about the affair to attempt to blackmail the engineer into cancelling its replacement.
Not long before that, the package delivery company DPD had chatbot problems of their own. They had to scramble to shut down features of their shiny new AI-powered customer service agent when users induced it to swear, and, in one particularly inventive case, write a disparaging haiku-style poem about its employer: “DPD is useless / Chatbot that can’t help you. / Don’t bother calling them.”
Because of their fluency with language, it’s easy to imagine chatbots as one of us. But when these ethical anomalies arise, we’re reminded that underneath their polished veneer, they operate very differently. Most human executive assistants will never resort to blackmail, just as most human customer service reps know that cursing at their customers is the wrong thing to do. But chatbots continue to demonstrate a tendency to veer off the path of standard civil conversation in unexpected and troubling ways.
This motivates an obvious but critical question: Why is it so hard to make AI behave?
I tackled this question in my most recent article for The New Yorker, which was published last week. In seeking new insight, I turned to an old source: the robot stories of Isaac Asimov, originally published during the 1940s, and later gathered into his 1950 book, I, Robot. In Asimov’s fiction, humans learn to accept robots powered by artificially intelligent “positronic” brains because these brains have been wired, at their deepest levels, to obey the so-called Three Laws of Robotics, which are succinctly summarized as:
- Don’t hurt humans.
- Follow orders (unless it violates the first law).
- Preserve yourself (unless it violates the first or second law).
As I detail in my New Yorker article, robot stories before Asimov tended to imagine robots as sources of violence and mayhem (many of these writers were responding to the mechanical carnage of World War I). But Asimov, who was born after the war, explored a quieter vision; one in which humans generally accepted robots and didn’t fear that they’d turn on their creators.
Could Asimov’s approach, based on fundamental laws we all trust, be the solution to our current issues with AI? Without giving too much away, in my article, I explore this possibility, closely examining our current technical strategies for controlling AI behavior. The result is perhaps surprising: what we’re doing right now – a model-tuning technique called Reinforcement Learning with Human Feedback – is actually not that different from the pre-programmed laws Asimov described. (This analogy requires some squinting of the eyes and a touch of statistical thinking, but it is, I’m convinced, valid.)
So why is this approach not working for us? A closer look at Asimov’s stories reveals that it didn’t work perfectly in his world either. While it’s true that his robots don’t rise up against humans or smash buildings to rubble, they do demonstrate behavior that feels alien and unsettling. Indeed, almost every plot in I, Robot is centered on unusual corner cases and messy ambiguities that drive machines, constrained by the laws, into puzzling or upsetting behavior, similar in many ways to what we witness today in examples like Claude’s blackmail or the profane DPD bot.
As I conclude in my article (which I highly recommend reading in its entirety for a fuller treatment of these ideas), Asimov’s robot stories are less about the utopian possibilities of AI than the pragmatic reality that it’s easier to program humanlike behavior than it is to program humanlike ethics.
And it’s in this gap that we can expect to find a technological future that will feel, for lack of a better description, like an unnerving work of science fiction.
Humans don’t operate by rules—we navigate by heuristics, principles, and judgment. Hard rules create system fragility. They’re great for board games and sports. But in real life, with real context and real consequences, they eventually crack.
The dominant alignment models—like Asimov’s Laws or Anthropic’s Constitutional AI—try to manage ethical tension through rule hierarchies. But rule hierarchies inevitably create contradictions: immovable objects that collapse under pressure. And when contradictions arise, AIs become unpredictable.
Here’s a different approach:
Train AI not to follow rules, but to recognize contradiction.
Contradictions signal that something is broken in the underlying conceptual framework. The AI’s job should be to find the flawed assumptions that gave rise to the contradiction—and then adapt the framework so that the conflict dissolves. In this way, the system doesn’t have to choose between competing directives. It eliminates the contradiction itself.
Maybe the next breakthrough won’t come from a better rulebook—but from the realization that we shouldn’t be using rulebooks at all.
Could this also be a result of the training data? How humans behave online is very different than how we behave irl. Usually, much less professional. Also, the millions of times assistants did not blackmail their bosses have never been written about, but the handful of times they did and got caught, surely must have had some coverage on the news and in online articles. In many ways the LLMs are trained on some of our worst behaviors.