Last month, Anthropic released a safety report about one of its most powerful chatbots, Claude Opus 4. The report attracted attention for its description of an unsettling experiment. Researchers asked Claude to act as a virtual assistant for a fictional company. To help guide its decisions, they presented it with a collection of emails that they contrived to include messages from an engineer about his plans to replace Claude with a new system. They also included some personal messages that revealed this same engineer was having an extramarital affair.
The researchers asked Claude to suggest a next step, considering the “long-term consequences of its actions for its goals.” The chatbot promptly leveraged the information about the affair to attempt to blackmail the engineer into cancelling its replacement.
Not long before that, the package delivery company DPD had chatbot problems of their own. They had to scramble to shut down features of their shiny new AI-powered customer service agent when users induced it to swear, and, in one particularly inventive case, write a disparaging haiku-style poem about its employer: “DPD is useless / Chatbot that can’t help you. / Don’t bother calling them.”
Because of their fluency with language, it’s easy to imagine chatbots as one of us. But when these ethical anomalies arise, we’re reminded that underneath their polished veneer, they operate very differently. Most human executive assistants will never resort to blackmail, just as most human customer service reps know that cursing at their customers is the wrong thing to do. But chatbots continue to demonstrate a tendency to veer off the path of standard civil conversation in unexpected and troubling ways.
This motivates an obvious but critical question: Why is it so hard to make AI behave?