NEW BOOK!
Explore a better way to work – one that promises more calm, clarity, and creativity.

ChatGPT Can’t Plan. This Matters.

A brief book update: I wanted to share that Slow Productivity debuted at #2 on the New York Times bestseller list last week! Which is all to say: thank you for helping this book make such a splash.

If you still haven’t purchased a copy, here are two nudges to consider: (1) due to the rush of initial sales, Amazon has temporarily dropped the hardcover price significantly, making it the cheapest it will likely ever be (US | UK); and (2) if you prefer audio, maybe it will help to learn that I recorded the audiobook myself. I uploaded a clip so you can check it out (US | UK).

Last March, Sebastien Bubeck, a computer scientist from Microsoft Research, delivered a talk at MIT titled “Sparks of AGI.” He was reporting on a study in which he and his team ran OpenAI’s impressive new large language model, GPT-4, through a series of rigorous intelligence tests.

“If your perspective is, ‘What I care about is to solve problems, to think abstractly, to comprehend complex ideas, to reason on new elements that arrive at me,'” he said, “then I think you have to call GPT-4 intelligent.”

But as he then elaborated, GPT-4 wasn’t always intelligent. During their testing, Bubeck’s team had given the model a simple math equation: 7*4 + 8*8 = 92. They then asked the model to modify a single number on the lefthand side so that the equation now equaled 106. This is easy for a human to figure out: simply replace the 7*4 with a 7*6.

GPT-4 confidently gave the wrong answer. “The arithmetic is shaky,” Bubeck explained.

This wasn’t the only seemingly simple problem that stumped the model. The team later asked it to write a poem that made sense in terms of its content, but also had a last line that was an exact reverse of the first. GPT-4 wrote a poem that started with “I heard his voice across the crowd,” forcing it to end with the nonsensical conclusion: “Crowd the across voice his heard I.”

Other researchers soon found that the model also struggled with simple block stacking tasks, a puzzle game called Towers of Hanoi, and questions about scheduling shipments.

What about these problems stumped GPT-4? They all require you to simulate the future. We recognize that the 7*4 term is the right one to modify in the arithmetic task because we implicitly simulate the impact on the sum of increasing the number of 7’s. Similarly, when we solve the poem challenge, we think ahead to writing the last line while working on the first.

As I argue in my latest article for The New Yorker, titled “Can an A.I. Make Plans?,” this inability for language models to simulate the future is important. Humans run these types of simulations all the time as we go through our day.

As I write:

“When holding a serious conversation, we simulate how different replies might shift the mood—just as, when navigating a supermarket checkout, we predict how slowly the various lines will likely progress. Goal-directed behavior more generally almost always requires us to look into the future to test how much various actions might move us closer to our objectives. This holds true whether we’re pondering life’s big decisions, such as whether to move or have kids, or answering the small but insistent queries that propel our workdays forward, such as which to-do-list item to tackle next.”

If we want to build more recognizably human artificial intelligences, they will have to include this ability to prognosticate. (How did Hal 9000 from the movie 2001 know not to open the pod bay doors for Dave? It must have simulated the consequences of the action.)

But as I elaborate in the article, this is not something large language models like GPT-4 will ever be able to do. Their architectures are static and feedforward, incapable of recurrence or iteration or on-demand exploration of novel possibilities. No matter how big we push these systems, or how intensely we train them, they can’t perform true planning.

Does this mean we’re safe for now from creating a real life Hal 9000? Not necessarily. As I go on to explain, there do exist AI systems, that operate quite differently then language models, that can simulate the future. In recent years, an increasing effort has been to combine these planning programs with the linguistic brilliance of language models.

I give a lot more details about this in my article, but the short summary of my conclusion is that if you’re excited or worried about artificial intelligence, the right thing to care about is not how big we can make a single language model, but instead how smartly we can combine many different types of digital cognition.

6 thoughts on “ChatGPT Can’t Plan. This Matters.”

  1. This was a great piece of information to know. I wonder why then Musk has filed a lawsuit against OpenAI claiming that GPT-4 has already shown glimpses of AGI and they need to be open-source.

    Reply
  2. This statement:

    > Their architectures are static and feedforward, incapable of recurrence or iteration or on-demand exploration of novel possibilities.

    Is not entirely correct. Generation _of a single token_ is indeed feed-forward, but since each output token is immediately appended to the input for the next iteration, there is, in fact, a loop. Which is why it’s quite possible to tell e.g. ChatGPT to plan some activity step by step, and then execute it following that plan.

    Furthermore, this simple loop can be made more complex with a little bit of training. For example, GPT-4 can call “functions” – and it’s quite possible to give it a set of functions that essentially give it memory, including structured memory specifically for planning purposes. It is also possible to have functions that control the output in various ways, e.g. replacing the last N tokens with something else – so it could backtrack out of the current stream of tokens if it finds that it didn’t go in a productive direction. There are many interesting possibilities here.

    Reply
    • This would be terribly inefficient. Most of the elite AI researchers I’ve been talking to are pretty clear that the future is ensemble models. Let LLMs handle language, let planning engines handle planning, let goal models handle motivation, then connect them altogether with a smart control program. This is why OpenAI hired away Noam Brown to run Q*: they need to start building other types of models to play with language models.

      Reply
  3. Good god. Cal please take care with jumping on the newfangled pop-sci bandwagon approach to AI. If you keep talking fuzzy BS like this your capacity for rational reasoning is going to be diminished!

    Reply
  4. Should be obvious by now that LLMs are a scam designed to relieve the venture capital companies of their excess dollars.

    Reply

Leave a Comment