The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity
5 min readAt its heart, “The Illusion of Thinking,” is about trying to genuinely understand how well these new “Large Reasoning Models” (LRMs) actually reason. You’ve probably heard about models that show their “thinking steps” before giving an answer, like with Chain-of-Thought. They often do better on benchmarks, which is exciting. But we felt that just looking at the final answer on standard math or coding tests wasn’t telling the whole story.
The Big Problem We Saw:
Imagine you’re teaching a student. If they just spit out the right answer, you don’t know if they truly understood the method or just got lucky, or maybe even saw the answer somewhere before (which is a bit like “data contamination” in AI benchmarks). We wanted to look inside the “thinking process” and also test these models in a way where we could be sure they hadn’t just memorized the solutions.
Our Approach: Puzzles!
Instead of standard benchmarks, we turned to classic puzzles – things like the Tower of Hanoi, Checkers Jumping, River Crossing, and Blocks World (you can see these in Figure 3 on page 6). Why puzzles?
- Controllable Complexity: We can make these puzzles a little harder or a little easier just by changing a small thing (like adding one more disk in Tower of Hanoi). This lets us see exactly when and how the models start to struggle.
- No Cheating: It’s highly unlikely these models have seen the exact step-by-step solutions to, say, a 10-disk Tower of Hanoi problem in their training data, especially in the specific format we used.
- Clear Rules, Clear Steps: The logic is all there. They don’t need outside knowledge, just the rules we give them. This helps us see if they can follow algorithmic steps.
- We Can Check Their Work: We can use simulators to see if every single step in their “thinking” is correct, not just the final answer. (See the top part of Figure 1 on page 2 for how we analyze both the “thoughts” and the final answer).
What We Found (The “Illusion of Thinking” Part):
There’s a Wall – Complete Accuracy Collapse: As we made the puzzles harder, every LRM we tested eventually hit a wall. Beyond a certain complexity, their accuracy just plummeted to zero. They couldn’t solve it at all, no matter how many tries we gave them (within a generous token budget). (You can see this in Figure 4 on page 7, where accuracy drops off a cliff for all puzzles as complexity increases).
They Stop Trying So Hard (Counter-intuitive Scaling): This was really interesting. You’d think that as a problem gets harder, the model would “think” more (i.e., use more computational steps or “tokens”). They do, up to a point. But right around where they start to fail catastrophically, they actually start reducing their reasoning effort, even if they have plenty of “thinking time” (token budget) left. It’s like they get overwhelmed and just give up prematurely. (Look at Figure 1, bottom middle graph on page 2, or Figure 6 on page 9 – the token usage goes up, then down).
The Three Regimes (When “Thinking” Helps, and When It Doesn’t):
- Easy Problems: Surprisingly, standard LLMs (without the fancy “thinking” steps) often did better or were more efficient than the LRMs. It seems the extra “thinking” by LRMs was sometimes unnecessary or even confusing for simple tasks. (See Figure 5 on page 8, the “low complexity” section).
- Medium Problems: This is where LRMs shone. The extra “thinking” steps helped them solve problems that the standard LLMs struggled with. (Figure 5, “medium complexity”).
- Hard Problems: Both the LRMs and the standard LLMs eventually failed completely. The “thinking” could delay the failure, but it couldn’t prevent it. (Figure 5, “high complexity”).
Peeking Inside the “Thoughts” (Figure 1, bottom right, and Figure 7a on page 10):
- Overthinking Simple Stuff: For easy puzzles, we saw models find the correct solution path early in their “thoughts” but then continue to explore, sometimes making mistakes later or just wasting effort.
- Struggling with Harder Stuff: For more complex problems (but before they completely collapsed), they’d often explore a lot of wrong paths before stumbling upon the correct one, if they found it at all.
- Giving Up: For the really hard problems, they just couldn’t find a correct solution path, no matter how long their “thoughts” were.
They Can’t Always Follow Directions: Even when we gave the models the exact algorithm to solve a puzzle (like Tower of Hanoi, see Figure 8a and 8b on page 11), they still failed around the same complexity level! This tells us they have trouble just executing a known series of logical steps, not just finding the steps.
Inconsistent Reasoning: A model might be pretty good at Tower of Hanoi up to a certain number of disks, which involves many steps. But then it might fail miserably at a River Crossing problem that actually requires fewer steps but has different types of constraints (Figure 8c and 8d on page 11). This suggests they aren’t learning a general, abstract reasoning skill but are perhaps more sensitive to the specific structure or “feel” of a problem, maybe due to what they saw in training.
So, What’s the “Illusion”?
The “illusion” is that while these models appear to be thinking and reasoning like humans when they produce step-by-step traces, their underlying capabilities have hard limits. Their reasoning isn’t as robust or generalizable as it might seem from their performance on some benchmarks. It seems they are very good at pattern matching and interpolating from things they’ve seen, but when faced with problems that require them to systematically apply rules in novel or increasingly complex ways (algorithmic complexity), they break down.
Why Does This Matter?
Our findings suggest we need to be cautious about how much we trust the “reasoning” of current LRMs, especially for complex, high-stakes tasks. It also points towards areas where these models need to improve, like developing more robust algorithmic execution and generalization capabilities.
Read the original paper: The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity
The content presented here is a collection of my personal notes and explanations based on the paper. This is by no means an exhaustive explanation, and I strongly encourage you to read the actual paper for a comprehensive understanding.