New Study Challenges Apple’s Claims on Large Language Model Reasoning Limits

New Study Challenges Apple’s Claims on Large Language Model Reasoning Limits

19 views

Apple’s recent AI paper, The Illusion of Thinking, asserts that advanced large reasoning models (LRMs) fail on complex tasks. However, a new rebuttal by Open Philanthropy researcher Alex Lawsen challenges this conclusion, attributing Apple’s findings to experimental design flaws rather than inherent reasoning limits. The critique also credits Anthropic’s Claude Opus model as a co-author of the rebuttal.

Lawsen’s paper, titled The Illusion of the Illusion of Thinking, agrees that LRMs struggle with complex planning but argues that Apple’s study confuses output constraints and flawed evaluation methods with actual reasoning failure. The main points raised include:

  • Ignoring token limits: Models like Claude reached token output ceilings during tests like the Tower of Hanoi with 8+ disks, often indicating they stopped to save tokens rather than failing to reason.
  • Punishing recognition of unsolvable puzzles: Apple’s River Crossing test included impossible scenarios, yet models were penalized for acknowledging their unsolvability.
  • Flawed evaluation processes: Automated scripts judged model output strictly on complete move lists, treating partial or strategic responses as failures even when token limits were exceeded.

To support his argument, Lawsen re-executed Tower of Hanoi tests asking models to write recursive Lua functions generating solutions instead of enumerating every move. Models including Claude, Gemini, and OpenAI’s o3 effectively solved problems with 15 disks, well beyond Apple’s reported failure point.

Lawsen concludes that when artificial output constraints are removed, LRMs can reason effectively on complex tasks, at least in generating algorithms. However, he acknowledges these results are preliminary and that true algorithmic generalization remains challenging.

This debate holds significance because Apple’s paper has been widely cited as evidence of fundamental reasoning deficiencies in LRMs. Lawsen’s rebuttal suggests a more nuanced view: LRMs may be limited by current deployment constraints rather than intrinsic reasoning flaws.

As recommendations for future research, Lawsen advises:

  • Designing evaluation methods that separate reasoning ability from output constraints.
  • Verifying puzzle solvability before testing model performance.
  • Using complexity metrics that reflect computational difficulty, not just solution length.
  • Considering multiple solution formats to distinguish between algorithmic understanding and output execution.

Ultimately, Lawsen emphasizes the need to refine evaluation standards before concluding that LRMs fundamentally lack reasoning capacity.