In June, Apple researchers published a study examining whether simulated reasoning (SR) AI models genuinely reason through problems or primarily rely on pattern matching from training data.
The study, titled “The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity,” was led by Parshin Shojaee and Iman Mirzadeh with contributions from Keivan Alizadeh, Maxwell Horton, Samy Bengio, and Mehrdad Farajtabar.
They evaluated so-called large reasoning models (LRMs) that use a technique called chain-of-thought reasoning to generate step-by-step problem-solving text. The team tested the models on four classic puzzles with varying difficulty: Tower of Hanoi, checkers jumping, river crossing, and blocks world. These ranged from simple instances—like a one-disk Tower of Hanoi—to extremely complex versions, including a 20-disk Tower of Hanoi requiring over a million moves.
The researchers noted that current AI benchmarks focus mainly on final answer accuracy for known math and coding problems rather than assessing whether the model used genuine reasoning or just recalled patterns.
The results aligned with a recent USAMO study, showing the AI models scored mostly below 5 percent on novel mathematical proofs, with only one model reaching 25 percent and no perfect proofs out of nearly 200 attempts. Both studies highlight significant declines in performance for tasks requiring sustained systematic reasoning.