Apple’s research department issued a new paper that undermines the overall utility of large language models (LLMs.) They tested leading “reasoning” models like Claude and DeepSeek and found that as the complexity of tasks increases, the models collapse – even when they are given an exact algorithm for a complex problem and have plenty of compute left. The researchers concluded that LLMs cannot use basic logic and are not actually “reasoning;” they are simply pattern matching. Salesforce researchers similarly found “a significant gap between current LLM capabilities and real-world enterprise demands.”
Existing methods for evaluating these models have been overstating their abilities. The industry has learned to train models to score well on specific tests, and widespread data contamination means models may even have the test answers in their training data. Researchers are trying to address this AI “evaluation crisis.” Some are calling for evaluations that encompass riskiness, not just performance; a model’s unreliability, hallucinations, and brittleness can be disastrous when safety and money are on the line and should also be benchmarked as the models are released.
As Gary Marcus puts it, “approximations of reality are no substitute for truth.” In his view, the tech industry should rebuild AI from the ground up if it wants to achieve any of the promised productivity gains that LLMs are supposed to enable. Even Sam Altman has recently acknowledged that simply scaling existing machine learning models will not yield further significant AI advances.
Going all-in on LLMs risks starving out a potentially better alternative. Marcus advocates for neurosymbolic AI, an approach that attempts to rebuild human logic artificially rather than training computer models on vast datasets. Fei-Fei Li of World Labs and Yann LeCun of Meta are both researching “world models,” which are generative AI models that understand the dynamics of the real world. These models learn to represent and predict dynamics like motion, force, and spatial relationships from sensory data. World models have many uses, including reasoning; they can take multimodal inputs and analyze them over time and space, use chain-of-thought reasoning to understand what’s happening, and decide the best actions. Unlike LLMs, they are much more adept at complex problem solving. Li and LeCun believe these models are the only path toward truly intelligent AI.
Questions to consider
How are companies measuring the actual productivity gains from LLM use?
For companies developing foundation models, what methods are they using for model evaluation? How are they addressing the limitations of LLM scaling?
For companies exploring alternatives to LLMs, what steps are they taking to ensure the fairness and safety of these models?


