One critical question underpins the excitement around LLM’s potential: just how intelligent are Large Language Models (LLMs) when it comes to reasoning?
Asking this question is crucial, as understanding the true capabilities of LLM’s enables us to apply it more effectively and to make strategic decisions on where to invest in skills that foster long-term competitive advantage.
LLMs undoubtedly enhance human potential—boosting productivity and supporting collaboration—but their ability to replace humans still remains largely confined to simpler tasks. In very complex roles, AI still acts as a powerful supplement rather than a full substitute.
Challenging the Reasoning Capabilities of AI
Recently, Apple’s white paper questioned how LLM models are evaluated, particularly concerning their reasoning skills. One frequently used benchmark, GSM8K, contains 8,500 maths problems at a grade-school level, developed by OpenAI to assess mathematical reasoning in LLMs.
In its white paper, GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models (https://arxiv.org/abs/2410.05229), Apple highlighted how LLMs often fall short of true reasoning, even with benchmarks like GSM8K. The paper introduces GSM-Symbolic, a new benchmark using symbolic templates to create varied questions, demonstrating that LLMs’ performance can decline with slight changes in numbers or complexity, revealing inherent weaknesses in logical reasoning.
The paper has prompted discussions about the need for more robust evaluation methods for AI models. Many experts highlighted that the findings suggest a necessity for integrating traditional logical reasoning modules or symbolic processing units alongside standard machine learning to create hybrid AI systems capable of tackling complex tasks.
Overall, the reception of Apple’s white paper underscores a growing recognition of the limitations of current models in reasoning tasks and the need for more rigorous evaluation frameworks.
This conversation points to a fundamental question: What kinds of reasoning are humans capable of, and how do LLMs measure up against these diverse reasoning skills? As we explore these distinctions, we gain a clearer view of AI’s potential—and its current boundaries—in augmenting human intelligence.
Types of Reasoning
LLMs can handle some reasoning types well, particularly those aligned with pattern recognition, yet they often struggle with more nuanced or contextually demanding reasoning.
In this section, we examine various reasoning types and assess LLMs’ performance in each.
Deductive Reasoning
Deductive reasoning draws specific conclusions from general premises. If the premises are true, the conclusion must also be true, such as in “All humans are mortal; Socrates is human; therefore, Socrates is mortal.”
LLMs handle basic deductive reasoning patterns and follow straightforward logical structures. However, they still struggle with complex, multi-step deductions or maintaining strict logical coherence, often producing errors when rigour or nuanced logic is needed.
Inductive Reasoning
Inductive reasoning generalises based on specific observations, forming probabilistic conclusions rather than certainties. For example, observing only white swans may lead to the conclusion that “All swans are white.”
Inductive reasoning aligns well with LLMs, as they recognise patterns across large datasets. While LLMs generalise based on patterns in their training data, they lack the flexibility to develop new hypotheses, limiting their ability to extrapolate beyond learned data.
Abductive Reasoning
Abductive reasoning infers the most likely explanation from incomplete information, often used in fields like medicine and detective work. For example, a doctor might deduce an infection as the likely cause of a fever, though other causes are possible.
LLMs can mimic abductive reasoning by suggesting likely explanations based on correlations in data, but they lack a true understanding of causality and context as described below. This limits their ability to reliably identify the “best” explanation, especially in complex cases.
Analogical Reasoning
Analogical reasoning applies knowledge of one situation to a similar one, like using knowledge of Earth’s ecosystems to hypothesise about extraterrestrial life. Analogies are powerful tools for understanding new contexts based on familiar ones.
LLMs are capable of creating basic analogies based on linguistic patterns, but they miss deeper, conceptual similarities. Without a true understanding of the underlying relationships, their analogies tend to be superficial rather than insightful.
Causal Reasoning
Causal reasoning determines cause-and-effect relationships, helping predict outcomes, such as knowing touching a hot surface will cause pain. This reasoning enables better decision-making by understanding consequences.
LLMs struggle with causal reasoning, as they primarily identify correlations rather than true cause-and-effect relationships. Studies, such as Kiciman et al. (2023), show that even after fine-tuning, models perform near-randomly on causal inference tasks, highlighting the need for more advanced causal understanding in AI systems.
Probabilistic Reasoning
Probabilistic reasoning assesses the likelihood of various outcomes, useful in scenarios like gambling or risk assessment. It deals with confidence levels rather than certainties, helping manage uncertainty.
LLMs approximate probabilistic reasoning based on language patterns but lack true statistical reasoning mechanisms. This limitation means they can misinterpret probabilities and may overlook important statistical nuances in complex scenarios.
Moral/Ethical Reasoning
Moral reasoning involves making judgments based on principles of right and wrong, often influenced by cultural values and the specific context of an ethical dilemma.
LLMs can echo moral principles found in training data but lack genuine ethical reasoning. Their responses reflect statistical data biases rather than an understanding of ethics, so they cannot truly “reason” through moral dilemmas.
Spatial Reasoning
Spatial reasoning involves understanding the spatial relationships between objects, essential in tasks like navigation, assembling puzzles, and visualising 3D transformations.
LLMs are weak in spatial reasoning as it relies on an understanding of physical space that text-based models struggle to capture. While they handle basic spatial descriptions in language, they falter with complex spatial tasks or visual transformations.
Intuitive Reasoning
Intuitive reasoning relies on instincts or “gut feelings” shaped by past experiences and pattern recognition, allowing people to sense danger without explicit evidence.
LLMs may seem “intuitive” when they provide probable responses quickly, but this is merely a function of statistical pattern matching, not genuine intuition. They lack the subconscious processing that human intuition entails.
Formal Logical Reasoning
Formal logical reasoning follows strict rules, often mathematical or symbolic, to derive conclusions. This is common in fields like mathematics and computer science, where adherence to logic is essential.
LLMs can imitate formal logic in simple cases but often fail in complex, structured logical tasks, especially when multiple steps or rigorous adherence to rules are required. They lack a genuine logical framework, limiting their consistency in formal reasoning.
In summary, while LLMs can approximate some reasoning types by pattern recognition, they are still inherently limited by their lack of understanding, true causality, and the ability to handle complex, multi-step reasoning reliably. These models are proficient in language patterns but still lack the cognitive structures necessary for robust, human-like reasoning across all types.
Progress and Challenges in Boosting LLM Reasoning Abilities
Recognising the critical role of reasoning in enhancing the capabilities of Large Language Models (LLMs), several leading organisations have developed models with advanced reasoning abilities:
- OpenAI’s o1 Series: Introduced in September 2024, the o1 series is designed to tackle complex tasks in science, coding, and mathematics by spending more time deliberating before responding. This approach enables the models to reason through intricate problems more effectively.
- Google DeepMind’s Gemini 1.5: Released in May 2024, Gemini 1.5 is a family of multimodal models that integrate advanced reasoning capabilities across various modalities, including text and images. These models are embedded in a range of Google products, enhancing user experiences through improved reasoning and understanding.
In some solutions, LLM providers are using increased compute time during inference to improve model reasoning. By allowing more processing time, models can perform multi-step reasoning (like “thinking” through a problem), handle longer context windows, and refine answers before responding. This makes responses more accurate and contextually aware, especially for complex questions.
These advancements underscore the industry’s commitment to refining reasoning abilities in LLMs, thereby broadening their applicability and reliability across diverse sectors.
However improving reasoning in LLMs still presents several key challenges, such as:
- Statistical Pattern Dependence: LLMs rely on patterns, not true comprehension, restricting their ability to achieve deep logical coherence and nuanced understanding.
- Lack of World Models: Without an internal model of the physical world, LLMs could struggle to fully grasp cause and effect or perform complex real-world reasoning.
- No Continuous Learning: Fixed post-training, LLMs can’t adapt or refine reasoning based on new experiences, unlike humans.
- Limited Memory: Short memory restricts LLMs in long, context-dependent tasks, impacting applications that require continuity, like medicine or law.
- No Goals or Intentions: LLMs lack intrinsic motivations, which limits their ability to reason purposefully, often resulting in plausible but superficial responses.
While companies developing LLMs may eventually find ways to address these barriers—through advances in model architecture, adaptive learning frameworks, or integrated world models—these challenges currently slow progress in creating AI that can reason in ways that more closely mirror human thought. As a result, overcoming these obstacles will be essential for further breakthroughs in the field.
Future-Proofing Software Engineering: Strategies for Companies & Professionals
While Large Language Models (LLMs) are bound to improve in reasoning, their progress in this area is likely to be slower than the rapid advancements we’ve witnessed in automation and pattern recognition tasks. Unlike straightforward tasks, reasoning involves complex layers of context, judgement, and abstract thinking that LLMs currently struggle to replicate. This means that, for the foreseeable future, there will be a sustained need for human oversight and critical thinking to bridge these gaps and complement AI capabilities.
LLMs hold significant potential in software development, particularly as powerful tools for augmenting productivity and creativity. In product design, they can assist teams in generating ideas, automating repetitive tasks like documentation, and producing design suggestions based on patterns in data.
LLMs can also support UX research by summarising feedback trends and suggesting user flows. However, when it comes to creating innovative design concepts or understanding nuanced user behaviour, their existing limitations in complex reasoning are evident. Human oversight still remains crucial to bridge gaps in context and creativity.
In engineering, LLMs can aid in coding by generating boilerplate code, automating routine tasks, and suggesting bug fixes based on historical patterns. They excel at offering code snippets, refactoring suggestions, or even pseudocode for certain types of tasks. However, their limitations in formal and causal reasoning mean they can struggle with deeply complex algorithms or systems integration. This makes them better suited as a tool to enhance engineers’ workflows rather than replace their critical thinking.
Overall, while LLMs are valuable for streamlining processes and automating straightforward tasks, leveraging them most effectively in software development requires an understanding of their limits. By using LLMs to handle simpler tasks and supplement human expertise in higher-level decision-making, teams can achieve a balance that maximises productivity and supports sustained innovation.
For a new company or professionals navigating the AI landscape, focusing on roles and skills that complement AI capabilities is a strategic way to future-proof a career and build long-term value. Here are some key areas and capabilities to consider:
- Complex Problem Solving and Critical Thinking: While LLMs can handle repetitive and pattern-based tasks, they still struggle with nuanced reasoning and multi-step problem-solving. Developing strong analytical and critical-thinking skills will make you invaluable for tasks requiring deep judgement, strategic planning, and creative solutions.
- Hybrid AI Roles: Specialising in roles that bridge human and AI capabilities, such as AI-assisted software engineering, human-in-the-loop processes, or AI ethics and governance, can offer a niche advantage. These roles leverage human oversight to guide and enhance AI outputs, ensuring ethical, reliable, and contextually relevant outcomes.
- Product and User Experience Design: Since AI struggles with understanding deep human needs and emotions, focusing on user-centric design, customer empathy, and UX/UI design will make you essential to creating products that genuinely connect with users. AI can support design processes, but it’s unlikely to replace the human touch needed for innovative, user-focused experiences.
- Data Strategy and Data Interpretation: AI is only as good as the data it’s trained on. Building expertise in data strategy, data analysis, and contextual interpretation will enable you to guide AI projects and maximise their accuracy and relevance. This includes understanding data quality, biases, and ethical considerations—areas where human oversight is critical.
- Ethical and Responsible AI Practices: With growing concerns about AI’s societal impact, roles focused on AI ethics, bias mitigation, and responsible AI use are becoming increasingly valuable. Professionals with a deep understanding of ethical frameworks and regulatory compliance will play a crucial role in guiding AI development responsibly.
- Cross-Disciplinary Knowledge: AI applications are expanding across fields like healthcare, finance, law, and education. Building a blend of domain expertise (e.g., healthcare or finance) with AI literacy can set you apart, enabling you to apply AI effectively and meaningfully in industry-specific contexts.
- Interpersonal and Leadership Skills: As AI takes on more technical tasks, soft skills like communication, team collaboration, and empathy will become even more essential. Leadership roles that guide AI strategies, manage human-AI collaboration, and inspire innovation will remain squarely in the human domain.
- Innovation in Complex Engineering and Algorithm Design: AI’s limitations in formal and abstract reasoning mean that areas like algorithm design, complex software engineering, and system architecture are still fields where human expertise shines. Focusing on these areas allows professionals to create foundational systems that AI can build upon but not replace.
By focusing on these areas, new companies and professionals can carve out roles that leverage AI’s strengths while filling the critical gaps AI cannot yet address. Building these capabilities ensures adaptability in an AI-driven world, positioning you as a crucial asset in an evolving market.
References
Here are some references on the current capabilities and limitations of large language models in reasoning and cognition.
“GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models” – https://arxiv.org/pdf/2410.05229
“Reasoning or Reciting? Exploring the Capabilities and Limitations of Language Models Through Counterfactual Tasks” – https://arxiv.org/pdf/2307.02477
“Are Large Language Models Really Good Logical Reasoners? A Comprehensive Evaluation and Beyond” – https://arxiv.org/pdf/2306.09841
“The Limitations of Large Language Models for Understanding Human Language and Cognition” – http://direct.mit.edu/opmi/article-pdf/doi/10.1162/opmi_a_00160/2468254/opmi_a_00160.pdf