Author:

Ricardo Moral

Published on:

November 27, 2024

ai-agents

Building Effective and Responsible AI Agents: The Importance of Recognising Causal Reasoning Limits in Generative AI

In this article, we examine why understanding causal reasoning—and recognising the limitations of large language models (LLMs) in this area—is crucial for implementing AI Agents effectively.

As organisations increasingly turn to AI Agents to perform complex tasks, awareness of these limitations becomes essential to assess their true capabilities and to ensure responsible, reliable deployments that align with desired outcomes.

The difference between causality and correlation lies in how two things are connected.

Causality means that one event directly causes another. There is an active influence where one event produces an outcome.

Example: If someone pushes a ball, it rolls because of the push. Without the push, the ball would remain still. Here, the push causes the rolling.

Strengths of Causality:

Causality provides a clear explanation of why something happens, allowing for a deeper understanding of cause-and-effect relationships.

It is useful for predicting outcomes and designing interventions, such as medical treatments or social policies that aim to create specific effects.

Causality helps uncover underlying mechanisms, offering a more detailed understanding of complex processes.

Weaknesses of Causality:

Causation is often difficult to prove, especially outside controlled experimental conditions, as isolating causes can be complex.

Many real-world outcomes are influenced by numerous factors, making it challenging to pinpoint a single cause.

Ethical limitations can restrict testing certain causal relationships, particularly in medicine or psychology, where experiments could harm participants.

Correlation occurs when two things appear together or follow a similar pattern, but one doesn’t necessarily cause the other. They simply appear related.

Example: When it rains, more umbrellas appear on the streets, but umbrellas don’t cause the rain. People carry umbrellas to stay dry, so both happen together.

Strengths of Correlation:

Correlation can reveal patterns and associations that suggest areas for further study or intervention.

It is useful for generating hypotheses when causation cannot be immediately tested, especially in observational studies.

Correlation is valuable in fields like social sciences or epidemiology, where controlled experiments may be impossible or impractical.

Weaknesses of Correlation:

Correlation does not imply causation, leading to potential misinterpretations if one assumes that one thing causes another just because they’re linked.

Correlations are vulnerable to confounding factors, where a third factor influences both correlated items, creating misleading associations.

Sometimes, correlations happen by random chance, leading to false patterns with no meaningful relationship.

While causality provides powerful insights for explaining and predicting outcomes, it is often challenging to prove. Correlation, on the other hand, is valuable for identifying patterns but can lead to false conclusions if it’s mistakenly assumed to imply causation.

The distinction between causality and correlation aligns with different types of human reasoning: deductive, inductive, and abductive reasoning. 

Each reasoning type approaches relationships between things in unique ways:


Deductive Reasoning (General to Specific)

Deductive reasoning starts with a general principle or theory and applies it to a specific case to reach a conclusion.

When applied to causality, it begins with a known causal relationship (e.g., “Exercise improves cardiovascular health”) and uses it to predict outcomes in specific situations (e.g., “If a person exercises regularly, they have a lower risk of heart disease”).

Relation to Causality: Deductive reasoning is strong in establishing causality because it confirms outcomes based on established principles.

Limitations with Correlation: Deduction does not work well with correlation because correlation alone doesn’t provide a solid principle or cause-effect rule, only an observed link.


Inductive Reasoning (Specific to General)

Inductive reasoning involves observing specific cases and inferring a general rule. For example, observing that people with higher education tend to earn more may lead to the general conclusion that education affects income.

Relation to Correlation: Inductive reasoning is often used to identify correlations, as it relies on observing patterns across multiple cases.

Limitations with Causality: While induction can suggest possible causal relationships, it cannot confirm them, as there may be other factors influencing the pattern.


Abductive Reasoning (Best Explanation)

Abductive reasoning finds the most likely explanation for an observed phenomenon. Given a correlation between two things, abduction may attempt to provide a causal explanation but doesn’t claim to prove it.

Relation to Both Correlation and Causation: Abduction can work with both correlation and causation. For instance, if there is a correlation between exercise and lower heart disease rates, abduction might hypothesise that exercise improves heart health, though this remains speculative until causation is confirmed.

Limitations: Abduction provides a best guess based on available information but lacks certainty. In correlation studies, this often leads to hypotheses that require further testing to confirm any causal link.

  • Deductive reasoning reliably applies known causal laws but is limited with correlations.
  • Inductive reasoning identifies correlations and patterns but cannot confirm causation.
  • Abductive reasoning offers plausible explanations for correlations but needs further testing to establish causality.

For years, I’ve been frequently frustrated by the over-reliance of business books on correlation rather than causation. In business, establishing clear causal relationships is challenging because companies operate in diverse, dynamic environments with numerous variables influencing outcomes, making it difficult to prove that one action or strategy directly leads to a specific result.

​​For example, a book might claim that companies adopting a particular management style see higher productivity, using a correlation between the style and productivity as evidence. However, the causation behind this is often unclear. It could be that companies already inclined toward efficiency adopt that style, or that other factors—like the industry or market conditions—play a bigger role.

Yet many business publications present correlations and inferences drawn from them as if they are concrete truths rather than potential patterns or frameworks that require further scrutiny.

This approach implies that specific strategies guarantee success without considering the unique contexts where they were observed. It misleads readers, who may apply these insights as fact rather than as a starting point for further analysis, adaptation, and personal judgement.

Understanding the distinction between correlation and causation is critical to avoid flawed conclusions and misguided paths that can ultimately prove damaging—a lesson I’ll illustrate in the next section.

Recently, Malcolm Gladwell gave a TED talk where he acknowledged a flaw in a chapter of his book The Tipping Point.

In this chapter, he discussed how the application of the Broken Windows Theory in early 1990s New York was considered a key factor in reducing crime across the city.

The Broken Windows Theory suggests that visible signs of disorder, like broken windows or graffiti, lead to more serious crime by signalling neglect and lawlessness. It argues that addressing minor issues can prevent larger crimes, shaping policing strategies focused on enforcing low-level offences.

In NYC, Broken Windows policing during 1990’s led to aggressive enforcement of minor offences, disproportionately impacting small communities. This approach involved frequent stops, searches, and arrests, increasing racial disparities, fostering community distrust, and leaving many with lasting criminal records.

In 2013, the landmark case Floyd v. City of New York found that the NYPD’s stop-and-frisk practices were unconstitutional due to racial targeting. While this ruling didn’t end Broken Windows policing, it pressured the city to reconsider aggressive enforcement tactics often justified by the theory.

As the city moved away from this approach, many feared crime would rise. However, crime continued to decline, challenging the real impact of the Broken Windows Theory, which relied heavily on correlation rather than causation.

Critics argue that the theory’s application led to over-policing, especially in lower-income and minority neighbourhoods, and contributed to strained community-police relations. While NYC saw results that some connect to the theory, it’s debated how much of the drop in crime was due to broken windows policies specifically versus other social and economic factors.

In his TED talk, Gladwell reflects on his reliance on this theory in his book, acknowledging that his analysis was built on a flawed premise.

These errors illustrate how an over-reliance on correlation instead of causation can lead to misconceptions and potentially unintended consequences.


You can see Malcom Gladwell’s TED Talk here: 

LLMs have specific limitations in distinguishing correlation from causation, especially in complex topics such as the examples given above.

These weaknesses include:

  • Pattern Recognition Over Causal Understanding: LLMs are designed to identify patterns in large datasets, aligning with correlational insights rather than causal understanding. LLMs may reproduce correlations without assessing if a true cause-effect relationship exists.
  • Limited Contextual Judgement: LLMs lack real-world context, making it difficult for them to weigh historical or social nuances and distinguish between correlation and causation.
  • Inability to Critique Sources: While LLMs can summarise information, they cannot critically evaluate sources or identify biases, which can lead to reproducing flawed assumptions.
  • Susceptibility to Bias Amplification: Since LLMs are trained on biased datasets, they may reinforce existing narratives and present correlated relationships as fact, risking misleading conclusions.
  • Absence of Real Causal Reasoning: True causal reasoning requires understanding confounding factors, which LLMs cannot independently identify. Without this, LLMs may present patterns as truths.

In essence, LLMs excel at reflecting patterns but they still lack the capability for causal evaluation. Without human oversight, LLMs may default to reproducing correlations as truths.

Companies working on model implementation are adopting various techniques to improve causal understanding in language models, such as integrating causal models, using counterfactual training, and leveraging causal-specific architectures.

These approaches aim to help models better recognise cause-and-effect relationships, especially in complex decision-making. While these techniques will advance causal reasoning, true causality—grasping cause and effect beyond patterns—remains an open challenge in AI. This area is still in development and will take time to fully mature.

Understanding both the strengths and limitations of AI Agents, particularly those based on LLMs, is essential when considering their implementation in business solutions.

While these models today excel at recognising patterns and generating insights quickly, their current limitations in causal reasoning mean that they must be applied thoughtfully, with a strong layer of human judgement and oversight.

Businesses looking to integrate AI Agents must be mindful that LLMs are not a replacement for nuanced, causative thinking but rather a powerful tool to aid it.

By recognising where these tools can provide genuine value—and where they require additional analysis or human context—we can avoid costly missteps and ensure that AI applications are both effective and ethically responsible.

This balanced approach will be crucial to leveraging AI to its fullest potential, enabling organisations to deploy AI Agents that are reliable, accurate, and aligned with business goals.

Relevant Papers and Studies

Zečević, M., et al. (2023). Causal Parrots: Large Language Models May Talk Causality But Are Not Causal. https://arxiv.org/pdf/2308.13067

This paper investigates the ability of large language models (LLMs) to perform causal reasoning. The authors argue that while LLMs can generate text that mimics causal understanding, they lack genuine causal inference capabilities. Introducing the concept of meta Structural Causal Models (meta SCMs), the study highlights how LLMs leverage correlations within training data rather than reasoning causally. Empirical results suggest that LLMs function as “causal parrots,” reflecting causal information from their data without true comprehension.

Jin, Z., et al. (2023). Can Large Language Models Infer Causation from Correlation? https://arxiv.org/pdf/2306.05836

This paper introduces the Corr2Cause task and dataset to evaluate the ability of large language models to infer causation from correlational data. The study finds that current LLMs perform poorly in distinguishing causation from correlation, with limited generalisation, highlighting significant challenges and opportunities for improving causal reasoning in these models.

Romanou, A., et al. (2023). CRAB: Assessing the Strength of Causal Relationships Between Real-World Events. https://aclanthology.org/2023.emnlp-main.940.pdf

This paper introduces the Causal Reasoning Assessment Benchmark (CRAB), designed to evaluate large language models’ understanding of causal relationships in real-world narratives. CRAB comprises approximately 2,700 event pairs with fine-grained, contextual causality annotations. The study reveals that current models struggle with complex causal reasoning, particularly when events form intricate causal structures. The authors suggest that enhancing models’ performance in causal reasoning requires addressing these complexities.

Yang, L., et al. (2024). A Critical Review of Causal Reasoning Benchmarks for Large Language Models. https://arxiv.org/pdf/2407.08029

This paper provides a comprehensive review of benchmarks designed to evaluate causal reasoning in large language models, emphasising the importance of incorporating interventional and counterfactual reasoning to enhance the assessment of causal understanding.

Kapkiç, A., et al. (2024). Introducing CausalBench: A Flexible Benchmark Framework for Causal Analysis and Machine Learning.
https://arxiv.org/pdf/2404.06349

This paper presents CausalBench, a benchmark suite designed to evaluate causal learning methods across various datasets and tasks. It promotes reproducibility and scientific collaboration in developing and assessing causal inference algorithms.

Zhou, Y., et al. (2024). CausalBench: A Comprehensive Benchmark for Causal Learning Capability of Large Language Models.
https://arxiv.org/pdf/2404.06349

This paper also proposes CausalBench as a benchmark to evaluate the causal reasoning abilities of large language models. And it includes tasks of varying complexity and explores causal networks to assess models’ capabilities in causal learning and inference.



Scroll to Top