Core Concept: Causal DAGs (Directed Acyclic Graphs) & d-separation
- Maria Alice Maia

- Sep 30, 2024
- 3 min read
Your regression model is beautiful. But are you sure you should be "controlling for" that variable?
This is one of the most subtle—and most damaging—forms of "Doing Data Wrong" that I've seen in my career. We are so used to the idea that "more controls are better" that we often adjust for variables that end up creating bias, not removing it.
The key to avoiding this trap lies in a powerful, visual tool: the Causal Directed Acyclic Graph (DAG). A DAG is a simple map of your assumptions about how the world works. It's a set of nodes (variables) and arrows (causal effects). But its real power comes from a concept called d-separation, which gives us rules for how information—and bias—flows through our system.
Let's break it down with a classic Marketing Department example.
You want to know the effect of sending a Discount Coupon (X) on Customer Purchases (Y). You also have data on whether the customer recently visited your Website (Z). Should you control for the website visit in your analysis?
It depends entirely on your causal assumptions, which we can draw in a DAG.
Scenario 1: The Confounder
Website Visit (Z) → Discount Coupon (X)
Website Visit (Z) → Purchase (Y)
Here, visiting the website is a confounder. It's a common cause of both receiving the coupon (maybe you get it after a visit) and making a purchase (visitors are more likely to buy). This creates a "backdoor path" of spurious correlation.
The Right Way: You must control for Website Visit (Z). Doing so closes the backdoor path and isolates the true causal effect of the coupon.
Scenario 2: The Mediator Discount Coupon (X) → Website Visit (Z) → Purchase (Y)
Here, the website visit is a mediator. The coupon causes the customer to visit the website, which in turn causes them to purchase.
The Wrong Way: If you control for Website Visit (Z), you block the very causal pathway you want to measure. You are essentially asking, "What is the effect of the coupon, ignoring the fact that it drove people to our website?" You will incorrectly conclude the coupon has no effect.
Scenario 3: The Collider (The Most Dangerous Case)
Discount Coupon (X) → Website Visit (Z) ← Customer Tech-Savviness (U, unobserved)
Customer Tech-Savviness (U) → Purchase (Y)
Here, the website visit is a collider. Two arrows point into it. Maybe both getting a digital coupon and being tech-savvy cause you to visit the website. And maybe tech-savviness also independently causes you to make a purchase.
The Wrong Way: If you control for Website Visit (Z), you create a spurious correlation between the coupon and tech-savviness. You are essentially creating bias out of thin air. It’s like selecting only for Ivy League graduates to study the effect of a prep course on test scores; you've created a distorted sample.
The math behind d-separation gives us the rules: control for confounders, never control for mediators (if you want the total effect), and never, ever control for colliders.

As an executive and data scientist, I've seen brilliant analysts fall into the collider trap. They build a "more controlled" model that is actually more biased. Drawing a simple DAG on a whiteboard before running a single line of code can prevent these multi-million dollar mistakes.
This knowledge isn’t mine to keep. It's our collective responsibility to bring this level of rigor to our work.
If you’re ready to move beyond blindly adding variables to lm(Y ~ X + Z) and want to join a movement dedicated to true causal thinking, subscribe to my email list.
And if you’re staring at a regression model right now and wondering if you’ve created a collider, book a 20-minute, no-nonsense consultation with me. Let’s draw the DAG together.


