Average treatment effects

From difference-in-means to doubly-robust DML — and why the harness prefers the last one.

The average treatment effect is the spine of the platform: the number an A/B test reports. This chapter walks the estimator ladder — naive difference, regression adjustment, inverse-propensity weighting, and the doubly-robust AIPW/DML workhorse — and shows, via the harness, which rungs survive confounding and which quietly break. (Math recycled from notation/ate-estimators.md, after Wager ch. 1–3 and Imbens–Wooldridge 2009.)

Estimand & assumptions

With treatment W\in\{0,1\}, potential outcomes Y(1),Y(0), and observed Y=Y(W), the target is

\tau = \mathbb{E}\!\left[Y_i(1)-Y_i(0)\right].

The fundamental problem is that \Delta_i = Y_i(1)-Y_i(0) is never observed for any single unit. Recovery rests on the standard trio:

  1. SUTVAY_i = Y_i(W_i) (no interference; one version of treatment).
  2. Unconfoundedness\{Y_i(0),Y_i(1)\} \perp\!\!\!\perp W_i \mid X_i.
  3. Overlap\eta \le e(x) \le 1-\eta, where e(x)=\Pr(W=1\mid X=x) is the propensity.

In a randomized experiment, (2) holds by design with e(x)=\pi known — and because Lyra’s DGP sets e(x), overlap holds by construction, which is exactly what makes clean recovery tests possible.

The ladder

Difference-in-means

The randomized baseline, unbiased essentially without assumptions:

\hat\tau_{\text{DM}}=\frac1{n_1}\sum_{W_i=1}Y_i-\frac1{n_0}\sum_{W_i=0}Y_i, \qquad V_{\text{DM}}=\frac{\operatorname{Var}[Y(1)]}{\pi}+\frac{\operatorname{Var}[Y(0)]}{1-\pi}.

It is the platform’s deliberately-biased comparator the moment assignment is not random — confounding, interference, ramp-selection. The harness measures exactly how biased.

Regression adjustment / CUPED

Fit the interacted regression Y_i \sim \alpha + W_i\tau + X_i\beta + W_i X_i\gamma and report \hat\tau_{\text{IREG}}=\hat\tau+\bar X\hat\gamma. Adjusting on pre-treatment covariates is never worse than difference-in-means, even under misspecification:

V_{\text{IREG}} = V_{\text{DM}} - \lVert \beta^{*}_{(0)}+\beta^{*}_{(1)} \rVert_A^2 \ \le\ V_{\text{DM}}.

This is the CUPED variance-reduction principle (covered in the Metrics chapter): no bias, strictly tighter intervals when X is predictive.

The linearity tax. Plain “control for X” via Y_i = \alpha+\tau W_i+\beta'X_i+\varepsilon_i is not justified by unconfoundedness alone — it silently assumes (i) constant effects and (ii) a linear baseline. With discrete X, OLS and IPW are both weighted averages of stratum effects \tau_x, but with different weights:

\tau^{\text{IPW}}=\frac{\sum_x \tau_x\, e(x)\,P(X{=}x)}{\sum_x e(x)\,P(X{=}x)}, \qquad \tau^{\text{OLS}}=\frac{\sum_x \tau_x\, \sigma^2_{W|x}\,P(X{=}x)}{\sum_x \sigma^2_{W|x}\,P(X{=}x)}, \quad \sigma^2_{W|x}=e(x)(1-e(x)).

OLS over-weights the 50/50-split strata (where \sigma^2_{W|x} peaks) and equals the ATE only if effects are constant. IPW targets the ATE. This is the gap the doubly-robust machinery closes.

Inverse-propensity weighting

\hat\tau_{\text{IPW}}=\frac1n\sum_i\left[\frac{W_iY_i}{\hat e(X_i)}-\frac{(1-W_i)Y_i}{1-\hat e(X_i)}\right].

Oracle IPW (true e) is unbiased but inefficient — it ignores the outcome model entirely.

Augmented IPW = doubly robust

The workhorse combines a regression imputation with an IPW correction on its residuals:

\hat\tau_{\text{AIPW}}=\frac1n\sum_i\Big[ \underbrace{\hat\mu_{(1)}(X_i)-\hat\mu_{(0)}(X_i)}_{\text{regression}} +\underbrace{W_i\tfrac{Y_i-\hat\mu_{(1)}(X_i)}{\hat e(X_i)} -(1-W_i)\tfrac{Y_i-\hat\mu_{(0)}(X_i)}{1-\hat e(X_i)}}_{\text{IPW on residuals}}\Big].

It is doubly robust: consistent if either \hat\mu or \hat e is right. With cross-fitting (the DML recipe — fit nuisances on held-out folds, evaluate on the rest), the product-of-errors bias is removed and, provided \alpha_\mu+\alpha_e \ge \tfrac12 (e.g. both nuisances at n^{-1/4}),

\sqrt n\,(\hat\tau_{\text{AIPW}}-\tau)\ \Rightarrow\ N(0, V^{*}), \qquad V^{*}=\operatorname{Var}[\tau(X)]+\mathbb{E}\!\left[\tfrac{\sigma^2_{(1)}(X)}{e(X)}\right] +\mathbb{E}\!\left[\tfrac{\sigma^2_{(0)}(X)}{1-e(X)}\right].

V^{*} is the semiparametric efficiency bound — no regular estimator does better. The interval uses the influence-function variance \hat V = \tfrac{1}{n-1}\sum_i(\hat\Gamma_i-\hat\tau)^2 from the per-unit scores \hat\Gamma_i.

Lyra’s shortcut (Cor 3.3). Because the simulator knows e(x), AIPW attains the efficiency bound V^{*} using any consistent outcome model — no rate condition required. Recovery tests lean on this: the estimator should hit nominal coverage even with a crude \hat\mu.

Certification

Run all four through the harness on DGPLevel1 with a confounding knob and a nonlinear baseline (ate=2.0, R replications). The verdict is unambiguous and measured:

estimator confounded · nonlinear reads
diff-in-means bias \approx +1.8, coverage \approx 0 uncertified
OLS (linear) bias \approx +1.3, coverage \approx 0 uncertified
AIPW bias \approx 0.04, coverage \approx 0.93 certified
DML (econml) bias \approx -0.02, coverage \approx 0.90 certified

Selection on observables is solvable: the doubly-robust pair recovers the planted \tau with nominal coverage while the naive and linear estimators are confidently wrong. The harder case — a hidden confounder that breaks even AIPW, and the sensitivity analysis that quantifies it — is the subject of the Observational chapter.