Chapter 8

Effect Estimation

Getting the numbers right: regression, matching, and doubly-robust methods

You have a treatment, an outcome, and some confounders you want to control for. How do you actually compute the causal effect?

This chapter covers the main estimation strategies — from simple regression to the doubly-robust AIPW estimator that Reinforce OS uses under the hood.


Regression Adjustment

The simplest approach: run a regression of the outcome on treatment and confounders.

Yi=α+τTi+βXi+εiY_i = \alpha + \tau T_i + \beta X_i + \varepsilon_i

The coefficient τ^\hat{\tau} is the estimated ATE, holding XX fixed. This works well when:

  • The confounders XX are measured correctly
  • The relationship between XX and YY is roughly linear
  • You haven't omitted any important confounders

Regression is fast, interpretable, and the workhorse of empirical research. Its weakness: it relies on correctly specifying the outcome model. If the true relationship between XX and YY is nonlinear, a linear regression will give biased estimates.

Limitations

The most common mistake with regression adjustment: using it with high-dimensional confounders without regularization. If you have 100 confounders and 200 observations, OLS will overfit. Use regularized regression (Lasso, Ridge) or a machine learning model instead.


Propensity Score Methods

The propensity score is the probability of receiving treatment given covariates:

e(x)=P(T=1X=x)e(x) = P(T = 1 \mid X = x)

Rosenbaum & Rubin (1983) showed a remarkable result: conditioning on the propensity score is sufficient to remove confounding, even though e(x)e(x) is a single number rather than the full covariate vector XX.

This dimension reduction property makes propensity scores powerful when you have many confounders.

Propensity Score Matching

Match each treated unit to a control unit with a similar propensity score. Then estimate the ATE as the average difference in outcomes between matched pairs.

Treated unit: e(x) = 0.73  →  match with  Control unit: e(x) = 0.71
Treated unit: e(x) = 0.41  →  match with  Control unit: e(x) = 0.40
...

Matching removes imbalance in observed confounders. After matching, the treated and control groups should look similar on all covariates — similar to what randomization achieves.

Inverse Probability Weighting (IPW)

Instead of matching, weight each observation by the inverse of its probability of receiving the treatment it actually received:

ATE^IPW=1ni[TiYie(Xi)(1Ti)Yi1e(Xi)]\hat{\text{ATE}}_{\text{IPW}} = \frac{1}{n}\sum_i \left[\frac{T_i Y_i}{e(X_i)} - \frac{(1-T_i) Y_i}{1 - e(X_i)}\right]

Treated units with low propensity scores (they were unlikely to be treated) get high weight — they're informative precisely because they were treated despite the odds. Control units with high propensity scores (they "could have been" treated) also get high weight.

IPW creates a pseudo-population where treatment is independent of covariates — mimicking randomization.

Weakness of IPW: extreme propensity scores (e(x)e(x) near 0 or 1) create extreme weights, inflating variance. Common fixes: weight trimming, stabilized weights.


The Doubly-Robust AIPW Estimator

The cleanest solution to the limitations of both regression and IPW is to combine them. The Augmented Inverse Probability Weighted (AIPW) estimator does exactly this:

ATE^AIPW=1ni[μ^1(Xi)μ^0(Xi)+Ti(Yiμ^1(Xi))e(Xi)(1Ti)(Yiμ^0(Xi))1e(Xi)]\hat{\text{ATE}}_{\text{AIPW}} = \frac{1}{n}\sum_i \left[\hat{\mu}_1(X_i) - \hat{\mu}_0(X_i) + \frac{T_i(Y_i - \hat{\mu}_1(X_i))}{e(X_i)} - \frac{(1-T_i)(Y_i - \hat{\mu}_0(X_i))}{1 - e(X_i)}\right]

This looks complex. Let's break it apart:

  • μ^1(Xi)μ^0(Xi)\hat{\mu}_1(X_i) - \hat{\mu}_0(X_i): the regression-adjusted estimate (outcome model)
  • The remaining terms: IPW corrections for how well the outcome model fits

Why "Doubly Robust"?

AIPW has a remarkable property: it gives a consistent estimate of the ATE if either the outcome model or the propensity model is correctly specified — not necessarily both.

  • If you get the outcome model right but the propensity model wrong → consistent
  • If you get the propensity model right but the outcome model wrong → consistent
  • If you get both right → efficient (lowest possible variance)
  • If you get both wrong → biased

You get two chances to be right. This "double robustness" is why AIPW is now standard in modern causal inference.

Hand-drawn illustration of doubly robust estimation using two safety nets.
Doubly robust estimation: two statistical ropes, one slightly nervous effect estimate.
🔍Why Reinforce OS uses AIPW

Reinforce OS uses AIPW as its primary estimator for observational analysis. When you run an experiment without randomization — or when you want to adjust for confounders measured during a randomized experiment — the AIPW estimator gives you the most reliable effect estimates.

The outcome model μ^(X)\hat{\mu}(X) is fit using regularized regression. The propensity model e(X)e(X) uses logistic regression. Both models are cross-fit (trained on held-out data) to avoid overfitting bias.


Cross-Fitting: Making ML-Based Estimation Valid

When you use machine learning for the outcome model or propensity model inside AIPW, a subtle problem arises: if you use the same data to fit the model and evaluate the estimator, the in-sample fit is too good, creating bias.

The solution is cross-fitting (also called sample splitting):

  1. Split data into KK folds
  2. For each fold kk: fit μ^\hat{\mu} and ee on the other K1K-1 folds
  3. Evaluate the AIPW estimator using the predictions for fold kk
  4. Average across all folds

This ensures predictions are always out-of-sample. The resulting estimator is called DML (Double Machine Learning, Chernozhukov et al. 2018) and achieves the semiparametric efficiency bound — the lowest possible variance for this class of estimators.

Fold 1: train on folds 2-5, evaluate on fold 1
Fold 2: train on folds 1,3-5, evaluate on fold 2
...
Fold 5: train on folds 1-4, evaluate on fold 5
Average the estimates

Regression Discontinuity

When you can't randomize and don't have enough covariates to use AIPW, regression discontinuity (RD) is a powerful alternative — if your assignment has a sharp threshold.

The idea: people just above and just below the threshold are essentially comparable. The discontinuity in outcome at the threshold estimates the causal effect.

Classic example: students scoring just above vs. just below the cutoff for a scholarship program. Near the cutoff, assignment is essentially random, even though globally it's determined by score.

In an RD design, the estimator is:

τ^RD=limxcE[YX=x]limxcE[YX=x]\hat{\tau}_{\text{RD}} = \lim_{x \downarrow c} \mathbb{E}[Y \mid X = x] - \lim_{x \uparrow c} \mathbb{E}[Y \mid X = x]

The limits from above and below the cutoff, estimated by local linear regression near the threshold.


Difference-in-Differences

When you have panel data (the same units observed over time), difference-in-differences (DiD) is a workhorse design.

Setup: some units receive treatment at time t0t_0, others never do. Compare the change in outcomes for treated units to the change for control units:

τ^DiD=(Yˉtreated, afterYˉtreated, before)(Yˉcontrol, afterYˉcontrol, before)\hat{\tau}_{\text{DiD}} = (\bar{Y}_{\text{treated, after}} - \bar{Y}_{\text{treated, before}}) - (\bar{Y}_{\text{control, after}} - \bar{Y}_{\text{control, before}})

The key assumption: parallel trends — absent treatment, treated and control units would have moved in parallel. This is untestable in principle, but you can check it pre-treatment.

DiD is everywhere in economics: evaluating minimum wage laws (Card & Krueger 1994), policy changes, and natural experiments of all kinds.


Choosing Your Estimator

SituationRecommended approach
Randomized experimentSimple difference in means (or regression for precision)
Randomized + covariatesAIPW for efficiency gain
Observational + rich covariatesAIPW with ML outcome and propensity models
Sharp assignment thresholdRegression discontinuity
Panel data, parallel trends plausibleDifference-in-differences
Instrument availableInstrumental variables (IV)
Can't satisfy any of the aboveSensitivity analysis (Chapter 9)

What Reinforce OS Does

When you run an observational analysis in Reinforce OS — for example, analyzing logged behavioral data rather than a controlled experiment — the engine:

  1. Fits an outcome model μ^(T,X)\hat{\mu}(T, X) using regularized regression
  2. Fits a propensity model e(X)e(X) using logistic regression
  3. Applies 5-fold cross-fitting
  4. Computes the AIPW estimate with bootstrap confidence intervals
  5. Reports the result with a plain-English interpretation and the full posterior distribution

For controlled experiments (where you randomized assignment), the simple difference in means is used, with Bayesian updating to give you a posterior over the effect size.


Summary

  • Regression adjustment is simple and interpretable but relies on correct model specification
  • Propensity score methods (matching, IPW) reduce confounding by balancing covariate distributions
  • AIPW combines both: doubly robust (consistent if either model is correct), efficient if both are
  • Cross-fitting lets you use flexible ML models inside AIPW without bias
  • Regression discontinuity and DiD are powerful designs for specific data structures
  • Reinforce OS uses AIPW with cross-fitting as its default observational estimator

Next: When Experiments Fail — what to do when you can't randomize and you're not sure your assumptions hold.