Counterfactual Estimation via mlsynth: A Practical Guide

Jared Greathouse

2025-06-01

Counterfactual Estimation

  • Maximizing value for business services and crafting effective public policy depends on knowing whether policies, promotions, new web features, taxes, or other interventions meaningfully affect metrics that we care about.

  • Researchers commonly use Difference-in-Differences designs and Synthetic Control methods (DID and SCM) to measure causal impacts.

  • Both approaches use weighted averages of control groups from the pre-treatment period to estimate counterfactuals, or the post-treatment values of the outcome we’d see absent treatment.

DID and SCM

Both methods optimize over pre-treatment periods.

  • For DID, we have
    \[ \begin{aligned} \underset{\mathbf{w} \in \mathbb{R}^{N_0}}{\operatorname*{argmin}} &\quad \lVert \mathbf{y}_1 - \mathbf{Y}_0 \mathbf{w} - \boldsymbol{\beta}_0 \rVert_2^2, \\ \text{s.t.} &\quad \mathbf{w} = \frac{1}{N_0} \mathbf{1}_{N_0}. \end{aligned} \]

  • SCM imposes \(\Delta \operatorname{:=} \{\lVert \mathbf{w} \rVert_1 = 1 \mid \mathbf{w} \in \mathbb{R}_{\geq 0}^{N_0}\}.\) We learn the weights via
    \[ \underset{\mathbf{w} \in \Delta}{\operatorname*{argmin}} \lVert \mathbf{y}_1 - \mathbf{Y}_0 \mathbf{w} \rVert_2^2. \]

Note there is no intercept, and we presume the convex combination of the donor pool will be well-fitting.

Potential Problems

However, SCM and DID may struggle sometimes.

  • DID requires \(\mathbb{E}[y_{1t}(0)] - \mathbb{E}[y_{\mathcal{N}_0 t}(0)] = \mathbb{E}[y_{1t-1}(0)] - \mathbb{E}[y_{\mathcal{N}_0 t-1}(0)]\), or parallel trends. Does not hold when the mean of controls is dissimilar to the trend of the treated unit/group.
  • SCM struggles when \(N_0>>T_0\), and is prone to overfitting in this setting. Also, smaller donor pools are preferable to mitigate interpolation biases. But picking a good donor pool can be challenging when there are very many controls.

  • Convex SCM also favors sparse solutions. But what if the true vector of weights is not mostly 0? What if the DGP is actually dense instead of sparse?

Potential Solutions

  • Fortunately, econometricians have crafted methods which address these issues.

  • Many of these methods are re-formulations or relaxations of the standard toolkit.

Potential Solutions, cont.

For example…

  • Forward DID uses forward selection to choose a donor pool/control group for a single treated unit, using the DID estimator.

  • However, its only existing implementations are either in MATLAB and R.

  • MATLAB naturally is not free! And it, as well as the R code, requires users to have a wide dataframe and manually specify the number of pre-treatment periods (see the below).

  • This may work fine for one study, but what about if we do not have 44 pretreatment periods?

data = read.csv("GDP.csv", sep=",", header = TRUE)

t=dim(data)[1]
no_control=dim(data)[2]-1
control_ID=1:no_control
t1=44           # t1 is the pretreatment sample size
y=data[,1]
y1=data[1:t1,1]
y2=data[(t1+1):t,1]}

  • Robust Synthetic Control uses PCA to denoise the donor pool and use the low-rank approximation to learn the weights. An upshot of this method is that, in their empirical applications at least, the method appears to not rely on additional covariate metrics that standard SCMs need to attain acceptable fit.

  • However, one existing implementation requires the user to manually specify the number of principal components/singular values as well as the lambda parameter. Another requires an even more complex setup.

  • The existing implementations (arguably) presume a more advanced knowledge of machine-learning, econometrics, and/or programming than the modal marketing scientist, policy analyst/economist, or data scientist is likely to have.

  • Many of the steps critical to the analysis are not automated, leaving room for small errors.

  • This combination makes it harder/more cumbersome for the applied scientist to use these methods in practice.

Enter mlsynth

  • The Python library mlsynth is meant to further democratize causal inference and econometrics, making its benefits available to a wider class of researchers.

  • mlsynth automates these estimators (and more!).

  • Its syntax is very simple, and the steps are totally automated.

  • This presentation will cover mlsynth and how to use it in practice. I will describe a few of the algorithms and give examples of how to use them.

Installing mlsynth

To install mlsynth, we need:

  • Python 3.9 or later

  • cvxpy, matplotlib, numpy, pandas, scipy, scikit-learn, statsmodels, pydantic

  • And then, we can install this from the command line

pip install -U git+https://github.com/jgreathouse9/mlsynth.git

Using mlsynth

  • A long df where we have one observations per unit per time period.

  • A Time variable (numeric or date-time are allowed).

  • A unit variable (a string, to know which units are which).

  • A numeric outcome variable.

  • A dummy variable denoting treatment, 1 when the treatment is active and the unit is treated, else 0 for all other units and times.

Proposition 99

Find the data here.

This is the classical example from the Abadie 2010 paper.

import pandas as pd
from mlsynth.utils.helperutils import sc_diagplot

# Load smoking data
url = "https://raw.githubusercontent.com/jgreathouse9/mlsynth/main/basedata/smoking_data.csv"
df = pd.read_csv(url)

# Configure the SC_DIAGPLOT method
plotconfig = {
    "df": df,
    "outcome": df.columns[2],
    "treat": df.columns[-1],
    "unitid": df.columns[0],
    "time": df.columns[1],
    "display_graphs": True,
    "save": False,
    "counterfactual_color": "red",
    "method": "RPCA",
    "Frequentist": True
}

sc_diagplot([plotconfig])

Parallel trends does not appear to hold for California with respect to its donor pool. California has a steeper downward sloping trend of cigarette consumption compared to the average of the donor pool. No matter what scalar constant we would shift the mean of the donor pool by, the mean difference would not be constant with respect to California. This suggests one of two solutions is viable: either parallel trends would hold with some control units, or we simply need to re-weight the donor pool to better match the pre-intervention trajectory.

Here is the results of the CLUSTERSC class. This class implements the Robust Synthetic Control Method. We can see that this model fits the pre-treatment trajectory very well, without needing to use any of the seven covariates that the original paper uses.

import pandas as pd
from mlsynth import CLUSTERSC

url = "https://raw.githubusercontent.com/jgreathouse9/mlsynth/refs/heads/main/basedata/smoking_data.csv"
data = pd.read_csv(url)

SCconfig = {
    "df": data,
    "outcome": data.columns[2],
    "treat": data.columns[-1],
    "unitid": data.columns[0],
    "time": data.columns[1],
    "display_graphs": True,
    "save": False,
    "counterfactual_color": ["blue"], "method": "PCR", "cluster": False
}

SCResult = CLUSTERSC(SCconfig).fit()

Of course, this estimator presumes that some units should matter more than others (hence the weighting scheme).

import pandas as pd
from mlsynth import FDID

url = "https://raw.githubusercontent.com/jgreathouse9/mlsynth/refs/heads/main/basedata/smoking_data.csv"
data = pd.read_csv(url)

FDIDconfig = {
    "df": data,
    "outcome": data.columns[2],
    "treat": data.columns[-1],
    "unitid": data.columns[0],
    "time": data.columns[1],
    "display_graphs": True,
    "save": False,
    "counterfactual_color": ["blue", "red"]
}

FDIDResult = FDID(FDIDconfig).fit()

Here all selected units are given equal weight, adjusted only by an intercept.