Counterfactual Estimation via mlsynth: A Practical Guide

2025-11-11

Counterfactual Estimation

  • Maximizing value for business services and crafting effective public policy depends on knowing whether policies, promotions, new web features, taxes, or other interventions meaningfully affect metrics that we care about.

  • Researchers commonly use Difference-in-Differences designs and Synthetic Control methods (DID and SCM) to measure causal impacts.

  • DID requires \(\mathbb{E}[y_{1t}(0)] - \mathbb{E}[y_{\mathcal{N}_0 t}(0)] = \mathbb{E}[y_{1t-1}(0)] - \mathbb{E}[y_{\mathcal{N}_0 t-1}(0)]\), or parallel trends. Does not hold when the mean of controls is dissimilar to the trend of the treated unit/group.

For the SCM, we solve a least squares problem

\[\begin{align} &\mathbf{w}^{\mathrm{SCM}} = \underset{\mathbf{w}^{\mathrm{SCM}} \in \mathcal{W}_{\mathrm{conv}}}{\operatorname*{argmin}} \; \left\| \mathbf{y}_1 - \mathbf{Y}_0 \mathbf{w}^{\mathrm{SCM}} \right\|_2^2 \quad \forall \, t \in \mathcal{T}_{1}, \\ &\mathcal{W}_{\mathrm{conv}} \in \Big\{ \mathbf{w}^{\mathrm{SCM}} \in (\{0\} \cup \mathbb{R}_+)^{|\mathcal{N}_0|} : \sum_{j \in \mathcal{N}_0} w_j = 1 \Big\}. \end{align}\]

  • SCM struggles when \(N_0>>T_0\), and is prone to overfitting in this setting. Also, smaller donor pools are preferable to mitigate interpolation biases. But picking a good donor pool can be challenging when there are very many controls.

Many SCM Libraries Exist

Estimator / Method Software Notes / Accessibility
Classical SCM R / Stata Original SC implementation.
PySyncon Python Robust/Penalized SC.
microsynth R Focused on disaggregated data.
augsynth R/Python bias corrects and performs inference.
sdid (Synthetic DID) R/Stata Implements Synthetic Difference-in-Differences.

While having all these is nice, they are all scattered in different libraries. They have different data structure demands, different options, and so on.

Other SCMs…

Estimator / Method Software Notes / Accessibility
Forward SCM Stata Do File Only, not production ready
Forward PDA R/Stata R Requires wide data, Stata proprietary
TSSC MATLAB MATLAB is proprietary, TSSC not packaged
Relaxed \(\ell_2\) SCM Python handles high dimensional data, but requires wide data.
Synthetic Historical Control Not Packaged Time Series only Approach, but no package exists
Forward Difference-in-Differences Stata/R Stata is not free, and R code is not a package
Robust PCA SCM Not Packaged Author’s Code is in R and Python

These approaches are very useful; they address high dimensional data structures via clustering/forward selection, allow for model selection between different SCM variants, address noisily observed outcomes, but many are either proprietary or not packaged, making the actual use of these estimators difficult.

Enter mlsynth

  • The Python library mlsynth is meant to further democratize causal inference SC methods, each with different use cases.

  • At present it has roughly 20 different flavors of SCM.

To install mlsynth, we need:

  • Python 3.9 or later
pip install -U git+https://github.com/jgreathouse9/mlsynth.git

Using mlsynth

  • A long df where we have one observations per unit per time period.

  • A Time variable (numeric or date-time are allowed).

  • A unit variable (a string, to know which units are which).

  • A numeric outcome variable.

  • A dummy variable denoting treatment, 1 when the treatment is active and the unit is treated, else 0 for all other units and times.

Proposition 99

Find the data here.

state year cigsale Proposition 99
Alabama 1970 89.8 False
Alabama 1971 95.4 False
Alabama 1972 101.1 False
Alabama 1973 102.9 False
Alabama 1974 108.2 False
state year cigsale Proposition 99
Wyoming 1996 110.3 False
Wyoming 1997 108.8 False
Wyoming 1998 102.9 False
Wyoming 1999 104.8 False
Wyoming 2000 90.5 False
import pandas as pd
from mlsynth.utils.helperutils import sc_diagplot
url = "https://raw.githubusercontent.com/jgreathouse9/mlsynth/main/basedata/smoking_data.csv"
df = pd.read_csv(url)

print(df.head().to_markdown(index=False))

print(df.tail().to_markdown(index=False))

# Configure the SC_DIAGPLOT method
plotconfig = {
    "df": df,
    "outcome": df.columns[2],
    "treat": df.columns[-1],
    "unitid": df.columns[0],
    "time": df.columns[1],
    "display_graphs": True,
    "save": False,
    "counterfactual_color": "red"
}

sc_diagplot([plotconfig])
| state   |   year |   cigsale | Proposition 99   |
|:--------|-------:|----------:|:-----------------|
| Alabama |   1970 |      89.8 | False            |
| Alabama |   1971 |      95.4 | False            |
| Alabama |   1972 |     101.1 | False            |
| Alabama |   1973 |     102.9 | False            |
| Alabama |   1974 |     108.2 | False            |
| state   |   year |   cigsale | Proposition 99   |
|:--------|-------:|----------:|:-----------------|
| Wyoming |   1996 |     110.3 | False            |
| Wyoming |   1997 |     108.8 | False            |
| Wyoming |   1998 |     102.9 | False            |
| Wyoming |   1999 |     104.8 | False            |
| Wyoming |   2000 |      90.5 | False            |

Parallel trends does not appear to hold for California with respect to its donor pool. California has a steeper downward sloping trend of cigarette consumption compared to the average of the donor pool. No matter what scalar constant we would shift the mean of the donor pool by, the mean difference would not be constant with respect to California. This suggests one of two solutions is viable: either parallel trends would hold with some control units, or we simply need to re-weight the donor pool to better match the pre-intervention trajectory.

Here is the results of the CLUSTERSC class. This class implements the Robust Synthetic Control Method. We can see that this model fits the pre-treatment trajectory very well, without needing to use any of the seven covariates that the original paper uses.

import pandas as pd
from mlsynth import CLUSTERSC

stub = "https://raw.githubusercontent.com/jgreathouse9/mlsynth/"
trail = "refs/heads/main/basedata/"
file = "smoking_data.csv"

data = pd.read_csv(stub + trail + file)

SCconfig = {
    "df": data,
    "outcome": data.columns[2],
    "treat": data.columns[-1],
    "unitid": data.columns[0],
    "time": data.columns[1],
    "display_graphs": True,
    "save": False,
    "counterfactual_color": ["blue"], "method": "PCR", "cluster": False
}

SCResult = CLUSTERSC(SCconfig).fit()

Here, the weights are allowed to be negative and there is no summation constraint: PCA is used to denoise the donor pool, we select the number of singular values via a singular value thresholding, and we reconstruct the donor matrix to learn the weights via OLS. We may also use the Bayesian version of this estimator if we set Frequentist to False

FDID is an iterative, data-driven approach that selects the control group for a treated unit.
By design, it is agnostic to which units to include or how many controls to use. The algorithm adds controls sequentially based on pre-treatment fit.

  1. Initialization

\(\widehat{U}_0 = \emptyset \qquad \mathcal{U} = \emptyset\)

  1. Forward Selection (Iteration \(k = 1\))

For each candidate control \(i \in \mathcal{N}_0\):

\[ \underset{\beta_{\widehat{U}_0 \cup \{i\}}}{\operatorname*{argmin}} \big\| \mathbf{y}_1 - \mathbf{Y}_{\widehat{U}_0 \cup \{i\}} \beta_{\widehat{U}_0 \cup \{i\}} - \beta_\text{cons}\mathbf{1} \big\|_2^2, \quad \beta_{\widehat{U}_0 \cup \{i\}} = \frac{1}{|\widehat{U}_0| + 1} \]

Select the control with highest \(R^2\):

\[ i_1^\ast = \underset{i \in \mathcal{N}_0}{\operatorname*{argmax}} R^2_i, \quad \widehat{U}_1 = \{ i_1^\ast \}, \quad \mathcal{U} = \{\widehat{U}_1\}. \]

  1. Subsequent Iterations (\(k = 2, \dots, N_0\))

For each remaining candidate \(i \notin \widehat{U}_{k-1}\):

\[ i_k^\ast = \underset{i \in \mathcal{N}_0 \setminus \widehat{U}_{k-1}}{\operatorname*{argmax}} R^2_{\widehat{U}_{k-1} \cup \{i\}}, \quad \widehat{U}_k = \widehat{U}_{k-1} \cup \{ i_k^\ast \}, \quad \mathcal{U} = \mathcal{U} \cup \{ \widehat{U}_k \}. \]

  1. Final Control Group

\[ \widehat{U} := \underset{\widehat{U}_k \in \mathcal{U}}{\operatorname*{argmax}} R^2(\widehat{U}_k) \]

The selected control set \(\widehat{U}\) maximizes pre-treatment fit (\(R^2\)) among all candidate sets.

import pandas as pd
from mlsynth import FDID

url = "https://raw.githubusercontent.com/jgreathouse9/mlsynth/refs/heads/main/basedata/smoking_data.csv"
data = pd.read_csv(url)

FDIDconfig = {
    "df": data,
    "outcome": data.columns[2],
    "treat": data.columns[-1],
    "unitid": data.columns[0],
    "time": data.columns[1],
    "display_graphs": True,
    "save": False,
    "counterfactual_color": ["blue", "red"]
}

FDIDResult = FDID(FDIDconfig).fit()

Here all selected units are given equal weight, adjusted only by an intercept.

Basque Country

Basque Treated in 1975 with the other 16 Spanish communities under control.

#| fig-align: center

import pandas as pd
from mlsynth.utils.helperutils import sc_diagplot
stub = "https://raw.githubusercontent.com/jgreathouse9/mlsynth/"
trail = "refs/heads/main/basedata/"
file = "basque_data.csv"

df = pd.read_csv(stub + trail + file)

plotconfig = {
    "df": df,
    "outcome": df.columns[2],
    "treat": df.columns[-1],
    "unitid": df.columns[0],
    "time": df.columns[1],
    "display_graphs": True,
    "save": False,
    "counterfactual_color": "red"
}

sc_diagplot([plotconfig])

  • A feature of Classic SCM is that the weights are often very sparse, which may overfit to certain donor units and produce unstable results.

  • Relaxed Balanced SCM introduces two modifications:

    1. Encourages dense weight vectors to diversify prediction risk.
    2. Allows a slack in the pre-treatment fit to reduce overfitting to a few donors.
  • Let

\[ \mathbf{G} \operatorname{:=} \mathbf{Y}_0^\top \mathbf{Y}_0 \in \mathbb{R}^{N_0 \times N_0}, \quad \mathbf{g} \operatorname{:=} \mathbf{Y}_0^\top \mathbf{y}_1 \in \mathbb{R}^{N_0}. \]

  • The optimization problem is

\[ \begin{aligned} \mathbf{w}^{\mathrm{RBSCM}} &= \underset{\mathbf{w} \in \mathcal{W}_{\mathrm{conv}}}{\operatorname*{argmin}} \; \|\mathbf{w}\|_2^2, \\ \text{s.t.} \quad & \big\| \mathbf{G}\mathbf{w} - \mathbf{g} \big\|_\infty \le \tau, \end{aligned} \]

where

\[ \mathcal{W}_{\mathrm{conv}} \operatorname{:=} \Big\{ \mathbf{w} \in (\{0\} \cup \mathbb{R}_+)^{N_0} : \sum_{j=1}^{N_0} w_j = 1 \Big\}. \]

  • \(\tau \ge 0\) controls the allowed relaxation of the pre-treatment fit.

  • The \(\ell_2\)-norm penalty encourages dense weights, distributing contribution across multiple donor units instead of concentrating on a few.

import pandas as pd
from mlsynth import RESCM

stub = "https://raw.githubusercontent.com/jgreathouse9/mlsynth/"
trail = "refs/heads/main/basedata/"
file = "basque_data.csv"

data = pd.read_csv(stub + trail + file)

Relconfig = {
    "df": data,
    "outcome": data.columns[2],
    "treat": data.columns[-1],
    "unitid": data.columns[0],
    "time": data.columns[1],
    "display_graphs": True,
    "save": False,
    "counterfactual_color": ["blue", "red"]
}

RelResult = RESCM(Relconfig).fit()

The ATT for ADH-SCM and the Relaxed Balanced SC are very similar, as is their pretreatment fits. However we can see that they select very different unit weights. The relaxed SC, with \(\tau\) chosen by cross validation, diversifies the prediction risk.

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from mlsynth import FDID

stub = "https://raw.githubusercontent.com/jgreathouse9/mlsynth/"
trail = "refs/heads/main/basedata/"
file = "basque_data.csv"

data = pd.read_csv(stub + trail + file)

FDIDconfig = {
    "df": data,
    "outcome": data.columns[2],
    "treat": data.columns[-1],
    "unitid": data.columns[0],
    "time": data.columns[1],
    "display_graphs": True,
    "save": False,
    "counterfactual_color": ["blue", "red"]
}

FDIDResult = FDID(FDIDconfig).fit()

Aragon and Catalonia (weighted at 0.5), counterfactual similar to the relaxed SCM.

Simplicity of Use

  • Estimating multiple SCM/DID variants across different datasets requires only minimal changes to the API call (just the class name).
  • Furthermore, we can do this with essentially zero change in the data structure or input requirements beyond simple changes to a dictionary. Each .fit() call returns a comprehensive BaseEstimatorResults object containing treatment effects, fit diagnostics (e.g., RMSE, R-squared), time-series data, and (where applicable) donor weights, ready for analysis or visualization.
  • No manual data reshaping, cross-validation loops, or separate plotting functions are needed—mlsynth handles these under the hood.
  • Users can focus on the analysis, not on boilerplate preprocessing or plotting code.

mlsynth is also customizable. For advanced users, each class has its own configuration dictionary. To add a proprietary method, you can:

  1. Add it to the base models,
  2. Update __init__.py locally or in a fork, and
  3. Use the existing helpers and utilities to implement a new proprietary method seamlessly.

Documentation