Counterfactual Estimation via `mlsynth`: A Practical Guide

2025-11-11

Counterfactual Estimation

Maximizing value for business services and crafting effective public policy depends on knowing whether policies, promotions, new web features, taxes, or other interventions meaningfully affect metrics that we care about.
Researchers commonly use Difference-in-Differences designs and Synthetic Control methods (DID and SCM) to measure causal impacts.
DID requires \(\mathbb{E}[y_{1t}(0)] - \mathbb{E}[y_{\mathcal{N}_0 t}(0)] = \mathbb{E}[y_{1t-1}(0)] - \mathbb{E}[y_{\mathcal{N}_0 t-1}(0)]\), or parallel trends. Does not hold when the mean of controls is dissimilar to the trend of the treated unit/group.

For the SCM, we solve a least squares problem

\[\begin{align} &\mathbf{w}^{\mathrm{SCM}} = \underset{\mathbf{w}^{\mathrm{SCM}} \in \mathcal{W}_{\mathrm{conv}}}{\operatorname*{argmin}} \; \left\| \mathbf{y}_1 - \mathbf{Y}_0 \mathbf{w}^{\mathrm{SCM}} \right\|_2^2 \quad \forall \, t \in \mathcal{T}_{1}, \\ &\mathcal{W}_{\mathrm{conv}} \in \Big\{ \mathbf{w}^{\mathrm{SCM}} \in (\{0\} \cup \mathbb{R}_+)^{|\mathcal{N}_0|} : \sum_{j \in \mathcal{N}_0} w_j = 1 \Big\}. \end{align}\]

SCM struggles when \(N_0>>T_0\), and is prone to overfitting in this setting. Also, smaller donor pools are preferable to mitigate interpolation biases. But picking a good donor pool can be challenging when there are very many controls.

Many SCM Libraries Exist

Estimator / Method	Software	Notes / Accessibility
Classical SCM	R / Stata	Original SC implementation.
PySyncon	Python	Robust/Penalized SC.
microsynth	R	Focused on disaggregated data.
augsynth	R/Python	bias corrects and performs inference.
sdid (Synthetic DID)	R/Stata	Implements Synthetic Difference-in-Differences.

While having all these is nice, they are all scattered in different libraries. They have different data structure demands, different options, and so on.

Other SCMs…

Estimator / Method	Software	Notes / Accessibility
Forward SCM	Stata	Do File Only, not production ready
Forward PDA	R/Stata	R Requires wide data, Stata proprietary
TSSC	MATLAB	MATLAB is proprietary, TSSC not packaged
Relaxed \(\ell_2\) SCM	Python	handles high dimensional data, but requires wide data.
Synthetic Historical Control	Not Packaged	Time Series only Approach, but no package exists
Forward Difference-in-Differences	Stata/R	Stata is not free, and R code is not a package
Robust PCA SCM	Not Packaged	Author’s Code is in R and Python

These approaches are very useful; they address high dimensional data structures via clustering/forward selection, allow for model selection between different SCM variants, address noisily observed outcomes, but many are either proprietary or not packaged, making the actual use of these estimators difficult.

Enter `mlsynth`…

The Python library mlsynth is meant to further democratize causal inference SC methods, each with different use cases.
At present it has roughly 20 different flavors of SCM.

To install mlsynth, we need:

Python 3.9 or later

pip install -U git+https://github.com/jgreathouse9/mlsynth.git

Using `mlsynth`

A long df where we have one observations per unit per time period.
A Time variable (numeric or date-time are allowed).
A unit variable (a string, to know which units are which).
A numeric outcome variable.
A dummy variable denoting treatment, 1 when the treatment is active and the unit is treated, else 0 for all other units and times.

Proposition 99

Find the data here.

Dataset
Donors
Robust SC
FDID

state	year	cigsale	Proposition 99
Alabama	1970	89.8	False
Alabama	1971	95.4	False
Alabama	1972	101.1	False
Alabama	1973	102.9	False
Alabama	1974	108.2	False

state	year	cigsale	Proposition 99
Wyoming	1996	110.3	False
Wyoming	1997	108.8	False
Wyoming	1998	102.9	False
Wyoming	1999	104.8	False
Wyoming	2000	90.5	False

import pandas as pd
from mlsynth.utils.helperutils import sc_diagplot
url = "https://raw.githubusercontent.com/jgreathouse9/mlsynth/main/basedata/smoking_data.csv"
df = pd.read_csv(url)

print(df.head().to_markdown(index=False))

print(df.tail().to_markdown(index=False))

# Configure the SC_DIAGPLOT method
plotconfig = {
    "df": df,
    "outcome": df.columns[2],
    "treat": df.columns[-1],
    "unitid": df.columns[0],
    "time": df.columns[1],
    "display_graphs": True,
    "save": False,
    "counterfactual_color": "red"
}

sc_diagplot([plotconfig])

| state   |   year |   cigsale | Proposition 99   |
|:--------|-------:|----------:|:-----------------|
| Alabama |   1970 |      89.8 | False            |
| Alabama |   1971 |      95.4 | False            |
| Alabama |   1972 |     101.1 | False            |
| Alabama |   1973 |     102.9 | False            |
| Alabama |   1974 |     108.2 | False            |
| state   |   year |   cigsale | Proposition 99   |
|:--------|-------:|----------:|:-----------------|
| Wyoming |   1996 |     110.3 | False            |
| Wyoming |   1997 |     108.8 | False            |
| Wyoming |   1998 |     102.9 | False            |
| Wyoming |   1999 |     104.8 | False            |
| Wyoming |   2000 |      90.5 | False            |

Parallel trends does not appear to hold for California with respect to its donor pool. California has a steeper downward sloping trend of cigarette consumption compared to the average of the donor pool. No matter what scalar constant we would shift the mean of the donor pool by, the mean difference would not be constant with respect to California. This suggests one of two solutions is viable: either parallel trends would hold with some control units, or we simply need to re-weight the donor pool to better match the pre-intervention trajectory.

Here is the results of the CLUSTERSC class. This class implements the Robust Synthetic Control Method. We can see that this model fits the pre-treatment trajectory very well, without needing to use any of the seven covariates that the original paper uses.

import pandas as pd
from mlsynth import CLUSTERSC

stub = "https://raw.githubusercontent.com/jgreathouse9/mlsynth/"
trail = "refs/heads/main/basedata/"
file = "smoking_data.csv"

data = pd.read_csv(stub + trail + file)

SCconfig = {
    "df": data,
    "outcome": data.columns[2],
    "treat": data.columns[-1],
    "unitid": data.columns[0],
    "time": data.columns[1],
    "display_graphs": True,
    "save": False,
    "counterfactual_color": ["blue"], "method": "PCR", "cluster": False
}

SCResult = CLUSTERSC(SCconfig).fit()

Here, the weights are allowed to be negative and there is no summation constraint: PCA is used to denoise the donor pool, we select the number of singular values via a singular value thresholding, and we reconstruct the donor matrix to learn the weights via OLS. We may also use the Bayesian version of this estimator if we set Frequentist to False

FDID is an iterative, data-driven approach that selects the control group for a treated unit.
By design, it is agnostic to which units to include or how many controls to use. The algorithm adds controls sequentially based on pre-treatment fit.

Initialization

\(\widehat{U}_0 = \emptyset \qquad \mathcal{U} = \emptyset\)

Forward Selection (Iteration \(k = 1\))

For each candidate control \(i \in \mathcal{N}_0\):

\[ \underset{\beta_{\widehat{U}_0 \cup \{i\}}}{\operatorname*{argmin}} \big\| \mathbf{y}_1 - \mathbf{Y}_{\widehat{U}_0 \cup \{i\}} \beta_{\widehat{U}_0 \cup \{i\}} - \beta_\text{cons}\mathbf{1} \big\|_2^2, \quad \beta_{\widehat{U}_0 \cup \{i\}} = \frac{1}{|\widehat{U}_0| + 1} \]

Select the control with highest \(R^2\):

\[ i_1^\ast = \underset{i \in \mathcal{N}_0}{\operatorname*{argmax}} R^2_i, \quad \widehat{U}_1 = \{ i_1^\ast \}, \quad \mathcal{U} = \{\widehat{U}_1\}. \]

Subsequent Iterations (\(k = 2, \dots, N_0\))

For each remaining candidate \(i \notin \widehat{U}_{k-1}\):

\[ i_k^\ast = \underset{i \in \mathcal{N}_0 \setminus \widehat{U}_{k-1}}{\operatorname*{argmax}} R^2_{\widehat{U}_{k-1} \cup \{i\}}, \quad \widehat{U}_k = \widehat{U}_{k-1} \cup \{ i_k^\ast \}, \quad \mathcal{U} = \mathcal{U} \cup \{ \widehat{U}_k \}. \]

Final Control Group

\[ \widehat{U} := \underset{\widehat{U}_k \in \mathcal{U}}{\operatorname*{argmax}} R^2(\widehat{U}_k) \]

The selected control set \(\widehat{U}\) maximizes pre-treatment fit (\(R^2\)) among all candidate sets.

import pandas as pd
from mlsynth import FDID

url = "https://raw.githubusercontent.com/jgreathouse9/mlsynth/refs/heads/main/basedata/smoking_data.csv"
data = pd.read_csv(url)

FDIDconfig = {
    "df": data,
    "outcome": data.columns[2],
    "treat": data.columns[-1],
    "unitid": data.columns[0],
    "time": data.columns[1],
    "display_graphs": True,
    "save": False,
    "counterfactual_color": ["blue", "red"]
}

FDIDResult = FDID(FDIDconfig).fit()

Here all selected units are given equal weight, adjusted only by an intercept.

Basque Country

Basque Treated in 1975 with the other 16 Spanish communities under control.

Donors
Relaxed SC
FDID

#| fig-align: center

import pandas as pd
from mlsynth.utils.helperutils import sc_diagplot
stub = "https://raw.githubusercontent.com/jgreathouse9/mlsynth/"
trail = "refs/heads/main/basedata/"
file = "basque_data.csv"

df = pd.read_csv(stub + trail + file)

plotconfig = {
    "df": df,
    "outcome": df.columns[2],
    "treat": df.columns[-1],
    "unitid": df.columns[0],
    "time": df.columns[1],
    "display_graphs": True,
    "save": False,
    "counterfactual_color": "red"
}

sc_diagplot([plotconfig])

A feature of Classic SCM is that the weights are often very sparse, which may overfit to certain donor units and produce unstable results.
Relaxed Balanced SCM introduces two modifications:
1. Encourages dense weight vectors to diversify prediction risk.
2. Allows a slack in the pre-treatment fit to reduce overfitting to a few donors.
Let

\[ \mathbf{G} \operatorname{:=} \mathbf{Y}_0^\top \mathbf{Y}_0 \in \mathbb{R}^{N_0 \times N_0}, \quad \mathbf{g} \operatorname{:=} \mathbf{Y}_0^\top \mathbf{y}_1 \in \mathbb{R}^{N_0}. \]

The optimization problem is

\[ \begin{aligned} \mathbf{w}^{\mathrm{RBSCM}} &= \underset{\mathbf{w} \in \mathcal{W}_{\mathrm{conv}}}{\operatorname*{argmin}} \; \|\mathbf{w}\|_2^2, \\ \text{s.t.} \quad & \big\| \mathbf{G}\mathbf{w} - \mathbf{g} \big\|_\infty \le \tau, \end{aligned} \]

where

\[ \mathcal{W}_{\mathrm{conv}} \operatorname{:=} \Big\{ \mathbf{w} \in (\{0\} \cup \mathbb{R}_+)^{N_0} : \sum_{j=1}^{N_0} w_j = 1 \Big\}. \]

\(\tau \ge 0\) controls the allowed relaxation of the pre-treatment fit.
The \(\ell_2\)-norm penalty encourages dense weights, distributing contribution across multiple donor units instead of concentrating on a few.

import pandas as pd
from mlsynth import RESCM

stub = "https://raw.githubusercontent.com/jgreathouse9/mlsynth/"
trail = "refs/heads/main/basedata/"
file = "basque_data.csv"

data = pd.read_csv(stub + trail + file)

Relconfig = {
    "df": data,
    "outcome": data.columns[2],
    "treat": data.columns[-1],
    "unitid": data.columns[0],
    "time": data.columns[1],
    "display_graphs": True,
    "save": False,
    "counterfactual_color": ["blue", "red"]
}

RelResult = RESCM(Relconfig).fit()

The ATT for ADH-SCM and the Relaxed Balanced SC are very similar, as is their pretreatment fits. However we can see that they select very different unit weights. The relaxed SC, with \(\tau\) chosen by cross validation, diversifies the prediction risk.

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from mlsynth import FDID

stub = "https://raw.githubusercontent.com/jgreathouse9/mlsynth/"
trail = "refs/heads/main/basedata/"
file = "basque_data.csv"

data = pd.read_csv(stub + trail + file)

FDIDconfig = {
    "df": data,
    "outcome": data.columns[2],
    "treat": data.columns[-1],
    "unitid": data.columns[0],
    "time": data.columns[1],
    "display_graphs": True,
    "save": False,
    "counterfactual_color": ["blue", "red"]
}

FDIDResult = FDID(FDIDconfig).fit()

Aragon and Catalonia (weighted at 0.5), counterfactual similar to the relaxed SCM.

Simplicity of Use

Estimating multiple SCM/DID variants across different datasets requires only minimal changes to the API call (just the class name).
Furthermore, we can do this with essentially zero change in the data structure or input requirements beyond simple changes to a dictionary. Each .fit() call returns a comprehensive BaseEstimatorResults object containing treatment effects, fit diagnostics (e.g., RMSE, R-squared), time-series data, and (where applicable) donor weights, ready for analysis or visualization.
No manual data reshaping, cross-validation loops, or separate plotting functions are needed—mlsynth handles these under the hood.
Users can focus on the analysis, not on boilerplate preprocessing or plotting code.

mlsynth is also customizable. For advanced users, each class has its own configuration dictionary. To add a proprietary method, you can:

Add it to the base models,
Update __init__.py locally or in a fork, and
Use the existing helpers and utilities to implement a new proprietary method seamlessly.

Documentation

Counterfactual Estimation via mlsynth: A Practical Guide