mlsynth
: A Practical Guide2025-06-01
Maximizing value for business services and crafting effective public policy depends on knowing whether policies, promotions, new web features, taxes, or other interventions meaningfully affect metrics that we care about.
Researchers commonly use Difference-in-Differences designs and Synthetic Control methods (DID and SCM) to measure causal impacts.
Both approaches use weighted averages of control groups from the pre-treatment period to estimate counterfactuals, or the post-treatment values of the outcome we’d see absent treatment.
Both methods optimize over pre-treatment periods.
For DID, we have
\[
\begin{aligned}
\underset{\mathbf{w} \in \mathbb{R}^{N_0}}{\operatorname*{argmin}} &\quad \lVert \mathbf{y}_1 - \mathbf{Y}_0 \mathbf{w} - \boldsymbol{\beta}_0 \rVert_2^2, \\
\text{s.t.} &\quad \mathbf{w} = \frac{1}{N_0} \mathbf{1}_{N_0}.
\end{aligned}
\]
SCM imposes \(\Delta \operatorname{:=} \{\lVert \mathbf{w} \rVert_1 = 1 \mid \mathbf{w} \in \mathbb{R}_{\geq 0}^{N_0}\}.\) We learn the weights via
\[
\underset{\mathbf{w} \in \Delta}{\operatorname*{argmin}} \lVert \mathbf{y}_1 - \mathbf{Y}_0 \mathbf{w} \rVert_2^2.
\]
Note there is no intercept, and we presume the convex combination of the donor pool will be well-fitting.
However, SCM and DID may struggle sometimes.
SCM struggles when \(N_0>>T_0\), and is prone to overfitting in this setting. Also, smaller donor pools are preferable to mitigate interpolation biases. But picking a good donor pool can be challenging when there are very many controls.
Convex SCM also favors sparse solutions. But what if the true vector of weights is not mostly 0? What if the DGP is actually dense instead of sparse?
Fortunately, econometricians have crafted methods which address these issues.
Many of these methods are re-formulations or relaxations of the standard toolkit.
For example…
Forward DID uses forward selection to choose a donor pool/control group for a single treated unit, using the DID estimator.
However, its only existing implementations are either in MATLAB and R.
MATLAB naturally is not free! And it, as well as the R code, requires users to have a wide dataframe and manually specify the number of pre-treatment periods (see the below).
This may work fine for one study, but what about if we do not have 44 pretreatment periods?
Robust Synthetic Control uses PCA to denoise the donor pool and use the low-rank approximation to learn the weights. An upshot of this method is that, in their empirical applications at least, the method appears to not rely on additional covariate metrics that standard SCMs need to attain acceptable fit.
However, one existing implementation requires the user to manually specify the number of principal components/singular values as well as the lambda parameter. Another requires an even more complex setup.
The existing implementations (arguably) presume a more advanced knowledge of machine-learning, econometrics, and/or programming than the modal marketing scientist, policy analyst/economist, or data scientist is likely to have.
Many of the steps critical to the analysis are not automated, leaving room for small errors.
This combination makes it harder/more cumbersome for the applied scientist to use these methods in practice.
mlsynth
…The Python library mlsynth
is meant to further democratize causal inference and econometrics, making its benefits available to a wider class of researchers.
mlsynth
automates these estimators (and more!).
Its syntax is very simple, and the steps are totally automated.
This presentation will cover mlsynth
and how to use it in practice. I will describe a few of the algorithms and give examples of how to use them.
mlsynth
To install mlsynth
, we need:
Python 3.9 or later
cvxpy, matplotlib, numpy, pandas, scipy, scikit-learn, statsmodels, pydantic
And then, we can install this from the command line
mlsynth
A long df
where we have one observations per unit per time period.
A Time variable (numeric or date-time are allowed).
A unit variable (a string, to know which units are which).
A numeric outcome variable.
A dummy variable denoting treatment, 1 when the treatment is active and the unit is treated, else 0 for all other units and times.
Find the data here.
This is the classical example from the Abadie 2010 paper.
import pandas as pd
from mlsynth.utils.helperutils import sc_diagplot
# Load smoking data
url = "https://raw.githubusercontent.com/jgreathouse9/mlsynth/main/basedata/smoking_data.csv"
df = pd.read_csv(url)
# Configure the SC_DIAGPLOT method
plotconfig = {
"df": df,
"outcome": df.columns[2],
"treat": df.columns[-1],
"unitid": df.columns[0],
"time": df.columns[1],
"display_graphs": True,
"save": False,
"counterfactual_color": "red",
"method": "RPCA",
"Frequentist": True
}
sc_diagplot([plotconfig])
Parallel trends does not appear to hold for California with respect to its donor pool. California has a steeper downward sloping trend of cigarette consumption compared to the average of the donor pool. No matter what scalar constant we would shift the mean of the donor pool by, the mean difference would not be constant with respect to California. This suggests one of two solutions is viable: either parallel trends would hold with some control units, or we simply need to re-weight the donor pool to better match the pre-intervention trajectory.
Here is the results of the CLUSTERSC
class. This class implements the Robust Synthetic Control Method. We can see that this model fits the pre-treatment trajectory very well, without needing to use any of the seven covariates that the original paper uses.
import pandas as pd
from mlsynth import CLUSTERSC
url = "https://raw.githubusercontent.com/jgreathouse9/mlsynth/refs/heads/main/basedata/smoking_data.csv"
data = pd.read_csv(url)
SCconfig = {
"df": data,
"outcome": data.columns[2],
"treat": data.columns[-1],
"unitid": data.columns[0],
"time": data.columns[1],
"display_graphs": True,
"save": False,
"counterfactual_color": ["blue"], "method": "PCR", "cluster": False
}
SCResult = CLUSTERSC(SCconfig).fit()
Of course, this estimator presumes that some units should matter more than others (hence the weighting scheme).
import pandas as pd
from mlsynth import FDID
url = "https://raw.githubusercontent.com/jgreathouse9/mlsynth/refs/heads/main/basedata/smoking_data.csv"
data = pd.read_csv(url)
FDIDconfig = {
"df": data,
"outcome": data.columns[2],
"treat": data.columns[-1],
"unitid": data.columns[0],
"time": data.columns[1],
"display_graphs": True,
"save": False,
"counterfactual_color": ["blue", "red"]
}
FDIDResult = FDID(FDIDconfig).fit()
Here all selected units are given equal weight, adjusted only by an intercept.