Forward Selected Synthetic Control

Machine Learning
Econometrics
Author

Jared Greathouse

Published

April 2, 2025

Interpolation bias is a known issue with synthetic control methods. For valid counterfactual prediction, the donor units should be as similar as possible to the treated unit in the pre-treatment periods. Selecting an appropriate donor pool is therefore critical, but this can be challenging in settings with many potential controls. This post introduces the Forward Selected Synthetic Control Method. This applies Forward Selection to choose the donor pool for a synthetic control model before estimating out-of-sample predictions.

Notation

Let \(\mathbb{R}\) denote the set of real numbers. A calligraphic letter, such as \(\mathcal{S}\), represents a discrete set with cardinality \(S = |\mathcal{S}|\). Let \(j \in \mathbb{N}\) represent indices for a total of \(N\) units and \(t \in \mathbb{N}\) index time. Let \(j = 1\) be the treated unit, with the set of controls being \(\mathcal{N}_0 = \mathcal{N} \setminus \{1\}\), with cardinality \(N_0\). The pre-treatment period consists of the set \(\mathcal{T}_1 = \{ t \in \mathbb{N} : t \leq T_0 \},\) where \(T_0\) is the final period before treatment. Similarly, the post-treatment period is given by \(\mathcal{T}_2 = \{ t \in \mathbb{N} : t > T_0 \}.\)

The observed outcome for unit \(j\) at time \(t\) is denoted by \(y_{jt}\), where a generic outcome vector for a given unit in the dataset is \(\mathbf{y}_j \in \mathbb{R}^T\), where \(\mathbf{y}_j = (y_{j1}, y_{j2}, \dots, y_{jT})^\top \in \mathbb{R}^{T}\). The outcome vector for the treated unit specifically is \(\mathbf{y}_1\). The donor matrix, similarly, is defined as \(\mathbf{Y}_0 \coloneqq \begin{bmatrix} \mathbf{y}_j \end{bmatrix}_{j \in \mathcal{N}_0} \in \mathbb{R}^{T \times N_0}\), where each column indexes a donor unit and each row is indexed to a time period.

Synthetic Controls

SCM estimates the counterfactual outcome for the treated unit by solving the program

\[ \mathbf{w}^\ast = \underset{\mathbf{w} \in \Delta^{N_0}}{\operatorname*{argmin}} \|\mathbf{y}_1 - \mathbf{Y}_0 \mathbf{w} \|_2^2 \: \forall t \in \mathcal{T}_1. \]

We seek the weight vector, \(\mathbf{w}\), that minimizes the mean squared error between the treated unit outcomes and the weighted average of control units. For our purposes, the space of synthetic control weights is the \(N_0\)-dimensional probability simplex \(\Delta^{N_0} = \left\{ \mathbf{w} \in \mathbb{R}_{\geq 0}^{N_0} : \|\mathbf{w}\|_1 = 1 \right\}.\) Practically this means that the post-intervention predictions will never be outside the donor pool’s outcome values. The key thing to preventing interpolation biases is choosing the right donor units such that they resemble the treatment unit on latent factors.

The Forward Selection Algorithm via SCM

Consider now a restricted donor pool chosen by the forward selection algorithm.We have, perhaps, a high-dimensional donor pool, and we are unsure as to which donors to include, or we simply wish to use the method as a robustness check. The donor pool chosen by forward-selection consists of a subset of control units, \(\mathcal{S} \subseteq \mathcal{N}_0\), with cardinality \(k = |\mathcal{S}|\) where \(k \leq N_0\). This subset induces a sub-simplex of weights \(\Delta^{k}(\mathcal{S}) = \left\{ \mathbf{w}^\prime \in \mathbb{R}_{\geq 0}^{k} : \|\mathbf{w}^{\prime}\|_1 = 1 \right\}\). We presume, naturally, that some control units will be more relevant to the treated unit than others. The forward selection algorithm, proceeding over \(K \in \mathbb{N}\) iterations, builds a sequence of tuples, \(\mathbb{T} = \{(\mathcal{S}_K, \text{MSE}_K) \}_{K=1}^{N_0}\). The tuple contains two elements: the selected donor set for the \(K\)-th iteration and its corresponding \(\text{MSE}_K\) (or the pre-treatment mean squared error).

We begin by minimizing the SCM objective function as above, cycling through each donor unit vector one at a time instead of using the full control group. We denote these as submodels, which returns \(N_0\) one unit SCM models. We choose the single donor unit (the nearest neighbor in this specific case) that minimizes the MSE among all the \(N_0\) submodels. Our first tuple, then, is built with this single donor unit and the model’s corresponding MSE

\[ \mathcal{S}_1 = \{j^\ast\}, \quad \text{where} \quad j^\ast = \underset{j \in \mathcal{N}_0}{\operatorname*{argmin}} \ \text{MSE}(\{j\}). \]

For \(K=2\), we see which of the remaining donor units, in conjunction with the first selected donor, minimizes the \(\text{MSE}\). We now estimate \(N_0-1\) two-unit synthetic control models. As before, this second optimal donor (along with the first selected one) and pre-treatment MSE forms the first and second elements of the second tuple, respectively

\[ \mathcal{S}_2 = \mathcal{S}_1 \cup \{j^\ast\},\quad \text{where} \quad j^\ast = \underset{j \in \mathcal{N}_0 \setminus \mathcal{S}_{K-1}}{\operatorname*{argmin}} \ \text{MSE}(\mathcal{S}_{K-1} \cup \{j\}). \]

This process generalizes to the rest of the iterations. This process continues for the rest of the donor pool, taking the form

\[ (\mathcal{S}_K, \text{MSE}_K) = (\mathcal{S}_{K-1} \cup \{j^\ast\}, \text{MSE}(\mathcal{S}_{K-1} \cup \{j^\ast\})). \]

The algorithm continues until all \(S_K=N_0\). The final donor set we use for analysis is the tuple with the lowest \(\text{MSE}\):

\[ \mathcal{S}^{\ast} = \underset{(\mathcal{S}_K, \text{MSE}_K) \in \mathbb{T}}{\operatorname*{argmin}} \ \text{MSE}_K. \]

Here, \(\mathcal{S}^{\ast}\) is the optimal donor pool by forward selection. Note that even within \(\mathcal{S}^{\ast}\) (as we will see below), some donors may receive zero weight in the final solution, as these are just the units selected for inclusion in the donor pool. They are not necessarily the ones that will actually get weight, in contrast to methods such as Forward Difference-in-Differences or the forward selection panel data method. Both of these designs are available in mlsynth too, in the FDID class and PDA class with the method of fs (the default). The main difference here is that FDID can never overfit because it estimates only one parameter, whereas (in theory) FSCM and fsPDA can overfit if they end up including too many parameters in the regression model.

FSCM in mlsynth

As ususal, in order to properly implement this, we begin by installing mlsynth from my Github

pip install -U git+https://github.com/jgreathouse9/mlsynth.git

And then we load the Proposition 99 dataset and fit the model in the ususal mlsynth fashion.

import pandas as pd # To work with panel data

from IPython.display import display, Markdown # To create the table

from mlsynth.mlsynth import FSCM # The method of interest

url = "https://raw.githubusercontent.com/jgreathouse9/mlsynth/refs/heads/main/basedata/smoking_data.csv"

data = pd.read_csv(url)

# Our method inputs

config = {
    "df": data,
    "outcome": data.columns[2],
    "treat": data.columns[-1],
    "unitid": data.columns[0],
    "time": data.columns[1],
    "display_graphs": True,
    "save": False,
    "counterfactual_color": "red"}

arco = FSCM(config).fit()

After estimation, we can get the weights into a table like

weights_dict = arco['Weights'][0]
df = pd.DataFrame(list(weights_dict.items()), columns=['State', 'Weight'])
display(Markdown(df.to_markdown(index=False)))
State Weight
Montana 0.232
Alabama -0
Colorado 0.015
Connecticut 0.109
Georgia -0
Idaho 0
Illinois 0
Nevada 0.205
New Hampshire 0.045
New Mexico 0
North Carolina 0
North Dakota -0
Oklahoma -0
Utah 0.394
Vermont -0
West Virginia 0
Wyoming 0

These are the weights for all 17 units that were selected by the algorithm. As we can see, all of these even did not ultimately contribute to the synthetic control, with only 6 being assigned positive weight. Our ATT of Prop 99 is -19.51 and the pre-treatment Root Mean Squared Error for the Forward Selected SCM is 1.66. I compared these results to the same results we get in the Stata help file, which includes the covariates that Abadie, Diamond, and Hainmuller originally adjusted for as well as customizes the period over which to minimize the MSE. The ATT using the original method is -18.97, and the RMSE for the pre-treatment period is 1.75. The corresponding weights, using the full donor pool, are 0.335 for Utah, 0.236 for Nevada, 0.2 for Montana, 0.16 for Colorado, 0.068 for Connecticut, and 0.001 for New Mexico. So as we can see, the ATTs are very similar, and the pre-treatment prediction errors are pretty much the same. By comparison, when we estimate this in Stata (omitting the auxilary covariate predictors and estimate synth2 cigsale cigsale(1988) cigsale(1980) cigsale(1975) , trunit(3) trperiod(1989) xperiod(1980(1)1988) nested fig, we get a RMSE of 4.33 and an ATT of -22.88. Furthermore, with this specification, the weights are no longer a sparse vector.

The point of this article is very simple. The original SCM works well, however it can be very sensitive to the inclusion of covariates, which covariates are included, what their lags are, and so on and so forth. Furthermore, there is also an issue of covariate selection in settings where we have multiple covariates that can potentially inform our selection of the donor pool. In situations where we have many donors, without some obvious pre-existing valid donor pool, analysts may apply the Forward Selected SCM to guard against interpolation biases. At least with the California example (and West Germany and Basque datasets, which I also tested), we can sometimes get comparable results to the baseline estimates which generally to require fitting to multiple covariates for acceptable results. In this example, we select some of the same donor units, get a slightly better MSE and a very similar ATT without needing to fit to the covariates originally specified. The promise of machine-learning methods in this space is to automate away donor/predictor selection (to some degree). The key thing is interest is which methods are best suited for this, when, and why. For example, it might be useful to derive bias bounds for this estimator to quantify how much the MSE should improve by compared to the original SCM and Forward difference-in-differences.

In the original paper, Giovanni uses cross-validation to estimate this model. I have not done this yet, but I will very soon. As ususal, email me with questions or comments.