2025-09-13
What is SCM?
A quasi-experimental method used to estimate the causal impact of policies, marketing campaigns, or other treatments of interest.
Historical roots
Originates from the comparative case study framework:
Connection to Difference-in-Differences (DiD)
DiD assumes outcomes follow:
\[
y_{jt}(0) = \mu_j + \lambda_t + \epsilon_{jt}
\] where \(\mu_j\) are unobservable unit factors and \(\lambda_t\) are time variant, unit constant factors (note they are additive)
SCM generalizes this with an interactive fixed effects model, where time factors may affect units differently:
\[
y_{jt}(0) = \mathbf{\mu}_j^\top \mathbf{\lambda}_t + \epsilon_{jt}
\] where \(\mathbf{\mu}_j \in \mathbb{R}^r\) and \(\mathbf{\lambda}_t \in \mathbb{R}^r\) are latent factor vectors.
Before we cover any of the details of SCM though, we need to be very clear about what an average is as a concept.
\[ \bar{x} = \frac{1}{n} \sum_{x \in S} x = \tfrac{x_1 + x_2 + \cdots + x_n}{n}. \]
\[ 15 = \bar{x} = \frac{1}{2} \sum_{x \in \{12,18\}} x = (\tfrac{1}{2} \cdot 12) + (\tfrac{1}{2} \cdot 18) = \frac{12 + 18}{2}. \]
On average, you have 15 apples.
\[ \mathbf{x} = \begin{bmatrix} 12 & 18 \end{bmatrix} \in \mathbb{R}^{1 \times 2}. \]
Here, each column entry represents a person’s apples .
\[ \mathbf{w} = \begin{bmatrix} \frac{1}{2} \\[2mm] \frac{1}{2} \end{bmatrix} \in \mathbb{R}^{2 \times 1}. \]
\[ \bar{x} = \mathbf{x}\mathbf{w} = \begin{bmatrix} 12 & 18 \end{bmatrix} \begin{bmatrix} \tfrac{1}{2} \\[1mm] \tfrac{1}{2} \end{bmatrix} = 12 \cdot \tfrac{1}{2} + 18 \cdot \tfrac{1}{2} = 15. \] See? We still have 15 apples, on average.
\[ \mathbf{Y} = \begin{bmatrix} 12 & 18 \\ 30 & 22 \end{bmatrix} \in \mathbb{R}^{2 \times 2}. \]
Each row represents a snapshot in time, and each column represents a person.
To find the average number of apples per time period, we use the same weights as before (1/2 for each person). Multiplying the matrix by the weight vector gives a row-wise weighted sum:
\[ \mathbf{w} = \begin{bmatrix} 1/2 \\ 1/2 \end{bmatrix}, \quad \bar{\mathbf{x}} = \mathbf{Y} \mathbf{w} = \begin{bmatrix} 12 & 18 \\ 30 & 22 \end{bmatrix} \begin{bmatrix} 1/2 \\ 1/2 \end{bmatrix} = \begin{bmatrix} 12\cdot 1/2 + 18\cdot 1/2 \\ 30\cdot 1/2 + 22\cdot 1/2 \end{bmatrix} = \begin{bmatrix} 15 \\ 26 \end{bmatrix}. \]
Interpretation: Today, the average is 15 apples; tomorrow, the average is 26 apples.
This shows that row-wise averaging is just taking a weighted sum across the columns for each row. Each row “collapses” into a single number representing the average for that time period.
This concept generalizes to any time-series or panel data, like GDPs, incomes, or other outcomes measured across multiple units.
You are at a bar with your friend. You earn $60k/year. Your friend earns $70k/year. The simple average of your incomes is:
\[ \mathbf{x} = \begin{bmatrix} 60 & 70 \end{bmatrix}, \quad \mathbf{w} = \begin{bmatrix} \frac{1}{2} \\[1mm] \frac{1}{2} \end{bmatrix} \]
\[ \bar{x} = \mathbf{x} \mathbf{w} = \begin{bmatrix} 60 & 70 \end{bmatrix} \begin{bmatrix} \frac{1}{2} \\[1mm] \frac{1}{2} \end{bmatrix} = \frac{1}{2}\cdot 60 + \frac{1}{2}\cdot 70 = 65 \text{ (k/year)} \]
Now, suddenly Bill Gates walks in, with a net worth of $107B/year. Taking the simple average of all three income levels:
\[ \mathbf{x} = \begin{bmatrix} 60 & 70 & 1.07\times 10^{11} \end{bmatrix}, \quad \mathbf{w} = \begin{bmatrix} \frac{1}{3} \\[1mm] \frac{1}{3} \\[1mm] \frac{1}{3} \end{bmatrix} \]
\[ \bar{x} = \mathbf{x} \mathbf{w} = \frac{1}{3}\cdot 60 + \frac{1}{3}\cdot 70 + \frac{1}{3}\cdot 1.07\times 10^{11} \approx 35.7 \text{ billion/year} \]
Clearly, treating Bill Gates the same as the others skews the average.
If the bartender next day bragged about earning 50k that night, we’d say: “It’s not that you earned 50k organically; a wealthy dude happened to be there that night. We can’t use this as evidence that your business is booming.”
In other words, if we want a meaningful “bar average,” Bill should likely get much less weight than the other patrons.
Instead of treating Bill as equal to you two, we can give each person a weight \(w_i \in [0,1]\) to decide how much they count in the average. This weight \(w_i\) is equivalent to the fraction we used before in the arithmetic average, which before was simply \(\tfrac{1}{n}\). Our weights may take the from:
\[ \bar{x}_{\text{weighted}} = \sum_{i=1}^{n} w_i x_i, \quad \sum_{i=1}^{n} w_i = 1. \]
Here, Bill can get a tiny weight (if any), and you and your friend get larger weights, giving a more reasonable “average bar income.”
Suppose we only care about you and your friend, and ignore Bill Gates (weight = 0):
\[ \mathbf{x} = \begin{bmatrix} 60 & 70 & 1.07\times 10^{11} \end{bmatrix}, \quad \mathbf{w} = \begin{bmatrix} 0.5 \\[1mm] 0.5 \\[1mm] 0 \end{bmatrix} \]
Then the weighted average is:
\[ \bar{x} = \mathbf{x} \mathbf{w} = 0.5\cdot 60 + 0.5\cdot 70 + 0 \cdot 1.07\times 10^{11} = 65 \text{ (k/year)} \]
See how the average is as it was at first when it was just you and your friend? Bill is clearly not representative of a bar or even a country. Giving him as much weight as regular humans does not make sense.
In the earlier examples, the weights we used were arbitrary. We split 50/50 between you and your friend, but there was no real reason for that choice. In real life, weights are usually unknown — and assigning them poorly can create nonsense averages.
Think back to the bar example: if Bill Gates walks in, and we give him the same weight as everyone else, the “average income” of the bar will be absurdly high. To represent the bar’s patrons, Gates should either receive a much smaller weight or none at all.
Suppose we have one unit exposed to terrorism in 1975 (the Basque Country), and a set of 16 other units that are not exposed. We want to estimate the effect of terrorism on GDP per capita via taking a weighted average of control units, but how to choose the weights?
Optimization solves this problem. Instead of us just making up weights, we choose the weights that minimize the pre-intervention gap between the Basque Country and the weighted donor units.
One method of optimization we could use to do this is least-squares in the form of a linear regression model, which economists are quite familiar with. In multiple linear regression, the coefficients returned to us can be interpreted as weights for our control group.
Here is the dataset, literally displayed in preparation for regression. We can see the Basque column as well as the donor columns. We wish to take a weighted average of these control unit columns to well approximate the Basque Country in the preintervention period, such that the weighted average provides a counterfactual estimate for the Basque Country in the post-intervention period.
Reminder: a weighted average can be written as
\[ \hat{y}_1(t) = \mathbf{Y}_0 \mathbf{w} = \sum_{j \in \mathcal{J}_0} w_j \, y_j(t), \qquad t \in \mathcal{T}_1 \cup \mathcal{T}_2. \]
year | Andalucia | Aragon | Baleares (Islas) | Basque | Canarias | Cantabria | Castilla Y Leon | Castilla-La Mancha | Cataluna | Comunidad Valenciana |
---|---|---|---|---|---|---|---|---|---|---|
1955 | 1.68873 | 2.28877 | 3.14396 | 3.85318 | 1.91438 | 2.55941 | 1.72915 | 1.32776 | 3.54663 | 2.57598 |
1956 | 1.7585 | 2.44516 | 3.34776 | 3.94566 | 2.07184 | 2.69387 | 1.83833 | 1.4151 | 3.69045 | 2.7385 |
1957 | 1.82762 | 2.6034 | 3.54963 | 4.03356 | 2.22608 | 2.82034 | 1.94766 | 1.50357 | 3.82683 | 2.89989 |
1958 | 1.85276 | 2.63903 | 3.64267 | 4.02342 | 2.22087 | 2.87903 | 1.97137 | 1.53142 | 3.87568 | 2.96351 |
1959 | 1.87803 | 2.67709 | 3.73486 | 4.01378 | 2.21344 | 2.94373 | 1.99514 | 1.55934 | 3.92174 | 3.02621 |
1960 | 2.01014 | 2.88146 | 4.05884 | 4.28592 | 2.35768 | 3.13703 | 2.13882 | 1.66752 | 4.24179 | 3.21929 |
1961 | 2.12918 | 3.09954 | 4.36025 | 4.57434 | 2.44573 | 3.32762 | 2.2395 | 1.75243 | 4.57534 | 3.36247 |
1962 | 2.28035 | 3.35918 | 4.64617 | 4.89896 | 2.64824 | 3.55534 | 2.45423 | 1.92045 | 4.83805 | 3.56998 |
1963 | 2.43102 | 3.61418 | 4.91153 | 5.19701 | 2.84476 | 3.77142 | 2.67224 | 2.0919 | 5.08133 | 3.76521 |
1964 | 2.50885 | 3.68009 | 5.0507 | 5.3389 | 2.95116 | 3.8394 | 2.77778 | 2.18259 | 5.1581 | 3.82369 |
\[ \mathbf{w}^{\ast} = \underset{\mathbf{w} \in \mathbb{R}^J}{\operatorname*{argmin}} \; \| \mathbf{y}_1 - \mathbf{Y}_0 \mathbf{w} \|_2^2 \]
Once we have \(\mathbf{w}^{\ast}\), we multiply all the values for each control unit by its regression coefficient and sum these across donors.
This gives the predicted Basque trajectory pre- and post-intervention.
\[ \text{ATT} = T_{2}^{-1}\sum_{t \in \mathcal{T}_{2}}\left( y_{1t} - \hat{y}_{1t} \right) \]
\(y_{1t}\): observed outcome of the treated unit at time \(t\)
\(\hat{y}_{1t}\): OLS prediction at time \(t\)
Intuition: ATT measures how much the real, treated Basque GDP deviates from the model’s prediction (estimated via OLS in this case).
Here are the weights OLS returns to us for the 16 donor units:
regionname | Weight |
---|---|
Andalucia | 4.17551 |
Aragon | -1.50358 |
Baleares (Islas) | -0.0107201 |
Canarias | -0.0168126 |
Cantabria | 0.261694 |
Castilla Y Leon | 6.07601 |
Castilla-La Mancha | -2.6561 |
Cataluna | -0.945626 |
Comunidad Valenciana | 0.860017 |
Extremadura | -1.93895 |
Galicia | -2.96441 |
Madrid (Comunidad De) | 0.343366 |
Murcia (Region de) | -0.620705 |
Navarra (Comunidad Foral De) | 0.583705 |
Principado De Asturias | 0.51578 |
Rioja (La) | -0.94721 |
Notice that OLS assigns some positive and some negative weights. These tell us how similar the control units is to the treated unit. Notice how some units have negative weights. These can be hard to interpret substantively: which units are truly most similar to the Basque Country before terrorism happened? Is it Andalucia or Castilla Y Leon?
Here are some of the predicted donors plotted against the Basque Country
Much of this does not really square very nicely with intution. Cataluna, which is very close to the Basque Country (both literally in terms of GDP and being a French border region in Northern Spain) gets negative weight from OLS. The units which are much farther away from the Basque Country (on top of having much different political/economic histories) get much more weight
Here is our prediction using OLS. Unfortunately, while OLS fits very well to the Basque Country before terrorism happened, the out-of-sample/post-1975 predicions exhibits a lot more variance than any control unit in the pre-treatment period did. The counterfactual has a weird zig-zag-y prediction. We can see that the prediction line for the Basque country suggests that its economy would have fallen off a cliff, had terrorism not happened. In other words, if the OLS estmate is to be taken seriously, the onset of terrorism saved the Basque economy… This does not seem like a sensible finding historically, or a very reasonable prediction statistically.
What if we could adjust the weights so the synthetic Basque stays within the range of the donor units?
In other words, we want the model predictions to lie inside the full range of the donors only.
The plot below shows the area spanned by all donor regions in Spain (shaded), along with the Basque Country. This visualizes the full feasible range for any weighted average of donors.
The classic SCM estimator solves the following ordinary least-squares regression problem:
\[ \begin{aligned} \mathbf{w}^{\mathrm{SCM}} &= \underset{\mathbf{w}^{\mathrm{SCM}} \in \mathcal{W}_{\mathrm{conv}}}{\operatorname*{argmin}} \; \left\| \mathbf{y}_1 - \mathbf{Y}_0 \mathbf{w}^{\mathrm{SCM}} \right\|_2^2 \quad \forall \, t \in \mathcal{T}_{1}, \\ \mathcal{W}_{\mathrm{conv}} &= \left\{ \mathbf{w}^{\mathrm{SCM}} \in \mathbb{R}_{\ge 0}^{J} \;\middle|\; \mathbf{1}^\top \mathbf{w} = 1 \right\} \end{aligned} \]
We want to combine the donor regions to best match the treated unit before the intervention. By requiring all weights to be nonnegative and sum to 1, the synthetic Basque is a weighted average that stays inside the range of the donors—it can’t “overshoot” or assign negative importance to any region.
Given the optimal weights \(\mathbf{w}^{\mathrm{SCM}}\), the SCM prediction for the treated unit is
\[ \hat{\mathbf{y}}_1^{\mathrm{SCM}} = \mathbf{Y}_0 \mathbf{w}^{\mathrm{SCM}}\equiv\sum_{j \in \mathcal{J}_0} w_j^{\mathrm{SCM}} \, y_{jt}, \quad \forall \, t \in \mathcal{T} \] —
Convex Combination
Validating SCM designs are important as well, Consider a few robustness checks:
Low-volatility series
Synthetic controls work best with aggregate, relatively smooth series.
Condition (informal): the pre-treatment volatility should be small relative to the signal, e.g. \[
\frac{\operatorname{sd}(y_{1t \in \mathcal{T}_{1}})}{\overline{y}_{1t \in \mathcal{T}_{1}}} \quad\text{is small.}
\]
Extended pre-intervention period
A long pre-treatment window improves identification.
Condition: the number of pre-periods \(T_0\) should be sufficiently large: \[
T_0 \gg r \quad\text{(where $r$ is the effective number of latent factors).}
\]
Small, relevant donor pool
More donors can increase interpolation bias and overfitting. Prefer a smaller donor set.
Condition (selection): restrict donor count \(J\) relative to pre-period information, e.g. \[
J \lesssim c \cdot T_0 \quad\text{for some modest constant } c,
\]
Sparsity aids interpretability
Fewer nonzero weights make the synthetic control easier to interpret.
Condition: prefer weight vectors \(\mathbf{w}\) with small support: \[
\|\mathbf{w}\|_0 << J \quad\text{or use the } \ell_1 \text{ norm: } \|\mathbf{w}\|_1
\]
Plot your treated unit versus the donors (seriously)
Visual diagnostics: check level/trend alignment and extreme donors before fitting.
Diagnostic: inspect \(\{y_{jt}\}_{j\in \mathcal{J}_{0}}\) vs. \(y_{1t}\) for \(t \in \mathcal{T}_{1}\).
Good pre-treatment fit matters
If pre-fit error is large, post-treatment inferences are unreliable.
Condition (quantitative): require a low pre-fit RMSE: \[
\text{RMSE}_{t \in \mathcal{T}_{1}} = \sqrt{\frac{1}{T_0}\sum_{t \in \mathcal{T}_{1}}\big(y_{1t}-\hat y_{1t}\big)^2}
\]
Out-of-sample validation is essential
Use placebo tests (in-time and in-space) to check robustness.
Condition (placebo): the treated unit’s post-treatment gap should be an outlier relative to the placebo distribution. If \(\widehat{\text{ATT}}_1\) is the treated ATT and \(\{\widehat{\text{ATT}}_j\}_{j\neq 1}\) are placebo ATTs, \[
p\text{-value} \approx \frac{1}{J}\sum_{j\neq 1} \mathbf{1}\big(|\widehat{\text{ATT}}_j|\ge |\widehat{\text{ATT}}_1|\big)
\] should be small. Furthermore, the ATT for the real treatment period should not change substantively given reasonable in-time placebos.