The Synthetic Control Method

2025-09-13

Introduction to the Synthetic Control Method (SCM)

  • What is SCM?
    A quasi-experimental method used to estimate the causal impact of policies, marketing campaigns, or other treatments of interest.

  • Historical roots
    Originates from the comparative case study framework:

    • A single (or small number of) treated units.
    • Compare against untreated units to infer causal effects.
  • Connection to Difference-in-Differences (DiD)

    • DiD assumes outcomes follow:
      \[ y_{jt}(0) = \mu_j + \lambda_t + \epsilon_{jt} \] where \(\mu_j\) are unobservable unit factors and \(\lambda_t\) are time variant, unit constant factors (note they are additive)

    • SCM generalizes this with an interactive fixed effects model, where time factors may affect units differently:
      \[ y_{jt}(0) = \mathbf{\mu}_j^\top \mathbf{\lambda}_t + \epsilon_{jt} \] where \(\mathbf{\mu}_j \in \mathbb{R}^r\) and \(\mathbf{\lambda}_t \in \mathbb{R}^r\) are latent factor vectors.

What Is An Average?

Before we cover any of the details of SCM though, we need to be very clear about what an average is as a concept.

  • The arithmetic mean of a finite set of numbers \(S = \{x_1, x_2, \dots, x_n\} \subset \mathbb{R}\) is:

\[ \bar{x} = \frac{1}{n} \sum_{x \in S} x = \tfrac{x_1 + x_2 + \cdots + x_n}{n}. \]

  • For example, say you have 12 apples and your friend has 18. This can be called set \(A = \{12, 18\} \subset \mathbb{R}\):

\[ 15 = \bar{x} = \frac{1}{2} \sum_{x \in \{12,18\}} x = (\tfrac{1}{2} \cdot 12) + (\tfrac{1}{2} \cdot 18) = \frac{12 + 18}{2}. \]

On average, you have 15 apples.


  • We can also write the same example as a row vector:

\[ \mathbf{x} = \begin{bmatrix} 12 & 18 \end{bmatrix} \in \mathbb{R}^{1 \times 2}. \]

Here, each column entry represents a person’s apples .

  • To compute the average using linear algebra, we use weights. Since there are two people, each gets weight \(1/2\):

\[ \mathbf{w} = \begin{bmatrix} \frac{1}{2} \\[2mm] \frac{1}{2} \end{bmatrix} \in \mathbb{R}^{2 \times 1}. \]

  • Then, the average is just a weighted sum (technically called the dot product), which in vector form is:

\[ \bar{x} = \mathbf{x}\mathbf{w} = \begin{bmatrix} 12 & 18 \end{bmatrix} \begin{bmatrix} \tfrac{1}{2} \\[1mm] \tfrac{1}{2} \end{bmatrix} = 12 \cdot \tfrac{1}{2} + 18 \cdot \tfrac{1}{2} = 15. \] See? We still have 15 apples, on average.


  • Let’s extend the idea of averaging to multiple time periods. Suppose today you have 12 apples and tomorrow 30, while your friend has 18 today and 22 tomorrow. We can put this into a matrix, where each row is a time period and each column is a person:

\[ \mathbf{Y} = \begin{bmatrix} 12 & 18 \\ 30 & 22 \end{bmatrix} \in \mathbb{R}^{2 \times 2}. \]

  • Each row represents a snapshot in time, and each column represents a person.

  • To find the average number of apples per time period, we use the same weights as before (1/2 for each person). Multiplying the matrix by the weight vector gives a row-wise weighted sum:

\[ \mathbf{w} = \begin{bmatrix} 1/2 \\ 1/2 \end{bmatrix}, \quad \bar{\mathbf{x}} = \mathbf{Y} \mathbf{w} = \begin{bmatrix} 12 & 18 \\ 30 & 22 \end{bmatrix} \begin{bmatrix} 1/2 \\ 1/2 \end{bmatrix} = \begin{bmatrix} 12\cdot 1/2 + 18\cdot 1/2 \\ 30\cdot 1/2 + 22\cdot 1/2 \end{bmatrix} = \begin{bmatrix} 15 \\ 26 \end{bmatrix}. \]

  • Interpretation: Today, the average is 15 apples; tomorrow, the average is 26 apples.

  • This shows that row-wise averaging is just taking a weighted sum across the columns for each row. Each row “collapses” into a single number representing the average for that time period.

  • This concept generalizes to any time-series or panel data, like GDPs, incomes, or other outcomes measured across multiple units.

What Are We Weighting For? Introducing Weighted Averages

You are at a bar with your friend. You earn $60k/year. Your friend earns $70k/year. The simple average of your incomes is:

\[ \mathbf{x} = \begin{bmatrix} 60 & 70 \end{bmatrix}, \quad \mathbf{w} = \begin{bmatrix} \frac{1}{2} \\[1mm] \frac{1}{2} \end{bmatrix} \]

\[ \bar{x} = \mathbf{x} \mathbf{w} = \begin{bmatrix} 60 & 70 \end{bmatrix} \begin{bmatrix} \frac{1}{2} \\[1mm] \frac{1}{2} \end{bmatrix} = \frac{1}{2}\cdot 60 + \frac{1}{2}\cdot 70 = 65 \text{ (k/year)} \]

Now, suddenly Bill Gates walks in, with a net worth of $107B/year. Taking the simple average of all three income levels:

\[ \mathbf{x} = \begin{bmatrix} 60 & 70 & 1.07\times 10^{11} \end{bmatrix}, \quad \mathbf{w} = \begin{bmatrix} \frac{1}{3} \\[1mm] \frac{1}{3} \\[1mm] \frac{1}{3} \end{bmatrix} \]

\[ \bar{x} = \mathbf{x} \mathbf{w} = \frac{1}{3}\cdot 60 + \frac{1}{3}\cdot 70 + \frac{1}{3}\cdot 1.07\times 10^{11} \approx 35.7 \text{ billion/year} \]

Clearly, treating Bill Gates the same as the others skews the average.

  • If the bartender next day bragged about earning 50k that night, we’d say: “It’s not that you earned 50k organically; a wealthy dude happened to be there that night. We can’t use this as evidence that your business is booming.”

  • In other words, if we want a meaningful “bar average,” Bill should likely get much less weight than the other patrons.


Instead of treating Bill as equal to you two, we can give each person a weight \(w_i \in [0,1]\) to decide how much they count in the average. This weight \(w_i\) is equivalent to the fraction we used before in the arithmetic average, which before was simply \(\tfrac{1}{n}\). Our weights may take the from:

\[ \bar{x}_{\text{weighted}} = \sum_{i=1}^{n} w_i x_i, \quad \sum_{i=1}^{n} w_i = 1. \]

Here, Bill can get a tiny weight (if any), and you and your friend get larger weights, giving a more reasonable “average bar income.”


Suppose we only care about you and your friend, and ignore Bill Gates (weight = 0):

\[ \mathbf{x} = \begin{bmatrix} 60 & 70 & 1.07\times 10^{11} \end{bmatrix}, \quad \mathbf{w} = \begin{bmatrix} 0.5 \\[1mm] 0.5 \\[1mm] 0 \end{bmatrix} \]

Then the weighted average is:

\[ \bar{x} = \mathbf{x} \mathbf{w} = 0.5\cdot 60 + 0.5\cdot 70 + 0 \cdot 1.07\times 10^{11} = 65 \text{ (k/year)} \]

See how the average is as it was at first when it was just you and your friend? Bill is clearly not representative of a bar or even a country. Giving him as much weight as regular humans does not make sense.

How to Choose the Weights?

  • In the earlier examples, the weights we used were arbitrary. We split 50/50 between you and your friend, but there was no real reason for that choice. In real life, weights are usually unknown — and assigning them poorly can create nonsense averages.

  • Think back to the bar example: if Bill Gates walks in, and we give him the same weight as everyone else, the “average income” of the bar will be absurdly high. To represent the bar’s patrons, Gates should either receive a much smaller weight or none at all.

  • Suppose we have one unit exposed to terrorism in 1975 (the Basque Country), and a set of 16 other units that are not exposed. We want to estimate the effect of terrorism on GDP per capita via taking a weighted average of control units, but how to choose the weights?

  • Optimization solves this problem. Instead of us just making up weights, we choose the weights that minimize the pre-intervention gap between the Basque Country and the weighted donor units.

Using Least Squares

One method of optimization we could use to do this is least-squares in the form of a linear regression model, which economists are quite familiar with. In multiple linear regression, the coefficients returned to us can be interpreted as weights for our control group.

  • \(\mathbf{y}_1 \in \mathbb{R}^{T}\): outcome vector (e.g., GDP per capita) for the treated unit (Basque Country) over all \(T\) time periods
  • \(\mathbf{Y}_0 \in \mathbb{R}^{T \times J}\): donor matrix; each column is a control unit, each row is a time period (\(J=16\) control units in this case)
  • \(\mathbf{w} \in \mathbb{R}^{J}\): vector of donor weights where each entry \(w_j\) indicates the contribution of control unit \(j \in \mathcal{J}_0\) to the weighted average of the treated unit
  • \(\mathcal{T}_1 = \{1,\dots,T_0\}\): set of pre-treatment periods, with cardinality \(|\mathcal{T}_1| = T_0\)
  • \(\mathcal{T}_2 = \{T_0+1,\dots,T\}\): set of post-treatment periods, with cardinality \(|\mathcal{T}_2| = T - T_0\)
  • \(\mathcal{J}_0 = \{1,\dots,J\}\): set of untreated donor units, with cardinality \(|\mathcal{J}_0| = J\)

Here is the dataset, literally displayed in preparation for regression. We can see the Basque column as well as the donor columns. We wish to take a weighted average of these control unit columns to well approximate the Basque Country in the preintervention period, such that the weighted average provides a counterfactual estimate for the Basque Country in the post-intervention period.

Reminder: a weighted average can be written as

\[ \hat{y}_1(t) = \mathbf{Y}_0 \mathbf{w} = \sum_{j \in \mathcal{J}_0} w_j \, y_j(t), \qquad t \in \mathcal{T}_1 \cup \mathcal{T}_2. \]

year Andalucia Aragon Baleares (Islas) Basque Canarias Cantabria Castilla Y Leon Castilla-La Mancha Cataluna Comunidad Valenciana
1955 1.68873 2.28877 3.14396 3.85318 1.91438 2.55941 1.72915 1.32776 3.54663 2.57598
1956 1.7585 2.44516 3.34776 3.94566 2.07184 2.69387 1.83833 1.4151 3.69045 2.7385
1957 1.82762 2.6034 3.54963 4.03356 2.22608 2.82034 1.94766 1.50357 3.82683 2.89989
1958 1.85276 2.63903 3.64267 4.02342 2.22087 2.87903 1.97137 1.53142 3.87568 2.96351
1959 1.87803 2.67709 3.73486 4.01378 2.21344 2.94373 1.99514 1.55934 3.92174 3.02621
1960 2.01014 2.88146 4.05884 4.28592 2.35768 3.13703 2.13882 1.66752 4.24179 3.21929
1961 2.12918 3.09954 4.36025 4.57434 2.44573 3.32762 2.2395 1.75243 4.57534 3.36247
1962 2.28035 3.35918 4.64617 4.89896 2.64824 3.55534 2.45423 1.92045 4.83805 3.56998
1963 2.43102 3.61418 4.91153 5.19701 2.84476 3.77142 2.67224 2.0919 5.08133 3.76521
1964 2.50885 3.68009 5.0507 5.3389 2.95116 3.8394 2.77778 2.18259 5.1581 3.82369

  • We can choose weights to minimize the squared pre-intervention difference between Basque and the weighted donors:

\[ \mathbf{w}^{\ast} = \underset{\mathbf{w} \in \mathbb{R}^J}{\operatorname*{argmin}} \; \| \mathbf{y}_1 - \mathbf{Y}_0 \mathbf{w} \|_2^2 \]

  • In practice, this is just OLS on the pre-1975 period, where \(\mathbf{y}_1\) is the Basque GDP column and \(\mathbf{Y}_0\) are the columns for the donor pool. The regression coefficients (what we call the betas (\(\beta\)) in our introductory econometrics classes) are the weights.

  • Once we have \(\mathbf{w}^{\ast}\), we multiply all the values for each control unit by its regression coefficient and sum these across donors.

  • This gives the predicted Basque trajectory pre- and post-intervention.


  • The average treatment effect on the treated (ATT) can be expressed as the mean difference between observed outcomes and the weighted average in the post-intervention period:

\[ \text{ATT} = T_{2}^{-1}\sum_{t \in \mathcal{T}_{2}}\left( y_{1t} - \hat{y}_{1t} \right) \]

  • \(y_{1t}\): observed outcome of the treated unit at time \(t\)

  • \(\hat{y}_{1t}\): OLS prediction at time \(t\)

Intuition: ATT measures how much the real, treated Basque GDP deviates from the model’s prediction (estimated via OLS in this case).

Least Squares Cont.

Here are the weights OLS returns to us for the 16 donor units:

regionname Weight
Andalucia 4.17551
Aragon -1.50358
Baleares (Islas) -0.0107201
Canarias -0.0168126
Cantabria 0.261694
Castilla Y Leon 6.07601
Castilla-La Mancha -2.6561
Cataluna -0.945626
Comunidad Valenciana 0.860017
Extremadura -1.93895
Galicia -2.96441
Madrid (Comunidad De) 0.343366
Murcia (Region de) -0.620705
Navarra (Comunidad Foral De) 0.583705
Principado De Asturias 0.51578
Rioja (La) -0.94721

Notice that OLS assigns some positive and some negative weights. These tell us how similar the control units is to the treated unit. Notice how some units have negative weights. These can be hard to interpret substantively: which units are truly most similar to the Basque Country before terrorism happened? Is it Andalucia or Castilla Y Leon?

Here are some of the predicted donors plotted against the Basque Country

Much of this does not really square very nicely with intution. Cataluna, which is very close to the Basque Country (both literally in terms of GDP and being a French border region in Northern Spain) gets negative weight from OLS. The units which are much farther away from the Basque Country (on top of having much different political/economic histories) get much more weight

Here is our prediction using OLS. Unfortunately, while OLS fits very well to the Basque Country before terrorism happened, the out-of-sample/post-1975 predicions exhibits a lot more variance than any control unit in the pre-treatment period did. The counterfactual has a weird zig-zag-y prediction. We can see that the prediction line for the Basque country suggests that its economy would have fallen off a cliff, had terrorism not happened. In other words, if the OLS estmate is to be taken seriously, the onset of terrorism saved the Basque economy… This does not seem like a sensible finding historically, or a very reasonable prediction statistically.

How To Change The Weights

  • What if we could adjust the weights so the synthetic Basque stays within the range of the donor units?

  • In other words, we want the model predictions to lie inside the full range of the donors only.

  • The plot below shows the area spanned by all donor regions in Spain (shaded), along with the Basque Country. This visualizes the full feasible range for any weighted average of donors.

Classic Synthetic Control Model

The classic SCM estimator solves the following ordinary least-squares regression problem:

\[ \begin{aligned} \mathbf{w}^{\mathrm{SCM}} &= \underset{\mathbf{w}^{\mathrm{SCM}} \in \mathcal{W}_{\mathrm{conv}}}{\operatorname*{argmin}} \; \left\| \mathbf{y}_1 - \mathbf{Y}_0 \mathbf{w}^{\mathrm{SCM}} \right\|_2^2 \quad \forall \, t \in \mathcal{T}_{1}, \\ \mathcal{W}_{\mathrm{conv}} &= \left\{ \mathbf{w}^{\mathrm{SCM}} \in \mathbb{R}_{\ge 0}^{J} \;\middle|\; \mathbf{1}^\top \mathbf{w} = 1 \right\} \end{aligned} \]

We want to combine the donor regions to best match the treated unit before the intervention. By requiring all weights to be nonnegative and sum to 1, the synthetic Basque is a weighted average that stays inside the range of the donors—it can’t “overshoot” or assign negative importance to any region.

Given the optimal weights \(\mathbf{w}^{\mathrm{SCM}}\), the SCM prediction for the treated unit is

\[ \hat{\mathbf{y}}_1^{\mathrm{SCM}} = \mathbf{Y}_0 \mathbf{w}^{\mathrm{SCM}}\equiv\sum_{j \in \mathcal{J}_0} w_j^{\mathrm{SCM}} \, y_{jt}, \quad \forall \, t \in \mathcal{T} \]

SCM for Basque

  • Here is the result of the regression model. The Basque Country is best reproduced by 85% Catalonia and 15% Madrid.

Convex Combination

Placebos

Validating SCM designs are important as well, Consider a few robustness checks:

Key Takeaways for Classic SCM

  • Low-volatility series
    Synthetic controls work best with aggregate, relatively smooth series.
    Condition (informal): the pre-treatment volatility should be small relative to the signal, e.g. \[ \frac{\operatorname{sd}(y_{1t \in \mathcal{T}_{1}})}{\overline{y}_{1t \in \mathcal{T}_{1}}} \quad\text{is small.} \]

  • Extended pre-intervention period
    A long pre-treatment window improves identification.
    Condition: the number of pre-periods \(T_0\) should be sufficiently large: \[ T_0 \gg r \quad\text{(where $r$ is the effective number of latent factors).} \]

  • Small, relevant donor pool
    More donors can increase interpolation bias and overfitting. Prefer a smaller donor set.
    Condition (selection): restrict donor count \(J\) relative to pre-period information, e.g. \[ J \lesssim c \cdot T_0 \quad\text{for some modest constant } c, \]

  • Sparsity aids interpretability
    Fewer nonzero weights make the synthetic control easier to interpret.
    Condition: prefer weight vectors \(\mathbf{w}\) with small support: \[ \|\mathbf{w}\|_0 << J \quad\text{or use the } \ell_1 \text{ norm: } \|\mathbf{w}\|_1 \]

  • Plot your treated unit versus the donors (seriously)
    Visual diagnostics: check level/trend alignment and extreme donors before fitting.
    Diagnostic: inspect \(\{y_{jt}\}_{j\in \mathcal{J}_{0}}\) vs. \(y_{1t}\) for \(t \in \mathcal{T}_{1}\).

  • Good pre-treatment fit matters
    If pre-fit error is large, post-treatment inferences are unreliable.
    Condition (quantitative): require a low pre-fit RMSE: \[ \text{RMSE}_{t \in \mathcal{T}_{1}} = \sqrt{\frac{1}{T_0}\sum_{t \in \mathcal{T}_{1}}\big(y_{1t}-\hat y_{1t}\big)^2} \]

  • Out-of-sample validation is essential
    Use placebo tests (in-time and in-space) to check robustness.
    Condition (placebo): the treated unit’s post-treatment gap should be an outlier relative to the placebo distribution. If \(\widehat{\text{ATT}}_1\) is the treated ATT and \(\{\widehat{\text{ATT}}_j\}_{j\neq 1}\) are placebo ATTs, \[ p\text{-value} \approx \frac{1}{J}\sum_{j\neq 1} \mathbf{1}\big(|\widehat{\text{ATT}}_j|\ge |\widehat{\text{ATT}}_1|\big) \] should be small. Furthermore, the ATT for the real treatment period should not change substantively given reasonable in-time placebos.