# Run simulation
= simulate(seed=10000, r=3)
df = {
config "df": df,
"outcome": 'Gross Booking Value',
"treat": 'Experiences',
"unitid": 'Market',
"time": 'Week',
"display_graphs": True,
"save": False,
"counterfactual_color": ["blue"], "addout": list(df.columns[4:]),
"method": "both"
}
= SCMO(config).fit() arco
Synthetic Controls With More Than One Outcome
Sometimes, analysts have multiple different metrics at their disposal with which to estimate synthetic control models. That is, multiple relevant variables may possibly predict a target/outcome variable that we care about, and plenty of other papers have commented on this fact before. Some analysts even argue for the surrogate approach, where we treat other outcomes or affected units as a kind of instrument for the counterfactual trajcectory of the target unit.
However, all of this is most uncommon. As it turns out, most people in academia and industry who use synthetic controls use only a single focal outcome in their analyses. Perhaps they will adjust their unit weights by some diagonal matrix, a diagonal matrix \(\mathbf{V}\) in most applications. The point of this matrix is basically to assist the main optimization in choosing the unit weights. However, even this is limited by the number of pretreatment periods you have- if you have more covariates than you have pretreatment periods, you cannot estimate the regression. Recent papers by econometricians have tried to get around this, though. This blog post covers a few recent recent papers which have advocated for this. I explain the econometric method and apply it in a simulated setting.
Notation
I adopt the following notational conventions throughout. Scalars are denoted by lowercase italicized letters such as \(g\), vectors are denoted by bold lowercase letters such as \(\mathbf{v} \in \mathbb{R}^n\), and matrices are denoted by bold uppercase letters such as \(\mathbf{A} \in \mathbb{R}^{m \times n}\). Sets are denoted by calligraphic letters, such as \(\mathcal{N}\) or \(\mathcal{T}\), and the cardinality of a finite set \(\mathcal{A}\) is written \(|\mathcal{A}|\). For a matrix \(\mathbf{A}\), we write \(\| \mathbf{A} \|_F^2 = \sum_{i,j} A_{ij}^2\) for the squared Frobenius norm and \(\| \mathbf{A} \|_2\) for its spectral norm. For any \(n \in \mathbb{N}\), we define the \(n\)-dimensional probability simplex by
\[ \Delta^{N_0 - 1} = \left\{ \mathbf{w} \in \mathbb{R}^{N_0} \mid w_j \geq 0 \: \forall j \in \mathcal{N}_0, \sum_{j \in \mathcal{N}_0} w_j = 1 \right\}. \]
Matrix multiplication is written in the usual way; products like \(\mathbf{A}^\top \mathbf{b}\), \(\mathbf{A} \mathbf{B}\), or \(\mathbf{b}^\top \mathbf{c}\) denote standard matrix-vector or matrix-matrix products. Let \(\mathcal{N} = \{1, 2, \dots, N\}\) index the units in the panel, with unit \(1\) denoting the treated unit and \(\mathcal{N}_0 = \{2, 3, \dots, N\}\) denoting the set of control units. We write \(N_0 = |\mathcal{N}_0| = N - 1\) for its cardinality. Let \(\mathcal{T} \subset \mathbb{N}\) be a finite set of time periods with \(|\mathcal{T}| = T\). Let \(T_0 < T\) denote the final pre-treatment period. Define the pre-treatment period as \(\mathcal{T}_1 = \{1, 2, \dots, T_0\}\) with \(|\mathcal{T}_1| = T_0\) and the post-treatment period as \(\mathcal{T}_2 = \{T_0 + 1, \dots, T\}\) with \(|\mathcal{T}_2| = T - T_0\). Let \(\mathbf{y}_j \in \mathbb{R}^{T}\) denote the time series for our focal outcome of interest \(\forall \, j \in \mathcal{N}\). We write \(\mathbf{y}_1 = \begin{bmatrix} y_{1,1} & \cdots & y_{1,T} \end{bmatrix}^\top \in \mathbb{R}^T\) for the treated unit’s outcome vector. Let \(\mathbf{Y}_0 \in \mathbb{R}^{T \times N_0}\) denote the matrix that stacks the outcome vectors for all control units \(j \in \mathcal{N}_0\). Each time series \(\mathbf{y}_j\) is partitioned into a pre-treatment vector \(\mathbf{y}_j^{\text{pre}} \in \mathbb{R}^{T_0}\) and a post-treatment vector \(\mathbf{y}_j^{\text{post}} \in \mathbb{R}^{T - T_0}\). Likewise, the control matrix \(\mathbf{Y}_0\) is partitioned into \(\mathbf{Y}_0^{\text{pre}} \in \mathbb{R}^{T_0 \times N_0}\) and \(\mathbf{Y}_0^{\text{post}} \in \mathbb{R}^{(T - T_0) \times N_0}\).
Now, let \(K\) denote the number of auxiliary outcomes (excluding the main one). For each outcome \(\ell \in \{1, 2, \dots, K\}\), let \(\mathbf{y}_j^{(\ell)} \in \mathbb{R}^T\) denote the \(\ell\)-th outcome vector for unit \(j\), and let \(\mathbf{Y}_0^{(\ell)} \in \mathbb{R}^{T \times N_0}\) denote the matrix collecting the \(\ell\)-th outcome across all controls. These are also split into pre- and post-treatment periods. We define the “stacking” operation—denoted by a prime—as the vertical concatenation of multiple outcomes. Specifically, let \(\mathbf{Y}_0^{\prime} \in \mathbb{R}^{KT_{0} \times N_0}\) denote the vertically stacked matrix (and vector):
\[ \mathbf{Y}_0^{\prime} = \begin{bmatrix} \mathbf{Y}_0^{(1), \text{pre}} \\ \mathbf{Y}_0^{(2), \text{pre}} \\ \vdots \\ \mathbf{Y}_0^{(K), \text{pre}} \end{bmatrix}, \quad \mathbf{y}_1^{\prime} = \begin{bmatrix} \mathbf{y}_1^{(1), \text{pre}} \\ \mathbf{y}_1^{(2), \text{pre}} \\ \vdots \\ \mathbf{y}_1^{(K), \text{pre}} \end{bmatrix}. \]
This is simply stacking the pre-intervention period metrics atop one another. If we have two metrics and 19 pretreatment periods, we have a vector of 38 rows. This is done for the matrices too: if we have 10 donors, 19 pretreatment periods, and two metrics (the main outcome and another one) we now have 10 columns and 38 rows for all the metrics across those same 19 pre-treatment periods.
Standard Synthetic Control
Before introducing the stacked estimator however, I begin by reviewing the standard synthetic control method. Given pre-treatment data for a treated unit and a set of control units, the canonical synthetic control estimator minimizes
\[ \mathbf{w}^\ast = \underset{\mathbf{w} \in \Delta^{N_0}}{\operatorname*{argmin}} \|\mathbf{y}_1^{\text{pre}} - \mathbf{Y}_0^{\text{pre}} \mathbf{w} \|_F^2 \quad \forall t \in \mathcal{T}_1. \]
This is a constrained least squares program in which we regress the treated unit’s pre-treatment outcomes onto the control matrix under the constraint that \(\mathbf{w}\) lies in the simplex \(\Delta^{N_0}\). For the purposes of this blog post, think of this as the solution to a single linear equation. Once the optimal weights \(\mathbf{w}^\ast\) are estimated, the out-of-sample estimates are obtained by applying the same weights to the control matrix
\[ \mathbf{y}^{\text{SC}}_1 = \mathbf{Y}_0^{\text{post}} \mathbf{w}^\ast, \]
with the concatenation between the in and out of sample vectors corresponding to the full predictions of the model. The estimated treatment effect at each post-treatment time point is then given by the difference between observed and out-of-sample outcomes: \(\hat{\tau}_{1t} = y_{1t} - \hat{y}_{1t}\) for \(t \in \mathcal{T}_2\).
Stacked SCM with Multiple Outcomes
When multiple outcomes are present, there are three choices we can make: do we use the concatenated approach, a demeaning/intercept adjusted approach, or do we combine them as a model average? The first approach stacks all the outcomes as we’ve discussed above into a big matrix/big vector. We then apply the standard SCM optimization problem to the stacked data. We minimize
\[ \mathbf{w}^\ast = \underset{\mathbf{w} \in \Delta^{N_0}}{\operatorname*{argmin}} \|\mathbf{y}_1^{\prime} - \mathbf{Y}_0^{\prime} \mathbf{w} \|_F^2 \quad \forall t \in \mathcal{T}_1. \]
Here, the vector \(\mathbf{y}_1^{\prime}\) is the stacked pre-treatment outcomes of the treated unit, including at least the outcome of the main target unit and auxilary outcomes. Similarly, \(\mathbf{Y}_0^{\prime}\) is the stacked donor matrix including the focal outcome and the same set of auxilary outcomes. The weight vector \(\mathbf{w}^\ast\) is found such that the in-sample fit is as close as possible to the weighted stacked combination of the control outcomes. The out-of-sample prediction for the treated unit is then estimated as
\[ \mathbf{y}_1^{\text{CAT}} = \mathbf{Y}_0 \mathbf{w}^\ast. \]
The second approach involves including an intercept term for each outcome to control for differences in levels across outcomes. Instead of demeaning the outcomes ourselves, we add an intercept to the optimization. This approach effectively adjusts for any systematic differences between the target unit and the donor pool. The objective function in this case is
\[ \mathbf{w}^\ast = \underset{\mathbf{w} \in \Delta^{N_0}, \beta \in \mathbb{R}}{\operatorname*{argmin}} \|\mathbf{y}_1^{\prime} - \mathbf{Y}_0^{\prime} \mathbf{w} - \beta \|_F^2 \quad \forall t \in \mathcal{T}_1. \]
Here, \(\beta\) represents the unconstrained intercept term, and the unit weight vector retains its initial interpretation. Technically, this is the same as the MSCa estimator, discussed here. The out-of-sample prediction is estimated as
\[ \mathbf{y}_1^{\text{AVG}} = \mathbf{Y}_0 \mathbf{w}^\ast + \beta. \]
I must admit, when I first read about this apporach, I was confused as to what was going on, until I considered them as the solution to a system of multiple equations. To have a better sense of what is going on, we typically consider our outcomes in data science or econoemtrics to be generated by some process that we never see. A common assumption is that they are generated by a low-rank model, which essentially formulates our outcomes as a byproduct of the interaction between unit specific fixed effects that are idiosyncratic to each unit (that are not expected to change over time) and time effects that are common across all units of interest. Well, more than one outcome exists in the world that may plausibly have a similar factor structure to the metric that we care most about. When we stack these together, the hteoretical intuition is that we now allow the model to better capture the latent factors that influence the main outcome by solving for all of the outcomes at the same time, thereby improving the match of the synthetic control to those same latent factor loadings.
Note that for estimation purposes, under the hood within mlsynth
, both the outcome of interest and the auxiliary outcomes are normalized to 100 in the final pre-treatment period. This is to avoid the truly nasty solver issues we would get if we used the raw, untransformed outcomes that may be on different scales from the target outcome of interest. Note that this is a legal move because this is a linear transformation of the outcome, so it has no effect on the optimality conditions. But, this transformation does affect the feasability region. To emphasize, the user of mlsynth
does not need to manually normalize their outcome metrics. This is taken care of under the hood.
Model Averaging
We may also model average these models together, which sometimes results in better fit than using either model alone. Suppose we are given two distinct estimators of the counterfactual outcome for the treated unit. On one hand, we have \(\mathbf{y}_1^{\text{CAT}} \in \mathbb{R}^{T_0}\), which denotes the pre-treatment fit from the concatenated model by Tian, Lee, and Panchenko, and on the other hand, we have \(\mathbf{y}_1^{\text{AVG}} \in \mathbb{R}^{T_0}\), the corresponding fit from the demeaned model by Sun, Ben-Michael, and Feller. As before, we observe the treated unit’s pre-treatment trajectory, \(\mathbf{y}_1^{\text{pre}} \in \mathbb{R}^{T_0}\).
To begin, we stack the two counterfactuals into a single matrix:
\[ \mathbf{Y}^{\text{MA}} = \begin{bmatrix} \mathbf{y}_1^{\text{CAT}} & \mathbf{y}_1^{\text{AVG}} \end{bmatrix} \in \mathbb{R}^{T_0 \times 2}. \]
We define the model-averaged pre-treatment fit as a convex combination of the two predictions
\[ \mathbf{y}_1^{\text{MA}}(\boldsymbol{\lambda}) = \mathbf{Y}^{\text{MA}} \boldsymbol{\lambda}, \]
where \(\boldsymbol{\lambda} \in \Delta^2\) is a 2-dimensional simplex weight vector
\[ \Delta^2 = \left\{ \boldsymbol{\lambda} \in \mathbb{R}_{\geq 0}^2 : \| \boldsymbol{\lambda} \|_1 = 1 \right\}. \]
The model averaged objective function minimizes
\[ \boldsymbol{\lambda}^\ast = \underset{\boldsymbol{\lambda} \in \Delta^2}{\operatorname{argmin}} \left\| \mathbf{y}_1^{\text{pre}} - \mathbf{Y}^{\text{MA}} \boldsymbol{\lambda} \right\|_F^2 \quad \forall t \in \mathcal{T}_1. \]
The interpretation of the convex hull remains the same as in the traditional SCM: for each time point in the pre-treatment period, the model-averaged prediction lies between the global minimum and maximum of the two individual estimators
\[ \mathbf{y}_1^{\text{MA}} \in \left[ \min\left(\mathbf{y}_1^{\text{CAT}}, \mathbf{y}_1^{\text{AVG}}\right), \max\left(\mathbf{y}_1^{\text{CAT}}, \mathbf{y}_1^{\text{AVG}}\right) \right] \quad \forall t \in \mathcal{T}. \]
Once \(\boldsymbol{\lambda}^\ast\) is found, the model-averaged out-of-sample predictions are estimated like
\[ \mathbf{Y}^{\text{MA, post}} = \begin{bmatrix} \mathbf{y}_1^{\text{CAT, post}} & \mathbf{y}_1^{\text{AVG, post}} \end{bmatrix}, \quad \mathbf{y}_1^{\text{MA, post}} = \mathbf{Y}^{\text{MA, post}} \boldsymbol{\lambda}^\ast. \]
Essentially, this is a mixture of both models.
Conformal Prediction via Agnostic Means
Now a final word on infernece. I use conformal prediction intervals to conduct inference here, developed in this paper. Precisely, I use the agnostic approach (yes, I know other approaches exist; users of mlsynth
will likely be given the option to choose which flavor of conformal prediction they desire as an option, in the future). Define the vector of residuals as \(\mathbf{u}_{\text{pre}} = \mathbf{y}_{1,\text{pre}} - \mathbf{y}^{\text{SC}}_{1,\text{pre}}\), or just the pretreatment difference betwixt the observed values and its counterfactual. Furthermore, let \(\hat{\sigma}^2 = \frac{1}{T_0 - 1} \left\| \mathbf{u}_{\text{pre}} - \bar{u} \mathbf{1} \right\|^2\) be the unbiased estimator of the residual variance, where \(\bar{u} = \frac{1}{T_0} \sum_{t=1}^{T_0} u_t\) is the mean residual.
We aim to construct prediction intervals for the counterfactual outcomes in the post-treatment period \(\mathbf{y}^{\text{SC}}_{1,\text{post}} \in \mathbb{R}^{T_1}\) be the post-treatment SC predictions for some generic estimator. Assuming that the out-of-sample error is sub-Gaussian given the history \(\mathscr{H}\) (in plain English, this just means that large errors are unlikely, which makes sense given that SC is less biased in a well-fitting pre-intervention model), we obtain a valid non-asymptotic prediction interval via concentration inequalities. Specifically, we have \(\delta_\alpha = \sqrt{2 \hat{\sigma}^2 \log(2 / \alpha)}\). The conformal prediction intervals are then defined as \(\mathbf{p}_{\text{post}} = \mathbf{y}^{\text{SC}}_{1,\text{post}} - \delta_\alpha \mathbf{1}\), \(\mathbf{u}_{\text{post}} = \mathbf{y}^{\text{SC}}_{1,\text{post}} + \delta_\alpha \mathbf{1}\). These bounds provide uniform coverage guarantees under the sub-Gaussian assumption on the prediction error. In the paper, the authors also provide two more methods, and these will likely be incorporated in the future.
Estimation in Python
In order to get these results, you need Python (3.9 or greater) and mlsynth
, which you may install from the Github repo. You’ll need the most recent version which automatically computes the conformal prediciton.
pip install -U git+https://github.com/jgreathouse9/mlsynth.git
The format of the dataset is the same as we would expect for any other mlsynth dataset. The class expects: a long form dataset where we have one column for time, another for the unit, another dummy column for treatment, and some columns for the outcomes of interest, The way we import the class for this is by doing from mlsynth.mlsynth import SCMO
at the top of one’s Python script.
Simulation
Suppose we are working at Airbnb, and we wish to see the causal effect of the introduction of Airbnb Experiences on Gross Booking Value (GBV), a metric which is defined as ‘’the total revenue generated by room or property rentals before any costs or expenses are subtracted’’. Airbnb Experiences connects users of the platform to local tour guides or other local attractions. It serves as a kind of competition to Travelocity, Viator and other booking/ travel services. In other words, this program may make makes this city an attraction, and we may see an increase in GBV as a result.
Well, all sorts of things may be related to GBV, such as local hotel prices, pre-existing level of tourist arrivals, average city-specific booking price, and other relevant metrics. The goal is to see how the GBV would have evolve absent the policy. The point of this simulation is to use the SCMO
(synthetic control multiple outcomes) estimator to measure the causal impact.
For each unit, the observed outcome \(\mathbf{Y}_{jtk}\) evolves according to an autoregressive process with latent structure for time, place, and seasonality
\[ \mathbf{Y}_{jtk} = \rho_k \mathbf{Y}_{jt-1k} + (1 - \rho_k) \left( \alpha_{jk} + \beta_{tk} + \boldsymbol{\phi}_j^\top \boldsymbol{\mu}_{tk} + \mathbf{S}_{jt} + \delta_k \right) + \varepsilon_{jtk}, \quad \text{for } t > 1, \]
with initial condition
\[ \mathbf{Y}_{j1k} = \alpha_{jk} + \beta_{1k} + \boldsymbol{\phi}_j^\top \boldsymbol{\mu}_{1k} + \mathbf{S}_{j1} + \delta_k + \varepsilon_{j1k}. \]
Here, \(\alpha_{jk} \sim \mathcal{N}(0, 1)\) and \(\beta_{tk} \sim \mathcal{N}(0, 1)\) represent unit-outcome and time-outcome fixed effects, respectively. Each unit \(j\) possesses latent attributes \(\boldsymbol{\phi}_j \in \mathbb{R}^r \sim \mathcal{N}(0, \mathbf{I})\), while each time-outcome pair \((tk)\) has associated latent loadings \(\boldsymbol{\mu}_{tk} \in \mathbb{R}^r \sim \mathcal{N}(0, \mathbf{I})\). The seasonal component \(\mathbf{S}_{jt}\) captures unit-specific periodicity and is defined as \(\gamma_j \cos\left( \frac{4\pi(t - \tau_j)}{T_{\text{season}}} \right)\), with \(\gamma_j \sim \text{Unif}(0, \bar{\gamma})\) representing the amplitude and \(\tau_j \sim \text{Unif}\{0, \dots, T_{\text{season}} - 1\}\) the phase shift. Each outcome \(k\) has a baseline shift \(\delta_k \sim \text{Unif}(200, 500)\), an autocorrelation parameter \(\rho_k \in (0, 1)\), and an idiosyncratic noise component \(\varepsilon_{jtk} \sim \mathcal{N}(0, \sigma^2)\). One unit (Iquique, Chile in this draw) is designated as treated. To introduce selection bias, the unit with the second-largest realization on the first latent factor dimension is treated, meaning methods like difference-in-differences or interrupted time series methods will not perform well. The target unit’s GBV receives an additive treatment effect of \(+5\) during all post-treatment periods.
Results
When we run this estimator, we need to specify one of three estimators: TLP
, SBMF
, or both
, where the abbreviations are obviouslyfor the surnames of the authors. We also need to supply a dictionary entry to mlsynth
called addout
. This is either a string or a list which lists the additional outcomes we care about in the dataframe. When we run the estimator, we get:
Using the model averaging estimator, our pre-treatment Root Mean Squared Error is 0.276. The ATT is 5.046. The weights are also a sparse vector. The model averaged estimator returns Arequipa (0.237), Bogotá (0.232), San Salvador (0.218), Santiago de Chile (0.174), San Luis Potosí (0.081), Montevidio (0.032), and Manzanillo (0.026), as the contributing units or only 7 of the 98 donor units. The optimal mixing between the models is 0.538 for the intercept-shifted estimator and 0.461 for the concatenated method. For DID, the RMSE is 0.878 and the ATT is 5.2, meaning that the intercept adjusted average of all donor units is clearly a biased estimator.
Compare to Forward DID, this is NOT true: we have an ATT of 5.063 and a pre-intervention RMSE of 0.289, selecting Antofagasta, Bocas del Toro, Punta del Este, San Pedro Sula, and Santiago as the optimal donor pool (this method uses no additional outcomes, only the GBV metric). When I compare to the clustered PCR method, the positively weighted donors are San Pedro Sula (0.296), Punta del Este (0.255), Santiago (0.199), Antofagasta (0.190), and La Plata (0.060).
Caveats
So, a few comments are in order. One, this method is preciated on the selection of the right covariates to use. In testing on other empirical datasets, I got different results depending on the outcomes I employed. It would be cool to have a method to select the relevant ones. There are also other methods of combining multiple outcomes together, such as the PCR version of this (the multi-dimensional Robust SCM/PCR). This method uses a diagonal matrix to weight the most relevant additional outcomes in a quadratic-programming style estimator. I will program this in the future, maybe even comparing it to the current method. Anyways, this is the first method in mlsynth
which directly incorporates addtional predictors/outcomes that may be similar to the latent factor structure. Please, do email me or comment on the LinkedIn post if you have any questions.