9  Difference-in-Differences

Note

For the below, let \(i\) index the units and \(t\) index all time periods. \(\mathcal{N}_0\) is the control group, where \(i=1\) is the singular treated unit.

Even though we cannot randomize all treatments/policies, does this mean that we cannot do policy analysis at all? No. Modern econometrics has developed a slew of methods for doing policy analysis when the intervention of interest simply cannot be subject to a controlled experiment in Stata, R, and Python. I now introduce the difference-in-differences method (DD), using Proposition 99 as an example case. DD is a method used for panel data, where we observe multiple units over many periods of time.

9.1 Underlying Theory of DD

DD is predicated on the idea that we may use the average of our controls as a comparison for one or more treated units. That is, it assumes our outcomes are generated by some time specific effect and some unit specific effect, \[ y_{it}= a_i + b_t \]

where \(a_i\) is the unit specific effect (note the \(i\) indexer for \(a\)) and \(b_t\) is the time effect. The unit component \(a_i\) is the unit-stable effect of factors that affect the outcome. We ususally do not expect for these to vary much over time. Examples of unit stable factors include things like culture, geography, hidden/latent economic conditions and aspects we cannot as easily observe. As one of my instructors once said in his thick Kentucky accent, “there is something that makes Alabama, Alabama.” By contrast, the time effect \(b_t\) is a time-shock. These are things which we expect to vary across time, but affect the units similarly, such as economic depressions, holidays, or other things we expect to be time-stable, unit variant. The summation of these two factors produces our outcomes, which we econometricians call the two-way fixed effects model.

For DD, units with similar contextual factors will on average be more comaprable in terms of their trends (we will formalize a definition of this more below). For example, suppose we wish to study the impact of the Maui wildfires on the seasonally adjusted level of tourism. A natural control group for Maui might be Hawai’i, O’ahu, Moloka’i, Lana’i, and Kaua’i. All of these islands may have different levels of tourism. However, due to their relatively similar cultural history and other contextual factors, it may make sense that these islands would serve as a better counterfactual to Maui than say, Honshu, Japan or Palma de Mallorca, since places with the same contextual factors will likely have simialr trends to one another. In other words, due to their baseline similarities, it is more likely that the average trends of these control group islands would move similarly to the Maui, if the fires did not happen. And, if this is true, all we need to do then is estimate whatever the time-differences are between Maui and the control group.

9.1.2 Calculating the Treatment Effect

Okay, at this point, you may be wondering “Okay this is fine, but how do I estimate the treatment effect? That’s why we’re here aren’t we?” Yes. The way we do this is quite simple. We can begin by defining the counterfactual

Definition 9.2 (DD Counterfactual) \(\hat{y}_{1t}^0=\hat\alpha_{\mathcal{N}_0}+\bar{y}_{\mathcal{N}_0t}\).

With this in mind, consider the following code


clear *


cls
u "https://github.com/jgreathouse9/FDIDTutorial/raw/main/smoking.dta", clear

keep id year cigsale

qui reshape wide cigsale, j(id) i(year)

order cigsale3, a(year)

tempvar ymeangood ytilde ymean cf te rss tss
egen `ymean' = rowmean(cigsale1-cigsale39)



*** Normal DID, all controls

g `ytilde' = cigsale3-`ymean'
reg `ytilde' if year < 1989


loc alpha = e(b)[1,1] // here is the intercept shift

g cf = `ymean' + `alpha' // DID Counterfacutal

g te = cigsale3 - cf

keep year cigsale3 cf te

What do we see from the Stata browser window? We see the observed values minus the DID predictions. Here, te is the difference between the observed values and the predictions, \(y_{1t} - \hat{y}_{1t}^0\). We then take the average of this over the post-intervention period (on and after 1989, 12 years in total), \(\widehat{ATT}_{\mathcal{N}_0} = \frac{1}{T_2} \sum_{t \in \mathcal{T}_2} \left(y_{1t} - \hat{y}_{1t}^0\right)=-27.349\). In Stata, we can do this with the code su te if year >=1989, which returns the same number. Here, ATT stands for the average treatment effect on the treated. The output of the Stata code is

    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
          te |         12   -27.34911    8.373848  -36.17521  -12.90415

9.2 Streamlining DD

Of course though, all this was very involved. Some way wonder if we can further use regression to streamline this… and indeed we can. One way we can do this is using an interaction term. We’ve discussed this before, in previous lectures in the context of calculating nonlinearities. But we can do the same with treatment effect estimation. Consider the code


clear *
cls
u "https://github.com/jgreathouse9/FDIDTutorial/raw/main/smoking.dta", clear

replace treated = cond(id==3,1,0)

g post= cond(year >=1989,1,0)

keep id year treated post cigsale

regress cigsale i.treated##i.post, vce(cl id)

Okay so, here we load in the Proposition 99 data. We then create a dummy variable for whether the state is California or not (that is, if it is treated ever). We then create another dummy for whether the time period is in the post-intervention period. We use the ## notation to create the interaction term. The regression equation itself takes the form of

\[ \begin{equation} Y_{it} = \beta_0 + \beta_1 \text{Treated}_i + \beta_2 \text{Post}_t + \beta_3 (\text{Treated}_i \times \text{Post}_t) + \epsilon_{it} \end{equation} \]

Again, none of this is in principle different from what we’ve seen before. the constant is still the average of our outcomes if all our \(x\)-s are 0. The first beta represents the baseline differences between the treated unit and the control group when the period is before treatment. The second beta is the average of the control group outcomes in the post=intervention period… and the iteraction term is simply the effect of being in the treated group and in the post intervention period. The output of the Stata code is



Linear regression                               Number of obs     =      1,209
                                                F(1, 38)          =          .
                                                Prob > F          =          .
                                                R-squared         =     0.2073
                                                Root MSE          =     29.209

                                    (Std. err. adjusted for 39 clusters in id)
------------------------------------------------------------------------------
             |               Robust
     cigsale | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
   1.treated |    -14.359   4.961268    -2.89   0.006    -24.40256   -4.315442
      1.post |  -28.51142   2.769628   -10.29   0.000    -34.11823    -22.9046
             |
treated#post |
        1 1  |  -27.34911   2.769628    -9.87   0.000    -32.95593   -21.74229
             |
       _cons |   130.5695   4.961268    26.32   0.000      120.526    140.6131
------------------------------------------------------------------------------

Some of this should look familiar. Such as, the -14.359, which is the exact same number we got when we manually calculated the averge of the differences. The treated#post coefficient also looks familiar; its our average treatment effect on the treated!! Notice how it is the same as above, with the same result as su te if year >=1989. It reflects the effect of a unit being both treated (California in this case) AND being in the post-intervention period, the period we care about most. For our purposes, this is the coefficient we care the most about, since it tells us if our policy was successful or not. We can also calculate confidence intervals with DID, as we see from the Stata output above. In this case, we say that we are 95% confident, given the dataset, that the true effect of Prop 99 led to a reduction of 32 cigarettes per capita smoked per year in California, or as little of a reduction of 21 cigarettes per capita smoked, from 1989 to 2000. Of course, other useful tools exist in Stata that further facilitates estimating DD models (see Greathouse, Coupet, and Sevigny 2024 for example).

9.3 Limitations of DD

As we have discussed so far, parallel trends holds if the difference between the treated unit and control group would be similar absent treatment. However, this is a strong assumption to make: it essentially assumes that across the whole control group, the only thing really changing the outcome aside from the treatment was some time effect. However, this may be a dubious assumption: maybe, unit specific factors will interact with time in a way unique to that unit only. In other words, maybe the simple two-way fixed effect model is much too simplistic.

Another limitation of DD is implicit in the regression equation: by now, hopefully we can see how that DD is, at tis heart, just a substraction of the pre-intervention mean from the control group mean. What this means, practically, is that DD presumes each control unit we have in our control group is a good control unit for one or more treated units. Therefore, all control units are given the same weight in the construction of the counterfactual. However, maybe this is unrealistic. Indeed, maybe some control units should matter more than other control units. After all, if Wailuku, Hawaii does an anti-crime policy, should New Orleans be given the same weight as Makakilo, Hawaii? No. Likely not, anyways. New Orleans is completely different from Wailuku, and therefore likely should not be used as a comparison unit at all. Other methods have been developed for this.

10 Summary

The DD method is the baseline method we use for program evaluation in public policy analysis. It is the most elementary method we use aside from the randomized controlled trial to evaluate the effect of a policy even when we cannot randomize the policy. It operates under the PTA, or the notion that the counterfactual for the treated unit would have moved the same way as the average of the control group. Numerous extensions have been developed for DD. It is the most popular method of evaluating the effect of policies, and is valued for its simplicity to compute, inference theory, and applicability in various research settings, while also being coded across various softwares. The key message though, for your papers, remains the same: using statistical methods to evaluate the impact of a policy you’ve chosen.

A final note, just in terms of doing science: I’ve never mentioned things such as the p-value in my interpretation of the regression results, the measure for statistical significance commonly employed in social science research. The reason for this is because just because the p-value for a regression coefficient is below \(0.05\) does not mean your analysis is correct or that you have done your job as a research. No, indeed, by my standards, the design, the way we set up the identification of the counterfactual matters. It completely trumps estimation.

Therefore, for your papers, you’ll mainly be graded also in terms of whether or not you can identify the violation/applicability of DD’s parallel trend assumption, as well as the limitations of the method for your case. So, your goal is not to have “the final say” on whether policy \(x\) worked or not; the goal is to get you to think like a policy analyst who views the world through a cause and effect lens, and being able to use and articulate designs such as DD are a useful way of beginning this.

Greathouse, Jared, Jason Coupet, and Eric Sevigny. 2024. “Greed Is Good: Estimating Forward Difference-in-Differences in Stata.” Georgia State University; [working paper]. https://jgreathouse9.github.io/publications/FDIDSJ.pdf.
Roth, Jonathan, Pedro H.C. Sant’Anna, Alyssa Bilinski, and John Poe. 2023. “What’s Trending in Difference-in-Differences? A Synthesis of the Recent Econometrics Literature.” Journal of Econometrics 235 (2): 2218–44. https://doi.org/10.1016/j.jeconom.2023.03.008.