4  Introduction to Stata

This chapter provides the intro to Stata. Stata is the software presumably most of you will use for analysis. In this chapter. I review the basics of importing and cleaning data, data types, and basic forms of analysis. To begin, open Stata on your computer (or whatever the current version of Stata is). When you do this, you’ll see a terminal (where we input commands) and a guided user interface, which inputs commands for us.

4.1 Script

Before we get to any of that though, we begin with the lifeblood of science: documentation and script. In this course, your analysis is only valid if what you do can be reproduced and shown to others. The way we do this is with script. The reason I opted away from Excel for this course (aside from the fact that we are a professional policy analysis department and Excel isn’t used in real life for proper data analysis) is that Stata allows for script such that we can replicate exactly what we did every single time. Excel, by contrast, does not have a reproducible documentation system which allows us to say exactly what we did and how we did it.

In Stata, we do this (mainly) with what we call .do files. Once you’ve opened Stata, type the word doedit into the terminal. What will open is an untitled do file. Alternatively, you can type doedit and a file name, say doedit file1, and a corresponding do file will open named file1.do. This is where all of our work will live for the purposes of class.

I present script in code blocks. These code blocks may be copied and pasted directly into Stata so that you may run them. Each do file should begin with

clear *
cls

This clears the Stata terminal to be completely empty, as well as clears all of the current output on the screen. I do this for two reasons: firstly, it keeps everything clean in the sense that the screen is not cluttered with needless output. But, it also means that everything should be able to run from the very first step to the very last one without error.

By the way, clear and cls are Stata commands. We can see how to use these commands by consulting the help files, by typing in the terminal help and the command we’re interested in. For example, for help with the clear command, we can just do help clear or h clear where h is the short version of help in Stata. If you wanted to know how to summarize a variable using descriptive statistics, you’d do h summarize. If you wanted to learn how to take the correlation between two variables, you’d do help corr.

All Stata command names are lowercase. Typically, your variable of interest comes first after the command name, with all options being followed up after a comma. For example, I suppose I wanted to estimate the causal impact of drought in California on the violent crime rate. To do this, we can use data from this paper. In a Stata do file, we’d put this into a do file and then type CTRL+d.

clear *
import delim "https://ndownloader.figstatic.com/files/9466315"
keep state year violentcrimerate murderrate vcrimerate id
order id state year vcrimerate, first
g drought = cond(id==4 & year >=2011,1,0)
net from "https://raw.githubusercontent.com/jgreathouse9/FDIDTutorial/main"
net install fdid, replace
xtset id year, y
fdid vcrimerate, tr(drought) unitnames(state) gr1opts(scheme(sj))

See how all commands are lowercase? In particular, after the fdid command, we see that the main variable is vcrimerate and that the options for the command follow the comma. We can see the same thing for the order command, where I rearrange the way the dataset looks.

4.2 Importing Data

Now that that’s partly out of the way, we cover how to import data. Data are presented in different file types, but for undergraduate study the most you will need is the way to import Stata’s native datasets. Let’s import the terrorism dataset. Note that we can do this whether the data is on our local machine or if it is on the internet.


clear * // We can leave comments in our code like this
cls

use /// or we can leave comments like
"https://github.com/jgreathouse9/FDIDTutorial/raw/main/basque.dta" // this
browse


/*
Or, here's a longer 




block comment

*/

This imports the Basque dataset using the use command (we may also use u as an abbreviation). Note that the use command may ONLY be used with Stata datasets, datasets whose file extension ends with .dta. To view the dataset, we use the browse command (which we may abbreviate with br). Here, we see the entire dataset. The first column is the variable id, and the second one is year, and the third column is the outcome of interest gdpcap. When we browse the dataset, we may notice that some cells of the spreadsheet have periods in them. This means that the value is missing.

4.2.1 Data Types

Now, a word on variable types. The first column is what we call a categorical variable. A categorical variable is a variable that is numeric in nature, but it has no specific order attached to it. For example, consider the output of the list command, which simply prints the dataset.

list id if year ==1955

     +------------------------------+
     |                           id |
     |------------------------------|
  1. |                    Andalucia |
 44. |                       Aragon |
 87. |                     Asturias |
130. |             Baleares (Islas) |
173. |  Basque Country (Pais Vasco) |
     |------------------------------|
216. |                     Canarias |
259. |                    Cantabria |
302. |              Castilla Y Leon |
345. |           Castilla-La Mancha |
388. |                     Cataluna |
     |------------------------------|
431. |         Comunidad Valenciana |
474. |                  Extremadura |
517. |                      Galicia |
560. |        Madrid (Comunidad De) |
603. |           Murcia (Region de) |
     |------------------------------|
646. | Navarra (Comunidad Foral De) |
689. |                   Rioja (La) |
     +------------------------------+

As I said, this simply lists the dataset, in this case in the year 1955. We could just as easily do list id treat if year ==1955 to see the values of the id and the treatment varaible in 1955. What makes id categorical though? Well, imagine the state names indexed to the numbers 1…17, where 1 means Andalucia, 2 means the state of Aragon, and so on and so forth. However, there’s no inherent order for these beyond what we assign these states. We could just as easily have the Basque Country’s number be 1, instead of the current number (which if we do display id[173] in the Stat terminal we see is number 5). You may wonder about why, when we browse the dataset, the letters are blue. This is because Stata knows it is a categorical variable. We can also create what’s called a string variable, or letters by doing

decode id, generate(statestring)

browse id statestring if year ==1970

We can now see that the next variable, statestring is colored red. In Stata, this means that it is what we call a string variable (i.e., something we cannot do math with).

4.3 Inspecting Data

All other variables in the dataset are numeric, in that we may do math with them. For example, suppose we wish do see the average of all Spanish gdpcap in the year 1955, the first year of the dataset. We can do

summarize gdpcap if year ==1955

which returns the output table


    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
      gdpcap |         17     2.42521    .9244023    1.24343   4.594473

We can do the same with other variables too, such as year. When we do summarize year, we get

    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
        year |        731        1976    12.41817       1955       1997

Notice how the number of observations is now 731. Why? Well, I did not place any restrictions on the range of years (naturally) by using the if qualifier. So, when you don’t place restrictions on any command you write, Stata will presume you desire the full dataset. Anyways, we can see that the minimum year is 1955 and the maximum year is 1987 (so, nobody who is doing their paper on the Basque Country should get wrong when their study period is when the data section is due!!!)

We can use the if qualifier to do other things too. Suppose I ask you when the treatment happened, denoted by the treatment variable treat. We can do summarize year if treat==1.

Suppose I ask you which unit was treated. How can we solve this? Well, we already know that by construction that only the Basque Country was exposed to the terrorist intervention, and the other 16 control units were not exposed. So, this means that across all units, the treat variable should only ever equal 1 for one unit. Because of this, this means that the average id (which is a number, as we’ve seen) shoould be a constant number. Let’s use summarize to get this result.

summarize id if treat ==1

will return


    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
          id |         23           5           0          5          5

This means the treated unit’s numeric id id 5. Note, for the sake of interpretation, the standard deviation is 0. Why? Well, as we discussed, only one unit is treated. Only one unit’s row for treat ever takes on the value of 1. So, the standard deviation should be 0 because there IS no spread or variation among the values of the unit id variable that is treated, it’s all the same number. Similarly, the maximum and minimum for the unit id variable id is the same because only one unit is treated.

To get a list of all the variables in ones dataset, one may do


clear *

cls

use "https://github.com/jgreathouse9/FDIDTutorial/raw/main/basque.dta", clear

ds

4.4 Doing Basic Analysis

Let’s suppose we wish to test the hypothesis that the Basque Country’s GDP per Capita was different from the average of all other 16 control units before 1975. We can do this with a t-test. What kind of t-test? A Two Group T Test, where groups in this instance are defined by whether or not the unit is the Basque Country or not. To do this, we use Stata’s ttest command by doing


clear *

cls

use "https://github.com/jgreathouse9/FDIDTutorial/raw/main/basque.dta", clear

g basque = cond(id==5,1,0)

ttest gdpcap if year < 1975, by(basque) reverse unequal

which returns


Two-sample t test with unequal variances
------------------------------------------------------------------------------
   Group |     Obs        Mean    Std. err.   Std. dev.   [95% conf. interval]
---------+--------------------------------------------------------------------
       1 |      20    5.282476     .232162     1.03826    4.796556    5.768397
       0 |     320    3.662215     .080167    1.434072    3.504492    3.819938
---------+--------------------------------------------------------------------
Combined |     340    3.757524    .0793618    1.463359    3.601421    3.913628
---------+--------------------------------------------------------------------
    diff |            1.620262    .2456134                1.113093     2.12743
------------------------------------------------------------------------------
    diff = mean(1) - mean(0)                                      t =   6.5968
H0: diff = 0                     Satterthwaite's degrees of freedom =   23.781

    Ha: diff < 0                 Ha: diff != 0                 Ha: diff > 0
 Pr(T < t) = 1.0000         Pr(|T| > |t|) = 0.0000          Pr(T > t) = 0.0000

To do this, we need a variable that is equal to 1 if the unit id belongs to the Basque Country, else 0. I do this with the cond() function, which obeys the logic of “If condition 1, do A, if not condition 1, do B”. In this case, condition 1 is id==5, or the id variable being 5 (since this is the number for the Basque Country). So, the variable basque will be for allidthat are 5, else 0. We then restrict thettest` to be done for all years before 1975. This is the exact same t-test as we discussed in the second chapter. We can use the formula to compute t

\[ t = \frac{5.282476 - 3.662215}{\sqrt{\frac{1.078}{20} + \frac{2.0576}{320}}}=6.59 \]

The term diff refers to the simple difference in averages, in this case before 1975. The confidence interval for this difference is \([1.113, 2.127]\). This means that the Basque Country’s GDP per Capita is 1.6 dollars higher than the rest of the states in Spain, and the lower bound for this estimate is 1.113 dollars higher and the upper bound is 2.127 dollars higher. From a practical standpoint, we can conclude that Basque GDP was much higher than the rest of Spain in the pre-1975 period. This suggests that a comparison between the raw means of Basque Country and Spain is invalid, since they are quite different from each other in the pre-intervention period.

4.5 Graphs and Plots

Note

Part of being a policy analyst is using data to tell a story, but the only way this can be done graphically is by the graph being informative and well constructed. These are the principles you should follow when making graphs, making them informative and clearly drawn. This will require you playing with Stata’s options, as the defaults for plots are rarely adequate.

We are primarily working with what we call panel data, where we observe many units over time. In this case, we observe 17 states across 43 time periods, hence the number of observations being \(17 \times 43=731\), or one row per state per time period. One useful way of plotting data that we see over time, is by using a line plot. Let’s plot the difference between the Basque Country and the 16 control states.


clear *

cls

use "https://github.com/jgreathouse9/FDIDTutorial/raw/main/basque.dta", clear

keep year gdpcap id

reshape wide gdpcap, i(year) j(id)

browse

First to do this we need to reshape our dataset from long to wide. In long datasets, we have one observation (time period) per unit per row. In a wide dataset, we have one observation per row, and a column for each unit. When we browse the above code block, we see that year is the first variable, and the values of GDP per Capita are populated by the values for each state. In other words, these line plots produce the same results

clear *

cls

use "https://github.com/jgreathouse9/FDIDTutorial/raw/main/basque.dta", clear

keep year gdpcap id

line gdp year if id ==5, name(plot1, replace) //plot number 1, long dataset
reshape wide gdpcap, i(year) j(id)
line gdpcap5 year, name(plot2, replace) // plot number 2, wide dataset

Now, we need to calculate the mean of the control units. To do this, we need the egen command.


clear *

cls

use "https://github.com/jgreathouse9/FDIDTutorial/raw/main/basque.dta", clear

keep year gdpcap id

reshape wide gdpcap, i(year) j(id)
order gdpcap5, after(year)

egen controlmean = rowmean(gdpcap1-gdpcap17)

drop gdpcap1-gdpcap17
line gdpcap controlmean  year, xli(1975) // creates a reference line for 1975, where the last variable is the x axis

We can see that the Basque Country is much richer than the rest of Spain here. Its trends of GDP per Capita are much greater than the national average. So for example, this is one plot you might make if you wanted to argue that some additional statistical test would need to be done to see what the true effect of a policy was. After all, if the trend of Spain is not sufficiently similar to the Basque Country in the pre-intervention period, the rest of Spain may not offer a suitable idea for how the Basque Country’s trends would have evolved, absent treatment. And as policy analysts, the way an outcome would look absent an intervention of interest is a key thing we care about, which we define as the counterfactual.

Note that this is just a line plot, mostly useful for when observations happen over a set time frame. Other plots are useful for visualizing the distribution of data, say boxplots. If we wished to see the distribution of gdp per capita and levels of local investment of all Spanish states in 1970, we’d do


clear *

cls

use "https://github.com/jgreathouse9/FDIDTutorial/raw/main/basque.dta", clear

graph box gdpcap if year ==1970, name(distgdp, replace) nodraw
graph box invest if year ==1970, name(distinvest, replace) nodraw
graph combine distinvest distgdp

5 Summary

At present, these are the basics of Stata that you need to be able to write about the dataset for your chosen question. I went over how to import data, how to use the browse command to check visually check through certain elements of your data, how to do t-tests as we discussed in the first lecture, and other general principles. We will refine these ideas more and more as we discuss correlation, regression, and the essentials of causal inference. Below is the do file I used to run most of this code.


clear * // We can leave comments in our code like this
cls

use /// or we can leave comments like
"https://github.com/jgreathouse9/FDIDTutorial/raw/main/basque.dta" // this

* Imports our data


browse // pulls up the broswer window


/*
Or, here's a longer 




block comment

*/


list id if year ==1955 // list of units


decode id, generate(statestring) // generates a string from the categorical variable

browse id statestring if year ==1970

summarize gdpcap if year ==1955 // average gdp in 1955

summarize year if treat==1 // when did treatment happen


summarize id if treat ==1 // which unit was treated


ds // list of variables



unique id // how many units are there



/* Basic Analysis */


clear *

cls

use "https://github.com/jgreathouse9/FDIDTutorial/raw/main/basque.dta", clear

g basque = cond(id==5,1,0)

ttest gdpcap if year < 1975, by(basque) reverse unequal


keep year gdpcap id

reshape wide gdpcap, i(year) j(id)


order gdpcap5, after(year)

egen controlmean = rowmean(gdpcap1-gdpcap17)

drop gdpcap1-gdpcap17
line gdpcap controlmean  year, xli(1975) // creates a reference line for 1975, where the last variable is the x axis