Lab 01 - Functions and Vectors

Professor Almada - Solutions

19 January, 2024


Source Data

The following report analyzes tax parcel data from Syracuse, New York (USA).

View the “Data Dictionary” here: Syracuse City Tax Parcel Data



Importing the Data

The following code imports the Syracuse, NY tax parcel data using a URL.

url <- paste0("https://raw.githubusercontent.com/DS4PS/Data",
              "-Science-Class/master/DATA/syr_parcels.csv")

dat <- read.csv(url, 
                strings = FALSE)



Previewing the Data

There are several exploratory functions to better understand our new dataset.

We can inspect the first 5 rows of these data using function head().


head(dat, 5)              # Preview a dataset with 'head()'


Listing All Variables

Functions names() or colnames() will print all variable names in a dataset.


names(dat)                # List all variables with 'names()'
##  [1] "tax_id"       "neighborhood" "stnum"        "stname"       "zip"         
##  [6] "owner"        "frontfeet"    "depth"        "sqft"         "acres"       
## [11] "yearbuilt"    "age"          "age_range"    "land_use"     "units"       
## [16] "residential"  "rental"       "vacantbuil"   "assessedla"   "assessedva"  
## [21] "tax.exempt"   "countytxbl"   "schooltxbl"   "citytaxabl"   "star"        
## [26] "amtdelinqu"   "taxyrsdeli"   "totint"       "overduewater"


Previewing Specific Variables

We can also inspect the values of a variable by extracting it with $.

The extracted variable is called a “vector”.


head(dat$owner, 10)       # Preview a variable, or "vector"
##  [1] "CLARMIN BUILDERS ONON COR" "JOHNSTON LEE R"           
##  [3] "CHRISTO CRAIG S"           "HAWKINS FARMS INC"        
##  [5] "PETERS LYNNETTE"           "MITCHELL LOTAN G"         
##  [7] "WHALEN GIOVANNA A"         "BERGH GARY D"             
##  [9] "CITY OF SYRACUSE TD"       "DOUGHERTY ROBERT K JR"


Listing Unique Values

Function unique() helps us determine what values exist in a variable.


unique(dat$land_use)      # Print all possible values with 'unique()'
##  [1] "Vacant Land"        "Single Family"      "Commercial"        
##  [4] "Parking"            "Two Family"         "Three Family"      
##  [7] "Apartment"          "Schools"            "Parks"             
## [10] "Multiple Residence" "Cemetery"           "Religious"         
## [13] "Recreation"         "Community Services" "Utilities"         
## [16] "Industrial"


Examining Data Structure

Function str() provides an overview of total rows and columns (dimensions), variable classes, and a preview of values.


str(object = dat,
    vec.len = 2)          # Examine data structure with 'str()'
## 'data.frame':    41502 obs. of  29 variables:
##  $ tax_id      : int  1393130501 1393130500 1437100600 1425100900 1425101000 ...
##  $ neighborhood: chr  "South Valley" "South Valley" ...
##  $ stnum       : chr  "2655" "2635" ...
##  $ stname      : chr  "VALLEY DR" "VALLEY DR" ...
##  $ zip         : chr  "13215" "13120" ...
##  $ owner       : chr  "CLARMIN BUILDERS ONON COR" "JOHNSTON LEE R" ...
##  $ frontfeet   : num  67.2 104.8 ...
##  $ depth       : num  50 46.5 ...
##  $ sqft        : num  2149 6370 ...
##  $ acres       : num  0.0493 0.1462 ...
##  $ yearbuilt   : int  NA 1925 1957 1958 1965 ...
##  $ age         : int  NA 90 58 57 50 ...
##  $ age_range   : chr  NA "81-90" ...
##  $ land_use    : chr  "Vacant Land" "Single Family" ...
##  $ units       : int  0 0 0 0 0 ...
##  $ residential : logi  FALSE TRUE TRUE ...
##  $ rental      : logi  FALSE FALSE FALSE ...
##  $ vacantbuil  : logi  FALSE FALSE FALSE ...
##  $ assessedla  : int  475 10800 20200 18000 18000 ...
##  $ assessedva  : int  500 69300 88300 70500 74000 ...
##  $ tax.exempt  : logi  TRUE FALSE FALSE ...
##  $ countytxbl  : int  500 69300 88300 70500 74000 ...
##  $ schooltxbl  : int  500 69300 88300 70500 74000 ...
##  $ citytaxabl  : int  500 69300 88300 70500 74000 ...
##  $ star        : logi  NA TRUE TRUE ...
##  $ amtdelinqu  : num  0 0 0 0 0 ...
##  $ taxyrsdeli  : int  0 0 0 0 0 ...
##  $ totint      : num  0 0 0 0 0 ...
##  $ overduewater: num  0 178 ...



Questions & Solutions

Instructions: Provide the code for each solution in the following “chunks”.

Remember to modify the text to show your answer in human-readable terms.


Question 1: Total Parcels

Question: How many tax parcels are in Syracuse, NY?

Answer: There are 41,502 tax parcels in Syracuse, NY. The answer to this question is found by using the functions below as well as reading the description of the data found in the data dictionary


dim(dat) # one can use dim() or... 
## [1] 41502    29
nrow(dat) # or another approach...
## [1] 41502
length(dat$tax_id) # only works if you specify a vector within the data set
## [1] 41502

Comments: The function dim() shows the dimensions of the data set (rows and columns). The rows represent a unique tax parcel while the columns represent the variables in the data set.

Because the data dictionary tells us that each row represents a unique tax parcel we can simply use the nrow() function to determine the number of tax parcels.

Even though the data dictionary states each row is a unique tax parcel, is this actually confirmed by the data?

length(unique(dat$tax_id)) # interesting... 
## [1] 41469


Question 2: Total Acres

Question: How many acres of land are in Syracuse, NY?

Answer: There are 12,510.49 acres of land in Syracuse, NY.


class(dat$acres) # confirm acres is a numeric vector/variable
## [1] "numeric"
summary(dat$acres) # summary statistics
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##   0.00028   0.10101   0.13180   0.30144   0.18866 140.56053
sum(dat$acres) # finding the summation 
## [1] 12510.49
sum(dat$acres, na.rm=T) # same result because there are no missing values 
## [1] 12510.49

Comments: Realizing acres is a numeric variable, we can check for summary statistics using summary() to see if there are any missing values (NAs). We confirm that there are none.

Because there are no missing values in the acres variable it does not matter if we use the argument na.rm = TRUE in our sum() function.

Question 3: Vacant Buildings

Question: How many vacant buildings are there in Syracuse, NY?

Answer: There are 1,888 vacant buildings in Syracuse, NY.


class(dat$vacantbuil) # confirming vacantbuil is a logical vector/variable. 
## [1] "logical"
summary(dat$vacantbuil) # summary statistics for variable vacantbuil in dat has missing values
##    Mode   FALSE    TRUE    NA's 
## logical   38859    1888     755
sum(dat$vacantbuil) # finding the summation does not work in there are missing values present  
## [1] NA
sum(dat$vacantbuil, na.rm=T) # ignores/removes missing values then calculates sum
## [1] 1888

Comments: Realizing vacantbuil is a logical variable (True or False), we can check for summary statistics to see if there are any missing values (NAs). We confirm that there are indeed 755 missing values for this variable.

Because there are missing values in the vacantbuil variable it does matter if we use the argument na.rm = TRUE in our sum() function. When we don’t include this argument and there are missing values, R will output NA rather than the count of TRUEs.

The argument na.rm = TRUE will tell R to ignore/remove the missing values and give you the sum of all the TRUE values which is 1888.


Question 4: Tax-Exempt Parcels

Question: What proportion of parcels are tax-exempt?

Answer: 10.7% of parcels are tax-exempt.


class(dat$tax.exempt) # confirming tax.exempt is a logical vector. 
## [1] "logical"
summary(dat$tax.exempt) # summary statistics and checking for NAs 
##    Mode   FALSE    TRUE 
## logical   37061    4441
mean(dat$tax.exempt) # mean or average when False==0 and True==1 is the same as a proportion.
## [1] 0.1070069
mean(dat$tax.exempt, na.rm=T) # mean function that ignores missing values 
## [1] 0.1070069
sum(dat$tax.exempt)/length(dat$tax.exempt) # Another approach that calculates the proportion
## [1] 0.1070069

Comments: Realizing tax.exempt is a logical variable (True or False), we can check for summary statistics to see if there are any missing values (NAs). We confirm that there are no missing values for this variable.

Because there are no missing values, the mean() function gives us the same result regardless of the na.rm argument.


Question 5: Neighborhoods & Parcels

Question: Which neighborhood contains the most tax parcels?

Answer: Eastwood contains the most tax parcels. It contains 4889 tax parcels.


table(dat$neighborhood) # tabulates neighborhoods and counts 
## 
##                Brighton          Court-Woodlawn                Downtown 
##                    2302                    2402                     389 
##                Eastwood                 Elmwood            Far Westside 
##                    4889                    1444                    1027 
##         Franklin Square            Hawley-Green               Lakefront 
##                      89                     367                     312 
##            Lincoln Hill             Meadowbrook           Near Eastside 
##                    1123                    1878                     441 
##           Near Westside            North Valley               Northside 
##                    1772                    1531                    3261 
##          Outer Comstock               Park Ave.           Prospect Hill 
##                     990                     942                     365 
##            Salt Springs                Sedgwick              Skunk City 
##                    1414                    1138                     713 
##            South Campus            South Valley               Southside 
##                      36                    1925                    1370 
##               Southwest              Strathmore               Tipp Hill 
##                    1150                    1822                    1468 
##         University Hill University Neighborhood       Washington Square 
##                     505                    1259                    1180 
##                Westcott               Winkworth 
##                    1540                     452
sort(table(dat$neighborhood), decreasing=TRUE) # sorts table in decreasing order
## 
##                Eastwood               Northside          Court-Woodlawn 
##                    4889                    3261                    2402 
##                Brighton            South Valley             Meadowbrook 
##                    2302                    1925                    1878 
##              Strathmore           Near Westside                Westcott 
##                    1822                    1772                    1540 
##            North Valley               Tipp Hill                 Elmwood 
##                    1531                    1468                    1444 
##            Salt Springs               Southside University Neighborhood 
##                    1414                    1370                    1259 
##       Washington Square               Southwest                Sedgwick 
##                    1180                    1150                    1138 
##            Lincoln Hill            Far Westside          Outer Comstock 
##                    1123                    1027                     990 
##               Park Ave.              Skunk City         University Hill 
##                     942                     713                     505 
##               Winkworth           Near Eastside                Downtown 
##                     452                     441                     389 
##            Hawley-Green           Prospect Hill               Lakefront 
##                     367                     365                     312 
##         Franklin Square            South Campus 
##                      89                      36
which.max(table(dat$neighborhood)) # Quick way of finding the max, but doesn't display count.
## Eastwood 
##        4

Comments: The table() function can be used to produce a table that lists all neighborhoods and the count of tax parcels within each neighborhood. However, this list comes in alphabetical order which isn’t very helpful in identifying which neighborhood contains the most tax parcels.

The sort() function will sort by count to help identify what we are looking for. The default is ascending or increasing order, so if we use the argument decreasing=TRUE we get the neighborhoods with most tax parcels first.

One quick method of finding the max value across a table or vector is using the which.max() function. But this function won’t display the actual count. It simply tells us that Eastwood, which has the max value, is the fourth unique value in the ‘neighborhood’ variable.


Question 6: Neighborhoods & Vacant Lots

Question: Which neighborhood contains the most vacant lots?

Answer: Near Westside contains the most vacant lots. It contains 425 vacant lots.


table(dat$neighborhood, dat$land_use) # cross tabulation using our two variables of interest
##                          
##                           Apartment Cemetery Commercial Community Services
##   Brighton                       26        0         49                 10
##   Court-Woodlawn                 18        4         52                  2
##   Downtown                        6        0        209                 17
##   Eastwood                      139        0        149                  6
##   Elmwood                        13        3         41                  4
##   Far Westside                   32        0         82                  2
##   Franklin Square                 3        0         27                  1
##   Hawley-Green                   47        0         78                  2
##   Lakefront                       1        0        121                  2
##   Lincoln Hill                   58        0         62                  4
##   Meadowbrook                    14        1         18                  1
##   Near Eastside                  36        0         79                  4
##   Near Westside                  75        0        133                 11
##   North Valley                   43        0         65                  3
##   Northside                     165        1        239                  5
##   Outer Comstock                 20       18         65                  0
##   Park Ave.                      48        0        171                  4
##   Prospect Hill                  54        0         76                  5
##   Salt Springs                   31        1        107                  7
##   Sedgwick                       11        0         14                  0
##   Skunk City                     17        0         18                  4
##   South Campus                    4        0          2                  0
##   South Valley                   22        2         53                  4
##   Southside                      45        0         84                 10
##   Southwest                      31        0         46                  5
##   Strathmore                      8        0          5                  1
##   Tipp Hill                      32        1         74                  2
##   University Hill                81        0        137                 15
##   University Neighborhood        20        1        132                  1
##   Washington Square              89        1        127                  4
##   Westcott                       38        1         81                  2
##   Winkworth                       1        0          5                  0
##                          
##                           Industrial Multiple Residence Parking Parks
##   Brighton                         0                  9       2     1
##   Court-Woodlawn                   1                  4       7     3
##   Downtown                         4                  0      78     8
##   Eastwood                         2                 13      15     3
##   Elmwood                          0                  5       8     6
##   Far Westside                     2                 14      11     1
##   Franklin Square                 10                  0       9     0
##   Hawley-Green                     4                  9      32     2
##   Lakefront                       12                  4       3     0
##   Lincoln Hill                     3                 10      13     1
##   Meadowbrook                      0                  0       2     6
##   Near Eastside                   11                  1       6     2
##   Near Westside                    9                 25      25     3
##   North Valley                     0                  4       3     3
##   Northside                        3                 37      29     9
##   Outer Comstock                   7                  3       1     1
##   Park Ave.                        7                 26      24     5
##   Prospect Hill                    2                  3      33     0
##   Salt Springs                     1                  0       8     1
##   Sedgwick                         0                  1       2     1
##   Skunk City                       0                  3       3     0
##   South Campus                     0                  0       0     0
##   South Valley                     1                  6       4     7
##   Southside                        5                  6      23     9
##   Southwest                       11                  1      15     3
##   Strathmore                       0                  2       2     5
##   Tipp Hill                        1                 10       7     6
##   University Hill                  2                  1      58     4
##   University Neighborhood          0                  1       0     3
##   Washington Square                4                 17      11     2
##   Westcott                         0                  2       3     3
##   Winkworth                        0                  0       0     0
##                          
##                           Recreation Religious Schools Single Family
##   Brighton                         2        16       3          1398
##   Court-Woodlawn                   0         2       4          1859
##   Downtown                         5         6       4             1
##   Eastwood                         4         7       2          3605
##   Elmwood                          2         7       2           909
##   Far Westside                     4         5       1           471
##   Franklin Square                  1         0       0             0
##   Hawley-Green                     2         3       0            52
##   Lakefront                        3         0       0            24
##   Lincoln Hill                     3         2       2           580
##   Meadowbrook                      1        10       6          1721
##   Near Eastside                    0         3       0            93
##   Near Westside                    3        12       2           521
##   North Valley                     1         6       3          1194
##   Northside                        1         7       3          1508
##   Outer Comstock                   0         2       1           697
##   Park Ave.                        1         6       3           167
##   Prospect Hill                    0         4       1            29
##   Salt Springs                     2         2       1          1029
##   Sedgwick                         1         2       2           892
##   Skunk City                       0         1       1           345
##   South Campus                     0         0       1            25
##   South Valley                     3         5       4          1605
##   Southside                        1        24       3           481
##   Southwest                        0        14       1           419
##   Strathmore                       0         3       4          1475
##   Tipp Hill                        1         7       3           785
##   University Hill                  6         6      41            17
##   University Neighborhood          4         3       6           803
##   Washington Square                1         5       0           425
##   Westcott                         2         4       2           851
##   Winkworth                        1         0       0           411
##                          
##                           Three Family Two Family Utilities Vacant Land
##   Brighton                          38        436         0         312
##   Court-Woodlawn                    11        370         1          64
##   Downtown                           0          0         6          45
##   Eastwood                          50        718         2         174
##   Elmwood                           18        240         0         186
##   Far Westside                      23        271         7         101
##   Franklin Square                    0          0         5          33
##   Hawley-Green                      14         66         0          56
##   Lakefront                          1         21        18         102
##   Lincoln Hill                      24        274         3          84
##   Meadowbrook                        1         30         2          65
##   Near Eastside                     23        103         4          76
##   Near Westside                     63        460         5         425
##   North Valley                       7        102         1          96
##   Northside                        159        920         1         174
##   Outer Comstock                     1         67         9          98
##   Park Ave.                         45        314         5         116
##   Prospect Hill                     31         88         0          39
##   Salt Springs                       2        120         7          95
##   Sedgwick                          11        181         0          20
##   Skunk City                        14        197         1         109
##   South Campus                       0          0         0           4
##   South Valley                       8         89         1         111
##   Southside                         35        293        10         341
##   Southwest                         23        209         5         367
##   Strathmore                        21        237         3          56
##   Tipp Hill                         35        433         2          69
##   University Hill                   13         30         3          91
##   University Neighborhood           41        208         1          35
##   Washington Square                 47        352         0          95
##   Westcott                          66        420         0          65
##   Winkworth                          0         10         0          24
unique(dat$land_use) # examine unique values for land_use to find our value of interest
##  [1] "Vacant Land"        "Single Family"      "Commercial"        
##  [4] "Parking"            "Two Family"         "Three Family"      
##  [7] "Apartment"          "Schools"            "Parks"             
## [10] "Multiple Residence" "Cemetery"           "Religious"         
## [13] "Recreation"         "Community Services" "Utilities"         
## [16] "Industrial"
table(dat$neighborhood, dat$land_use=="Vacant Land") # narrowed down cross-tabulation 
##                          
##                           FALSE TRUE
##   Brighton                 1990  312
##   Court-Woodlawn           2338   64
##   Downtown                  344   45
##   Eastwood                 4715  174
##   Elmwood                  1258  186
##   Far Westside              926  101
##   Franklin Square            56   33
##   Hawley-Green              311   56
##   Lakefront                 210  102
##   Lincoln Hill             1039   84
##   Meadowbrook              1813   65
##   Near Eastside             365   76
##   Near Westside            1347  425
##   North Valley             1435   96
##   Northside                3087  174
##   Outer Comstock            892   98
##   Park Ave.                 826  116
##   Prospect Hill             326   39
##   Salt Springs             1319   95
##   Sedgwick                 1118   20
##   Skunk City                604  109
##   South Campus               32    4
##   South Valley             1814  111
##   Southside                1029  341
##   Southwest                 783  367
##   Strathmore               1766   56
##   Tipp Hill                1399   69
##   University Hill           414   91
##   University Neighborhood  1224   35
##   Washington Square        1085   95
##   Westcott                 1475   65
##   Winkworth                 428   24
table(dat$neighborhood,dat$land_use == "Vacant Land")[ ,2] # [ , 2] displays only the second column for all rows
##                Brighton          Court-Woodlawn                Downtown 
##                     312                      64                      45 
##                Eastwood                 Elmwood            Far Westside 
##                     174                     186                     101 
##         Franklin Square            Hawley-Green               Lakefront 
##                      33                      56                     102 
##            Lincoln Hill             Meadowbrook           Near Eastside 
##                      84                      65                      76 
##           Near Westside            North Valley               Northside 
##                     425                      96                     174 
##          Outer Comstock               Park Ave.           Prospect Hill 
##                      98                     116                      39 
##            Salt Springs                Sedgwick              Skunk City 
##                      95                      20                     109 
##            South Campus            South Valley               Southside 
##                       4                     111                     341 
##               Southwest              Strathmore               Tipp Hill 
##                     367                      56                      69 
##         University Hill University Neighborhood       Washington Square 
##                      91                      35                      95 
##                Westcott               Winkworth 
##                      65                      24
sort(table(dat$neighborhood,dat$land_use == "Vacant Land")[ ,2], decreasing=TRUE) # sorts the table 
##           Near Westside               Southwest               Southside 
##                     425                     367                     341 
##                Brighton                 Elmwood                Eastwood 
##                     312                     186                     174 
##               Northside               Park Ave.            South Valley 
##                     174                     116                     111 
##              Skunk City               Lakefront            Far Westside 
##                     109                     102                     101 
##          Outer Comstock            North Valley            Salt Springs 
##                      98                      96                      95 
##       Washington Square         University Hill            Lincoln Hill 
##                      95                      91                      84 
##           Near Eastside               Tipp Hill             Meadowbrook 
##                      76                      69                      65 
##                Westcott          Court-Woodlawn            Hawley-Green 
##                      65                      64                      56 
##              Strathmore                Downtown           Prospect Hill 
##                      56                      45                      39 
## University Neighborhood         Franklin Square               Winkworth 
##                      35                      33                      24 
##                Sedgwick            South Campus 
##                      20                       4

Comments: The key to answering this question is to produce a cross tabulation using the two variables of interest from our data set. The table() function produces a cross tabulation that lists all neighborhoods and their respective number of tax parcels across different land zoning values. However, we are only interested in vacant lots so it is helpful to narrow it down to only see the Vacant Land zones across neighborhoods.

By identifying Vacant Land as the value of interest, whether or not we’ve used function unique() to do so, we can create more targeted, concise, and elegant output if we use the relational operator, ==, to specify the value we seek.

We can get even more concise by subsetting to extract only the second column [ , 2], which contains all the instances of TRUE in the more narrowed down crosstabulation. Then we can sort the table to find the max value more easily.