The following report analyzes tax parcel data from Syracuse, New York (USA).
View the “Data Dictionary” here: Syracuse City Tax Parcel Data
The following code imports the Syracuse, NY tax parcel data using a URL.
url <- paste0("https://raw.githubusercontent.com/DS4PS/Data",
"-Science-Class/master/DATA/syr_parcels.csv")
dat <- read.csv(url,
strings = FALSE)
There are several exploratory functions to better understand our new dataset.
We can inspect the first 5 rows of these data using function
head()
.
Functions names()
or colnames()
will print
all variable names in a dataset.
## [1] "tax_id" "neighborhood" "stnum" "stname" "zip"
## [6] "owner" "frontfeet" "depth" "sqft" "acres"
## [11] "yearbuilt" "age" "age_range" "land_use" "units"
## [16] "residential" "rental" "vacantbuil" "assessedla" "assessedva"
## [21] "tax.exempt" "countytxbl" "schooltxbl" "citytaxabl" "star"
## [26] "amtdelinqu" "taxyrsdeli" "totint" "overduewater"
We can also inspect the values of a variable by extracting it with
$
.
The extracted variable is called a “vector”.
## [1] "CLARMIN BUILDERS ONON COR" "JOHNSTON LEE R"
## [3] "CHRISTO CRAIG S" "HAWKINS FARMS INC"
## [5] "PETERS LYNNETTE" "MITCHELL LOTAN G"
## [7] "WHALEN GIOVANNA A" "BERGH GARY D"
## [9] "CITY OF SYRACUSE TD" "DOUGHERTY ROBERT K JR"
Function unique()
helps us determine what values exist
in a variable.
## [1] "Vacant Land" "Single Family" "Commercial"
## [4] "Parking" "Two Family" "Three Family"
## [7] "Apartment" "Schools" "Parks"
## [10] "Multiple Residence" "Cemetery" "Religious"
## [13] "Recreation" "Community Services" "Utilities"
## [16] "Industrial"
Function str()
provides an overview of total rows and
columns (dimensions), variable classes, and a preview of values.
## 'data.frame': 41502 obs. of 29 variables:
## $ tax_id : int 1393130501 1393130500 1437100600 1425100900 1425101000 ...
## $ neighborhood: chr "South Valley" "South Valley" ...
## $ stnum : chr "2655" "2635" ...
## $ stname : chr "VALLEY DR" "VALLEY DR" ...
## $ zip : chr "13215" "13120" ...
## $ owner : chr "CLARMIN BUILDERS ONON COR" "JOHNSTON LEE R" ...
## $ frontfeet : num 67.2 104.8 ...
## $ depth : num 50 46.5 ...
## $ sqft : num 2149 6370 ...
## $ acres : num 0.0493 0.1462 ...
## $ yearbuilt : int NA 1925 1957 1958 1965 ...
## $ age : int NA 90 58 57 50 ...
## $ age_range : chr NA "81-90" ...
## $ land_use : chr "Vacant Land" "Single Family" ...
## $ units : int 0 0 0 0 0 ...
## $ residential : logi FALSE TRUE TRUE ...
## $ rental : logi FALSE FALSE FALSE ...
## $ vacantbuil : logi FALSE FALSE FALSE ...
## $ assessedla : int 475 10800 20200 18000 18000 ...
## $ assessedva : int 500 69300 88300 70500 74000 ...
## $ tax.exempt : logi TRUE FALSE FALSE ...
## $ countytxbl : int 500 69300 88300 70500 74000 ...
## $ schooltxbl : int 500 69300 88300 70500 74000 ...
## $ citytaxabl : int 500 69300 88300 70500 74000 ...
## $ star : logi NA TRUE TRUE ...
## $ amtdelinqu : num 0 0 0 0 0 ...
## $ taxyrsdeli : int 0 0 0 0 0 ...
## $ totint : num 0 0 0 0 0 ...
## $ overduewater: num 0 178 ...
Instructions: Provide the code for each solution in the following “chunks”.
Remember to modify the text to show your answer in human-readable terms.
Question: How many tax parcels are in Syracuse, NY?
Answer: There are 41,502 tax parcels in Syracuse, NY. The answer to this question is found by using the functions below as well as reading the description of the data found in the data dictionary
## [1] 41502 29
## [1] 41502
## [1] 41502
Comments: The function dim()
shows the
dimensions of the data set (rows and columns). The rows represent a
unique tax parcel while the columns represent the variables in the data
set.
Because the data dictionary tells us that each row represents a
unique tax parcel we can simply use the nrow()
function to
determine the number of tax parcels.
Even though the data dictionary states each row is a unique tax parcel, is this actually confirmed by the data?
## [1] 41469
Question: How many acres of land are in Syracuse, NY?
Answer: There are 12,510.49 acres of land in Syracuse, NY.
## [1] "numeric"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00028 0.10101 0.13180 0.30144 0.18866 140.56053
## [1] 12510.49
## [1] 12510.49
Comments: Realizing acres
is a numeric
variable, we can check for summary statistics using
summary()
to see if there are any missing values (NAs). We
confirm that there are none.
Because there are no missing values in the acres variable it does not
matter if we use the argument na.rm = TRUE
in our
sum()
function.
Question: How many vacant buildings are there in Syracuse, NY?
Answer: There are 1,888 vacant buildings in Syracuse, NY.
## [1] "logical"
## Mode FALSE TRUE NA's
## logical 38859 1888 755
## [1] NA
## [1] 1888
Comments: Realizing vacantbuil
is a
logical variable (True or False), we can check for summary statistics to
see if there are any missing values (NAs). We confirm that there are
indeed 755 missing values for this variable.
Because there are missing values in the vacantbuil
variable it does matter if we use the argument na.rm = TRUE
in our sum()
function. When we don’t include this argument
and there are missing values, R will output NA
rather than
the count of TRUEs.
The argument na.rm = TRUE
will tell R to ignore/remove
the missing values and give you the sum of all the TRUE values which is
1888.
Question: What proportion of parcels are tax-exempt?
Answer: 10.7% of parcels are tax-exempt.
## [1] "logical"
## Mode FALSE TRUE
## logical 37061 4441
## [1] 0.1070069
## [1] 0.1070069
## [1] 0.1070069
Comments: Realizing tax.exempt
is a
logical variable (True or False), we can check for summary statistics to
see if there are any missing values (NAs). We confirm that there are no
missing values for this variable.
Because there are no missing values, the mean()
function
gives us the same result regardless of the na.rm
argument.
Question: Which neighborhood contains the most tax parcels?
Answer: Eastwood contains the most tax parcels. It contains 4889 tax parcels.
##
## Brighton Court-Woodlawn Downtown
## 2302 2402 389
## Eastwood Elmwood Far Westside
## 4889 1444 1027
## Franklin Square Hawley-Green Lakefront
## 89 367 312
## Lincoln Hill Meadowbrook Near Eastside
## 1123 1878 441
## Near Westside North Valley Northside
## 1772 1531 3261
## Outer Comstock Park Ave. Prospect Hill
## 990 942 365
## Salt Springs Sedgwick Skunk City
## 1414 1138 713
## South Campus South Valley Southside
## 36 1925 1370
## Southwest Strathmore Tipp Hill
## 1150 1822 1468
## University Hill University Neighborhood Washington Square
## 505 1259 1180
## Westcott Winkworth
## 1540 452
##
## Eastwood Northside Court-Woodlawn
## 4889 3261 2402
## Brighton South Valley Meadowbrook
## 2302 1925 1878
## Strathmore Near Westside Westcott
## 1822 1772 1540
## North Valley Tipp Hill Elmwood
## 1531 1468 1444
## Salt Springs Southside University Neighborhood
## 1414 1370 1259
## Washington Square Southwest Sedgwick
## 1180 1150 1138
## Lincoln Hill Far Westside Outer Comstock
## 1123 1027 990
## Park Ave. Skunk City University Hill
## 942 713 505
## Winkworth Near Eastside Downtown
## 452 441 389
## Hawley-Green Prospect Hill Lakefront
## 367 365 312
## Franklin Square South Campus
## 89 36
## Eastwood
## 4
Comments: The table()
function can be
used to produce a table that lists all neighborhoods and the count of
tax parcels within each neighborhood. However, this list comes in
alphabetical order which isn’t very helpful in identifying which
neighborhood contains the most tax parcels.
The sort()
function will sort by count to help identify
what we are looking for. The default is ascending or increasing order,
so if we use the argument decreasing=TRUE
we get the
neighborhoods with most tax parcels first.
One quick method of finding the max value across a table or vector is
using the which.max()
function. But this function won’t
display the actual count. It simply tells us that Eastwood, which has
the max value, is the fourth unique value in the ‘neighborhood’
variable.
Question: Which neighborhood contains the most vacant lots?
Answer: Near Westside contains the most vacant lots. It contains 425 vacant lots.
##
## Apartment Cemetery Commercial Community Services
## Brighton 26 0 49 10
## Court-Woodlawn 18 4 52 2
## Downtown 6 0 209 17
## Eastwood 139 0 149 6
## Elmwood 13 3 41 4
## Far Westside 32 0 82 2
## Franklin Square 3 0 27 1
## Hawley-Green 47 0 78 2
## Lakefront 1 0 121 2
## Lincoln Hill 58 0 62 4
## Meadowbrook 14 1 18 1
## Near Eastside 36 0 79 4
## Near Westside 75 0 133 11
## North Valley 43 0 65 3
## Northside 165 1 239 5
## Outer Comstock 20 18 65 0
## Park Ave. 48 0 171 4
## Prospect Hill 54 0 76 5
## Salt Springs 31 1 107 7
## Sedgwick 11 0 14 0
## Skunk City 17 0 18 4
## South Campus 4 0 2 0
## South Valley 22 2 53 4
## Southside 45 0 84 10
## Southwest 31 0 46 5
## Strathmore 8 0 5 1
## Tipp Hill 32 1 74 2
## University Hill 81 0 137 15
## University Neighborhood 20 1 132 1
## Washington Square 89 1 127 4
## Westcott 38 1 81 2
## Winkworth 1 0 5 0
##
## Industrial Multiple Residence Parking Parks
## Brighton 0 9 2 1
## Court-Woodlawn 1 4 7 3
## Downtown 4 0 78 8
## Eastwood 2 13 15 3
## Elmwood 0 5 8 6
## Far Westside 2 14 11 1
## Franklin Square 10 0 9 0
## Hawley-Green 4 9 32 2
## Lakefront 12 4 3 0
## Lincoln Hill 3 10 13 1
## Meadowbrook 0 0 2 6
## Near Eastside 11 1 6 2
## Near Westside 9 25 25 3
## North Valley 0 4 3 3
## Northside 3 37 29 9
## Outer Comstock 7 3 1 1
## Park Ave. 7 26 24 5
## Prospect Hill 2 3 33 0
## Salt Springs 1 0 8 1
## Sedgwick 0 1 2 1
## Skunk City 0 3 3 0
## South Campus 0 0 0 0
## South Valley 1 6 4 7
## Southside 5 6 23 9
## Southwest 11 1 15 3
## Strathmore 0 2 2 5
## Tipp Hill 1 10 7 6
## University Hill 2 1 58 4
## University Neighborhood 0 1 0 3
## Washington Square 4 17 11 2
## Westcott 0 2 3 3
## Winkworth 0 0 0 0
##
## Recreation Religious Schools Single Family
## Brighton 2 16 3 1398
## Court-Woodlawn 0 2 4 1859
## Downtown 5 6 4 1
## Eastwood 4 7 2 3605
## Elmwood 2 7 2 909
## Far Westside 4 5 1 471
## Franklin Square 1 0 0 0
## Hawley-Green 2 3 0 52
## Lakefront 3 0 0 24
## Lincoln Hill 3 2 2 580
## Meadowbrook 1 10 6 1721
## Near Eastside 0 3 0 93
## Near Westside 3 12 2 521
## North Valley 1 6 3 1194
## Northside 1 7 3 1508
## Outer Comstock 0 2 1 697
## Park Ave. 1 6 3 167
## Prospect Hill 0 4 1 29
## Salt Springs 2 2 1 1029
## Sedgwick 1 2 2 892
## Skunk City 0 1 1 345
## South Campus 0 0 1 25
## South Valley 3 5 4 1605
## Southside 1 24 3 481
## Southwest 0 14 1 419
## Strathmore 0 3 4 1475
## Tipp Hill 1 7 3 785
## University Hill 6 6 41 17
## University Neighborhood 4 3 6 803
## Washington Square 1 5 0 425
## Westcott 2 4 2 851
## Winkworth 1 0 0 411
##
## Three Family Two Family Utilities Vacant Land
## Brighton 38 436 0 312
## Court-Woodlawn 11 370 1 64
## Downtown 0 0 6 45
## Eastwood 50 718 2 174
## Elmwood 18 240 0 186
## Far Westside 23 271 7 101
## Franklin Square 0 0 5 33
## Hawley-Green 14 66 0 56
## Lakefront 1 21 18 102
## Lincoln Hill 24 274 3 84
## Meadowbrook 1 30 2 65
## Near Eastside 23 103 4 76
## Near Westside 63 460 5 425
## North Valley 7 102 1 96
## Northside 159 920 1 174
## Outer Comstock 1 67 9 98
## Park Ave. 45 314 5 116
## Prospect Hill 31 88 0 39
## Salt Springs 2 120 7 95
## Sedgwick 11 181 0 20
## Skunk City 14 197 1 109
## South Campus 0 0 0 4
## South Valley 8 89 1 111
## Southside 35 293 10 341
## Southwest 23 209 5 367
## Strathmore 21 237 3 56
## Tipp Hill 35 433 2 69
## University Hill 13 30 3 91
## University Neighborhood 41 208 1 35
## Washington Square 47 352 0 95
## Westcott 66 420 0 65
## Winkworth 0 10 0 24
## [1] "Vacant Land" "Single Family" "Commercial"
## [4] "Parking" "Two Family" "Three Family"
## [7] "Apartment" "Schools" "Parks"
## [10] "Multiple Residence" "Cemetery" "Religious"
## [13] "Recreation" "Community Services" "Utilities"
## [16] "Industrial"
##
## FALSE TRUE
## Brighton 1990 312
## Court-Woodlawn 2338 64
## Downtown 344 45
## Eastwood 4715 174
## Elmwood 1258 186
## Far Westside 926 101
## Franklin Square 56 33
## Hawley-Green 311 56
## Lakefront 210 102
## Lincoln Hill 1039 84
## Meadowbrook 1813 65
## Near Eastside 365 76
## Near Westside 1347 425
## North Valley 1435 96
## Northside 3087 174
## Outer Comstock 892 98
## Park Ave. 826 116
## Prospect Hill 326 39
## Salt Springs 1319 95
## Sedgwick 1118 20
## Skunk City 604 109
## South Campus 32 4
## South Valley 1814 111
## Southside 1029 341
## Southwest 783 367
## Strathmore 1766 56
## Tipp Hill 1399 69
## University Hill 414 91
## University Neighborhood 1224 35
## Washington Square 1085 95
## Westcott 1475 65
## Winkworth 428 24
table(dat$neighborhood,dat$land_use == "Vacant Land")[ ,2] # [ , 2] displays only the second column for all rows
## Brighton Court-Woodlawn Downtown
## 312 64 45
## Eastwood Elmwood Far Westside
## 174 186 101
## Franklin Square Hawley-Green Lakefront
## 33 56 102
## Lincoln Hill Meadowbrook Near Eastside
## 84 65 76
## Near Westside North Valley Northside
## 425 96 174
## Outer Comstock Park Ave. Prospect Hill
## 98 116 39
## Salt Springs Sedgwick Skunk City
## 95 20 109
## South Campus South Valley Southside
## 4 111 341
## Southwest Strathmore Tipp Hill
## 367 56 69
## University Hill University Neighborhood Washington Square
## 91 35 95
## Westcott Winkworth
## 65 24
## Near Westside Southwest Southside
## 425 367 341
## Brighton Elmwood Eastwood
## 312 186 174
## Northside Park Ave. South Valley
## 174 116 111
## Skunk City Lakefront Far Westside
## 109 102 101
## Outer Comstock North Valley Salt Springs
## 98 96 95
## Washington Square University Hill Lincoln Hill
## 95 91 84
## Near Eastside Tipp Hill Meadowbrook
## 76 69 65
## Westcott Court-Woodlawn Hawley-Green
## 65 64 56
## Strathmore Downtown Prospect Hill
## 56 45 39
## University Neighborhood Franklin Square Winkworth
## 35 33 24
## Sedgwick South Campus
## 20 4
Comments: The key to answering this question is to
produce a cross tabulation using the two variables of interest from our
data set. The table()
function produces a cross tabulation
that lists all neighborhoods and their respective number of tax parcels
across different land zoning values. However, we are only interested in
vacant lots so it is helpful to narrow it down to only see the
Vacant Land
zones across neighborhoods.
By identifying Vacant Land
as the value of interest,
whether or not we’ve used function unique() to do so, we can create more
targeted, concise, and elegant output if we use the relational operator,
==
, to specify the value we seek.
We can get even more concise by subsetting to extract only the second
column [ , 2]
, which contains all the instances of
TRUE
in the more narrowed down crosstabulation. Then we can
sort the table to find the max value more easily.