Census and USGS DataAni Ruhil1 / 30

Agenda

Brief overview of useful Census and other data sources you may need

using the Census API with tidycensus and censusapi
data from the World Bank
data from the USGS

2 / 30

tidycensus 3 / 30

Kyle Walker has authored tidycensus to ease working with Census data and it is built to work well with the tidyverse and mapping libraries.

get a Census API key from here
install the tidycensus package

library(tidycensus)
library(tidyverse)
census_api_key("INSERT YOUR API KEY HERE")

Two functions of primary value --

(1) get_decennial() for decennial census data from 1990, 2000, and 2010 (2) get_acs() for data from the American Community Survey (ACS) series

Typical starting point will be finding the table you want and an easy way to do this is by pulling a list of all tables available for a sopecific census product. Below you see the code for grabbing a list of the Summary File 1 tables from the 2010 decennial census, followed by the 1-year 2016 ACS and the 5-year ACS (2012-2016).

load_variables(year = 2010, dataset = "sf1", cache = TRUE) -> my.tables.d 
load_variables(year = 2018, dataset = "acs1", cache = TRUE) -> my.tables.a1 
load_variables(year = 2018, dataset = "acs5", cache = TRUE) -> my.tables.a5

4 / 30

Each of these will show you what variables are available, and make it easier for you to choose the ones you want to work with. Assume we want total population, variable P0010001, from the decennial census

get_decennial(
  year = 2010, 
  geography = "state", 
  variables = c(totpopn = "P001001")
  ) -> state.popn.d 
get_decennial(
  year = 2010,
  geography = "county", 
  variables = c(totpopn = "P001001"),
  state = "OH"
  ) -> county.popn.d 
get_decennial(
  geography = "tract", 
  variables = c(totpopn = "P001001"),
  state = "OH",
  county = "Athens"
  ) -> tract.popn.d

and then from the ACS. Note that the default for each is 2010 and the most recent ACS (2012-2016).

get_acs(
  year = 2018,
  geography = "state", 
  variables = c(totpopn = "B01003_001")
  ) -> state.popn.acs 
get_acs(
  year = 2018,
  geography = "county", 
  variables = c(totpopn = "B01003_001"),
  state = "OH"
  ) -> county.popn.acs 
get_acs(
  year = 2018,
  geography = "tract", 
  variables = c(totpopn = "B01003_001"),
  state = "OH", county = "Athens"
  ) -> tract.popn.acs

5 / 30

You can also get an entire table instead of having to list variables, such as shown below, with the default decennial census for the package's current version (2010).

get_decennial(
  year = 2010,
  geography = "county",
  table = "P012",
  summary_var = "P001001"
  ) -> county.table.d 
get_decennial(
  year = 2010,
  geography = "state",
  table = "P012",
  summary_var = "P001001"
  ) -> state.table.d

Population data for all Census tracts in the country?

library(tidycensus)
library(purrr)
unique(fips_codes$state)[1:51] -> us 
map_df(
  us, function(x) {
    get_acs(geography = "tract", variables = "B01003_001", 
    state = x, geometry = TRUE)
    }
  ) -> totalpop

6 / 30

censusapi 7 / 30

This package will allow you to grab a vast array of Census products, and very niftily as well I might add.

library(censusapi)
listCensusApis() -> apis

Application Programming Interfaces (APIs) could have slightly varying parameters so it pays to check the API documentation available here.

It is easy to find variable names via the built-in function listCensusMetadata. As an example, the bureau's small area health insurance estimates are shown below, as well as the small area income and poverty estimates.

listCensusMetadata(
  name = "timeseries/healthins/sahie", 
  type = "variables"
  ) -> sahie_vars 
listCensusMetadata(
  name = "timeseries/poverty/saipe", 
  type = "variables"
  ) -> saipe_vars

Want to see what geographies are available? Switch type = to geography:

listCensusMetadata(
  name = "timeseries/healthins/sahie", 
  type = "geography"
  ) -> sahie_geos 
listCensusMetadata(
  name = "timeseries/poverty/saipe", 
  type = "geography"
  ) -> saipe_geos

8 / 30

Grab the most recent county-level data but note that the latest SAHIE are for 2015 while the latest SAIPE are for 2016. Every variable is downloaded as a chr so you will need to flip what should be numeric into numeric.

getCensus(
  name = "timeseries/healthins/sahie",
  vars = c("NAME", "IPRCAT", "IPR_DESC", "PCTUI_PT"), 
  region = "county:*", regionin = "state:39",
  time = 2015,
  key = CENSUS_KEY
  ) -> sahie_counties 
head(sahie_counties, n = 12L)
getCensus(
  name = "timeseries/poverty/saipe",
  vars = c("NAME", "SAEPOVRTALL_PT", "SAEMHI_PT"), 
  region = "county:*", regionin = "state:39",
  time = 2016,
  key = CENSUS_KEY) -> saipe_counties 
head(saipe_counties, n = 12L)
saipe_counties %>%
  mutate(
    prate = as.numeric(saipe_counties$SAEPOVRTALL_PT),
    mdhinc = as.numeric(saipe_counties$SAEMHI_PT)
  ) -> saipe_counties

9 / 30

If you want data for a lot of geographies, the package will let you do it in seconds. For example, if you want tract-level data for all tracts, you can get it as shown below:

fips
tracts <- NULL
for (f in fips) {
    stateget <- paste("state:", f, sep = "")
    temp <- getCensus(name = "sf3", vintage  = 1990,
    vars = c("P0070001", "P0070002", "P114A001"), region = "tract:*",
    regionin = stateget, key = CENSUS_KEY)
    tracts <- rbind(tracts, temp)
}
head(tracts)

Notice that you had to specify the region, and since tracts are nested within states, you had to add regionin = stateget.

10 / 30

World Bank data with data360r and wbstats 11 / 30

Two packages are available to access data gathered by the World Bank, data360r and wbstats. Install both packages, and then we can start with data360r. We can start by seeing what indicators are available, some basic country-level information, and what data-sets are available.

library(data360r)
#get all indicator metadata in Govdata360
get_metadata360(site = "gov", metadata_type = "indicators") -> df_indicators 
#get all country metadata in TCdata360
get_metadata360(metadata_type = 'countries') -> df_countries 
#get all dataset metadata in TCdata360
get_metadata360(metadata_type = 'datasets') -> df_datasets

12 / 30

Once you have identified a particular table, note it's ID, and then pull it. Say I want indicator 90 = Can a married woman register a business in the same way as a married man? I can do the same thing for more than one indicator by specifying the IDs.

get_data360(indicator_id = c(90)) -> df_ind90 
get_data360(indicator_id = c(28130, 28131)) -> df_indtwo

If I only want all data for a specific country or just the indicators we pull in df_ind1 for a specific country you could do:

get_data360(country_iso3 = "IND") -> df_allone 
search_360("woman business", search_type = "indicator", limit_results = 5) -> df_ind1 
get_data360(indicator_id = df_ind1$id, country_iso3 = "IND") -> df_allindtwo

13 / 30

Now an example with two measures -- legal age of marriage for boys and girls. Note that the package allows you to specify the long format (preferred) than the default wide format you see earlier results being returned in. Note also the use of timeframes = c() that allows you to specify the specific time period you want the indicators for.

search_360("marriage", search_type = "indicator") -> marriage.indicators
DT::datatable(marriage.indicators)

Show entries

Search:

	id	name	slug	type	score	dataset	redirect
1	204	What is the legal age of marriage for boys?	age.marr.male	indicator	0.111111111111111	Women, Business and the Law	false
2	205	What is the legal age of marriage for girls?	age.marr.fem	indicator	0.111111111111111	Women, Business and the Law	false
3	206	Are there any exceptions to the legal age of marriage?	marry.age.except	indicator	0.1	Women, Business and the Law	false
4	207	Does the law prohibit or invalidate child or early marriage?	chld.marr	indicator	0.1	Women, Business and the Law	false
5	40083	What is the minimum age of marriage with judicial authorization for girls?	h5e7b11fe	indicator	0.0833333333333333	Women, Business and the Law	false
6	40080	What is the minimum age of marriage with parental consent for boys?	haab6520b	indicator	0.0833333333333333	Women, Business and the Law	false
7	40081	What is the minimum age of marriage with parental consent for girls?	hdecd0041	indicator	0.0833333333333333	Women, Business and the Law	false
8	208	Are there penalties in the law for authorizing or knowingly entering into the child or early marriage?'	hb9d0a754	indicator	0.0588235294117647	Women, Business and the Law	false

Showing 1 to 8 of 8 entries

Previous1Next

14 / 30

Legal age of marriage?

get_data360(indicator_id = c(204, 205),
            timeframes = c(2016),
            output_type = 'long'
            ) -> df_marriage 
ggplot(df_marriage,
       aes(x = Observation,
           group = Indicator,
           fill = Indicator)) +
  geom_bar() +
  theme(legend.position = "none") +
  facet_wrap(~ Indicator, ncol = 1) +
  labs(
    x = "Legal Age of Marriage",
    y = "Frequency",
    title = "Legal Age of Marriage",
    subtitle = " for Boys vs. Girls (2016)",
    caption = "Source: World Bank Data")

15 / 30

The wbstats package does pretty much the same thing. Let us see the core functionality by loading the library and then seeing what is available in terms of indicators, topics, and so on. We can then set the most current list of information in wb_cachelist to be used via new_cache. Doing so speeds up the operations and ensures that you are getting the most uptodate data.

library(wbstats)
str(wb_cachelist, max.level = 1)

## List of 7
##  $ countries  :'data.frame':    304 obs. of  18 variables:
##  $ indicators :'data.frame':    16978 obs. of  7 variables:
##  $ sources    :'data.frame':    43 obs. of  8 variables:
##  $ datacatalog:'data.frame':    238 obs. of  29 variables:
##  $ topics     :'data.frame':    21 obs. of  3 variables:
##  $ income     :'data.frame':    7 obs. of  3 variables:
##  $ lending    :'data.frame':    4 obs. of  3 variables:

new_cache <- wbcache()

16 / 30

What indicators are available?

wbsearch(pattern = "corruption") -> corruption_vars 
DT::datatable(corruption_vars)

Show entries

Search:

	indicatorID	indicator
6106	IC.CNS.CORR.ZS	Corruption (% of managers surveyed ranking this as a major constraint)
6858	IC.FRM.ELEC.ZS	Electricity (% of firms identifying this as a major constraint)
6873	IC.FRM.CORR.GRAFT2	Bribery index (% of gift or informal payment requests during public transactions)
6874	IC.FRM.CORR.CRIME9	Percent of firms identifying the courts system as a major constraint
6875	IC.FRM.CORR.CORR9	Percent of firms expected to give gifts to get an import license
6876	IC.FRM.CORR.CORR8	Percent of firms expected to give gifts to get a construction permit
6877	IC.FRM.CORR.CORR7	Percent of firms expected to give gifts to get a water connection
6878	IC.FRM.CORR.CORR6	Percent of firms expected to give gifts to get an electrical connection
6879	IC.FRM.CORR.CORR4	Percent of firms expected to give gifts to public officials "to get things done"
6880	IC.FRM.CORR.CORR3	Value of gift expected to secure a government contract (% of contract value)

Showing 1 to 10 of 29 entries

Previous1 2 3Next

17 / 30

If I want information from a particular source, say Bloomberg,

wbsearch(
  pattern = "Bloomberg",
  fields = "sourceOrg"
  ) -> blmbrg_vars 
DT::datatable(blmbrg_vars)

Show entries

Search:

	indicatorID	indicator
262	WHEAT_US_HRW	Wheat, US, HRW, $/mt, current$
766	SUGAR_US	Sugar, US, cents/kg, current$
2563	RUBBER1_MYSG	Rubber, Singapore, cents/kg, current$
9488	GFDD.SM.01	Stock price volatility
9496	GFDD.OM.02	Stock market return (%, year-on-year)
12003	BARLEY	Barley, $/mt, current$
13295	CRUDE_WTI	Crude oil, WTI, $/bbl, current$
13297	CRUDE_DUBAI	Crude oil, Dubai, $/bbl, current$
13298	CRUDE_BRENT	Crude oil, Brendt, $/bbl, current$
13323	CHICKEN	Meat, chicken, cents/kg, current$

Showing 1 to 10 of 10 entries

Previous1Next

18 / 30

Searching for indicators tied to multiple subjects is easy as well:

wbsearch(
  pattern = "poverty | unemployment | employment"
  ) -> povemply_vars 
DT::datatable(povemply_vars)

Show entries

Search:

	indicatorID	indicator
35	WP15177.9	Received government transfers in the past year, income, richest 60% (% ages 15+) [w2]
36	WP15177.8	Received government transfers in the past year, income, poorest 40% (% ages 15+) [w2]
37	WP15177.7	Received government transfers in the past year, secondary education or more (% ages 15+) [w2]
38	WP15177.6	Received government transfers in the past year, primary education or less (% ages 15+) [w2]
39	WP15177.5	Received government transfers in the past year, older adults (% ages 25+) [w2]
40	WP15177.4	Received government transfers in the past year, young adults (% ages 15-24) [w2]
41	WP15177.3	Received government transfers in the past year, female (% age 15+) [w2]
42	WP15177.2	Received government transfers in the past year, male (% age 15+) [w2]
43	WP15177.10	Received government transfers in the past year, rural (% age 15+) [w2]
44	WP15177.1	Received government transfers in the past year (% age 15+) [w2]

Showing 1 to 10 of 594 entries

Previous1 2 3 4 5…60Next

19 / 30

Once we identify what we want, downloading the data is easy as well, needing us to specify just the indicator(s) and then the start and end dates, and then specific country codes if you want data for specific countries. Below I am pulling total population.

wb(
  indicator = "SP.POP.TOTL",
  startdate = 1960,
  enddate = 2016
  ) -> pop_data1 
wb(
  country = c("ABW","AFG", "SSF", "ECA", "IND", "CHN"),
  indicator = "SP.POP.TOTL",
  startdate = 1960,
  enddate = 2016
  ) -> pop_data2 
wb(
  country = c("ABW","AFG", "SSF", "ECA", "IND", "CHN"),
  indicator = c("SP.POP.TOTL", "NY.GDP.MKTP.CD"), 
  startdate = 1960,
  enddate = 2016
  ) -> pop_data3 
DT::datatable(pop_data3)

Show entries

Search:

	iso3c	date	value	indicatorID	indicator	iso2c	country
1	ABW	2016	104872	SP.POP.TOTL	Population, total	AW	Aruba
2	ABW	2015	104341	SP.POP.TOTL	Population, total	AW	Aruba
3	ABW	2014	103774	SP.POP.TOTL	Population, total	AW	Aruba
4	ABW	2013	103159	SP.POP.TOTL	Population, total	AW	Aruba
5	ABW	2012	102560	SP.POP.TOTL	Population, total	AW	Aruba
6	ABW	2011	102046	SP.POP.TOTL	Population, total	AW	Aruba
7	ABW	2010	101669	SP.POP.TOTL	Population, total	AW	Aruba
8	ABW	2009	101455	SP.POP.TOTL	Population, total	AW	Aruba
9	ABW	2008	101358	SP.POP.TOTL	Population, total	AW	Aruba
10	ABW	2007	101222	SP.POP.TOTL	Population, total	AW	Aruba

Showing 1 to 10 of 610 entries

Previous1 2 3 4 5…61Next

20 / 30

wb(
  country = c("ABW","AFG", "SSF", "ECA", "IND", "CHN"),
  indicator = c("SP.POP.TOTL", "NY.GDP.MKTP.CD"), 
  startdate = 1960,
  enddate = 2016,
  return_wide = TRUE
  ) -> pop_data4 
DT::datatable(pop_data4)

Show entries

Search:

	iso3c	date	iso2c	country	SP.POP.TOTL
1	ABW	1960	AW	Aruba	54211
2	ABW	1961	AW	Aruba	55438
3	ABW	1962	AW	Aruba	56225
4	ABW	1963	AW	Aruba	56695
5	ABW	1964	AW	Aruba	57032
6	ABW	1965	AW	Aruba	57360
7	ABW	1966	AW	Aruba	57715
8	ABW	1967	AW	Aruba	58055
9	ABW	1968	AW	Aruba	58386
10	ABW	1969	AW	Aruba	58726

Showing 1 to 10 of 342 entries

Previous1 2 3 4 5…35Next

21 / 30

By default wb() will return the data in long format but not necessarily in a tidy format. If you want the data returned on call in a wide format, specify return_wide = TRUE and you will have tidy data.

If you will be working with dates, whether for plotting purposes or otherwise, then activate the POSIXct = TRUE switch. Otherwise you will have to do this manually.

wb(
  country = c("IND", "CHN"),
  indicator = c("SP.POP.TOTL"), 
  startdate = 1960,
  enddate = 2016,
  return_wide = TRUE, 
  POSIXct = TRUE
  ) -> pop_data5 
DT::datatable(pop_data5)

Show entries

Search:

	iso3c	date	iso2c	country	date_ct	granularity	SP.POP.TOTL
1	CHN	1960	CN	China	1960-01-01	annual	667070000
2	CHN	1961	CN	China	1961-01-01	annual	660330000
3	CHN	1962	CN	China	1962-01-01	annual	665770000
4	CHN	1963	CN	China	1963-01-01	annual	682335000
5	CHN	1964	CN	China	1964-01-01	annual	698355000
6	CHN	1965	CN	China	1965-01-01	annual	715185000
7	CHN	1966	CN	China	1966-01-01	annual	735400000
8	CHN	1967	CN	China	1967-01-01	annual	754550000
9	CHN	1968	CN	China	1968-01-01	annual	774510000
10	CHN	1969	CN	China	1969-01-01	annual	796025000

Showing 1 to 10 of 114 entries

Previous1 2 3 4 5…12Next

22 / 30

library(scales)
ggplot(pop_data5, aes(x  = date_ct, y = SP.POP.TOTL)) + 
  geom_line(aes(color = country)) +
  scale_y_continuous(labels = comma) + 
  scale_x_date(date_breaks = "10 years") +
  theme(legend.position = "bottom") + 
  labs(x = "Date",
       y = "Total Population") +
  theme_minimal()

23 / 30

USGS data with dataRetrieval 24 / 30

The dataRetrieval package gives you easy access to water data gathered and warehoused by the USGS, USDA, EPA, and other entities. The package has an excellent tutorial available here so I will not go into too many details and nuances here. Start by installing the package and then loading it.

library(dataRetrieval)

You will need to know the site(s) you are interested in, as well as the parameters and statistics of interest. The package comes with a built-in data-set that shows you the parameters available, and the complete list of statistics is available here. Sites can be located from this inventory.

parameterCdFile -> parameterCdFile 
names(parameterCdFile)

## [1] "parameter_cd"       "parameter_group_nm" "parameter_nm"       "casrn"             
## [5] "srsname"            "parameter_units"

25 / 30

If you are curious about a specific parameter, you can see what all is available for it. I'll look for anything related to the keyword storm, and also what parameter units are available.

parameterCdFile[
  grep("storm",
  parameterCdFile$parameter_nm,
  ignore.case =  TRUE), 
  ] -> stormq 
unique(stormq$parameter_units)

## [1] "nu"      "hours"   "minutes" "Mgal"    "ft3/s"   "mgd"     "in"

Let us do a quick grab of some data for the Hocking River at Athens, Ohio.

"03159500" -> siteNo 
"00065" -> pCode 
"2014-10-01" -> start.date 
"2018-02-26" -> end.date 
readNWISuv(
  siteNumbers = siteNo,
  parameterCd = pCode,
  startDate = start.date,
  endDate = end.date
  ) -> hocking

## Observations: 54,883
## Variables: 6
## $ agency_cd        <chr> "USGS", "USGS", "USGS", "USGS", "USGS", "USGS", "USGS", "USGS"…
## $ site_no          <chr> "03159500", "03159500", "03159500", "03159500", "03159500", "0…
## $ dateTime         <dttm> 2014-10-01 04:00:00, 2014-10-01 04:30:00, 2014-10-01 05:00:00…
## $ X_00065_00000    <dbl> 2.91, 2.91, 2.91, 2.91, 2.91, 2.91, 2.91, 2.91, 2.91, 2.91, 2.…
## $ X_00065_00000_cd <chr> "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A…
## $ tz_cd            <chr> "UTC", "UTC", "UTC", "UTC", "UTC", "UTC", "UTC", "UTC", "UTC",…

26 / 30

Since the column names are based on parameter codes and hence cryptic, you can clean them up, and also see other attributes embedded in the data-set.

names(hocking)

## [1] "agency_cd"        "site_no"          "dateTime"         "X_00065_00000"   
## [5] "X_00065_00000_cd" "tz_cd"

renameNWISColumns(hocking) -> hocking 
names(hocking)

## [1] "agency_cd"  "site_no"    "dateTime"   "GH_Inst"    "GH_Inst_cd" "tz_cd"

names(attributes(hocking))

## [1] "names"         "row.names"     "class"         "url"           "siteInfo"     
## [6] "variableInfo"  "disclaimer"    "statisticInfo" "queryTime"

27 / 30

If I wanted data for multiple sites, I could find the site numbers and then grab the data.

c("03158200", "03159246") -> sites 
"00065" -> pcode 
"2014-10-01" -> start.date 
"2018-02-26" -> end.date 
readNWISuv(
  siteNumbers = sites,
  parameterCd = pcode,
  startDate = start.date,
  endDate = end.date
  ) -> hocking2 
renameNWISColumns(hocking2) -> hocking2

28 / 30

Now a simple time-series plot of gage height for both sites. Note that although I asked for data going far back, not all sites have data for all time periods; it helps to check the site inventory first.

attr(hocking2, "variableInfo") -> parameterInfo 
ifelse(
  hocking2$site_no == "03158200",
  "Monday Creek at Doanville",
  "Sunday Creek Below Millfield"
  ) -> hocking2$station
as.Date(
  as.character(hocking2$dateTime),
  format  = "%Y-%m-%d"
  ) -> hocking2$mydates 
ggplot(
  data = hocking2,
  aes(x = mydates, y = GH_Inst, color = station)
  ) + 
  geom_line() +
  labs(
    x = "",
    y = parameterInfo$variableDescription) + 
  scale_x_date(date_breaks = "24 weeks") +
  theme_minimal() + 
  theme(legend.position = "bottom")

29 / 30

Find me at...

@aruhil
aniruhil.org
ruhil@ohio.edu

30 / 30

Help

Keyboard shortcuts

↑, ←, Pg Up, k

Go to previous slide

↓, →, Pg Dn, Space, j

Go to next slide

Home

Go to first slide

End

Go to last slide

Number + Return

Go to specific slide

b / m / f

Toggle blackout / mirrored / fullscreen mode

Clone slideshow

Toggle presenter mode

Restart the presentation timer

?, h

Toggle this help

Census and USGS Data

Ani Ruhil

Agenda

`tidycensus`

`censusapi`

World Bank data with `data360r` and `wbstats`

Legal age of marriage?

USGS data with `dataRetrieval`

Find me at...

Agenda

Help