Brief overview of useful Census and other data sources you may need
tidycensus
and censusapi
tidycensus
Kyle Walker has authored tidycensus
to ease working with Census data and it is built to work well with the tidyverse
and mapping libraries.
tidycensus
package library(tidycensus)library(tidyverse)census_api_key("INSERT YOUR API KEY HERE")
Two functions of primary value --
(1) get_decennial()
for decennial census data from 1990, 2000, and 2010
(2) get_acs()
for data from the American Community Survey (ACS) series
Typical starting point will be finding the table you want and an easy way to do this is by pulling a list of all tables available for a sopecific census product. Below you see the code for grabbing a list of the Summary File 1 tables from the 2010 decennial census, followed by the 1-year 2016 ACS and the 5-year ACS (2012-2016).
load_variables(year = 2010, dataset = "sf1", cache = TRUE) -> my.tables.d load_variables(year = 2018, dataset = "acs1", cache = TRUE) -> my.tables.a1 load_variables(year = 2018, dataset = "acs5", cache = TRUE) -> my.tables.a5
Each of these will show you what variables are available, and make it easier for you to choose the ones you want to work with. Assume we want total population, variable P0010001
, from the decennial census
get_decennial( year = 2010, geography = "state", variables = c(totpopn = "P001001") ) -> state.popn.d get_decennial( year = 2010, geography = "county", variables = c(totpopn = "P001001"), state = "OH" ) -> county.popn.d get_decennial( geography = "tract", variables = c(totpopn = "P001001"), state = "OH", county = "Athens" ) -> tract.popn.d
and then from the ACS. Note that the default for each is 2010 and the most recent ACS (2012-2016).
get_acs( year = 2018, geography = "state", variables = c(totpopn = "B01003_001") ) -> state.popn.acs get_acs( year = 2018, geography = "county", variables = c(totpopn = "B01003_001"), state = "OH" ) -> county.popn.acs get_acs( year = 2018, geography = "tract", variables = c(totpopn = "B01003_001"), state = "OH", county = "Athens" ) -> tract.popn.acs
You can also get an entire table
instead of having to list variables, such as shown below, with the default decennial census for the package's current version (2010).
get_decennial( year = 2010, geography = "county", table = "P012", summary_var = "P001001" ) -> county.table.d get_decennial( year = 2010, geography = "state", table = "P012", summary_var = "P001001" ) -> state.table.d
Population data for all Census tracts in the country
?
library(tidycensus)library(purrr)unique(fips_codes$state)[1:51] -> us map_df( us, function(x) { get_acs(geography = "tract", variables = "B01003_001", state = x, geometry = TRUE) } ) -> totalpop
censusapi
This package will allow you to grab a vast array of Census products, and very niftily as well I might add.
library(censusapi)listCensusApis() -> apis
Application Programming Interfaces (APIs) could have slightly varying parameters so it pays to check the API documentation available here.
It is easy to find variable names via the built-in function listCensusMetadata
. As an example, the bureau's small area health insurance estimates are shown below, as well as the small area income and poverty estimates.
listCensusMetadata( name = "timeseries/healthins/sahie", type = "variables" ) -> sahie_vars listCensusMetadata( name = "timeseries/poverty/saipe", type = "variables" ) -> saipe_vars
Want to see what geographies are available? Switch type =
to geography
:
listCensusMetadata( name = "timeseries/healthins/sahie", type = "geography" ) -> sahie_geos listCensusMetadata( name = "timeseries/poverty/saipe", type = "geography" ) -> saipe_geos
Grab the most recent county-level data but note that the latest SAHIE are for 2015 while the latest SAIPE are for 2016. Every variable is downloaded as a chr
so you will need to flip what should be numeric into numeric.
getCensus( name = "timeseries/healthins/sahie", vars = c("NAME", "IPRCAT", "IPR_DESC", "PCTUI_PT"), region = "county:*", regionin = "state:39", time = 2015, key = CENSUS_KEY ) -> sahie_counties head(sahie_counties, n = 12L)getCensus( name = "timeseries/poverty/saipe", vars = c("NAME", "SAEPOVRTALL_PT", "SAEMHI_PT"), region = "county:*", regionin = "state:39", time = 2016, key = CENSUS_KEY) -> saipe_counties head(saipe_counties, n = 12L)saipe_counties %>% mutate( prate = as.numeric(saipe_counties$SAEPOVRTALL_PT), mdhinc = as.numeric(saipe_counties$SAEMHI_PT) ) -> saipe_counties
If you want data for a lot of geographies, the package will let you do it in seconds. For example, if you want tract-level data for all tracts
, you can get it as shown below:
fipstracts <- NULLfor (f in fips) { stateget <- paste("state:", f, sep = "") temp <- getCensus(name = "sf3", vintage = 1990, vars = c("P0070001", "P0070002", "P114A001"), region = "tract:*", regionin = stateget, key = CENSUS_KEY) tracts <- rbind(tracts, temp)}head(tracts)
Notice that you had to specify the region
, and since tracts are nested within states, you had to add regionin = stateget
.
data360r
and wbstats
Two packages are available to access data gathered by the World Bank, data360r
and wbstats
. Install both packages, and then we can start with data360r
. We can start by seeing what indicators are available, some basic country-level information, and what data-sets are available.
library(data360r)#get all indicator metadata in Govdata360get_metadata360(site = "gov", metadata_type = "indicators") -> df_indicators #get all country metadata in TCdata360get_metadata360(metadata_type = 'countries') -> df_countries #get all dataset metadata in TCdata360get_metadata360(metadata_type = 'datasets') -> df_datasets
Once you have identified a particular table, note it's ID, and then pull it. Say I want indicator 90 = Can a married woman register a business in the same way as a married man?
I can do the same thing for more than one indicator by specifying the IDs.
get_data360(indicator_id = c(90)) -> df_ind90 get_data360(indicator_id = c(28130, 28131)) -> df_indtwo
If I only want all data for a specific country or just the indicators we pull in df_ind1
for a specific country you could do:
get_data360(country_iso3 = "IND") -> df_allone search_360("woman business", search_type = "indicator", limit_results = 5) -> df_ind1 get_data360(indicator_id = df_ind1$id, country_iso3 = "IND") -> df_allindtwo
Now an example with two measures -- legal age of marriage for boys and girls. Note that the package allows you to specify the long
format (preferred) than the default wide
format you see earlier results being returned in. Note also the use of timeframes = c()
that allows you to specify the specific time period you want the indicators for.
search_360("marriage", search_type = "indicator") -> marriage.indicatorsDT::datatable(marriage.indicators)
id | name | slug | type | score | dataset | redirect | |
---|---|---|---|---|---|---|---|
1 | 204 | What is the legal age of marriage for boys? | age.marr.male | indicator | 0.111111111111111 | Women, Business and the Law | false |
2 | 205 | What is the legal age of marriage for girls? | age.marr.fem | indicator | 0.111111111111111 | Women, Business and the Law | false |
3 | 206 | Are there any exceptions to the legal age of marriage? | marry.age.except | indicator | 0.1 | Women, Business and the Law | false |
4 | 207 | Does the law prohibit or invalidate child or early marriage? | chld.marr | indicator | 0.1 | Women, Business and the Law | false |
5 | 40083 | What is the minimum age of marriage with judicial authorization for girls? | h5e7b11fe | indicator | 0.0833333333333333 | Women, Business and the Law | false |
6 | 40080 | What is the minimum age of marriage with parental consent for boys? | haab6520b | indicator | 0.0833333333333333 | Women, Business and the Law | false |
7 | 40081 | What is the minimum age of marriage with parental consent for girls? | hdecd0041 | indicator | 0.0833333333333333 | Women, Business and the Law | false |
8 | 208 | Are there penalties in the law for authorizing or knowingly entering into the child or early marriage?' | hb9d0a754 | indicator | 0.0588235294117647 | Women, Business and the Law | false |
get_data360(indicator_id = c(204, 205), timeframes = c(2016), output_type = 'long' ) -> df_marriage ggplot(df_marriage, aes(x = Observation, group = Indicator, fill = Indicator)) + geom_bar() + theme(legend.position = "none") + facet_wrap(~ Indicator, ncol = 1) + labs( x = "Legal Age of Marriage", y = "Frequency", title = "Legal Age of Marriage", subtitle = " for Boys vs. Girls (2016)", caption = "Source: World Bank Data")
The wbstats
package does pretty much the same thing. Let us see the core functionality by loading the library and then seeing what is available in terms of indicators, topics, and so on. We can then set the most current list of information in wb_cachelist
to be used via new_cache
. Doing so speeds up the operations and ensures that you are getting the most uptodate data.
library(wbstats)str(wb_cachelist, max.level = 1)
## List of 7## $ countries :'data.frame': 304 obs. of 18 variables:## $ indicators :'data.frame': 16978 obs. of 7 variables:## $ sources :'data.frame': 43 obs. of 8 variables:## $ datacatalog:'data.frame': 238 obs. of 29 variables:## $ topics :'data.frame': 21 obs. of 3 variables:## $ income :'data.frame': 7 obs. of 3 variables:## $ lending :'data.frame': 4 obs. of 3 variables:
new_cache <- wbcache()
What indicators are available?
wbsearch(pattern = "corruption") -> corruption_vars DT::datatable(corruption_vars)
indicatorID | indicator | |
---|---|---|
6106 | IC.CNS.CORR.ZS | Corruption (% of managers surveyed ranking this as a major constraint) |
6858 | IC.FRM.ELEC.ZS | Electricity (% of firms identifying this as a major constraint) |
6873 | IC.FRM.CORR.GRAFT2 | Bribery index (% of gift or informal payment requests during public transactions) |
6874 | IC.FRM.CORR.CRIME9 | Percent of firms identifying the courts system as a major constraint |
6875 | IC.FRM.CORR.CORR9 | Percent of firms expected to give gifts to get an import license |
6876 | IC.FRM.CORR.CORR8 | Percent of firms expected to give gifts to get a construction permit |
6877 | IC.FRM.CORR.CORR7 | Percent of firms expected to give gifts to get a water connection |
6878 | IC.FRM.CORR.CORR6 | Percent of firms expected to give gifts to get an electrical connection |
6879 | IC.FRM.CORR.CORR4 | Percent of firms expected to give gifts to public officials "to get things done" |
6880 | IC.FRM.CORR.CORR3 | Value of gift expected to secure a government contract (% of contract value) |
If I want information from a particular source, say Bloomberg,
wbsearch( pattern = "Bloomberg", fields = "sourceOrg" ) -> blmbrg_vars DT::datatable(blmbrg_vars)
indicatorID | indicator | |
---|---|---|
262 | WHEAT_US_HRW | Wheat, US, HRW, $/mt, current$ |
766 | SUGAR_US | Sugar, US, cents/kg, current$ |
2563 | RUBBER1_MYSG | Rubber, Singapore, cents/kg, current$ |
9488 | GFDD.SM.01 | Stock price volatility |
9496 | GFDD.OM.02 | Stock market return (%, year-on-year) |
12003 | BARLEY | Barley, $/mt, current$ |
13295 | CRUDE_WTI | Crude oil, WTI, $/bbl, current$ |
13297 | CRUDE_DUBAI | Crude oil, Dubai, $/bbl, current$ |
13298 | CRUDE_BRENT | Crude oil, Brendt, $/bbl, current$ |
13323 | CHICKEN | Meat, chicken, cents/kg, current$ |
Searching for indicators tied to multiple subjects is easy as well:
wbsearch( pattern = "poverty | unemployment | employment" ) -> povemply_vars DT::datatable(povemply_vars)
indicatorID | indicator | |
---|---|---|
35 | WP15177.9 | Received government transfers in the past year, income, richest 60% (% ages 15+) [w2] |
36 | WP15177.8 | Received government transfers in the past year, income, poorest 40% (% ages 15+) [w2] |
37 | WP15177.7 | Received government transfers in the past year, secondary education or more (% ages 15+) [w2] |
38 | WP15177.6 | Received government transfers in the past year, primary education or less (% ages 15+) [w2] |
39 | WP15177.5 | Received government transfers in the past year, older adults (% ages 25+) [w2] |
40 | WP15177.4 | Received government transfers in the past year, young adults (% ages 15-24) [w2] |
41 | WP15177.3 | Received government transfers in the past year, female (% age 15+) [w2] |
42 | WP15177.2 | Received government transfers in the past year, male (% age 15+) [w2] |
43 | WP15177.10 | Received government transfers in the past year, rural (% age 15+) [w2] |
44 | WP15177.1 | Received government transfers in the past year (% age 15+) [w2] |
Once we identify what we want, downloading the data is easy as well, needing us to specify just the indicator(s) and then the start and end dates, and then specific country codes if you want data for specific countries. Below I am pulling total population.
wb( indicator = "SP.POP.TOTL", startdate = 1960, enddate = 2016 ) -> pop_data1 wb( country = c("ABW","AFG", "SSF", "ECA", "IND", "CHN"), indicator = "SP.POP.TOTL", startdate = 1960, enddate = 2016 ) -> pop_data2 wb( country = c("ABW","AFG", "SSF", "ECA", "IND", "CHN"), indicator = c("SP.POP.TOTL", "NY.GDP.MKTP.CD"), startdate = 1960, enddate = 2016 ) -> pop_data3 DT::datatable(pop_data3)
iso3c | date | value | indicatorID | indicator | iso2c | country | |
---|---|---|---|---|---|---|---|
1 | ABW | 2016 | 104872 | SP.POP.TOTL | Population, total | AW | Aruba |
2 | ABW | 2015 | 104341 | SP.POP.TOTL | Population, total | AW | Aruba |
3 | ABW | 2014 | 103774 | SP.POP.TOTL | Population, total | AW | Aruba |
4 | ABW | 2013 | 103159 | SP.POP.TOTL | Population, total | AW | Aruba |
5 | ABW | 2012 | 102560 | SP.POP.TOTL | Population, total | AW | Aruba |
6 | ABW | 2011 | 102046 | SP.POP.TOTL | Population, total | AW | Aruba |
7 | ABW | 2010 | 101669 | SP.POP.TOTL | Population, total | AW | Aruba |
8 | ABW | 2009 | 101455 | SP.POP.TOTL | Population, total | AW | Aruba |
9 | ABW | 2008 | 101358 | SP.POP.TOTL | Population, total | AW | Aruba |
10 | ABW | 2007 | 101222 | SP.POP.TOTL | Population, total | AW | Aruba |
wb( country = c("ABW","AFG", "SSF", "ECA", "IND", "CHN"), indicator = c("SP.POP.TOTL", "NY.GDP.MKTP.CD"), startdate = 1960, enddate = 2016, return_wide = TRUE ) -> pop_data4 DT::datatable(pop_data4)
By default wb()
will return the data in long format but not necessarily in a tidy format. If you want the data returned on call in a wide format, specify return_wide = TRUE
and you will have tidy data.
If you will be working with dates, whether for plotting purposes or otherwise, then activate the POSIXct = TRUE
switch. Otherwise you will have to do this manually.
wb( country = c("IND", "CHN"), indicator = c("SP.POP.TOTL"), startdate = 1960, enddate = 2016, return_wide = TRUE, POSIXct = TRUE ) -> pop_data5 DT::datatable(pop_data5)
iso3c | date | iso2c | country | date_ct | granularity | SP.POP.TOTL | |
---|---|---|---|---|---|---|---|
1 | CHN | 1960 | CN | China | 1960-01-01 | annual | 667070000 |
2 | CHN | 1961 | CN | China | 1961-01-01 | annual | 660330000 |
3 | CHN | 1962 | CN | China | 1962-01-01 | annual | 665770000 |
4 | CHN | 1963 | CN | China | 1963-01-01 | annual | 682335000 |
5 | CHN | 1964 | CN | China | 1964-01-01 | annual | 698355000 |
6 | CHN | 1965 | CN | China | 1965-01-01 | annual | 715185000 |
7 | CHN | 1966 | CN | China | 1966-01-01 | annual | 735400000 |
8 | CHN | 1967 | CN | China | 1967-01-01 | annual | 754550000 |
9 | CHN | 1968 | CN | China | 1968-01-01 | annual | 774510000 |
10 | CHN | 1969 | CN | China | 1969-01-01 | annual | 796025000 |
library(scales)ggplot(pop_data5, aes(x = date_ct, y = SP.POP.TOTL)) + geom_line(aes(color = country)) + scale_y_continuous(labels = comma) + scale_x_date(date_breaks = "10 years") + theme(legend.position = "bottom") + labs(x = "Date", y = "Total Population") + theme_minimal()
dataRetrieval
The dataRetrieval
package gives you easy access to water data gathered and warehoused by the USGS, USDA, EPA, and other entities. The package has an excellent tutorial available here so I will not go into too many details and nuances here. Start by installing the package and then loading it.
library(dataRetrieval)
You will need to know the site(s) you are interested in, as well as the parameters and statistics of interest. The package comes with a built-in data-set that shows you the parameters available, and the complete list of statistics is available here. Sites can be located from this inventory.
parameterCdFile -> parameterCdFile names(parameterCdFile)
## [1] "parameter_cd" "parameter_group_nm" "parameter_nm" "casrn" ## [5] "srsname" "parameter_units"
If you are curious about a specific parameter, you can see what all is available for it. I'll look for anything related to the keyword storm
, and also what parameter units are available.
parameterCdFile[ grep("storm", parameterCdFile$parameter_nm, ignore.case = TRUE), ] -> stormq unique(stormq$parameter_units)
## [1] "nu" "hours" "minutes" "Mgal" "ft3/s" "mgd" "in"
Let us do a quick grab of some data for the Hocking River at Athens, Ohio.
"03159500" -> siteNo "00065" -> pCode "2014-10-01" -> start.date "2018-02-26" -> end.date readNWISuv( siteNumbers = siteNo, parameterCd = pCode, startDate = start.date, endDate = end.date ) -> hocking
## Observations: 54,883## Variables: 6## $ agency_cd <chr> "USGS", "USGS", "USGS", "USGS", "USGS", "USGS", "USGS", "USGS"…## $ site_no <chr> "03159500", "03159500", "03159500", "03159500", "03159500", "0…## $ dateTime <dttm> 2014-10-01 04:00:00, 2014-10-01 04:30:00, 2014-10-01 05:00:00…## $ X_00065_00000 <dbl> 2.91, 2.91, 2.91, 2.91, 2.91, 2.91, 2.91, 2.91, 2.91, 2.91, 2.…## $ X_00065_00000_cd <chr> "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A…## $ tz_cd <chr> "UTC", "UTC", "UTC", "UTC", "UTC", "UTC", "UTC", "UTC", "UTC",…
Since the column names are based on parameter codes and hence cryptic, you can clean them up, and also see other attributes embedded in the data-set.
names(hocking)
## [1] "agency_cd" "site_no" "dateTime" "X_00065_00000" ## [5] "X_00065_00000_cd" "tz_cd"
renameNWISColumns(hocking) -> hocking names(hocking)
## [1] "agency_cd" "site_no" "dateTime" "GH_Inst" "GH_Inst_cd" "tz_cd"
names(attributes(hocking))
## [1] "names" "row.names" "class" "url" "siteInfo" ## [6] "variableInfo" "disclaimer" "statisticInfo" "queryTime"
If I wanted data for multiple sites, I could find the site numbers and then grab the data.
c("03158200", "03159246") -> sites "00065" -> pcode "2014-10-01" -> start.date "2018-02-26" -> end.date readNWISuv( siteNumbers = sites, parameterCd = pcode, startDate = start.date, endDate = end.date ) -> hocking2 renameNWISColumns(hocking2) -> hocking2
Now a simple time-series plot of gage height
for both sites. Note that although I asked for data going far back, not all sites have data for all time periods; it helps to check the site inventory first.
attr(hocking2, "variableInfo") -> parameterInfo ifelse( hocking2$site_no == "03158200", "Monday Creek at Doanville", "Sunday Creek Below Millfield" ) -> hocking2$stationas.Date( as.character(hocking2$dateTime), format = "%Y-%m-%d" ) -> hocking2$mydates ggplot( data = hocking2, aes(x = mydates, y = GH_Inst, color = station) ) + geom_line() + labs( x = "", y = parameterInfo$variableDescription) + scale_x_date(date_breaks = "24 weeks") + theme_minimal() + theme(legend.position = "bottom")
Brief overview of useful Census and other data sources you may need
tidycensus
and censusapi
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |