Using APIs from the Census Bureau & other sources

Most of us end up having to work with Census data on a regular basis. In the past, we used American Factfinder (replaced by the data.census.gov) or then the Interuniversity Consortium of Political and Social Research (ICPSR) databases to grab the data we needed. Mercifully, with the opening up of a lot of government data, accessing Census data has become very easy. There are several packages that allow you to do so but I will initially focus on two packages – {tidycensus} and {censusapi} – given their ease of use. In addition to these packages, I will show you how to work with APIs to get data from the World Bank, the U.S. Geological Survey (USGS), and other sources. Please also be sure to look at IPUMS


Author

Affiliation

Ani Ruhil

 

Published

March 23, 2026

Citation

Ruhil, 2026


Most of us end up having to work with Census data on a regular basis. In the past, we used American Factfinder or then the ICPSR databases to grab the data we needed. Mercifully, with the opening up of a lot of government data grabbing Census data has become very easy. There are several packages that allow you to do so but I will focus on two packages – {tidycensus} and {censusapi} – given their ease of use. In addition to these packages, I will also show you how to use APIs to access data from the World Bank and from the U.S. Geological Survey (USGS).

Gathering Census Data with {tidycensus}

Let us start with {tidycensus}. As usual, we install the package and, since we will need it, get a Census API key. You can signup for the key here. Check your email and write your key somewhere and remember it. Install {tidycensus} next.

library(tidycensus)
library(tidyverse)
census_api_key("YOUR API KEY GOES HERE", install = TRUE)

Your key will be saved to .Renviron and automatically used in future sessions with {tidycensus} so be sure to not use the install = TRUE switch for subsequent runs otherwise you will be asked every time if you want to overwrite the previous setting in .Renviron.

The two data-streams of interest will either be the decennial census or then, most likely, the regularly updated American Community Survey (ACS) series. You can get each with specific commands:

load_variables(
  year = 2016, 
  dataset = "acs5", 
  cache = TRUE
  ) -> my.vars.acs 

load_variables(
  year = 2000, 
  dataset = "sf3", 
  cache = TRUE
  ) -> my.vars.d.sf3 

What if I am interested in the Summary File 1 from 2010 or Summary File 3 from 2000?There was no Summary File 3 generated for 2010.

load_variables(
  year = 2010, 
  dataset = "sf1", 
  cache = TRUE
  ) -> my.vars.d.sf1 

load_variables(
  year = 2000, 
  dataset = "sf3", 
  cache = TRUE
  ) -> my.vars.d.sf3 

Each of these will show you what variables are available, and make it easier for you to choose the ones you want to work with. Assume we want total population, variable P0010001, from the decennial census

get_decennial(
  geography = "state", 
  variables = c(totpopn = "P001001"),
  year = 2000
  ) -> state.popn.d 

get_decennial(
  geography = "county", 
  variables = c(totpopn = "P001001"),
  state = "OH",
  year = 2000
  ) -> county.popn.d 

get_decennial(
  geography = "tract", 
  variables = c(totpopn = "P001001"),
  state = "OH",
  county = "Athens",
  year = 2000
  ) -> tract.popn.d 

and then from the ACS. Note that the default for each is 2020 and the most recent ACS (2020-2024).

get_acs(
  geography = "state", 
  variables = c(totpopn = "B01003_001"),
  year = 2024
  ) -> state.popn.acs 

get_acs(
  geography = "county", 
  variables = c(totpopn = "B01003_001"),
  state = "OH",
  year = 2024
  ) -> county.popn.acs 

get_acs(
  geography = "tract", 
  variables = c(totpopn = "B01003_001"),
  state = "OH",
  county = "Athens",
  year = 2024
  ) -> tract.popn.acs 

You can also get an entire table instead of having to list variables, such as shown below, with the default decennial census for the package’s current version (2010).

get_decennial(
  geography = "county",
  table = c("P012"),
  year = 2000
  ) -> county.table.d 

get_decennial(
  geography = "state",
  table = c("P012"),
  year = 2000
  ) -> state.table.d 

Population data for all tracts in the country?

library(tidycensus)
library(purrr)

unique(fips_codes$state)[1:51] -> us 

map_df(us, function(x) {
  get_acs(
    geography = "tract",
    variables = "B01003_001", 
    state = x,
    year = 2024,
    geometry = TRUE
  )
  }
  ) -> totalpop 

Often you want to convert some measure into a percent. You can do that either by also downloading the denominator, or then leaning on built-in functions (as shown below). In this example, I am grabbing the B02001 table for all tracts in Cuyahoga county, Ohio.

get_acs(
  "tract",
  county = "Cuyahoga",
  state = "Ohio",
  year = 2024,
  table = "B02001",
  summary_var = "B02001_001",
  geometry = TRUE
  ) -> cuy
  |                                                                    |                                                            |   0%  |                                                                    |=                                                           |   1%  |                                                                    |=                                                           |   2%  |                                                                    |==                                                          |   3%  |                                                                    |===                                                         |   5%  |                                                                    |===                                                         |   6%  |                                                                    |====                                                        |   6%  |                                                                    |====                                                        |   7%  |                                                                    |=====                                                       |   8%  |                                                                    |=====                                                       |   9%  |                                                                    |======                                                      |  10%  |                                                                    |=======                                                     |  11%  |                                                                    |========                                                    |  13%  |                                                                    |========                                                    |  14%  |                                                                    |=========                                                   |  15%  |                                                                    |==========                                                  |  16%  |                                                                    |===========                                                 |  18%  |                                                                    |===========                                                 |  19%  |                                                                    |============                                                |  20%  |                                                                    |=============                                               |  21%  |                                                                    |=============                                               |  22%  |                                                                    |==============                                              |  23%  |                                                                    |===============                                             |  24%  |                                                                    |===============                                             |  25%  |                                                                    |================                                            |  27%  |                                                                    |=================                                           |  28%  |                                                                    |=================                                           |  29%  |                                                                    |==================                                          |  30%  |                                                                    |===================                                         |  31%  |                                                                    |===================                                         |  32%  |                                                                    |====================                                        |  33%  |                                                                    |=====================                                       |  34%  |                                                                    |=====================                                       |  36%  |                                                                    |======================                                      |  37%  |                                                                    |=======================                                     |  38%  |                                                                    |=======================                                     |  39%  |                                                                    |========================                                    |  40%  |                                                                    |=========================                                   |  41%  |                                                                    |=========================                                   |  42%  |                                                                    |==========================                                  |  43%  |                                                                    |===========================                                 |  45%  |                                                                    |===========================                                 |  46%  |                                                                    |============================                                |  47%  |                                                                    |=============================                               |  48%  |                                                                    |=============================                               |  49%  |                                                                    |==============================                              |  50%  |                                                                    |===============================                             |  51%  |                                                                    |===============================                             |  52%  |                                                                    |================================                            |  54%  |                                                                    |=================================                           |  55%  |                                                                    |=================================                           |  56%  |                                                                    |==================================                          |  57%  |                                                                    |===================================                         |  58%  |                                                                    |====================================                        |  59%  |                                                                    |====================================                        |  60%  |                                                                    |=====================================                       |  61%  |                                                                    |======================================                      |  63%  |                                                                    |======================================                      |  64%  |                                                                    |=======================================                     |  65%  |                                                                    |========================================                    |  66%  |                                                                    |========================================                    |  67%  |                                                                    |=========================================                   |  68%  |                                                                    |==========================================                  |  69%  |                                                                    |==========================================                  |  70%  |                                                                    |===========================================                 |  72%  |                                                                    |============================================                |  73%  |                                                                    |============================================                |  74%  |                                                                    |=============================================               |  75%  |                                                                    |==============================================              |  76%  |                                                                    |==============================================              |  77%  |                                                                    |===============================================             |  78%  |                                                                    |================================================            |  79%  |                                                                    |================================================            |  81%  |                                                                    |=================================================           |  82%  |                                                                    |==================================================          |  83%  |                                                                    |==================================================          |  84%  |                                                                    |===================================================         |  85%  |                                                                    |====================================================        |  86%  |                                                                    |====================================================        |  87%  |                                                                    |=====================================================       |  88%  |                                                                    |======================================================      |  90%  |                                                                    |======================================================      |  91%  |                                                                    |=======================================================     |  92%  |                                                                    |========================================================    |  93%  |                                                                    |========================================================    |  94%  |                                                                    |=========================================================   |  95%  |                                                                    |==========================================================  |  96%  |                                                                    |==========================================================  |  97%  |                                                                    |=========================================================== |  99%  |                                                                    |============================================================| 100%

Now I can run mutate(...) to convert each estimate into its corresponding percentage.

cuy %>%
  mutate(
    pct = round(estimate / summary_est) * 100
  ) -> cuy

The White count is B02001_002 and Black is B02001_003

Say I wanted to map just these two for the tracts.

cuy %>%
  filter(
    variable %in% c('B02001_002', 'B02001_003')
    ) %>%
  mutate(
    popgroup = case_when(
      variable == "B02001_002" ~ 'White Alone',
      variable == "B02001_003" ~ 'Black Alone'
      )
    ) -> cuy_df

cuy_df %>%
  ggplot() +
  geom_sf(
    aes(fill = pct)
    ) +
  facet_wrap(~ popgroup) +
  theme_void() +
  theme(legend.position = 'top') +
  scale_fill_viridis_c(option = "rocket") +
  labs(
    fill = "Percent"
    )

I threw in the mapping here to show how easy it would be to pipe through from getting data to visualizing it.

Gathering Census Data with {censusapi}

This package will allow you to grab a vast array of Census products, and very easily I might add.

library(censusapi)
listCensusApis() -> apis_df 

Application Programming Interfaces (APIs) could have slightly varying parameters so it pays to check the API documentation available here.

It is easy to find variable names via the built-in function listCensusMetadata. As an example, the bureau’s small area health insurance estimates are shown below, as well as the small area income and poverty estimates.

listCensusMetadata(
  name = "timeseries/healthins/sahie", 
  type = "variables"
  ) -> sahie_vars 

listCensusMetadata(
  name = "timeseries/poverty/saipe", 
  type = "variables"
  ) -> saipe_vars 

Curious about available geographies? Switch type = to geography:

listCensusMetadata(
  name = "timeseries/healthins/sahie", 
  type = "geography"
  ) -> sahie_geos 

listCensusMetadata(
  name = "timeseries/poverty/saipe", 
  type = "geography"
  ) -> saipe_geos 

Grab the most recent county-level data but note that the latest SAHIE are for 2023 while the latest SAIPE are for 2024. Every variable is downloaded as a chr so you will need to flip what should be numeric into numeric.

getCensus(
  name = "timeseries/healthins/sahie",
  vars = c("NAME", "IPRCAT", "IPR_DESC", "PCTUI_PT"), 
  region = "county:*",
  regionin = "state:39",
  time = 2023
  ) -> sahie_counties 

head(sahie_counties, n = 12L)

getCensus(
  name = "timeseries/poverty/saipe",
  vars = c("NAME", "SAEPOVRTALL_PT", "SAEMHI_PT"), 
  region = "county:*",
  regionin = "state:39",
  time = 2024
  ) -> saipe_counties 

head(saipe_counties, n = 12L)

saipe_counties %>%
  mutate(
    prate = as.numeric(SAEPOVRTALL_PT),
    mdhinc = as.numeric(SAEMHI_PT)
    ) -> saipe_counties

If you want data for a lot of geographies, the package will let you do it in seconds. For example, if you want tract-level data for all tracts, you can get it as shown below:

fips
 [1] "01" "02" "04" "05" "06" "08" "09" "10" "11" "12" "13" "15" "16"
[14] "17" "18" "19" "20" "21" "22" "23" "24" "25" "26" "27" "28" "29"
[27] "30" "31" "32" "33" "34" "35" "36" "37" "38" "39" "40" "41" "42"
[40] "44" "45" "46" "47" "48" "49" "50" "51" "53" "54" "55" "56"
tracts <- NULL

for (f in fips) {
    stateget <- paste("state:", f, sep = "")
    temp <- getCensus(
      name = "dec/sf1",
      vintage  = 2010,
      vars = c("NAME", "P001001", "H010001"),
      region = "tract:*",
      regionin = stateget
      )
    tracts <- rbind(tracts, temp)
    }

head(tracts)
  state county  tract                                      NAME
1    01    001 020100 Census Tract 201, Autauga County, Alabama
2    01    001 020500 Census Tract 205, Autauga County, Alabama
3    01    001 020300 Census Tract 203, Autauga County, Alabama
4    01    001 020400 Census Tract 204, Autauga County, Alabama
5    01    001 020200 Census Tract 202, Autauga County, Alabama
6    01    001 020600 Census Tract 206, Autauga County, Alabama
  P001001 H010001
1    1912    1912
2   10766   10585
3    3373    3373
4    4386    4386
5    2170    1989
6    3668    3668

Notice that you had to specify the region, and since tracts are nested within states, you had to add regionin = stateget.

World Bank Data

A few packages are available to access data gathered by the World Bank, {data360r} – though it may be dead now, {wbstats}, and then WDI. Install {WDI} and then we can explore how it works.

{data360r} seems moribund at the moment so the code shown below is not executed.

# remotes::install_github('vincentarelbundock/WDI')

library(WDI)

Let us search for all indicators that have to do with GDP.

WDIsearch("gdp") -> gdp.table

Too many to work with. So let us narrow it down to every indicator that is measured on a per capita basis.

WDIsearch("gdp*.per*.capita") -> gdp.table

Say we want a specific indicator, NY.GDP.PCAP.PP.KD

WDI(
  indicator = 'NY.GDP.PCAP.PP.KD', 
  country = c('MX','CA','US', 'CN', 'GR'),
  start = 1990,
  end = 2024
  ) -> wdi.df

Now we plot it.

wdi.df %>%
  ggplot() +
  geom_line(
    aes(
      x = year,
      y = NY.GDP.PCAP.PP.KD,
      color = country
    )
  ) +
  labs(
    x = "Year",
    y = "GDP per capita in 2021 PPP US Dollars",
    color = ""
    ) + 
  hrbrthemes::theme_ft_rc() +
  theme(legend.position = "bottom")

The list of indicators is cached locally so be sure to update it regularly as the API updates.

WDIcache() -> WDI_data

#WDIsearch(
#  'gdp',
#  cache = new_cache
#  )

Once you have identified a particular table, note it’s ID, and then pull it. Say I want indicator 90 = Can a married woman register a business in the same way as a married man? I can do the same thing for more than one indicator by specifying the IDs.

If I only want all data for a specific country or just the indicators we pull in df_ind1 for a specific country you could do:

Now an example with two measures – legal age of marriage for boys and girls. Note that the package allows you to specify the long format (preferred) than the default wide format you see earlier results being returned in. Note also the use of timeframes = c() that allows you to specify the specific time period you want the indicators for.

The {wbstats} package does pretty much the same thing. Let us see the core functionality by loading the library and then seeing what is available in terms of indicators, topics, and so on. We can then set the most current list of information in wb_cachelist to be used via new_cache. Doing so speeds up the operations and ensures that you are getting the latest data.

library(wbstats)

str(wb_cachelist, max.level = 1)
List of 8
 $ countries    : tibble [297 × 18] (S3: tbl_df/tbl/data.frame)
 $ indicators   : tibble [16,643 × 8] (S3: tbl_df/tbl/data.frame)
 $ sources      : tibble [62 × 9] (S3: tbl_df/tbl/data.frame)
 $ topics       : tibble [21 × 3] (S3: tbl_df/tbl/data.frame)
 $ regions      : tibble [42 × 4] (S3: tbl_df/tbl/data.frame)
 $ income_levels: tibble [7 × 3] (S3: tbl_df/tbl/data.frame)
 $ lending_types: tibble [4 × 3] (S3: tbl_df/tbl/data.frame)
 $ languages    : tibble [23 × 3] (S3: tbl_df/tbl/data.frame)
wb_cache() -> new_cache 

What indicators are available on the topic of corruption?

wb_search(pattern = "corruption") -> corruption_vars 

head(corruption_vars)
# A tibble: 6 × 3
  indicator_id     indicator                            indicator_desc
  <chr>            <chr>                                <chr>         
1 CC.EST           Control of Corruption: Estimate      "Control of C…
2 CC.NO.SRC        Control of Corruption: Number of So… "Control of C…
3 CC.PER.RNK       Control of Corruption: Percentile R… "Control of C…
4 CC.PER.RNK.LOWER Control of Corruption: Percentile R… "Control of C…
5 CC.PER.RNK.UPPER Control of Corruption: Percentile R… "Control of C…
6 CC.STD.ERR       Control of Corruption: Standard Err… "Control of C…

If I want information from a particular source, say from Bloomberg,

wb_search(
  pattern = "Bloomberg", 
  fields = "source_org"
  ) -> blmbrg_vars 

head(blmbrg_vars)
# A tibble: 2 × 3
  indicator_id indicator                             indicator_desc   
  <chr>        <chr>                                 <chr>            
1 GFDD.OM.02   Stock market return (%, year-on-year) Stock market ret…
2 GFDD.SM.01   Stock price volatility                Stock price vola…

Searching for indicators tied to multiple subjects is easy as well:

wb_search(
  pattern = "poverty | unemployment | employment"
  ) -> povemply_vars

head(povemply_vars)
# A tibble: 6 × 3
  indicator_id         indicator                        indicator_desc
  <chr>                <chr>                            <chr>         
1 1.0.HCount.1.90usd   Poverty Headcount ($1.90 a day)  The poverty h…
2 1.0.HCount.2.5usd    Poverty Headcount ($2.50 a day)  The poverty h…
3 1.0.HCount.Mid10to50 Middle Class ($10-50 a day) Hea… The poverty h…
4 1.0.HCount.Ofcl      Official Moderate Poverty Rate-… The poverty h…
5 1.0.HCount.Poor4uds  Poverty Headcount ($4 a day)     The poverty h…
6 1.0.HCount.Vul4to10  Vulnerable ($4-10 a day) Headco… The poverty h…

Once we identify what we want, downloading the data is easy as well, needing us to specify just the indicator(s) and then the start and end dates, and then specific country codes if you want data for specific countries. Below I am pulling total population.

wb_data(
  indicator = "SP.POP.TOTL", 
  start_date = 1960, 
  end_date = 2016
  ) -> pop_data1 

wb_data(
  country = c("ABW","AF", "SSF", "ECA", "IND", "CHN"),
  indicator = "SP.POP.TOTL",
  start_date = 1960,
  end_date = 2016
  ) -> pop_data2 

wb_data(
  country = c("ABW","AF", "SSF", "ECA", "IND", "CHN"),
  indicator = c("SP.POP.TOTL", "NY.GDP.MKTP.CD"), 
  start_date = 1960,
  end_date = 2016
  ) -> pop_data3 

head(pop_data3)
# A tibble: 6 × 6
  iso2c iso3c country  date NY.GDP.MKTP.CD SP.POP.TOTL
  <chr> <chr> <chr>   <dbl>          <dbl>       <dbl>
1 AW    ABW   Aruba    1960             NA       54208
2 AW    ABW   Aruba    1961             NA       55434
3 AW    ABW   Aruba    1962             NA       56234
4 AW    ABW   Aruba    1963             NA       56699
5 AW    ABW   Aruba    1964             NA       57029
6 AW    ABW   Aruba    1965             NA       57357
wb_data(
  country = c("ABW","AF", "SSF", "ECA", "IND", "CHN"),
  indicator = c("SP.POP.TOTL", "NY.GDP.MKTP.CD"), 
  start_date = 1960, 
  end_date = 2016, 
  return_wide = TRUE
  ) -> pop_data4 

head(pop_data4)
# A tibble: 6 × 6
  iso2c iso3c country  date NY.GDP.MKTP.CD SP.POP.TOTL
  <chr> <chr> <chr>   <dbl>          <dbl>       <dbl>
1 AW    ABW   Aruba    1960             NA       54208
2 AW    ABW   Aruba    1961             NA       55434
3 AW    ABW   Aruba    1962             NA       56234
4 AW    ABW   Aruba    1963             NA       56699
5 AW    ABW   Aruba    1964             NA       57029
6 AW    ABW   Aruba    1965             NA       57357

By default wb_data() will return the data in long format but not necessarily in a tidy format. If you want the data returned on call in a wide format, specify return_wide = TRUE and you will have tidy data.

If you will be working with dates, whether for plotting purposes or otherwise, then activate the date_as_class_date = TRUE switch. Otherwise you will have to manually format the date variables.

wb_data(
  country = c("IND", "CHN"),
  indicator = c("SP.POP.TOTL"), 
  start_date = 1960, 
  end_date = 2016, 
  return_wide = TRUE, 
  date_as_class_date = TRUE
  ) -> pop_data5 

head(pop_data5)
# A tibble: 6 × 10
  iso2c iso3c country date       SP.POP.TOTL unit  obs_status footnote
  <chr> <chr> <chr>   <date>           <dbl> <chr> <chr>      <chr>   
1 CN    CHN   China   1960-01-01   667070000 <NA>  <NA>       <NA>    
2 CN    CHN   China   1961-01-01   660330000 <NA>  <NA>       <NA>    
3 CN    CHN   China   1962-01-01   665770000 <NA>  <NA>       <NA>    
4 CN    CHN   China   1963-01-01   682335000 <NA>  <NA>       <NA>    
5 CN    CHN   China   1964-01-01   698355000 <NA>  <NA>       <NA>    
6 CN    CHN   China   1965-01-01   715185000 <NA>  <NA>       <NA>    
# … with 2 more variables: last_updated <date>, obs_resolution <chr>
library(scales)

ggplot(
  pop_data5, 
  aes(
    x  = date, 
    y = SP.POP.TOTL)
  ) + 
  geom_line(
    aes(
      color = country
      )
    ) +
  scale_y_continuous(labels = comma) + 
  scale_x_date(date_breaks = "10 years") +
  theme(legend.position = "bottom") + 
  labs(
    x = "Date", 
    y = "Total Population"
    ) +
  theme_minimal()

In the case that you want the most recent data, use mrv switch, and note that the number indicates how many of the most recent values you would like to access. This will also return rows of data where there are no recent values. If you wish to avoid this latter result, use mrnev instead.

wb_data(
  country = "all",
  indicator = c("SP.POP.TOTL"), 
  mrv = 1,
  return_wide = TRUE, 
  date_as_class_date = TRUE
  ) -> pop_data6 

head(pop_data6)
# A tibble: 6 × 10
  iso2c iso3c country date       SP.POP.TOTL unit  obs_status footnote
  <chr> <chr> <chr>   <date>           <dbl> <chr> <chr>      <chr>   
1 AW    ABW   Aruba   2024-01-01      107995 <NA>  <NA>       <NA>    
2 <NA>  AFE   Africa… 2024-01-01   769280888 <NA>  <NA>       <NA>    
3 AF    AFG   Afghan… 2024-01-01    42647492 <NA>  <NA>       <NA>    
4 <NA>  AFW   Africa… 2024-01-01   521764076 <NA>  <NA>       <NA>    
5 AO    AGO   Angola  2024-01-01    37885849 <NA>  <NA>       <NA>    
6 AL    ALB   Albania 2024-01-01     2377128 <NA>  <NA>       Arithme…
# ℹ 2 more variables: last_updated <date>, obs_resolution <chr>

Eritrea has missing data. To avoid this we could have done the following (code not executed below) …

wb_data(
  country = "all",
  indicator = c("SP.POP.TOTL"), 
  mrnev = 1,
  freq = "Y",
  return_wide = TRUE, 
  date_as_class_date = TRUE
  ) -> pop_data7 

head(pop_data7)

We could have also asked for data by countries only, administrative regions, and so on.

wb_data(
  country = "regions_only",
  indicator = c("SP.POP.TOTL"), 
  mrv = 1
  ) -> regions

wb_data(
  country = "admin_regions_only",
  indicator = c("SP.POP.TOTL"), 
  mrv = 1
  ) -> admin_regions

wb_data(
  country = "income_levels_only",
  indicator = c("SP.POP.TOTL"), 
  mrv = 1
  ) -> income_levels

wb_data(
  country = "lending_types_only",
  indicator = c("SP.POP.TOTL"), 
  mrv = 1
  ) -> lending_types

What if I want the data in Chinese?

wb_data(
  country = "lending_types_only",
  indicator = c("SP.POP.TOTL"), 
  mrv = 1,
  lang = "zh"
  ) -> lending_types_chinese

head(lending_types_chinese)
# A tibble: 3 × 9
  iso2c iso3c country   date SP.POP.TOTL unit  obs_status footnote
  <chr> <chr> <chr>    <dbl>       <dbl> <chr> <chr>      <chr>   
1 XF    IBD   只有IBRD  2024  4979421568 <NA>  <NA>       <NA>    
2 XH    IDB   IDA混合   2024   676957651 <NA>  <NA>       <NA>    
3 XI    IDX   只有IDA   2024  1269842894 <NA>  <NA>       <NA>    
# ℹ 1 more variable: last_updated <date>

United States Geological Survey (USGS) Data

The {dataRetrieval} package gives you easy access to water data gathered and warehoused by the USGS, USDA, EPA, and other entities. The package has an excellent tutorial available here so I will not go into too many details and nuances here. Start by installing the package and then loading it.

library(dataRetrieval)

You will need to know the site(s) you are interested in, as well as the parameters and statistics of interest. The package comes with a built-in data-set that shows you the parameters available, and the complete list of statistics is available here. Sites can be located from this inventory.

parameterCdFile -> parameterCdFile 

names(parameterCdFile)
[1] "parameter_cd"       "parameter_group_nm" "parameter_nm"      
[4] "casrn"              "srsname"            "parameter_units"   

If you are curious about a specific parameter, you can see what all is available for it. I’ll look for anything related to the keyword storm, and also what parameter units are available.

parameterCdFile[
  grep("storm",
  parameterCdFile$parameter_nm,
  ignore.case =  TRUE),
  ] -> stormq 

unique(stormq$parameter_units)
[1] "minutes" "nu"      "hours"   "ft3/s"   "FNU"     "mgd"    
[7] "in"     

Let us do a quick grab of some data for the Hocking River at Athens, Ohio.

siteNo <- "03159500"
pCode <- "00065"
start.date <- "2007-10-01"
end.date <- "2026-03-22"

readNWISuv(
  siteNumbers = siteNo,
  parameterCd = pCode,
  startDate = start.date,
  endDate = end.date
  ) -> hocking 

Since the column names are based on parameter codes and hence cryptic, you can clean them up, and also see other attributes embedded in the data-set.

names(hocking)
[1] "agency_cd"        "site_no"          "dateTime"        
[4] "X_00065_00000"    "X_00065_00000_cd" "tz_cd"           
renameNWISColumns(hocking) -> hocking 

names(hocking)
[1] "agency_cd"  "site_no"    "dateTime"   "GH_Inst"    "GH_Inst_cd"
[6] "tz_cd"     
names(attributes(hocking))
[1] "names"         "row.names"     "class"         "url"          
[5] "siteInfo"      "variableInfo"  "disclaimer"    "statisticInfo"
[9] "queryTime"    

If I wanted data for multiple sites, I could find the site numbers and then grab the data.

sites <- c("03158200", "03159246")
pcode <- "00065"
start.date <- "2014-10-01"
end.date <- "2026-03-21"

readNWISuv(
  siteNumbers = sites,
  parameterCd = pcode,
  startDate = start.date,
  endDate = end.date
  ) -> hocking2 

renameNWISColumns(hocking2) -> hocking2 

Now a simple time-series plot of gage height for both sites. Note that although I asked for data going far back, not all sites have all data for all time periods so it helps to check the site inventory first.

attr(hocking2, "variableInfo") -> parameterInfo 

ifelse(
  hocking2$site_no == "03158200",
  "Monday Creek at Doanville",
  "Sunday Creek Below Millfield"
  ) -> hocking2$station 

as.Date(
  as.character(hocking2$dateTime),
  format  = "%Y-%m-%d"
  ) -> hocking2$mydates

ggplot(
  data = hocking2, 
  aes(
    x = mydates, 
    y = GH_Inst, 
    color = station
    )
  ) + 
  geom_line() +
  labs(
    x = "", 
    y = parameterInfo$variableDescription,
    color = "Measuring Station"
    ) + 
  scale_x_date(date_breaks = "32 weeks") +
  theme_minimal() + 
  theme(legend.position = "bottom")  

Using RSocrata

Several APIs are discussed here, including the ones below.

The Socrata Open Data API allows you to programmatically access a wealth of open data resources from governments, non-profits, and NGOs around the world.

To access the SODA API in R, RSocrata may well be the easiest package out there! Let us install it, and then tap a couple of CDC datasets.

remotes::install_github("Chicago/RSocrata", ref = "dev")
library(RSocrata)

ls.socrata(
  "https://data.cdc.gov"
  ) -> cdc_catalog

Filter the cdc_catalog such that publisher$name is data.cdc.gov and then we will pick a surveillance dataset, as shown below. Alternatively, go to data.cdc.gov and then browse through the data sets. I picked a specific dataset that is in Chronic Diseases and titled Alzheimer's Disease and Healthy Aging Data

read.socrata(
  "https://data.cdc.gov/api/v3/views/hfr9-rurv"
  ) -> cdc_df01

How about another one that gives us the National Notifiable Diseases Surveillance System (NNDSS) Weekly Data

USAID’s Demographic and Health Surveys

remotes::install_github("ropensci/rdhs")

library(rdhs)
# set_rdhs_config()

Say I want the total fertility rate TFR 15-49 so let me see what I can get for it. Note the tagIDs

dhs_indicators() %>%
  janitor::clean_names() -> dhs_cat

dhs_cat %>%
  filter(
    short_name == "TFR 15-49"
    ) -> tfr

Well, now we use tagID = 74 and pull the data for all surveys starting in 2019 for whatever country was surveyed in a given year.

dhs_data(
  tagIds = 74,
  breakdown = "subnational",
  surveyYearStart = 2019
  ) -> tfr_df

parquet files in R with arrow

I am not going to give you anything original here so everything that you should look at is from here courtesy Hadley Wickham, Danielle Navarro, and the rest of the team. Please read the chapter there if interested. Below I am simply showing you how arrow works.

library(arrow)
dir.create("data", showWarnings = FALSE)

curl::multi_download(
  "https://r4ds.s3.us-west-2.amazonaws.com/seattle-library-checkouts.csv",
  "data/seattle-library-checkouts.csv",
  resume = TRUE)
# A tibble: 1 × 10
  success status_code resumefrom url              destfile error type 
  <lgl>         <dbl>      <dbl> <chr>            <chr>    <chr> <chr>
1 TRUE            200          0 https://r4ds.s3… /Users/… <NA>  text…
# ℹ 3 more variables: modified <dttm>, time <dbl>, headers <list>
open_dataset(
  sources = "data/seattle-library-checkouts.csv", 
  col_types = schema(ISBN = string()),
  format = "csv"
  ) -> seattle_csv

seattle_csv %>%
  glimpse()
FileSystemDataset with 1 csv file
41,389,465 rows x 12 columns
$ UsageClass      <string> "Physical", "Physical", "Digital", "Physica…
$ CheckoutType    <string> "Horizon", "Horizon", "OverDrive", "Horizon…
$ MaterialType    <string> "BOOK", "BOOK", "EBOOK", "BOOK", "SOUNDDISC…
$ CheckoutYear     <int64> 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2…
$ CheckoutMonth    <int64> 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6…
$ Checkouts        <int64> 1, 1, 1, 1, 1, 1, 1, 1, 4, 1, 1, 2, 3, 2, 1…
$ Title           <string> "Super rich : a guide to having it all / Ru…
$ ISBN            <string> "", "", "", "", "", "", "", "", "", "", "",…
$ Creator         <string> "Simmons, Russell", "Barclay, James, 1965-"…
$ Subjects        <string> "Self realization, Conduct of life, Attitud…
$ Publisher       <string> "Gotham Books,", "Pyr,", "Random House, Inc…
$ PublicationYear <string> "c2011.", "2010.", "2015", "2005.", "c2004.…

readnoaa

readnoaa is a package that reads the National Oceanic and Atmospheric Administration (NOAA) federal agency’s data via NOAA’s API. Some common variables follow:

Variable Code Unit (metric) Function
Maximum temperature TMAX °C noaa_daily()
Minimum temperature TMIN °C noaa_daily()
Precipitation PRCP mm noaa_daily(), noaa_monthly()
Snowfall SNOW mm noaa_daily()
Snow depth SNWD mm noaa_daily()
Average temperature TAVG °C noaa_monthly(), noaa_annual()
Wind speed AWND m/s noaa_daily()
Normal temperature MLY-TAVG-NORMAL °C noaa_normals()
Normal precipitation MLY-PRCP-NORMAL mm noaa_normals()
#devtools::install_github("charlescoverdale/readnoaa")

library(readnoaa)

noaa_nearby(39.32, -82.10, radius_km = 25) 
       station latitude longitude elevation                  name
1  USC00330279  39.3225  -82.0994     208.5             ATHENS OU
2  US1OHAT0014  39.3376  -82.0913     195.7        ATHENS 1.0 NNE
3  US1OHAT0024  39.3056  -82.1178     274.9         ATHENS 1.7 SW
4  US1OHAT0016  39.2940  -82.0793     289.3        ATHENS 2.3 SSE
5  US1OHAT0027  39.3023  -82.1548     200.6        ATHENS 3.5 WSW
6  US1OHAT0029  39.3631  -82.1268     217.6    THE PLAINS 0.4 ESE
7  US1OHAT0008  39.2851  -82.1528     228.9         ATHENS 4.1 SW
8  US1OHAT0020  39.3818  -82.1078     201.2       CHAUNCEY 1.6 SE
9  US1OHAT0021  39.3099  -82.0128     256.6        ATHENS 4.6 ESE
10 US1OHAT0028  39.3732  -82.1896     222.2         ATHENS 6.0 NW
11 USC00330274  39.3833  -82.1833     210.0           ATHENS 5 NW
12 US1OHAT0022  39.3603  -82.2194     235.3        ATHENS 7.0 WNW
13 US1OHAT0007  39.3092  -82.2328     276.1 NEW MARSHFIELD 1.6 SW
14 US1OHAT0006  39.2728  -82.2430     262.7         ALBANY 4.1 NW
15 US1OHAT0023  39.2302  -82.2050     234.1         ALBANY 0.6 NW
16 USC00330141  39.4000  -81.9667     201.2             AMESVILLE
17 US1OHAT0012  39.4067  -81.9657     227.7     AMESVILLE 0.7 WNW
18 US1OHMS0002  39.1950  -82.1869     217.3        ALBANY 2.1 SSE
19 US1OHAT0013  39.3195  -81.9152     267.9        STEWART 1.4 NW
20 US1OHAT0005  39.3722  -81.9231     234.1      AMESVILLE 2.6 SE
21 US1OHAT0009  39.4676  -82.1336     224.9       GLOUSTER 3.6 SW
22 US1OHVN0003  39.2512  -82.2747     271.9        ALBANY 4.5 WNW
23 US1OHAT0011  39.1998  -81.9603     218.2     GUYSVILLE 6.5 SSW
24 US1OHAT0010  39.4608  -82.2260     229.8     NELSONVILLE 0.1 N
25 US1OHAT0026  39.4608  -82.2524     210.0     NELSONVILLE 1.4 W
   distance_km
1    0.2827381
2    2.0952147
3    2.2156087
4    3.3956185
5    5.1089482
6    5.3178641
7    5.9749316
8    6.9044974
9    7.5853052
10   9.7137085
11  10.0420543
12  11.2033567
13  11.4876141
14  13.3778830
15  13.4682799
16  14.5075375
17  15.0413883
18  15.7852175
19  15.8971122
20  16.2814796
21  16.6644065
22  16.8698687
23  17.9806723
24  19.0358003
25  20.4117462
library(readnoaa)

noaa_daily(
  "USC00330279", 
  "2004-01-01", 
  "2026-03-22",
  datatypes = c(
    "TMAX", 
    "TMIN",
    "TAVG",
    "PRCP",
    "SNOW",
    "SNWD",
    "AWND"
    )
  ) -> athens.df

athens.df %>%
  glimpse()
Rows: 8,114
Columns: 10
$ station <chr> "USC00330279", "USC00330279", "USC00330279", "USC003…
$ name    <chr> "ATHENS OU, OH US", "ATHENS OU, OH US", "ATHENS OU, …
$ date    <date> 2004-01-01, 2004-01-02, 2004-01-03, 2004-01-04, 200…
$ awnd    <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ prcp    <dbl> 0.0, 10.2, 0.3, 40.1, 25.9, 0.3, 0.0, 0.0, 2.0, 0.0,…
$ snow    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 23, 0, 0, 0, 0, 0, 0, 0, 0, …
$ snwd    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 25, 25, 0, 0, 0, 0, 0, 0, 0,…
$ tavg    <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ tmax    <dbl> 10.6, 11.1, 15.0, 20.6, 8.3, 4.4, -3.9, -3.9, -2.2, …
$ tmin    <dbl> -3.9, -3.3, 7.8, 6.1, 3.9, -3.9, -10.0, -10.6, -6.1,…

See the buoydata package as well!

library(buoydata)

buoydata::buoy_data |> 
  dplyr::as_tibble() %>%
  dplyr::arrange(
    desc(YN)
    )
# A tibble: 1,330 × 17
   ID       Y1    YN nYEARS lastMeasurement       LAT   LON
   <chr> <dbl> <dbl>  <int> <dttm>              <dbl> <dbl>
 1 41002  1973  2026     52 2026-03-18 17:20:00  31.9 -74.9
 2 41004  1978  2026     38 2026-03-18 17:20:00  32.5 -79.1
 3 41008  1988  2026     35 2026-03-18 17:20:00  31.4 -80.9
 4 41009  1988  2026     39 2026-03-18 17:20:00  28.5 -80.2
 5 41013  2003  2026     24 2026-03-18 17:20:00  33.4 -77.7
 6 41024  2005  2026     22 2026-03-18 17:08:00  33.8 -78.5
 7 41025  2003  2026     24 2026-03-18 17:40:00  35.0 -75.4
 8 41029  2005  2026     22 2026-03-18 17:08:00  32.8 -79.6
 9 41033  2005  2026     22 2026-03-18 16:08:00  32.3 -80.4
10 41037  2005  2026     22 2026-03-18 17:08:00  34.0 -77.4
# ℹ 1,320 more rows
# ℹ 10 more variables: STATION_LOC <chr>, TTYPE <chr>,
#   TIMEZONE <chr>, OWNER <chr>, OWNERNAME <chr>, COUNTRY <chr>,
#   HULL <chr>, PAYLOAD <chr>, FORECAST <chr>, NOTE <chr>
get_variables() %>%
  knitr::kable(caption = "Variables in the Buoy Data Series")
Table 1: Variables in the Buoy Data Series
variable_name data_type actual_range description
apd float 0.0, 95.0 Wave Period, Average
atmp float -153.4, 50.0 Air Temperature
bar float 800.7, 1198.8 Air Pressure
dewp float -99.9, 48.7 Dewpoint Temperature
dpd float 0.0, 64.0 Wave Period, Dominant
gst float 0.0, 75.5 Wind Gust Speed
latitude float -55.0, 71.758 Latitude
longitude float -177.75, 179.001 Longitude
mwd short 0, 359 Wave Direction
ptdy float -15.0, 14.1 Pressure Tendency
station String Station Identifier
tide float -9.37, 6.15 Water Level
time double 4910400.0, 1.7742627E9 Time
vis float 0.0, 66.7 Station Visibility
wd short 0, 359 Wind Direction
wspd float 0.0, 96.0 Wind Speed
wspu float -98.7, 97.5 Wind Speed, Zonal
wspv float -98.7, 97.5 Wind Speed, Meridional
wtmp float -98.7, 50.0 SST
wvht float 0.0, 92.39 Wave Height
get_buoy_data(
  buoyid = "41002",
  ) -> buoy.41002
[1] "failed to get variable datatype information"
[1] "will leave units unchanged"
buoy.41002 %>%
  glimpse()
Rows: 652,549
Columns: 17
$ apd  <chr> "NaN", "NaN", "NaN", "NaN", "NaN", "NaN", "NaN", "NaN",…
$ atmp <chr> "15.5", "13.9", "11.4", "10.4", "9.5", "9.2", "8.7", "8…
$ bar  <chr> "1031.0", "1031.0", "1031.0", "1031.0", "1031.0", "1031…
$ dewp <chr> "5.4", "7.3", "6.5", "6.0", "6.2", "5.8", "5.9", "5.2",…
$ dpd  <chr> "NaN", "NaN", "NaN", "NaN", "NaN", "NaN", "NaN", "NaN",…
$ gst  <chr> "NaN", "NaN", "NaN", "NaN", "NaN", "NaN", "NaN", "NaN",…
$ mwd  <chr> "NaN", "NaN", "NaN", "NaN", "NaN", "NaN", "NaN", "NaN",…
$ ptdy <chr> "NaN", "NaN", "NaN", "NaN", "NaN", "NaN", "NaN", "NaN",…
$ tide <chr> "NaN", "NaN", "NaN", "NaN", "NaN", "NaN", "NaN", "NaN",…
$ time <chr> "1973-12-01T01:00:00Z", "1973-12-01T02:00:00Z", "1973-1…
$ vis  <chr> "NaN", "NaN", "NaN", "NaN", "NaN", "NaN", "NaN", "NaN",…
$ wd   <chr> "149", "145", "315", "312", "315", "314", "312", "305",…
$ wspd <chr> "1.5", "0.3", "1.4", "1.6", "2.2", "2.4", "2.1", "3.8",…
$ wspu <chr> "-0.8", "-0.2", "1.0", "1.2", "1.6", "1.7", "1.6", "3.1…
$ wspv <chr> "1.3", "0.2", "-1.0", "-1.1", "-1.6", "-1.7", "-1.4", "…
$ wtmp <chr> "NaN", "NaN", "NaN", "NaN", "NaN", "NaN", "NaN", "NaN",…
$ wvht <chr> "NaN", "NaN", "NaN", "NaN", "NaN", "NaN", "NaN", "NaN",…

Footnotes

  1. There was no Summary File 3 generated for 2010.[↩]

Citation

For attribution, please cite this work as

Ruhil (2026, March 23). Using APIs from the Census Bureau & other sources. Retrieved from https://aniruhil.org/courses/mpa6020/handouts/module06.html

BibTeX citation

@misc{ruhil2026using,
  author = {Ruhil, Ani},
  title = {Using APIs from the Census Bureau & other sources},
  url = {https://aniruhil.org/courses/mpa6020/handouts/module06.html},
  year = {2026}
}