Opportunity Youth in Ohio

A client needed state-level estimates of the percent of opportunity youth (defined either as 16-19 or 16-24 year-old persons who are neither in school nor employed) in each state plus Washington DC. The end result would be three years of estimates that matched the numbers put out by Measure of America’s yearly reports on the subject.. I had used Anthony Damico’s fantastic repository Analyze Survey Data for Free for BRFSS, DHS, and other data but never with the Census Bureau’s Public Use Microdata Samples (PUMS) data. This was a golden chance to remedy that limitation, and in this short post I am documenting my experience.

The PUMS are a handy source for creating special tabulations not available as standard tables in the ACS 1-year/5-year products precisely because the PUMS provide selected extracts of raw ACS data from a small sample of census records screened to protect confidentiality. The 1-year ACS PUMS files will give you 1% of the households in the country whereas the 5-year ACS PUMS will give you 5% of the households in country. While state-level estimates are easily derived from these PUMS data, doing anything for smaller geographies requires more work. The lowest level of geography at which PUMS data are available is a specialized census geography called a Public Use Microdata Area (PUMA), areas with populations of at least 100,000. Heavily populated areas may be split across multiple PUMAs while a number of contiguous sparsely populated areas (counties included) may be bundled into a single PUMA. The challenge then becomes getting at these smaller geographies and I reserve that experiment for the next post. For now we focus on the easier, the state-level analysis.

Getting the PUMS data is easy enough with the lodown package, but bear in mind that you need some computing horsepower and a fast internet connection. For example, even with half a terabyte of free disk space, a Macbook Pro (2016) with 16GB of RAM and 2.9 GHz Intel core i7 I had many failed attempts at downloading all fifty states, DC, and Puerto Rico. In brief, the 2016 American Community Survey 1-year files would download for each geography but then the 52 individual data-sets would not merge because the vector memory was being exhausted (RStudio 3.5+ has a new system of vector memory allocation “to avoid having the R process killed when macOS over-commits memory.”). My solution was to work outside of RStudio in the R console itself, with each yearly data download split into two calls, one for each of 26 geographies. The basic code follows.

library(lodown)
acs_cat <- get_catalog("acs", output_dir = file.path(path.expand("~"), "ACS"))

and then generating the catalogs of interest.

acs_cat1 <- subset(acs_cat, year == 2016 & time_period == "1-Year" & stateab %in% 
    c("ak", "al", "ar", "az", "ca", "co", "ct", "dc", "de", "fl", "ga", "hi", 
        "ia", "id", "il", "in", "ks", "ky", "la", "ma", "md", "me", "mi", "mn", 
        "mo", "ms"))

acs_cat2 <- subset(acs_cat, year == 2016 & time_period == "1-Year" & stateab %in% 
    c("mt", "nc", "nd", "ne", "nh", "nj", "nm", "nv", "ny", "oh", "ok", "or", 
        "pa", "pr", "ri", "sc", "sd", "tn", "tx", "ut", "va", "vt", "wa", "wi", 
        "wv", "wy"))

Now we can download the microdata to the ACS folder.

acs_cat1 <- lodown("acs", acs_cat1)

Your should end up with a file called acs2016_1yr.rds in the ACS folder. This is just the first batch of geographies so we’ll repeat this step again for the second batch. But before we do that, let us generate a few indicators we need, filter the data-frame to just the 16-24 year-olds, and save this file with a unique name.

library(survey)

acs_df <- readRDS(file.path(path.expand("~"), "ACS", "acs2016_1yr.rds"))

library(dplyr)
acs_df1 <- acs_df %>% filter(agep %in% seq(16, 24, by = 1)) %>% mutate(agep_1624 = sum(agep %in% 
    c(16, 17, 18, 19, 20, 21, 22, 23, 24)), noteduc = factor(sch, levels = c(1, 
    2, 3), labels = c("Has not attended in the last 3 months", "Public School/College", 
    "Private School/College or Homeschool")), notwork = factor(esr, levels = c(1:6), 
    labels = c("Civilian employed, at work", "Civilian employed, with a job but not at work", 
        "Unemployed", "Armed forces, at work", "Armed forces, with a job but not at work", 
        "Not in labor force")), notworking = ifelse(notwork == "Unemployed", 
    "Yes", ifelse(notwork == "Not in labor force", "Yes", "No")), notinschool = ifelse(noteduc == 
    "Has not attended in the last 3 months", "Yes", "No"), disengaged = ifelse(notinschool == 
    "Yes" & notworking == "Yes", "Disengaged", "Not Disengaged"))

saveRDS(acs_df1, file.path(path.expand("~"), "ACS", "acs2016_1yrA.rds"))

We can then rinse and repeat for the second batch of geographies – acs_cat2.

Having processed acs_cat2 and saved it as acs_df2, we can now combine the two data-frames.

acs_df <- rbind.data.frame(acs_df1, acs_df2)
saveRDS(acs_df, file.path(path.expand("~"), "ACS", "acs2016_1yrAB.rds"))

I cleared my RStudio Environment before loading the full 2016 data-frame, just to avoid any vector memory errors.

rm(list = ls())

acs_df <- readRDS(file.path(path.expand("~"), "ACS", "acs2016_1yrAB.rds"))

acs_df <- acs_df %>% mutate(hispanic = ifelse(hisp == 1, "Not Hispanic", "Hispanic"), 
    race = factor(rac1p, levels = 1:9, labels = c("White Alone", "Black or African American Alone", 
        "American Indian Alone", "Alaska Native Alone", "American Indian and Alaska Native", 
        "Asian Alone", "Native Hawaiian and Other Pacific Islander Alone", "Some Other Race Alone", 
        "Two or More Races")), youth = ifelse(agep >= 16 & agep <= 24, "Yes", 
        "No"), disengaged = factor(disengaged))

Now all that remained to be done was to specify the survey design and then run the weighted estimates I needed.

library(survey)
acs_design <- svrepdesign(weight = ~pwgtp, repweights = "pwgtp[0-9]+", scale = 4/80, 
    rscales = rep(1, 80), mse = TRUE, type = "JK1", data = acs_df)

I borrowed Anthony’s default code to label the states.

acs_design <- update(acs_design, relp = as.numeric(relp), state_name = factor(as.numeric(st), 
    levels = c(1L, 2L, 4L, 5L, 6L, 8L, 9L, 10L, 11L, 12L, 13L, 15L, 16L, 17L, 
        18L, 19L, 20L, 21L, 22L, 23L, 24L, 25L, 26L, 27L, 28L, 29L, 30L, 31L, 
        32L, 33L, 34L, 35L, 36L, 37L, 38L, 39L, 40L, 41L, 42L, 44L, 45L, 46L, 
        47L, 48L, 49L, 50L, 51L, 53L, 54L, 55L, 56L, 72L), labels = c("Alabama", 
        "Alaska", "Arizona", "Arkansas", "California", "Colorado", "Connecticut", 
        "Delaware", "District of Columbia", "Florida", "Georgia", "Hawaii", 
        "Idaho", "Illinois", "Indiana", "Iowa", "Kansas", "Kentucky", "Louisiana", 
        "Maine", "Maryland", "Massachusetts", "Michigan", "Minnesota", "Mississippi", 
        "Missouri", "Montana", "Nebraska", "Nevada", "New Hampshire", "New Jersey", 
        "New Mexico", "New York", "North Carolina", "North Dakota", "Ohio", 
        "Oklahoma", "Oregon", "Pennsylvania", "Rhode Island", "South Carolina", 
        "South Dakota", "Tennessee", "Texas", "Utah", "Vermont", "Virginia", 
        "Washington", "West Virginia", "Wisconsin", "Wyoming", "Puerto Rico")))

Now the target measure of opportunity youth.

oppyouth <- svyby(~disengaged, ~state_name, acs_design, svymean, na.rm = TRUE)
oppyouth[, c(2:5)] <- oppyouth[, c(2:5)] * 100
oppyouth[, c(2:5)] <- round(oppyouth[, c(2:5)], digits = 1)
oppyouth$se2 <- NULL
colnames(oppyouth) <- c("State", "Disengaged (%)", "Not Disengaged  (%)", "Standard Error (%)")

If all went well you should see oppyouth reflecting the following for 2016.

knitr::kable(oppyouth, booktabs = TRUE, align = c("l", "r", "r", "r"), caption = "Percent Disengaged Youth (16-24 year-olds) by State", 
    row.names = FALSE, col.names = c("State", "Disengaged (%)", "Not Disengaged  (%)", 
        "Standard Error (%)"), "html") %>% kableExtra::kable_styling("striped", 
    full_width = FALSE)
Table 1: Percent Disengaged Youth (16-24 year-olds) by State
State Disengaged (%) Not Disengaged (%) Standard Error (%)
Alabama 14.1 85.9 0.5
Alaska 17.9 82.1 1.9
Arizona 13.7 86.3 0.5
Arkansas 15.0 85.0 0.7
California 11.5 88.5 0.2
Colorado 10.7 89.3 0.5
Connecticut 8.5 91.5 0.5
Delaware 14.3 85.7 1.5
District of Columbia 14.8 85.2 1.6
Florida 11.8 88.2 0.3
Georgia 12.6 87.4 0.4
Hawaii 11.1 88.9 1.1
Idaho 10.9 89.1 1.1
Illinois 10.7 89.3 0.3
Indiana 10.7 89.3 0.4
Iowa 7.4 92.6 0.5
Kansas 10.0 90.0 0.6
Kentucky 14.3 85.7 0.8
Louisiana 17.4 82.6 0.8
Maine 11.9 88.1 1.2
Maryland 11.1 88.9 0.5
Massachusetts 7.4 92.6 0.4
Michigan 11.6 88.4 0.5
Minnesota 7.5 92.5 0.5
Mississippi 14.5 85.5 0.9
Missouri 10.1 89.9 0.6
Montana 11.8 88.2 1.3
Nebraska 9.2 90.8 0.8
Nevada 14.3 85.7 0.8
New Hampshire 8.5 91.5 0.8
New Jersey 10.1 89.9 0.4
New Mexico 16.4 83.6 1.3
New York 12.1 87.9 0.3
North Carolina 11.6 88.4 0.4
North Dakota 7.0 93.0 1.0
Ohio 11.1 88.9 0.4
Oklahoma 14.2 85.8 0.7
Oregon 11.9 88.1 0.7
Pennsylvania 10.7 89.3 0.4
Rhode Island 7.5 92.5 0.8
South Carolina 12.7 87.3 0.5
South Dakota 9.2 90.8 1.2
Tennessee 13.1 86.9 0.6
Texas 13.4 86.6 0.3
Utah 9.2 90.8 0.6
Vermont 11.6 88.4 2.3
Virginia 9.8 90.2 0.4
Washington 12.3 87.7 0.4
West Virginia 17.3 82.7 1.0
Wisconsin 9.1 90.9 0.5
Wyoming 10.3 89.7 1.5
Puerto Rico 21.6 78.4 0.7

If you want the breakouts by race, you will need to cross Hispanic status and the one race alone categories since Measure of America’s estimates for White and Black youth are for non-Hispanic one race alone, while their Latino estimates are for Latinos of any race. Here is a snippet of code for non-Hispanic Black alone youth.

blkyouth <- svyby(~disengaged, ~state_name, subset(acs_design, rac1p == "2" & 
    hisp == "01"), svymean, na.rm = TRUE)
blkyouth[, c(2:5)] <- blkyouth[, c(2:5)] * 100
blkyouth[, c(2:5)] <- round(blkyouth[, c(2:5)], digits = 1)
blkyouth$se2 <- NULL
colnames(blkyouth) <- c("State", "Disengaged (%)", "Not Disengaged  (%)", "Standard Error (%)")

Once you have the 2016 estimates, roll back through to get the 2015 and 2014 (or earlier) estimates.

Related