From an Attican Hollow: Estimating Disconnected Youth in Ohio

Ani Ruhil

A client needed state-level estimates of the percent of opportunity youth (defined either as 16-19 or 16-24 year-old persons who are neither in school nor employed) in each state plus Washington DC. The end result would be three years of estimates that matched the numbers put out by Measure of America’s yearly reports on the subject. I had used Anthony Damico’s fantastic repository Analyze Survey Data for Free for BRFSS, DHS, and other data but never with the Census Bureau’s Public Use Microdata Samples (PUMS) data. This was a golden chance to remedy that limitation, and in this short post I am documenting my experience.

The PUMS are a handy source for creating special tabulations not available as standard tables in the ACS 1-year/5-year products precisely because the PUMS provide selected extracts of raw ACS data from a small sample of census records screened to protect confidentiality. The 1-year ACS PUMS files will give you 1% of the households in the country whereas the 5-year ACS PUMS will give you 5% of the households in country. While state-level estimates are easily derived from these PUMS data, doing anything for smaller geographies requires more work. The lowest level of geography at which PUMS data are available is a specialized census geography called a Public Use Microdata Area (PUMA), areas with populations of at least 100,000. Heavily populated areas may be split across multiple PUMAs while a number of contiguous sparsely populated areas (counties included) may be bundled into a single PUMA. The challenge then becomes getting at these smaller geographies and I reserve that experiment for the next post. For now we focus on the easier, the state-level analysis.

Getting the PUMS data is easy enough with the lodown package, but bear in mind that you need some computing horsepower and a fast internet connection. For example, even with half a terabyte of free disk space, a Macbook Pro (2016) with 16GB of RAM and 2.9 GHz Intel core i7 I had many failed attempts at downloading all fifty states, DC, and Puerto Rico. In brief, the 2016 American Community Survey 1-year files would download for each geography but then the 52 individual data-sets would not merge because the vector memory was being exhausted (RStudio 3.5+ has a new system of vector memory allocation “to avoid having the R process killed when macOS over-commits memory.”). My solution was to work outside of RStudio in the R console itself, with each yearly data download split into two calls, one for each of 26 geographies. The basic code follows.

library(lodown)
acs_cat <- get_catalog("acs", output_dir = file.path( path.expand( "~" ), "ACS"))

acs_cat1 <- subset(acs_cat, year == 2016 & time_period == '1-Year' &
                     stateab %in% c("ak", "al", "ar", "az", "ca", "co",
                                    "ct", "dc", "de", "fl", "ga", "hi",
                                    "ia", "id", "il", "in", "ks", "ky",
                                    "la", "ma", "md", "me", "mi", "mn",
                                    "mo", "ms")
                   )

acs_cat2 <- subset(acs_cat, year == 2016 & time_period == '1-Year'
                   & stateab %in% c("mt", "nc", "nd", "ne", "nh", "nj",
                                    "nm", "nv", "ny", "oh", "ok", "or",
                                    "pa", "pr", "ri", "sc", "sd", "tn",
                                    "tx", "ut", "va", "vt", "wa", "wi",
                                    "wv", "wy")
                   )

Your should end up with a file called acs2016_1yr.rds in the ACS folder. This is just the first batch of geographies so we’ll repeat this step again for the second batch. But before we do that, let us generate a few indicators we need, filter the data-frame to just the 16-24 year-olds, and save this file with a unique name.

library(survey)

acs_df <- 
    readRDS( 
        file.path( path.expand( "~" ) , "ACS" , 
            "acs2016_1yr.rds" ) 
    )

library(dplyr)
acs_df1 <- acs_df %>% 
  filter(agep %in% seq(16, 24, by = 1)) %>% 
  mutate(agep_1624 = sum(agep %in% c(16, 17, 18, 19, 20, 21,  22, 23, 24)), 
         noteduc = factor(sch, levels = c(1, 2, 3), 
                          labels = c("Has not attended in the last 3 months", 
                                     "Public School/College", 
                                     "Private School/College or Homeschool")),
         notwork = factor(esr, levels = c(1:6), 
                          labels = c("Civilian employed, at work",
                                     "Civilian employed, with a job but not at work",
                                     "Unemployed",
                                     "Armed forces, at work",
                                     "Armed forces, with a job but not at work",
                                     "Not in labor force")), 
         notworking = ifelse(notwork == "Unemployed", "Yes", 
                             ifelse(notwork == "Not in labor force",
                                    "Yes", "No")),
         notinschool = ifelse(noteduc == "Has not attended in the last 3 months",
                              "Yes", "No"), 
         disengaged = ifelse(notinschool == "Yes" & notworking == "Yes",
                             "Disengaged", "Not Disengaged")
         )

saveRDS( acs_df1, file.path( path.expand( "~" ), "ACS", "acs2016_1yrA.rds") )

We can then rinse and repeat for the second batch of geographies – acs_cat2.

Having processed acs_cat2 and saved it as acs_df2, we can now combine the two data-frames.

acs_df <- rbind.data.frame(acs_df1, acs_df2) 
saveRDS( acs_df, file.path( path.expand( "~" ), "ACS", "acs2016_1yrAB.rds") )

I cleared my RStudio Environment before loading the full 2016 data-frame, just to avoid any vector memory errors.

rm(list = ls())

acs_df <- readRDS( file.path( path.expand( "~" ), "ACS", "acs2016_1yrAB.rds") )

acs_df <- acs_df %>% 
  mutate(
    race_ethnicity = case_when(
      racaian == 1 & fhisp == 0 ~ "NH AIAN",
      racasn == 1 & fhisp == 0  ~ "NH Asian",
      racblk == 1  & fhisp == 0 ~ "NH African-American",
      racnh == 1  & fhisp == 0 ~ "NH Native Hawaiian/Other Pacific Islander",
      racpi == 1  & fhisp == 0 ~ "NH Native Hawaiian/Other Pacific Islander",
      racsor == 1 & fhisp == 0 ~ "Some Other NH",
      racwht == 1 & fhisp == 0 ~ "NH White",
      fhisp == 1 ~ "Hispanic"
      ),
    youth = ifelse(agep >= 16 & agep <= 24, "Yes", "No"),
    disengaged = factor(disengaged),
    education = case_when(
      schl <= 15 ~ "Less Than High School Diploma/GED",
      schl >= 20 ~ "2-/4-year College Degree or Higher"
      ),
    income = case_when(
      povpip == 501 ~ "Income-to-poverty ratio >= 501%",
      TRUE ~ "Income-to-poverty ratio < 501%"
      ),
    disability = case_when(
      dis == 1 ~ "With a Disability",
      dis == 2 ~ "Without a Disability"
      )    
    )

Now all that remained to be done was to specify the survey design and then run the weighted estimates I needed.

library(survey)
acs_design <- svrepdesign(
        weight = ~pwgtp ,
        repweights = 'pwgtp[0-9]+' ,
        scale = 4 / 80 ,
        rscales = rep( 1 , 80 ) ,
        mse = TRUE ,
        type = 'JK1' ,
        data = acs_df
    )

acs_design <-
    update(
        acs_design ,
        relp = as.numeric( relp ) ,
        state_name =
            factor(
                as.numeric( st ) ,
                levels = c(1L, 2L, 4L, 5L, 6L, 8L, 9L,
                           10L, 11L, 12L, 13L, 15L, 16L,
                           17L, 18L, 19L, 20L, 21L, 22L,
                           23L, 24L, 25L, 26L, 27L, 28L,
                           29L, 30L, 31L, 32L, 33L, 34L,
                           35L, 36L, 37L, 38L, 39L, 40L,
                           41L, 42L, 44L, 45L, 46L, 47L,
                           48L, 49L, 50L, 51L, 53L, 54L,
                           55L, 56L, 72L) ,
                labels = c("Alabama", "Alaska", "Arizona", "Arkansas",
                           "California", "Colorado", "Connecticut",
                           "Delaware", "District of Columbia", "Florida",
                           "Georgia", "Hawaii", "Idaho", "Illinois",
                           "Indiana", "Iowa", "Kansas", "Kentucky",
                           "Louisiana", "Maine", "Maryland", "Massachusetts",
                           "Michigan", "Minnesota", "Mississippi", "Missouri",
                           "Montana", "Nebraska", "Nevada", "New Hampshire",
                           "New Jersey", "New Mexico", "New York", "North Carolina",
                           "North Dakota", "Ohio", "Oklahoma", "Oregon",
                           "Pennsylvania", "Rhode Island", "South Carolina",
                           "South Dakota", "Tennessee", "Texas", "Utah", "Vermont",
                           "Virginia", "Washington", "West Virginia", "Wisconsin",
                           "Wyoming", "Puerto Rico")
                )
        )

oppyouth <- svyby( ~ disengaged , ~ state_name ,
                   acs_design , svymean, na.rm = TRUE)
oppyouth[, c(2:5)] <- oppyouth[, c(2:5)] * 100 
oppyouth[, c(2:5)] <- round(oppyouth[, c(2:5)], digits = 1)
oppyouth$se2 <- NULL
colnames(oppyouth) <- c("State", "Disengaged (%)",
                        "Not Disengaged  (%)", "Standard Error (%)")

Table 1: Percent Disengaged Youth (16-24 year-olds) by State
State	Disengaged (%)	Not Disengaged (%)	Standard Error (%)
Alabama	14.1	85.9	0.5
Alaska	17.9	82.1	1.9
Arizona	13.7	86.3	0.5
Arkansas	15.0	85.0	0.7
California	11.5	88.5	0.2
Colorado	10.7	89.3	0.5
Connecticut	8.5	91.5	0.5
Delaware	14.3	85.7	1.5
District of Columbia	14.8	85.2	1.6
Florida	11.8	88.2	0.3
Georgia	12.6	87.4	0.4
Hawaii	11.1	88.9	1.1
Idaho	10.9	89.1	1.1
Illinois	10.7	89.3	0.3
Indiana	10.7	89.3	0.4
Iowa	7.4	92.6	0.5
Kansas	10.0	90.0	0.6
Kentucky	14.3	85.7	0.8
Louisiana	17.4	82.6	0.8
Maine	11.9	88.1	1.2
Maryland	11.1	88.9	0.5
Massachusetts	7.4	92.6	0.4
Michigan	11.6	88.4	0.5
Minnesota	7.5	92.5	0.5
Mississippi	14.5	85.5	0.9
Missouri	10.1	89.9	0.6
Montana	11.8	88.2	1.3
Nebraska	9.2	90.8	0.8
Nevada	14.3	85.7	0.8
New Hampshire	8.5	91.5	0.8
New Jersey	10.1	89.9	0.4
New Mexico	16.4	83.6	1.3
New York	12.1	87.9	0.3
North Carolina	11.6	88.4	0.4
North Dakota	7.0	93.0	1.0
Ohio	11.1	88.9	0.4
Oklahoma	14.2	85.8	0.7
Oregon	11.9	88.1	0.7
Pennsylvania	10.7	89.3	0.4
Rhode Island	7.5	92.5	0.8
South Carolina	12.7	87.3	0.5
South Dakota	9.2	90.8	1.2
Tennessee	13.1	86.9	0.6
Texas	13.4	86.6	0.3
Utah	9.2	90.8	0.6
Vermont	11.6	88.4	2.3
Virginia	9.8	90.2	0.4
Washington	12.3	87.7	0.4
West Virginia	17.3	82.7	1.0
Wisconsin	9.1	90.9	0.5
Wyoming	10.3	89.7	1.5
Puerto Rico	21.6	78.4	0.7

If you want the breakouts by race, you will need to cross Hispanic status and the one race alone categories since Measure of America’s estimates for White and Black youth are for non-Hispanic one race alone, while their Latino estimates are for Latinos of any race. Here is a snippet of code for non-Hispanic Black alone youth.

blkyouth <- svyby( ~ disengaged, ~ state_name, 
                   subset(acs_design, rac1p == "2" & hisp == "01"),
                   svymean, na.rm = TRUE)
blkyouth[, c(2:5)] <- blkyouth[, c(2:5)] * 100 
blkyouth[, c(2:5)] <- round(blkyouth[, c(2:5)], digits = 1)
blkyouth$se2 <- NULL
colnames(blkyouth) <- c("State", "Disengaged (%)",
                        "Not Disengaged  (%)", "Standard Error (%)"
                        )

Once you have the 2016 estimates, roll back through to get the 2015 and 2014 (or earlier) estimates.

Estimating Disconnected Youth in Ohio

Author

Affiliation

Published

Citation

Footnotes

Reuse

Citation