class: title-slide, center, middle background-image: url(images/ouaerial.jpeg) background-size: cover # .crimson[.fancy[More Data Operations in R]] ## .crimson[.fancy[Ani Ruhil]] --- name: agenda # .fancy[ Agenda ] This week we continue to learn how to organize and clean messy data. .pull-left[ + package of choice here is `{tidyr}` + can reshape data `long-to-wide` and `wide-to-long` + can `separate` and `unite` columns ] .pull.right[ <img src="images/tidyr_spread_gather.png" width="40%" style="display: block; margin: auto;" /> ] --- # `separate()` You will, at times, end up with columns that contain multiple pieces of information, all mashed up into some alphanumeric string or sequence of numbers. `separate()` allows us to split this mashed up column into specific pieces. For example, here are some data from the Census Bureau: ```r library(tidyverse) read.csv( "https://www2.census.gov/programs-surveys/popest/datasets/2010-2018/metro/totals/cbsa-est2018-alldata.csv" ) -> cbsa cbsa %>% select(NAME) %>% glimpse() ``` ``` ## Observations: 2,789 ## Variables: 1 ## $ NAME [3m[90m<fct>[39m[23m "Abilene, TX", "Callahan County, TX", "Jones County, TX", "Taylor… ``` --- This data-set contains population estimates for CBSAs -- core-based statistical areas. What are these? > Metropolitan and Micropolitan Statistical Areas are collectively referred to as Core-Based Statistical Areas. > Metropolitan statistical areas have at least one urbanized area of 50,000 or more population, plus adjacent territory that has a high degree of social and economic integration with the core as measured by commuting ties. Micropolitan statistical areas are a new set of statistical areas that have at least one urban cluster of at least 10,000 but less than 50,000 population, plus adjacent territory that has a high degree of social and economic integration with the core as measured by commuting ties. > Metropolitan and micropolitan statistical areas are defined in terms of whole counties or county equivalents, including the six New England states. As of June 6, 2003, there are 362 metropolitan statistical areas and 560 micropolitan statistical areas in the United States. [Source](https://www.census.gov/topics/housing/housing-patterns/about/core-based-statistical-areas.html) --- Look at the column called `NAME` ... it combines the state's name (abbreviated) and the name of the area, "Abilene, TX", "Callahan County, TX", etc We need to split this `NAME` column into two pieces -- `placename` ("Abilene", "Callahan County", etc) and `stateabb` ("TX", "TX", etc.) We do this below, with the separation occurring where a `","` is seen in `NAME` .pull-left[ ```r cbsa %>% separate( col = NAME, into = c("placename", "stateabb"), sep = ",", remove = FALSE ) -> cbsa cbsa %>% select(NAME, placename, stateabb) %>% head() ``` ``` ## NAME placename stateabb ## 1 Abilene, TX Abilene TX ## 2 Callahan County, TX Callahan County TX ## 3 Jones County, TX Jones County TX ## 4 Taylor County, TX Taylor County TX ## 5 Akron, OH Akron OH ## 6 Portage County, OH Portage County OH ``` ] .pull-right[ Here is what each piece of code is doing: | code | what it does ... | |:-- | :---- | | col = | identifies the column to be separated | | into = | creates the names for the new columns that will result | | sep = | indicates where the separation should occur | | remove = | indicates whether the column to be separated should be removed from the data-set or retained once the new columns have been created. Setting it equal to `FALSE` will keep the original column, `TRUE` will remove it. | ] --- ### What if the column to be separated was made up of numbers rather than text? Take the `STCOU` column that contains FIPS codes First two digits = the state; next three digits = the area. Ohio's FIPS code is, for instance, `39`, and Portage County's FIPS code is `133` We create two new columns, one with the state FIPS code (`stfips`) and the second with the county FIPS code (`coufips`) This time setting `sep = 2` because we want the separation to happen after the second digit. .pull-left[ ```r cbsa %>% separate( col = STCOU, into = c("stfips", "coufips"), sep = 2, remove = FALSE ) -> cbsa ``` ] .pull-right[ ```r cbsa %>% select(STCOU, stfips, coufips) %>% head() ``` ``` ## STCOU stfips coufips ## 1 NA <NA> <NA> ## 2 48059 48 059 ## 3 48253 48 253 ## 4 48441 48 441 ## 5 NA <NA> <NA> ## 6 39133 39 133 ``` ] --- # `unite()` This is the opposite of `separate()` -- two or more columns are united into ONE column. For example, take the file I am reading in as `coudf`. This file has similar content to what we read in for the CBSAs but this one has data for counties and states. ```r read_csv( "https://www2.census.gov/programs-surveys/popest/datasets/2010-2018/counties/totals/co-est2018-alldata.csv" ) -> coudf ``` Filter to retain rows only for counties. I am doing this with `filter(COUNTY != "000")` because the state rows are the ones with `COUNTY == "000"`. .pull-left[ ```r coudf %>% filter(COUNTY != "000") -> coudf2 ``` ] .pull-right[ ```r coudf2 %>% select(STNAME, CTYNAME) %>% glimpse() ``` ``` ## Observations: 3,193 ## Variables: 2 ## $ STNAME <chr> "Alabama", "Alabama", "Alabama", "Alabama", "Alabama", "Alabam… ## $ CTYNAME <chr> "Alabama", "Autauga County", "Baldwin County", "Barbour County… ``` ] --- Now I want to combine the county name (`CTYNAME`) and the state name (`STNAME`) into a single column, with the two names separated by a comma and a single white-space, i.e., by `", "`. .pull-left[ ```r coudf2 %>% unite( col = "countystate", c("CTYNAME", "STNAME"), sep = ", ", remove = FALSE ) -> coudf2 coudf2 %>% select(CTYNAME, STNAME, countystate) %>% head() ``` ``` ## # A tibble: 6 x 3 ## CTYNAME STNAME countystate ## <chr> <chr> <chr> ## 1 Autauga County Alabama Autauga County, Alabama ## 2 Baldwin County Alabama Baldwin County, Alabama ## 3 Barbour County Alabama Barbour County, Alabama ## 4 Bibb County Alabama Bibb County, Alabama ## 5 Blount County Alabama Blount County, Alabama ## 6 Bullock County Alabama Bullock County, Alabama ``` ] .pull-right[ here is what each piece of code is doing ... | code | what it does ... | | :-- | :-- | | col = | identifies the `new column` to be created | | c("..") | identifies the columns to be combined, as in c("column1", "column2", "column3") | | sep = | indicates if we want the merged elements to be separated in some manner. Here we are using ", " to separate with a comma followed by a single white-space. But we could have used any separator or no separator at all | | remove = | indicates if we want the original columns deleted `(TRUE)` or not `(FALSE)` | ] --- ## `Pivoting data` wide -> long and long -> wide <img src="images/tidyr_gif.gif" width="50%" style="display: block; margin: auto;" /> [`gif source`](https://afdataschool.github.io/r-novice-gapminder/15-tidyr/) --- If I look at the original CBSA file `cbsa`, I see that it has been setup very oddly. In particular, starting with column 6 we have a jumble of information ... + CENSUS2010POP, ESTIMATESBASE2010, POPESTIMATE2010 all have the population total for 2010. + POPESTIMATE2011 through POPESTIMATE2018 are the population totals for 2011-2018 + NPOPCHG_20XY give us net population change for 2010-2018 + BIRTHS20XY give us the number of births for 2010-2018 + DEATHS20XY give us the number of deaths for 2010-2018 + NATURALINC20XY gives us the natural increase = births - deaths for 2010-2018 + INTERNATIONALMIG20XY gives us international immigrant totals for for 2010-2018 + DOMESTICMIG20XY gives us domestic migrant totals for 2010-2018 + NETMIG20XY give us net migration totals for 2010-2018 + RESIDUAL20XY gives us some small numbers left over after adding and subtracting the --- Let us keep only a few columns to see what the current layout looks like. ```r read_csv( "https://www2.census.gov/programs-surveys/popest/datasets/2010-2018/metro/totals/cbsa-est2018-alldata.csv" ) -> cbsa cbsa %>% select(c(4, 8:16)) -> cbsa01 cbsa01 %>% head() ``` ``` ## # A tibble: 6 x 10 ## NAME POPESTIMATE2010 POPESTIMATE2011 POPESTIMATE2012 POPESTIMATE2013 ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 Abil… 165583 166616 167447 167472 ## 2 Call… 13513 13511 13488 13501 ## 3 Jone… 20237 20266 19870 20034 ## 4 Tayl… 131833 132839 134089 133937 ## 5 Akro… 703035 703123 702080 703625 ## 6 Port… 161389 161857 161375 161691 ## # … with 5 more variables: POPESTIMATE2014 <dbl>, POPESTIMATE2015 <dbl>, ## # POPESTIMATE2016 <dbl>, POPESTIMATE2017 <dbl>, POPESTIMATE2018 <dbl> ``` --- ### Wide-to-Long with `pivot_longer()` Why did they not setup the data in such a way that it had the following structure? This would make a lot more sense rather than having each year be a column all its own. | NAME | YEAR | POPULATION | | :-- | :-- | :-- | | Abilene, TX | 2010 | 165583 | | Abilene, TX | 2011 | 166616 | | Abilene, TX | 2012 | 167447 | | .... | .... | .... | | Callahan County, TX | 2010 | 13513 | | Callahan County, TX | 2011 | 13511 | | Callahan County, TX | 2012 | 13488 | | .... | .... | .... | --- Well, we can easily create the proper structure of the data-set, starting as shown below ... .pull-left[ ```r cbsa01 %>% group_by(NAME) %>% pivot_longer( names_to = "variable", values_to = "POPULATION", 2:10 ) -> cbsa01.long cbsa01.long %>% head() ``` ``` ## # A tibble: 6 x 3 ## # Groups: NAME [1] ## NAME variable POPULATION ## <chr> <chr> <dbl> ## 1 Abilene, TX POPESTIMATE2010 165583 ## 2 Abilene, TX POPESTIMATE2011 166616 ## 3 Abilene, TX POPESTIMATE2012 167447 ## 4 Abilene, TX POPESTIMATE2013 167472 ## 5 Abilene, TX POPESTIMATE2014 168355 ## 6 Abilene, TX POPESTIMATE2015 169704 ``` ] .pull-right[ This is what each piece of code does ... | code | what it does ... | | :-- | :-- | | names_to = | identifies the name of the new column that will be created | | values_to = | identifies the name of the new column in which values will be stored | | 2:10 | identifies the columns that will be pivoted from wide to long | | group_by() | holds unique combinations of whatever column names you put in `group_by()` fixed while it pivots the other columns | ] --- I still need to clean up the variable column so that it only shows the four-digit year rather than POPESTIMATE2010, and so on. Let us do this next. ```r cbsa01.long %>% separate( col = variable, into = c("todiscard", "toyear"), sep = 11, remove = TRUE) -> cbsa01.long2 cbsa01.long2 %>% mutate(YEAR = as.numeric(toyear)) %>% select(c(NAME, YEAR, POPULATION)) -> cbsa01.long3 cbsa01.long3 %>% head() ``` ``` ## # A tibble: 6 x 3 ## # Groups: NAME [1] ## NAME YEAR POPULATION ## <chr> <dbl> <dbl> ## 1 Abilene, TX 2010 165583 ## 2 Abilene, TX 2011 166616 ## 3 Abilene, TX 2012 167447 ## 4 Abilene, TX 2013 167472 ## 5 Abilene, TX 2014 168355 ## 6 Abilene, TX 2015 169704 ``` --- ### Long-to-wide with `pivot_wider()` Say the data-set was perhaps the one shown below. This data-set comes from the 2017 American Community Survey and along with state FIPS codes (`geoid`) and state name (`NAME`) it has data on `income` = median yearly income, `rent` = median monthly rent, and `moe` = the margin of error at the 90% confidence level. .pull-left[ ```r us_rent_income %>% head() ``` ``` ## # A tibble: 6 x 5 ## GEOID NAME variable estimate moe ## <chr> <chr> <chr> <dbl> <dbl> ## 1 01 Alabama income 24476 136 ## 2 01 Alabama rent 747 3 ## 3 02 Alaska income 32940 508 ## 4 02 Alaska rent 1200 13 ## 5 04 Arizona income 27517 148 ## 6 04 Arizona rent 972 4 ``` ] .pull-right[ Notice here the setup looks weird because two different variables have been combined in a single column. Instead, the data-set should have been setup as follows: | GEOID | NAME | income | rent | moe_income | moe_rent | | :-- | :-- | --: | --: | --: | --: | | 01 | Alabama | 24476 | 747 | 136 | 3 | | 02 | Alaska | 32940 | 1200 | 508 | 13 | | 03 | Arizona | 27517 | 972 | 148 | 4 | | ... | ... | ... | ... | ... | ... | ] --- Well, this can be achieved with the `pivot_wider()` function that takes from the "long" format to the "wide" format. .pull-left[ ```r us_rent_income %>% group_by(GEOID, NAME) %>% pivot_wider( names_from = variable, values_from = c(estimate, moe) ) -> usri.wide usri.wide %>% head() ``` ``` ## # A tibble: 6 x 6 ## # Groups: GEOID, NAME [6] ## GEOID NAME estimate_income estimate_rent moe_income moe_rent ## <chr> <chr> <dbl> <dbl> <dbl> <dbl> ## 1 01 Alabama 24476 747 136 3 ## 2 02 Alaska 32940 1200 508 13 ## 3 04 Arizona 27517 972 148 4 ## 4 05 Arkansas 23789 709 165 5 ## 5 06 California 29454 1358 109 3 ## 6 08 Colorado 32401 1125 109 5 ``` ] .pull-right[ Here is what each piece of code does ... | code | what it does ... | | :-- | :-- | | names_from = | identifies the column from which unique values will be taken to create the names of the new columns that will result | | values_from = | identifies the column(s) from which the values should be assigned to the new columns that will result | | group_by() | holds unique value combinations of whatever column names you put in `group_by()` fixed while it pivots the rows to new columns | ] --- ### <svg style="height:0.8em;top:.04em;position:relative;fill:darkred;" viewBox="0 0 384 512"><path d="M216 23.858c0-23.802-30.653-32.765-44.149-13.038C48 191.851 224 200 224 288c0 35.629-29.114 64.458-64.85 63.994C123.98 351.538 96 322.22 96 287.046v-85.51c0-21.703-26.471-32.225-41.432-16.504C27.801 213.158 0 261.332 0 320c0 105.869 86.131 192 192 192s192-86.131 192-192c0-170.29-168-193.003-168-296.142z"/></svg> The example that follows is a tricky one so be careful!! With the `cbsa` data we could use a combination of `pivot_longer()` and `pivot_wider()` ```r cbsa %>% select(3:5, 8:88) %>% group_by(NAME) %>% pivot_longer( names_to = "variable", values_to = "estimate", 4:84 ) -> cbsa.01 cbsa.01 %>% head() ``` ``` ## # A tibble: 6 x 5 ## # Groups: NAME [1] ## STCOU NAME LSAD variable estimate ## <chr> <chr> <chr> <chr> <dbl> ## 1 <NA> Abilene, TX Metropolitan Statistical Area POPESTIMATE2010 165583 ## 2 <NA> Abilene, TX Metropolitan Statistical Area POPESTIMATE2011 166616 ## 3 <NA> Abilene, TX Metropolitan Statistical Area POPESTIMATE2012 167447 ## 4 <NA> Abilene, TX Metropolitan Statistical Area POPESTIMATE2013 167472 ## 5 <NA> Abilene, TX Metropolitan Statistical Area POPESTIMATE2014 168355 ## 6 <NA> Abilene, TX Metropolitan Statistical Area POPESTIMATE2015 169704 ``` --- Now I will clean up the contents of `cbsa.01` so that year is a separate column. ```r cbsa.01 %>% separate( col = "variable", into = c("vartype", "year"), sep = "(?=[[:digit:]])", extra = "merge", remove = FALSE ) -> cbsa.02 cbsa.02 %>% head() ``` ``` ## # A tibble: 6 x 7 ## # Groups: NAME [1] ## STCOU NAME LSAD variable vartype year estimate ## <chr> <chr> <chr> <chr> <chr> <chr> <dbl> ## 1 <NA> Abilene, … Metropolitan Statistic… POPESTIMATE2… POPESTIM… 2010 165583 ## 2 <NA> Abilene, … Metropolitan Statistic… POPESTIMATE2… POPESTIM… 2011 166616 ## 3 <NA> Abilene, … Metropolitan Statistic… POPESTIMATE2… POPESTIM… 2012 167447 ## 4 <NA> Abilene, … Metropolitan Statistic… POPESTIMATE2… POPESTIM… 2013 167472 ## 5 <NA> Abilene, … Metropolitan Statistic… POPESTIMATE2… POPESTIM… 2014 168355 ## 6 <NA> Abilene, … Metropolitan Statistic… POPESTIMATE2… POPESTIM… 2015 169704 ``` --- Now the final flip to wide format ... ```r cbsa.02 %>% select(c(2, 5:7)) %>% group_by(NAME, year) %>% pivot_wider( names_from = "vartype", values_from = "estimate" ) -> cbsa.03 cbsa.03 %>% glimpse() ``` ``` ## Observations: 25,083 ## Variables: 11 ## Groups: NAME, year [25,083] ## $ NAME <chr> "Abilene, TX", "Abilene, TX", "Abilene, TX", "Abilene… ## $ year <chr> "2010", "2011", "2012", "2013", "2014", "2015", "2016… ## $ POPESTIMATE <list> [165583, 166616, 167447, 167472, 168355, 169704, 170… ## $ NPOPCHG <list> [337, 1033, 831, 25, 883, 1349, 314, 498, 935, -33, … ## $ BIRTHS <list> [540, 2295, 2358, 2390, 2382, 2417, 2379, 2427, 2381… ## $ DEATHS <list> [406, 1506, 1587, 1694, 1598, 1698, 1726, 1705, 1739… ## $ NATURALINC <list> [134, 789, 771, 696, 784, 719, 653, 722, 642, -29, -… ## $ INTERNATIONALMIG <list> [84, 205, 516, 361, 419, 484, 388, 325, 282, 0, 4, 5… ## $ DOMESTICMIG <list> [124, 54, -448, -1051, -301, 162, -723, -544, 19, -3… ## $ NETMIG <list> [208, 259, 68, -690, 118, 646, -335, -219, 301, -3, … ## $ RESIDUAL <list> [-5, -15, -8, 19, -19, -16, -4, -5, -8, -1, -1, -2, … ``` --- class: right, middle <img class="circle" src="https://github.com/aniruhil.png" width="175px"/> # Find me at... [<svg style="height:0.8em;top:.04em;position:relative;" viewBox="0 0 512 512"><path d="M459.37 151.716c.325 4.548.325 9.097.325 13.645 0 138.72-105.583 298.558-298.558 298.558-59.452 0-114.68-17.219-161.137-47.106 8.447.974 16.568 1.299 25.34 1.299 49.055 0 94.213-16.568 130.274-44.832-46.132-.975-84.792-31.188-98.112-72.772 6.498.974 12.995 1.624 19.818 1.624 9.421 0 18.843-1.3 27.614-3.573-48.081-9.747-84.143-51.98-84.143-102.985v-1.299c13.969 7.797 30.214 12.67 47.431 13.319-28.264-18.843-46.781-51.005-46.781-87.391 0-19.492 5.197-37.36 14.294-52.954 51.655 63.675 129.3 105.258 216.365 109.807-1.624-7.797-2.599-15.918-2.599-24.04 0-57.828 46.782-104.934 104.934-104.934 30.213 0 57.502 12.67 76.67 33.137 23.715-4.548 46.456-13.32 66.599-25.34-7.798 24.366-24.366 44.833-46.132 57.827 21.117-2.273 41.584-8.122 60.426-16.243-14.292 20.791-32.161 39.308-52.628 54.253z"/></svg> @aruhil](http://twitter.com/aruhil) [<svg style="height:0.8em;top:.04em;position:relative;" viewBox="0 0 512 512"><path d="M326.612 185.391c59.747 59.809 58.927 155.698.36 214.59-.11.12-.24.25-.36.37l-67.2 67.2c-59.27 59.27-155.699 59.262-214.96 0-59.27-59.26-59.27-155.7 0-214.96l37.106-37.106c9.84-9.84 26.786-3.3 27.294 10.606.648 17.722 3.826 35.527 9.69 52.721 1.986 5.822.567 12.262-3.783 16.612l-13.087 13.087c-28.026 28.026-28.905 73.66-1.155 101.96 28.024 28.579 74.086 28.749 102.325.51l67.2-67.19c28.191-28.191 28.073-73.757 0-101.83-3.701-3.694-7.429-6.564-10.341-8.569a16.037 16.037 0 0 1-6.947-12.606c-.396-10.567 3.348-21.456 11.698-29.806l21.054-21.055c5.521-5.521 14.182-6.199 20.584-1.731a152.482 152.482 0 0 1 20.522 17.197zM467.547 44.449c-59.261-59.262-155.69-59.27-214.96 0l-67.2 67.2c-.12.12-.25.25-.36.37-58.566 58.892-59.387 154.781.36 214.59a152.454 152.454 0 0 0 20.521 17.196c6.402 4.468 15.064 3.789 20.584-1.731l21.054-21.055c8.35-8.35 12.094-19.239 11.698-29.806a16.037 16.037 0 0 0-6.947-12.606c-2.912-2.005-6.64-4.875-10.341-8.569-28.073-28.073-28.191-73.639 0-101.83l67.2-67.19c28.239-28.239 74.3-28.069 102.325.51 27.75 28.3 26.872 73.934-1.155 101.96l-13.087 13.087c-4.35 4.35-5.769 10.79-3.783 16.612 5.864 17.194 9.042 34.999 9.69 52.721.509 13.906 17.454 20.446 27.294 10.606l37.106-37.106c59.271-59.259 59.271-155.699.001-214.959z"/></svg> aniruhil.org](https://aniruhil.org) [<svg style="height:0.8em;top:.04em;position:relative;" viewBox="0 0 512 512"><path d="M476 3.2L12.5 270.6c-18.1 10.4-15.8 35.6 2.2 43.2L121 358.4l287.3-253.2c5.5-4.9 13.3 2.6 8.6 8.3L176 407v80.5c0 23.6 28.5 32.9 42.5 15.8L282 426l124.6 52.2c14.2 6 30.4-2.9 33-18.2l72-432C515 7.8 493.3-6.8 476 3.2z"/></svg> ruhil@ohio.edu](mailto:ruhil@ohio.edu)