class: title-slide, center, middle background-image: url(images/theridges.jpeg) background-size: cover # .large[.fancy[Tidy Data]] ## .fancy[Ani Ruhil] --- class: inverse, middle, center <center><img src = "images/tidypkgs.png", width = 400px></center> --- ## Agenda - Overview of the `tidyverse` > "an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures." - `dplyr` > "dplyr is a grammar of data manipulation, providing a consistent set of verbs that help you solve the most common data manipulation challenges" - `tidyr` > "Tidy data describes a standard way of storing data that is used wherever possible throughout the tidyverse. If you ensure that your data is tidy, you’ll spend less time fighting with the tools and more time working on your analysis." --- class: inverse, middle, center <center><img src = "images/hex-dplyr.png", width = 200px></center> --- The core functions are: - `mutate()` ... to add new variables that are functions of existing variables - `transmute()` ... to add new variables and drop the rest of the variables - `select()` ... to pick a subset of variables to work with while dropping the rest - `filter()` ... to select a subset of observations to work with - `summarise()` ... to calculate quantities of interest (rather than add these quantities as new variables to the existing data frame) and save these as a standalone table - `arrange()` ... order the rows based on out preferences - `distinct()` ... to eliminate duplicate observations There are several other commands that belong to `dplyr` ... [here is the cheatsheet](https://github.com/rstudio/cheatsheets/raw/master/data-transformation.pdf) Let us see `dplyr` in action with the xclass and xstudent data frames ```r load("data/xclass.RData") load("data/xstudent.RData") ``` --- class: inverse, middle, center <center><img src = "images/pipe.jpg", width = 400px></center> # `%>%` --- # The pipe operator The `%>%` operator is useful for `chaining a sequence of operations` into a single command ```r the_data <- read.csv('/path/to/data/file.csv') %>% subset(variable_a > x) %>% transform(variable_c = variable_a/variable_b) ``` This command means ... (1) First read the csv file `THEN` (2) subset the data to cases where variable_a values are greater than some value g `THEN` (3) create variable_c as the ratio of variable_a to variable_b The result is saved in `the_data` --- class: inverse, middle, center # .fat[.fancy[distinct()]] <img src = "images/triplets.gif" /img> See also [Garrett and Hadley](http://r4ds.had.co.nz) --- # Eliminating duplicates You may have duplicate cases in your data set. How do you eliminate them? Let us assume we only need one record per student in `xclass`<sup>1</sup> ```r library(tidyverse) xclass.df1 <- xclass %>% distinct(anon_id, .keep_all = TRUE) ``` .footnote[1: While these may well be legitimate records, for training purposes I am assuming each student should only show up once.] But a student could be enrolled on multiple campuses and have multiple majors ```r xclass.df2 <- xclass %>% distinct(anon_id, campus_name, term_code, department_desc, .keep_all = TRUE) ``` If more than one variable is specified in `distinct()` then `dplyr` looks for unique combinations of these variables. Each unique combination `\(=\)` a unique observation `.keep_all = TRUE` is asking that all columns be retained after eliminating duplicate observations --- class: inverse, middle, center #.fat[.fancy[Merging data]] <img src = "images/joins.png", width = 500px> --- Say I have data on some students, with test scores in one file and their gender in another file. Each file has a unique student identifier, called `ID`. How can I create a unified data-set? Let us see each data-set first. ```r data1 = tibble(Score = c(10, 21, 33, 12, 35, 67, 43, 99), ID = c("A12", "A23", "Z14", "WX1", "Y31", "D66", "C31", "Q22") ) data2 = tibble(Sex = c("Male", "Female", "Male", "Female", "Male", "Female", "Male", "Female"), ID = c("A12", "A23", "Z14", "WX1", "Y31", "D66", "E52", "H71") ) ``` Open up both data-sets and note that students C31 and Q22 are missing from `data2` and students E52 and H71 are missing from `data1`. --- ### ... here are the two data-sets side-by-side .pull-left[ ```{=html} <div id="htmlwidget-a74f2629c9702d0961ec" style="width:100%;height:auto;" class="datatables html-widget"></div> <script type="application/json" data-for="htmlwidget-a74f2629c9702d0961ec">{"x":{"filter":"none","data":[["1","2","3","4","5","6","7","8"],[10,21,33,12,35,67,43,99],["A12","A23","Z14","WX1","Y31","D66","C31","Q22"]],"container":"<table class=\"display\">\n <thead>\n <tr>\n <th> <\/th>\n <th>Score<\/th>\n <th>ID<\/th>\n <\/tr>\n <\/thead>\n<\/table>","options":{"columnDefs":[{"className":"dt-center","targets":[1,2]},{"orderable":false,"targets":0}],"filter":"top","rownames":false,"class":"compact","order":[],"autoWidth":false,"orderClasses":false}},"evals":[],"jsHooks":[]}</script> ``` ] .pull-right[ ```{=html} <div id="htmlwidget-d1f40ace4abc0833e865" style="width:100%;height:auto;" class="datatables html-widget"></div> <script type="application/json" data-for="htmlwidget-d1f40ace4abc0833e865">{"x":{"filter":"none","data":[["1","2","3","4","5","6","7","8"],["Male","Female","Male","Female","Male","Female","Male","Female"],["A12","A23","Z14","WX1","Y31","D66","E52","H71"]],"container":"<table class=\"display\">\n <thead>\n <tr>\n <th> <\/th>\n <th>Sex<\/th>\n <th>ID<\/th>\n <\/tr>\n <\/thead>\n<\/table>","options":{"columnDefs":[{"className":"dt-center","targets":[1,2]},{"orderable":false,"targets":0}],"filter":"top","rownames":false,"class":"compact","order":[],"autoWidth":false,"orderClasses":false}},"evals":[],"jsHooks":[]}</script> ``` ] --- ## Mutating joins These combine data frames but to do so we need `merge key(s)` -- variable(s) that identifies unique observations in each file. Next, you need to decide on what you want. `natural join` ... merge observations that show up in both files ```r natural <- inner_join(x = data1, y = data2, by = "ID") natural ``` ``` ## # A tibble: 6 x 3 ## Score ID Sex ## <dbl> <chr> <chr> ## 1 10 A12 Male ## 2 21 A23 Female ## 3 33 Z14 Male ## 4 12 WX1 Female ## 5 35 Y31 Male ## 6 67 D66 Female ``` --- class: center, middle <img src = "images/animated-inner-join.gif" /img> --- `full outer join` ... merge everything even if some cases show up in one but not the other ```r full <- full_join(x = data1, y = data2, by = "ID") full ``` ``` ## # A tibble: 10 x 3 ## Score ID Sex ## <dbl> <chr> <chr> ## 1 10 A12 Male ## 2 21 A23 Female ## 3 33 Z14 Male ## 4 12 WX1 Female ## 5 35 Y31 Male ## 6 67 D66 Female ## 7 43 C31 <NA> ## 8 99 Q22 <NA> ## 9 NA E52 Male ## 10 NA H71 Female ``` --- class: center, middle <img src = "images/animated-full-join.gif" /img> --- `left outer join` ... merge such that all cases in file `x` are retained but only those seen in file `y` are joined ```r left <- left_join(x = data1, y = data2, by = "ID") left ``` ``` ## # A tibble: 8 x 3 ## Score ID Sex ## <dbl> <chr> <chr> ## 1 10 A12 Male ## 2 21 A23 Female ## 3 33 Z14 Male ## 4 12 WX1 Female ## 5 35 Y31 Male ## 6 67 D66 Female ## 7 43 C31 <NA> ## 8 99 Q22 <NA> ``` --- class: center, middle <img src = "images/animated-left-join.gif" /img> --- `right outer join` ... merge to keep all cases in `y` and only those cases are to be merged from `x` that show up in `y` ```r right <- right_join(x = data1, y = data2, by = "ID") right ``` ``` ## # A tibble: 8 x 3 ## Score ID Sex ## <dbl> <chr> <chr> ## 1 10 A12 Male ## 2 21 A23 Female ## 3 33 Z14 Male ## 4 12 WX1 Female ## 5 35 Y31 Male ## 6 67 D66 Female ## 7 NA E52 Male ## 8 NA H71 Female ``` --- class: center, middle <img src = "images/animated-right-join.gif" /img> --- What if the ID variables had different names in the files? You could rename them to have a common name or you could use `by.x =` and `by.y = ` as shown below. ```r data3 <- tibble(Score = c(10, 21, 33, 12, 35, 67, 43, 99), ID1 = c("A12", "A23", "Z14", "WX1", "Y31", "D66", "C31", "Q22") ) data4 <- tibble(Sex = c("Male", "Female", "Male", "Female", "Male", "Female", "Male", "Female"), ID2 = c("A12", "A23", "Z14", "WX1", "Y31", "D66", "E52", "H71") ) diff.full <- full_join(data3, data4, by = c("ID1" = "ID2")) ``` --- You can also have more than one merge key. For example, if I am merging data for Ohio schools, each district has a district ID number `(dirn)`, each school has building ID number `(birn)`, and then the district's name `(district)`, the building's name `(building)`, and so on. If I am merging these data then my `by = ` statement will be `by = c("dirn", "birn", "district", "building")`. Let us work through one example. Load the `sm_xstudent` data frame and make sure the `xclass` data frame is loaded as well. Look at `sm_xstudent` ... it has six variables, three of which are also in xclass -- `term_code`, `anon_id`, and `sex_code`. The other three variables are not present in `xclass` -- `ethnicity_desc`, `residency_code`, and `university_ftpt`. We will use the three common variables (the `keys`) to merge `sm_xstudent` and `xclass` ```r load("data/sm_xstudent.RData") multkey.merge <- full_join( sm_xstudent, xclass, by = c("term_code" = "term_code", "anon_id" = "anon_id", "sex_code" = "sex_code") ) ``` Open `multkey.merge` and see the merged fields --- If we have multiple data-frames to merge we can do so in several ways but my default approach is to rely on the `Reduce()` command. `Note: This is base R code` ```r df1 <- tibble(Score = c(10, 21, 33, 12, 35, 67, 43, 99), ID = c("A12", "A23", "Z14", "WX1", "Y31", "D66", "C31", "Q22") ) df2 <- tibble(Sex = c("Male", "Female", "Male", "Female", "Male", "Female", "Male", "Female"), ID = c("A12", "A23", "Z14", "WX1", "Y31", "D66", "E52", "H71") ) df3 <- tibble(Age = c(6, 7, 6, 8, 8, 9, 10, 5), ID = c("A12", "A23", "Z14", "WX1", "Y31", "D66", "E52", "H71") ) my.list <- list(df1, df2, df3) df.123 <- Reduce(function(...) merge(..., by = c("ID"), all = TRUE), my.list) ``` --- Here is the merged data-frame ... <table class="table table-striped" style="margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> ID </th> <th style="text-align:right;"> Score </th> <th style="text-align:left;"> Sex </th> <th style="text-align:right;"> Age </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> A12 </td> <td style="text-align:right;"> 10 </td> <td style="text-align:left;"> Male </td> <td style="text-align:right;"> 6 </td> </tr> <tr> <td style="text-align:left;"> A23 </td> <td style="text-align:right;"> 21 </td> <td style="text-align:left;"> Female </td> <td style="text-align:right;"> 7 </td> </tr> <tr> <td style="text-align:left;"> C31 </td> <td style="text-align:right;"> 43 </td> <td style="text-align:left;"> NA </td> <td style="text-align:right;"> NA </td> </tr> <tr> <td style="text-align:left;"> D66 </td> <td style="text-align:right;"> 67 </td> <td style="text-align:left;"> Female </td> <td style="text-align:right;"> 9 </td> </tr> <tr> <td style="text-align:left;"> E52 </td> <td style="text-align:right;"> NA </td> <td style="text-align:left;"> Male </td> <td style="text-align:right;"> 10 </td> </tr> <tr> <td style="text-align:left;"> H71 </td> <td style="text-align:right;"> NA </td> <td style="text-align:left;"> Female </td> <td style="text-align:right;"> 5 </td> </tr> <tr> <td style="text-align:left;"> Q22 </td> <td style="text-align:right;"> 99 </td> <td style="text-align:left;"> NA </td> <td style="text-align:right;"> NA </td> </tr> <tr> <td style="text-align:left;"> WX1 </td> <td style="text-align:right;"> 12 </td> <td style="text-align:left;"> Female </td> <td style="text-align:right;"> 8 </td> </tr> <tr> <td style="text-align:left;"> Y31 </td> <td style="text-align:right;"> 35 </td> <td style="text-align:left;"> Male </td> <td style="text-align:right;"> 8 </td> </tr> <tr> <td style="text-align:left;"> Z14 </td> <td style="text-align:right;"> 33 </td> <td style="text-align:left;"> Male </td> <td style="text-align:right;"> 6 </td> </tr> </tbody> </table> --- ## Filtering joins These only retain columns in the `x` data frame .pull-left[ `semi_join()` ... will keep only those rows found in `x` and `y` but keep only the columns in `x` ```r semi <- semi_join(data1, data2, by = "ID") semi ``` ``` ## # A tibble: 6 x 2 ## Score ID ## <dbl> <chr> ## 1 10 A12 ## 2 21 A23 ## 3 33 Z14 ## 4 12 WX1 ## 5 35 Y31 ## 6 67 D66 ``` ] .pull-right[ `anti_join()` ... will keep only those rows [not found]() in `x` and `y` but keep only the columns in `x` ```r anti <- anti_join(data1, data2, by = "ID") anti ``` ``` ## # A tibble: 2 x 2 ## Score ID ## <dbl> <chr> ## 1 43 C31 ## 2 99 Q22 ``` ] --- class: center, middle .pull-left[ <img src = "images/animated-semi-join.gif" /img> ] .pull-right[ <img src = "images/animated-anti-join.gif" /img> ] --- ## The oft-forgotten `rbind()` What if we have to stack data frames atop one another, for example, because one data frame is from one term and the second from another term? No problem. Load the `rb1` (where `term_code == 20171C`), `rb2` (where `term_code == 20181C`), and `rb3` (where `term_code == 20191C`) data frames. ```r load("data/rb1.RData") load("data/rb2.RData") load("data/rb3.RData") my.dfs <- rbind(rb1, rb2, rb3) ``` --- name: section class: top, inverse --- name: yourturn template: section .left-column[ # <i class = "fas fa-edit"></i><br>.fancy[Your turn] ] --- name: yourturn1 template: yourturn .right-column[ .large[(1) Load `xclass` and `sm2_xstudent` (2) Conduct a full_join, paying attention to the merge keys that should be used (3) Conduct a semi_join with `xclass` as the `x` data frame] ] .left-column[ ## .fancy[.saltinline[10:00 minutes]] ] ```{=html} <div class="countdown blink-colon noupdate-15" id="timer_5fde929a" style="right:33%;bottom:0;" data-audio="true" data-warnwhen="0"> <code class="countdown-time"><span class="countdown-digits minutes">10</span><span class="countdown-digits colon">:</span><span class="countdown-digits seconds">00</span></code> </div> ``` --- class: inverse, middle, center # .fat[.fancy[filter()]] <img src = "https://cdn.theatlantic.com/assets/media/img/posts/2019/04/Kopitiam_Kuala_Lumpur_2-2/1de5fed84.png" width = 600px /img> --- Say I want to select only online students in `xclass` ```r df.1 <- multkey.merge %>% filter(online.f == "Online") ``` What if I wanted online students in Athens? ```r df.2 <- multkey.merge %>% filter(online.f == "Online", campus_name == "Athens") ``` What if I only want students in selected colleges? ```r df.3 <- multkey.merge %>% filter(college_desc %in% c("Education", "Fine Arts")) ``` What if I only want students in all other colleges except Fine Arts and Education? ```r df.4 <- multkey.merge %>% filter(!college_desc %in% c("Education", "Fine Arts")) ``` Note the use of `!` to create `not in`. Similarly, `=` means "equal to", `!=` means "not equal to", `|` means "or", and `&` means "and" --- What if have complicated filters of multiple variables? No problem ... ```r df.5 <- multkey.merge %>% filter(!college_desc %in% c("Education", "Fine Arts"), sex_code == "F", enrlld_crdt_hrs > 4 & enrlld_crdt_hrs < 8) head(df.5) ``` ``` ## # A tibble: 6 x 26 ## term_code anon_id ethnicity_desc sex_code residency_code university_ftpt class_nbr ## <chr> <chr> <chr> <chr> <chr> <chr> <chr> ## 1 20171C ANON10… White F R Full Time 08287 ## 2 20171C ANON10… White F R Full Time 08285 ## 3 20171C ANON10… White F R Part Time 08024 ## 4 20171C ANON10… White F R Part Time 08935 ## 5 20171C ANON10… White F R Full Time 09761 ## 6 20171C ANON10… White F R Part Time 09339 ## # … with 19 more variables: campus_name <chr>, campus_dtl_desc <chr>, college_desc <chr>, ## # department_desc <chr>, subject <chr>, catalog_nbr <chr>, class_desc <chr>, ## # delivery_desc <chr>, online_flag <dbl>, rank_group <chr>, rank_desc <chr>, ## # acad_plan <chr>, student_college_desc <chr>, acad_plan_desc <chr>, ## # field_of_study_desc <chr>, residency_desc <chr>, enrolled_flag <dbl>, ## # enrlld_crdt_hrs <dbl>, online.f <fct> ``` --- name: yourturn2 template: yourturn .right-column[ .large[Use `filter()` to select cases that meet ALL of the the following criteria (1) term_code must be 20191C (2) student is not on the Athens camps (3) student is a senior Make sure you save the resulting data frame with some unique name] ] .left-column[ ## .fancy[.saltinline[05:00 minutes]] ] ```{=html} <div class="countdown blink-colon noupdate-15" id="timer_5fde94b1" style="right:33%;bottom:0;" data-audio="true" data-warnwhen="0"> <code class="countdown-time"><span class="countdown-digits minutes">05</span><span class="countdown-digits colon">:</span><span class="countdown-digits seconds">00</span></code> </div> ``` -- .right-column[ ```r my.df <- multkey.merge %>% filter(term_code == "20191C", campus_name != "Athens", rank_desc == "Senior") ``` ] --- class: inverse, middle, center # .fat[.fancy[select()]] <img src = "https://media.giphy.com/media/wHnw9knwgl13G/giphy.gif" width = 300px </img> --- Often you have more columns that you need for some operations, and so you trim the data by retaining (or discarding) specific columns. When you have to do this, it helps to know the column names\\numbers ```r names(multkey.merge) ``` ``` ## [1] "term_code" "anon_id" "ethnicity_desc" ## [4] "sex_code" "residency_code" "university_ftpt" ## [7] "class_nbr" "campus_name" "campus_dtl_desc" ## [10] "college_desc" "department_desc" "subject" ## [13] "catalog_nbr" "class_desc" "delivery_desc" ## [16] "online_flag" "rank_group" "rank_desc" ## [19] "acad_plan" "student_college_desc" "acad_plan_desc" ## [22] "field_of_study_desc" "residency_desc" "enrolled_flag" ## [25] "enrlld_crdt_hrs" "online.f" ``` Say I want the first 4 columns ```r select.1 <- multkey.merge %>% select(c(1:4)) names(select.1) ``` ``` ## [1] "term_code" "anon_id" "ethnicity_desc" "sex_code" ``` --- What if my columns are not sequential? No worries ... ```r select.2 <- multkey.merge %>% select(c(1, 2, 6, 9:11)) names(select.2) ``` ``` ## [1] "term_code" "anon_id" "university_ftpt" "campus_dtl_desc" ## [5] "college_desc" "department_desc" ``` Could I have called for the desired columns by name? Sure ... ```r select.3 <- multkey.merge %>% select(c(term_code, class_nbr, rank_desc, acad_plan)) names(select.3) ``` ``` ## [1] "term_code" "class_nbr" "rank_desc" "acad_plan" ``` --- Could I call out desired variables by a specific string in the column name? ```r select.4 <- multkey.merge %>% select(contains("desc")) names(select.4) ``` ``` ## [1] "ethnicity_desc" "campus_dtl_desc" "college_desc" ## [4] "department_desc" "class_desc" "delivery_desc" ## [7] "rank_desc" "student_college_desc" "acad_plan_desc" ## [10] "field_of_study_desc" "residency_desc" ``` What if it begins with or ends with some string? .pull-left[ ```r select.5 <- multkey.merge %>% select(ends_with("desc")) names(select.5) ``` ``` ## [1] "ethnicity_desc" "campus_dtl_desc" "college_desc" ## [4] "department_desc" "class_desc" "delivery_desc" ## [7] "rank_desc" "student_college_desc" "acad_plan_desc" ## [10] "field_of_study_desc" "residency_desc" ``` ] .pull-right[ ```r select.6 <- multkey.merge %>% select(starts_with("acad")) names(select.6) ``` ``` ## [1] "acad_plan" "acad_plan_desc" ``` ] This is just the tip of the iceberg ... look at the dplyr documentation and online resources for other variants, including the scoped variants of `select_if()`, `select_at()`, and `select_all()` --- name: yourturn3 template: yourturn .large[Use `select()` to retain the following cases, respectively (1) Columns 3, 4, 8, and 22 (2) Columns that end with the string "flag" (3) Columns that contain the string "plan"] .left-column[ ## .fancy[.saltinline[10:00 minutes]] ] ```{=html} <div class="countdown blink-colon noupdate-15" id="timer_5fde94c2" style="right:33%;bottom:0;" data-audio="true" data-warnwhen="0"> <code class="countdown-time"><span class="countdown-digits minutes">10</span><span class="countdown-digits colon">:</span><span class="countdown-digits seconds">00</span></code> </div> ``` --- (1) Columns 3, 4, 8, and 22 ```r multkey.merge %>% select(c(3:4, 8, 22)) ``` ``` ## # A tibble: 488,560 x 4 ## ethnicity_desc sex_code campus_name field_of_study_desc ## <chr> <chr> <chr> <chr> ## 1 Two or More Races F Athens Counselor Education ## 2 White F Athens Art ## 3 White F Athens Art ## 4 Asian M Athens International Affairs ## 5 Asian M Athens International Affairs ## 6 Asian M Athens International Affairs ## 7 Asian M Athens International Affairs ## 8 White M Athens Management Systems ## 9 White M Athens Management Systems ## 10 Race/Ethnicity Unknown F Athens Undecided Education ## # … with 488,550 more rows ``` --- (2) Columns that end with the string "flag" ```r multkey.merge %>% select(ends_with("flag")) ``` ``` ## # A tibble: 488,560 x 2 ## online_flag enrolled_flag ## <dbl> <dbl> ## 1 0 1 ## 2 0 1 ## 3 0 1 ## 4 0 1 ## 5 0 1 ## 6 0 1 ## 7 0 1 ## 8 1 1 ## 9 1 1 ## 10 0 1 ## # … with 488,550 more rows ``` --- (3) Columns that contain the string "plan" ```r multkey.merge %>% select(contains("plan")) ``` ``` ## # A tibble: 488,560 x 2 ## acad_plan acad_plan_desc ## <chr> <chr> ## 1 ND6274 School Counseling ## 2 MF5052 Painting and Drawing ## 3 MF5052 Painting and Drawing ## 4 MA4216 Asian Studies ## 5 MA4216 Asian Studies ## 6 MA4216 Asian Studies ## 7 MA4216 Asian Studies ## 8 MB6157 MBA Business Analytics ## 9 MB6157 MBA Business Analytics ## 10 ND9915 EHS-Nondegree ## # … with 488,550 more rows ``` --- class: inverse, middle, center # .fat[.fancy[summarise() and mutate()]] --- Say I want to calculate some quantities of interest --- frequencies perhaps, or means, standard deviations, etc. Let us start with mean, medians, etc. ```r tab1 <- multkey.merge %>% summarise( Mean.cr = mean(enrlld_crdt_hrs, na.rm = TRUE), Median.cr = median(enrlld_crdt_hrs, na.rm = TRUE), StdDev.cr = sd(enrlld_crdt_hrs, na.rm = TRUE), Var.cr = var(enrlld_crdt_hrs, na.rm = TRUE), Q1.cr = quantile(enrlld_crdt_hrs, p = c(0.25), na.rm = TRUE), Q3.cr = quantile(enrlld_crdt_hrs, p = c(0.75), na.rm = TRUE) ) tab1 ``` ``` ## # A tibble: 1 x 6 ## Mean.cr Median.cr StdDev.cr Var.cr Q1.cr Q3.cr ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 2.80 3 1.37 1.88 3 3 ``` Note that `tab1` is a table data frame with only what we asked for --- What if we wanted `frequencies`? Assume we want to know enrollment by college. .pull-left[ ```r tab2 <- multkey.merge %>% group_by(college_desc) %>% summarise(frequency = n()) %>% arrange(frequency) tab2 ``` ``` ## # A tibble: 14 x 2 ## college_desc frequency ## <chr> <int> ## 1 Miscellaneous 308 ## 2 Honors Tutorial 481 ## 3 International Studies 921 ## 4 George Voinovich School 1317 ## 5 Osteopathic Medicine 12170 ## 6 University College 17542 ## 7 Regional Higher Ed 19043 ## 8 Engineering & Technology 27094 ## 9 Communication 33312 ## 10 Fine Arts 34296 ## 11 Education 37544 ## 12 Business 53606 ## 13 Health Sciences & Professions 77085 ## 14 Arts & Sciences 173841 ``` ] .pull-right[ ```r tab3 <- multkey.merge %>% group_by(college_desc) %>% summarise(frequency = n()) %>% arrange(-frequency) tab3 ``` ``` ## # A tibble: 14 x 2 ## college_desc frequency ## <chr> <int> ## 1 Arts & Sciences 173841 ## 2 Health Sciences & Professions 77085 ## 3 Business 53606 ## 4 Education 37544 ## 5 Fine Arts 34296 ## 6 Communication 33312 ## 7 Engineering & Technology 27094 ## 8 Regional Higher Ed 19043 ## 9 University College 17542 ## 10 Osteopathic Medicine 12170 ## 11 George Voinovich School 1317 ## 12 International Studies 921 ## 13 Honors Tutorial 481 ## 14 Miscellaneous 308 ``` ] --- Frequencies are well and good but we may want `percentages` .pull-left[ ```r tab4 <- multkey.merge %>% group_by(college_desc) %>% summarise(frequency = n()) %>% mutate(percent = (frequency / sum(frequency)) * 100, percent = round(percent, digits = 2)) %>% arrange(percent) tab4 ``` ``` ## # A tibble: 14 x 3 ## college_desc frequency percent ## <chr> <int> <dbl> ## 1 Miscellaneous 308 0.06 ## 2 Honors Tutorial 481 0.1 ## 3 International Studies 921 0.19 ## 4 George Voinovich School 1317 0.27 ## 5 Osteopathic Medicine 12170 2.49 ## 6 University College 17542 3.59 ## 7 Regional Higher Ed 19043 3.9 ## 8 Engineering & Technology 27094 5.55 ## 9 Communication 33312 6.82 ## 10 Fine Arts 34296 7.02 ## 11 Education 37544 7.68 ## 12 Business 53606 11.0 ## 13 Health Sciences & Professions 77085 15.8 ## 14 Arts & Sciences 173841 35.6 ``` ] .pull-right[ ```r tab5 <- multkey.merge %>% group_by(college_desc) %>% summarise(frequency = n()) %>% mutate(percent = (frequency / sum(frequency)) * 100, percent = round(percent, digits = 2)) %>% arrange(-percent) tab5 ``` ``` ## # A tibble: 14 x 3 ## college_desc frequency percent ## <chr> <int> <dbl> ## 1 Arts & Sciences 173841 35.6 ## 2 Health Sciences & Professions 77085 15.8 ## 3 Business 53606 11.0 ## 4 Education 37544 7.68 ## 5 Fine Arts 34296 7.02 ## 6 Communication 33312 6.82 ## 7 Engineering & Technology 27094 5.55 ## 8 Regional Higher Ed 19043 3.9 ## 9 University College 17542 3.59 ## 10 Osteopathic Medicine 12170 2.49 ## 11 George Voinovich School 1317 0.27 ## 12 International Studies 921 0.19 ## 13 Honors Tutorial 481 0.1 ## 14 Miscellaneous 308 0.06 ``` ] --- Perhaps I want frequencies and percentages by `sex_code` ```r tab6 <- multkey.merge %>% group_by(college_desc, sex_code) %>% summarise(frequency = n()) %>% mutate(percent = (frequency / sum(frequency)) * 100, percent = round(percent, digits = 2)) %>% arrange(college_desc, sex_code, -percent) tab6 ``` ``` ## # A tibble: 31 x 4 ## # Groups: college_desc [14] ## college_desc sex_code frequency percent ## <chr> <chr> <int> <dbl> ## 1 Arts & Sciences F 98267 56.5 ## 2 Arts & Sciences M 75574 43.5 ## 3 Business F 19875 37.1 ## 4 Business M 33731 62.9 ## 5 Communication F 18844 56.6 ## 6 Communication M 14468 43.4 ## 7 Education F 25939 69.1 ## 8 Education M 11605 30.9 ## 9 Engineering & Technology F 4542 16.8 ## 10 Engineering & Technology M 22552 83.2 ## # … with 21 more rows ``` --- Or by `sex_code` and `rank_desc` ```r tab7 <- multkey.merge %>% group_by(college_desc, sex_code, rank_desc) %>% summarise(frequency = n()) %>% mutate(percent = (frequency / sum(frequency)) * 100, percent = round(percent, digits = 1)) %>% arrange(college_desc, sex_code, rank_desc, -percent) tab7 ``` ``` ## # A tibble: 178 x 5 ## # Groups: college_desc, sex_code [31] ## college_desc sex_code rank_desc frequency percent ## <chr> <chr> <chr> <int> <dbl> ## 1 Arts & Sciences F Doctoral Student 1178 1.2 ## 2 Arts & Sciences F Freshman 33519 34.1 ## 3 Arts & Sciences F Junior 16100 16.4 ## 4 Arts & Sciences F Master's Student 3388 3.4 ## 5 Arts & Sciences F Non-Degree Undergraduate 4068 4.1 ## 6 Arts & Sciences F Senior 17321 17.6 ## 7 Arts & Sciences F Sophomore 22693 23.1 ## 8 Arts & Sciences M Doctoral Student 1604 2.1 ## 9 Arts & Sciences M Freshman 26114 34.6 ## 10 Arts & Sciences M Junior 12320 16.3 ## # … with 168 more rows ``` --- class: middle, inverse, center #.fancy[.fat[case_when()]] --- Often you will need to collapse some categories or otherwise modify some values. For example, perhaps I want `sex_code` to be fully spelled out instead of `F` and `M`. How might I do that? .pull-left[ ```r multkey.merge <- multkey.merge %>% mutate( sex.f = case_when( sex_code == "F" ~ "Female", sex_code == "M" ~ "Male" ) ) table(multkey.merge$sex.f) ``` ``` ## ## Female Male ## 277650 210907 ``` ] .pull-right[ ```r multkey.merge <- multkey.merge %>% mutate( sex.f = case_when( sex_code == "F" ~ "Female", TRUE ~ "Male" ) ) table(multkey.merge$sex.f) ``` ``` ## ## Female Male ## 277650 210910 ``` ] --- I can also take a numeric variable and flip it into a categorical variable with `case_when()`. For example, say I want to split `enrlld_crdt_hrs` into three groups, those with hours <= 4, those with hours >= 4 but <= 15, and those with hours >= 16 .pull-left[ ```r multkey.merge <- multkey.merge %>% mutate( enrolled.f = case_when( enrlld_crdt_hrs < 4 ~ "<= 3 hours", enrlld_crdt_hrs >= 4 & enrlld_crdt_hrs <= 15 ~ "4 - 15 hours", enrlld_crdt_hrs > 15 ~ "> 15 hours", ) ) table(multkey.merge$enrolled.f) ``` ``` ## ## <= 3 hours > 15 hours 4 - 15 hours ## 404150 54 84356 ``` ] .pull-right[ ```r multkey.merge <- multkey.merge %>% mutate( enrolled.f = case_when( enrlld_crdt_hrs < 4 ~ "<= 3 hours", enrlld_crdt_hrs >= 4 & enrlld_crdt_hrs <= 15 ~ "4 - 15 hours", TRUE ~ "> 15 hours", ) ) table(multkey.merge$enrolled.f) ``` ``` ## ## <= 3 hours > 15 hours 4 - 15 hours ## 404150 54 84356 ``` ] --- class: inverse, middle, center # .fat[.fancy[Dealing with strings]] <img src = "https://media.giphy.com/media/WuaNt9aTcszyU/giphy.gif" width = 250px </img> --- ## stringr() Some times we may need to eliminate some strings in specific columns, perhaps change their case, or replace a string with another string. All this and more can be done either by two packages: `stringr` or `stringi`. Let us see `stringr` in action for a few commonly encountered tasks. For example, `rank_desc` has the word "student" in the labels. Let us assume we want to get rid of these. Here goes ... ```r multkey.merge <- multkey.merge %>% mutate(rank = stringr::str_replace(rank_desc, " Student", "")) ``` What about changing the `case`? ```r multkey.merge <- multkey.merge %>% mutate(college = stringr::str_to_upper(college_desc)) ``` Other case options: `str_to_title`, `str_to_lower`, `str_to_sentence` `stringr` is very versatile; see here for [the authors' walkthrough](https://r4ds.had.co.nz/strings.html) --- class: inverse, middle, center # .fat[.fancy[Grouped operations via group_by()]] <img src = "https://media.giphy.com/media/gfhfmIsWODmkE/giphy.gif" width = 250px </img> --- You will, more often than not, want to calculate quantities of interest for specific groups -- by campus, by term, by college/department/course, and so on. Well, the `group-by()` switch allows you to specify the grouping structure, while `ungroup()` will remove constraints imposed by the grouping structure. For example, assume I want to calculate the number of students in each college ```r multkey.merge %>% group_by(college_desc, campus_name) %>% summarise(students = n_distinct(anon_id)) %>% arrange(-students) ``` ``` ## # A tibble: 56 x 3 ## # Groups: college_desc [14] ## college_desc campus_name students ## <chr> <chr> <int> ## 1 Arts & Sciences Athens 47145 ## 2 Health Sciences & Professions Athens 30119 ## 3 Business Athens 19247 ## 4 Communication Athens 14332 ## 5 Fine Arts Athens 14272 ## 6 Education Athens 13854 ## 7 University College Athens 12891 ## 8 Engineering & Technology Athens 8263 ## 9 Arts & Sciences Lancaster 4590 ## 10 Arts & Sciences Chillicothe 4021 ## # … with 46 more rows ``` `n_distinct()` is picking up unique values of `anon_id` even though the data frame has multiple rows per some students --- What about male versus female students? ```r multkey.merge %>% group_by(college_desc, campus_name, sex_code) %>% summarise(students = n_distinct(anon_id)) %>% arrange(college_desc, campus_name, -students) ``` ``` ## # A tibble: 115 x 4 ## # Groups: college_desc, campus_name [56] ## college_desc campus_name sex_code students ## <chr> <chr> <chr> <int> ## 1 Arts & Sciences Athens F 25515 ## 2 Arts & Sciences Athens M 21630 ## 3 Arts & Sciences Chillicothe F 2559 ## 4 Arts & Sciences Chillicothe M 1462 ## 5 Arts & Sciences Eastern F 1318 ## 6 Arts & Sciences Eastern M 847 ## 7 Arts & Sciences Lancaster F 2755 ## 8 Arts & Sciences Lancaster M 1835 ## 9 Arts & Sciences Southern F 2032 ## 10 Arts & Sciences Southern M 1012 ## # … with 105 more rows ``` --- What about male versus female students by ethnicity? ```r multkey.merge %>% group_by(college_desc, campus_name, sex_code, ethnicity_desc) %>% summarise(students = n_distinct(anon_id)) %>% arrange(college_desc, campus_name, ethnicity_desc, sex_code) ``` ``` ## # A tibble: 820 x 5 ## # Groups: college_desc, campus_name, sex_code [115] ## college_desc campus_name sex_code ethnicity_desc students ## <chr> <chr> <chr> <chr> <int> ## 1 Arts & Sciences Athens F American Indian/Alaska Native 43 ## 2 Arts & Sciences Athens M American Indian/Alaska Native 34 ## 3 Arts & Sciences Athens F Asian 310 ## 4 Arts & Sciences Athens M Asian 215 ## 5 Arts & Sciences Athens F Black or African American 1488 ## 6 Arts & Sciences Athens M Black or African American 1087 ## 7 Arts & Sciences Athens F Hispanic or Latino 846 ## 8 Arts & Sciences Athens M Hispanic or Latino 611 ## 9 Arts & Sciences Athens F Native Hawaiian/Other Pacific Islander 11 ## 10 Arts & Sciences Athens M Native Hawaiian/Other Pacific Islander 17 ## # … with 810 more rows ``` --- Maybe I want the number enrolled in a give course. To do this, I will first combine `subject` and `catalog_nbr` ```r multkey.merge %>% mutate(course = stringr::str_c(subject, catalog_nbr, sep = " ")) %>% group_by(term_code, campus_name, course) %>% summarise(students = n_distinct(anon_id)) %>% arrange(-students) ``` ``` ## # A tibble: 12,666 x 4 ## # Groups: term_code, campus_name [18] ## term_code campus_name course students ## <chr> <chr> <chr> <int> ## 1 20181C Athens UC 1900 3107 ## 2 20191C Athens UC 1900 3102 ## 3 20171C Athens UC 1900 2811 ## 4 20171C Athens ENG 1510 1547 ## 5 20171C Athens NRSE 4570 1467 ## 6 20181C Athens ENG 1510 1428 ## 7 20171C Athens NRSE 4520 1426 ## 8 20191C Athens NRSE 4550 1414 ## 9 20171C Athens PSY 1010 1404 ## 10 20181C Athens NRSE 4530 1334 ## # … with 12,656 more rows ``` --- name: yourturn4 template: yourturn .right-column[ .large[ (1) Remove the phrase "Main Campus" from `campus_dtl_desc` (2) Add `class_nbr` to course, making sure the separator is " - " (3) Save `multkey.merge` with the RData extension] ] .left-column[ .fat[.fancy[10:00 minutes]] ] --- ## .fancy[rename()] You may need to rename some existing columns. An easy way to do this is as follows: ```r multkey.merge %>% rename("Ethnicity" = ethnicity_desc, "Sex" = sex_code, "Department" = department_desc) %>% names() ``` ``` ## [1] "term_code" "anon_id" "Ethnicity" ## [4] "Sex" "residency_code" "university_ftpt" ## [7] "class_nbr" "campus_name" "campus_dtl_desc" ## [10] "college_desc" "Department" "subject" ## [13] "catalog_nbr" "class_desc" "delivery_desc" ## [16] "online_flag" "rank_group" "rank_desc" ## [19] "acad_plan" "student_college_desc" "acad_plan_desc" ## [22] "field_of_study_desc" "residency_desc" "enrolled_flag" ## [25] "enrlld_crdt_hrs" "online.f" "sex.f" ## [28] "enrolled.f" "rank" "college" ``` --- ## .fancy[top_n()] After running some operations that generate some quantities of interest we may want to extract or merely peek at the `n` observations with the highest values or `n` observatins with the lowest values. This can be done as follows: .pull-left[ ```r multkey.merge %>% group_by(department_desc) %>% summarize(frequency = n()) %>% arrange(-frequency) %>% top_n(10, frequency) ``` ``` ## # A tibble: 10 x 2 ## department_desc frequency ## <chr> <int> ## 1 Nursing 36146 ## 2 Management Systems 27078 ## 3 Biological Sciences 22470 ## 4 Mathematics 20716 ## 5 Vice President Regional Higher Education 19043 ## 6 Psychology 18791 ## 7 English 17148 ## 8 Communication Studies 17008 ## 9 Teacher Education 16464 ## 10 Dean University College 16447 ``` ] .pull-right[ ```r multkey.merge %>% group_by(department_desc) %>% summarize(frequency = n()) %>% arrange(-frequency) %>% top_n(-10, frequency) ``` ``` ## # A tibble: 10 x 2 ## department_desc frequency ## <chr> <int> ## 1 Center For Intl Studies 921 ## 2 African American Studies 884 ## 3 Aviation 883 ## 4 Military Science 818 ## 5 Ohio Program of Intensive English 505 ## 6 Honors College 481 ## 7 Misc 308 ## 8 Aerospace Studies 277 ## 9 Education Dean 69 ## 10 Center for Law, Justice and Culture 54 ``` ] --- class: inverse, middle, center # .fat[.fancy[tidy data]] <center><img src = "images/hex-tidyr.png", width = 200px></center> --- Most packages you use in R call for data to be `tidy`. [What are tidy data](http://vita.had.co.nz/papers/tidy-data.pdf)? - Each variable forms a column. - Each observation forms a row. - Each type of observational unit forms a table. Here are some untidy data --- (a) The Pew survey data ```r pew <- read.delim( file = "http://stat405.had.co.nz/data/pew.txt", header = TRUE, stringsAsFactors = FALSE, check.names = F ) head(pew) ``` ``` ## religion <$10k $10-20k $20-30k $30-40k $40-50k $50-75k $75-100k $100-150k ## 1 Agnostic 27 34 60 81 76 137 122 109 ## 2 Atheist 12 27 37 52 35 70 73 59 ## 3 Buddhist 27 21 30 34 33 58 62 39 ## 4 Catholic 418 617 732 670 638 1116 949 792 ## 5 Don’t know/refused 15 14 15 11 10 35 21 17 ## 6 Evangelical Prot 575 869 1064 982 881 1486 949 723 ## >150k Don't know/refused ## 1 84 96 ## 2 74 76 ## 3 53 54 ## 4 633 1489 ## 5 18 116 ## 6 414 1529 ``` --- (b) Here are some more examples .pull-left[ ```r library(DSR) table2 ``` ``` ## # A tibble: 12 x 4 ## country year key value ## <fct> <int> <fct> <int> ## 1 Afghanistan 1999 cases 745 ## 2 Afghanistan 1999 population 19987071 ## 3 Afghanistan 2000 cases 2666 ## 4 Afghanistan 2000 population 20595360 ## 5 Brazil 1999 cases 37737 ## 6 Brazil 1999 population 172006362 ## 7 Brazil 2000 cases 80488 ## 8 Brazil 2000 population 174504898 ## 9 China 1999 cases 212258 ## 10 China 1999 population 1272915272 ## 11 China 2000 cases 213766 ## 12 China 2000 population 1280428583 ``` ] .pull-right[ ```r table3 ``` ``` ## # A tibble: 6 x 3 ## country year rate ## <fct> <int> <chr> ## 1 Afghanistan 1999 745/19987071 ## 2 Afghanistan 2000 2666/20595360 ## 3 Brazil 1999 37737/172006362 ## 4 Brazil 2000 80488/174504898 ## 5 China 1999 212258/1272915272 ## 6 China 2000 213766/1280428583 ``` ] --- .pull-left[ ```r table4 ``` ``` ## # A tibble: 3 x 3 ## country `1999` `2000` ## <fct> <int> <int> ## 1 Afghanistan 745 2666 ## 2 Brazil 37737 80488 ## 3 China 212258 213766 ``` ] .pull-right[ ```r table5 ``` ``` ## # A tibble: 3 x 3 ## country `1999` `2000` ## <fct> <int> <int> ## 1 Afghanistan 19987071 20595360 ## 2 Brazil 172006362 174504898 ## 3 China 1272915272 1280428583 ``` ] --- What would a tidy version of these examples look like? ```r table1 ``` ``` ## # A tibble: 6 x 4 ## country year cases population ## <fct> <int> <int> <int> ## 1 Afghanistan 1999 745 19987071 ## 2 Afghanistan 2000 2666 20595360 ## 3 Brazil 1999 37737 172006362 ## 4 Brazil 2000 80488 174504898 ## 5 China 1999 212258 1272915272 ## 6 China 2000 213766 1280428583 ``` Here, we have - each country as a row per year - each column has a single variable - each cell has a single value The `tidyr` package is designed to make messy data tidy in order to proceed with analysis. It has four primary functions that we will see in turn. --- ## separate You will run into variables that you want to break into smaller pieces. For instance, take `table3` where `rate` is a ratio of two variables, `cases` and `population`, each of which should be in their own columns. How can we achieve that? ```r library(tidyr) s.table3 <- table3 %>% separate(col = rate, into = c("cases", "population"), sep = "/", remove = TRUE, convert = TRUE) head(s.table3) ``` ``` ## # A tibble: 6 x 4 ## country year cases population ## <fct> <int> <int> <int> ## 1 Afghanistan 1999 745 19987071 ## 2 Afghanistan 2000 2666 20595360 ## 3 Brazil 1999 37737 172006362 ## 4 Brazil 2000 80488 174504898 ## 5 China 1999 212258 1272915272 ## 6 China 2000 213766 1280428583 ``` --- ## unite The `unite` command does the exact opposite of separate, combining columns ```r u.table3 <- s.table3 %>% unite("rate", cases, population, sep = ":", remove = FALSE) head(u.table3) ``` ``` ## # A tibble: 6 x 5 ## country year rate cases population ## <fct> <int> <chr> <int> <int> ## 1 Afghanistan 1999 745:19987071 745 19987071 ## 2 Afghanistan 2000 2666:20595360 2666 20595360 ## 3 Brazil 1999 37737:172006362 37737 172006362 ## 4 Brazil 2000 80488:174504898 80488 174504898 ## 5 China 1999 212258:1272915272 212258 1272915272 ## 6 China 2000 213766:1280428583 213766 1280428583 ``` Here you specify the name of the new column first, and then list the columns to be combined, specify the separator, and whether the original columns should be removed or not. --- ## spread Revisit `table2` ```r table2 ``` ``` ## # A tibble: 12 x 4 ## country year key value ## <fct> <int> <fct> <int> ## 1 Afghanistan 1999 cases 745 ## 2 Afghanistan 1999 population 19987071 ## 3 Afghanistan 2000 cases 2666 ## 4 Afghanistan 2000 population 20595360 ## 5 Brazil 1999 cases 37737 ## 6 Brazil 1999 population 172006362 ## 7 Brazil 2000 cases 80488 ## 8 Brazil 2000 population 174504898 ## 9 China 1999 cases 212258 ## 10 China 1999 population 1272915272 ## 11 China 2000 cases 213766 ## 12 China 2000 population 1280428583 ``` --- I want to flip this into a format where the data frame has two columns per country per year -- `cases` and `population`. How can I do this? ```r sp.table2 <- table2 %>% spread(key, value) sp.table2 ``` ``` ## # A tibble: 6 x 4 ## country year cases population ## <fct> <int> <int> <int> ## 1 Afghanistan 1999 745 19987071 ## 2 Afghanistan 2000 2666 20595360 ## 3 Brazil 1999 37737 172006362 ## 4 Brazil 2000 80488 174504898 ## 5 China 1999 212258 1272915272 ## 6 China 2000 213766 1280428583 ``` Notice that the `key` column had two unique values -- cases and population -- so two new columns were created. Notice also that tidyr knew to populate the value of each of the new cells with what it finds in the `value` column. The result is what we call data in a wide format. --- ## gather `gather` does the opposite of `spread`, converting wide data into what we call the long form. Revisit `table4` ```r table4 ``` ``` ## # A tibble: 3 x 3 ## country `1999` `2000` ## <fct> <int> <int> ## 1 Afghanistan 745 2666 ## 2 Brazil 37737 80488 ## 3 China 212258 213766 ``` Note that each country has two columns, one per year. ... This is untidy ... --- ... So, we'll flip this tibble as follows: ```r g.table4 <- table4 %>% gather("year", "cases", 2:3) head(g.table4) ``` ``` ## # A tibble: 6 x 3 ## country year cases ## <fct> <chr> <int> ## 1 Afghanistan 1999 745 ## 2 Brazil 1999 37737 ## 3 China 1999 212258 ## 4 Afghanistan 2000 2666 ## 5 Brazil 2000 80488 ## 6 China 2000 213766 ``` Notice that you had to specify the columns you wanted to gather and we did this via `2:3`. If you forget to specify columns it creates something else altogether so be careful. --- ## .salt[.fancy[Some Examples]] ```r nospread.df <- multkey.merge %>% group_by(department_desc, term_code) %>% summarize(frequency = n()) head(nospread.df, 10) ``` ``` ## # A tibble: 10 x 3 ## # Groups: department_desc [4] ## department_desc term_code frequency ## <chr> <chr> <int> ## 1 Accountancy 20171C 1874 ## 2 Accountancy 20181C 1740 ## 3 Accountancy 20191C 1734 ## 4 Aerospace Studies 20171C 69 ## 5 Aerospace Studies 20181C 112 ## 6 Aerospace Studies 20191C 96 ## 7 African American Studies 20171C 313 ## 8 African American Studies 20181C 238 ## 9 African American Studies 20191C 333 ## 10 Applied Health Sciences & Wellness 20171C 3922 ``` --- ```r spread.df <- multkey.merge %>% group_by(department_desc, term_code) %>% summarize(frequency = n()) %>% spread(term_code, frequency) head(spread.df, 10) ``` ``` ## # A tibble: 10 x 4 ## # Groups: department_desc [10] ## department_desc `20171C` `20181C` `20191C` ## <chr> <int> <int> <int> ## 1 Accountancy 1874 1740 1734 ## 2 Aerospace Studies 69 112 96 ## 3 African American Studies 313 238 333 ## 4 Applied Health Sciences & Wellness 3922 4525 4268 ## 5 Art 2447 2969 2738 ## 6 Aviation 293 275 315 ## 7 Biological Sciences 6722 7947 7801 ## 8 Center For Intl Studies 255 328 338 ## 9 Center for Law, Justice and Culture NA NA 54 ## 10 Chemical & Biomolecular Engineering 870 1189 1059 ``` .center[Note: `term_code` is now a column instead of rows] --- --- # The World Health Organization Example [At this website](http://www.who.int/tb/country/data/download/en/) you can find data provided by countries to WHO and estimates of TB burden generated by WHO for the Global Tuberculosis Report. I want you to gather the following data (all are in the csv format): - [Download the data dictionary](https://extranet.who.int/tme/generateCSV.asp?ds=dictionary) - [Download case notifications](https://extranet.who.int/tme/generateCSV.asp?ds=notifications) - [Download WHO TB burden estimates](https://extranet.who.int/tme/generateCSV.asp?ds=estimates) Read the data dictionary carefully. Then read-in the case notifications data. The database has too many variables (175 or so) and so I'll have you use just a few for this practice task. ```r library(readr) tb <- read_csv("data/TB_notifications_2018-02-07.csv") ``` --- Let us use `dplyr` to select and filter the tb database so that it meets the following criteria: (i) observations from 2012 or earlier years are kept, and (ii) keep only columns 1 through 6 and 25 (new_sp_m04) through 44 (new_sp_fu). ```r library(dplyr) tb2 <- tb %>% select(c(1:6, 25:44)) %>% filter(year <= 2012) head(tb2) ``` ``` ## # A tibble: 6 x 26 ## country iso2 iso3 iso_numeric g_whoregion year new_sp_m04 new_sp_m514 new_sp_m014 ## <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> ## 1 Afghan… AF AFG 004 EMR 1980 NA NA NA ## 2 Afghan… AF AFG 004 EMR 1981 NA NA NA ## 3 Afghan… AF AFG 004 EMR 1982 NA NA NA ## 4 Afghan… AF AFG 004 EMR 1983 NA NA NA ## 5 Afghan… AF AFG 004 EMR 1984 NA NA NA ## 6 Afghan… AF AFG 004 EMR 1985 NA NA NA ## # … with 17 more variables: new_sp_m1524 <dbl>, new_sp_m2534 <dbl>, new_sp_m3544 <dbl>, ## # new_sp_m4554 <dbl>, new_sp_m5564 <dbl>, new_sp_m65 <dbl>, new_sp_mu <dbl>, ## # new_sp_f04 <dbl>, new_sp_f514 <dbl>, new_sp_f014 <dbl>, new_sp_f1524 <dbl>, ## # new_sp_f2534 <dbl>, new_sp_f3544 <dbl>, new_sp_f4554 <dbl>, new_sp_f5564 <dbl>, ## # new_sp_f65 <dbl>, new_sp_fu <dbl> ``` --- Now, `new_sp_xx` refer to new pulmonary smear-positive cases. The `m` refers to males and `f` refers to females. Finally, `04` is the 0-4 age group, `514` is the 5-14 age group, and so on. `new_sp_mu` is males with an unknown age-group and `new_sp_fu` is females with an unknown age-group. This is the messy part of the data. Ideally we'd have a column for gender, flagged as `Male, Female, Unknown`, and a separate column for age-groups flagged `0-4, 5-14, 0-14, and so on`. Time to tidy these data. Since the data are in the wide format we can use gather to flip it into the long format. ```r tb3 <- tb2 %>% gather("group", "cases", 7:26) head(tb3) ``` ``` ## # A tibble: 6 x 8 ## country iso2 iso3 iso_numeric g_whoregion year group cases ## <chr> <chr> <chr> <chr> <chr> <dbl> <chr> <dbl> ## 1 Afghanistan AF AFG 004 EMR 1980 new_sp_m04 NA ## 2 Afghanistan AF AFG 004 EMR 1981 new_sp_m04 NA ## 3 Afghanistan AF AFG 004 EMR 1982 new_sp_m04 NA ## 4 Afghanistan AF AFG 004 EMR 1983 new_sp_m04 NA ## 5 Afghanistan AF AFG 004 EMR 1984 new_sp_m04 NA ## 6 Afghanistan AF AFG 004 EMR 1985 new_sp_m04 NA ``` --- Now we split the `new_sp_xx` values into three columns ```r tb4 <- tb3 %>% separate(col = group, into = c("new", "sp", "sexage"), sep = "_") head(tb4) ``` ``` ## # A tibble: 6 x 10 ## country iso2 iso3 iso_numeric g_whoregion year new sp sexage cases ## <chr> <chr> <chr> <chr> <chr> <dbl> <chr> <chr> <chr> <dbl> ## 1 Afghanistan AF AFG 004 EMR 1980 new sp m04 NA ## 2 Afghanistan AF AFG 004 EMR 1981 new sp m04 NA ## 3 Afghanistan AF AFG 004 EMR 1982 new sp m04 NA ## 4 Afghanistan AF AFG 004 EMR 1983 new sp m04 NA ## 5 Afghanistan AF AFG 004 EMR 1984 new sp m04 NA ## 6 Afghanistan AF AFG 004 EMR 1985 new sp m04 NA ``` --- and then split `sexage` into two columns by specifying that the split should occur after the first value (which in our case will be after `m` or `f`). ```r tb5 <- tb4 %>% separate(col = sexage, into = c("sex", "agegroup"), sep = 1) head(tb5) ``` ``` ## # A tibble: 6 x 11 ## country iso2 iso3 iso_numeric g_whoregion year new sp sex agegroup cases ## <chr> <chr> <chr> <chr> <chr> <dbl> <chr> <chr> <chr> <chr> <dbl> ## 1 Afghanistan AF AFG 004 EMR 1980 new sp m 04 NA ## 2 Afghanistan AF AFG 004 EMR 1981 new sp m 04 NA ## 3 Afghanistan AF AFG 004 EMR 1982 new sp m 04 NA ## 4 Afghanistan AF AFG 004 EMR 1983 new sp m 04 NA ## 5 Afghanistan AF AFG 004 EMR 1984 new sp m 04 NA ## 6 Afghanistan AF AFG 004 EMR 1985 new sp m 04 NA ``` --- The beauty of `dplyr` and `tidyr` (in fact of all things `tidyverse`) is the ability to combine multiple steps into a single piped command! ```r tb.df <- tb %>% select(c(1:6, 25:44)) %>% filter(year <= 2012) %>% gather("group", "cases", 7:26) %>% separate(col = group, into = c("new", "sp", "sexage"), sep = "_") %>% separate(col = sexage, into = c("sex", "agegroup"), sep = 1) %>% select(c(1:6, 9:11)) head(tb.df) ``` ``` ## # A tibble: 6 x 9 ## country iso2 iso3 iso_numeric g_whoregion year sex agegroup cases ## <chr> <chr> <chr> <chr> <chr> <dbl> <chr> <chr> <dbl> ## 1 Afghanistan AF AFG 004 EMR 1980 m 04 NA ## 2 Afghanistan AF AFG 004 EMR 1981 m 04 NA ## 3 Afghanistan AF AFG 004 EMR 1982 m 04 NA ## 4 Afghanistan AF AFG 004 EMR 1983 m 04 NA ## 5 Afghanistan AF AFG 004 EMR 1984 m 04 NA ## 6 Afghanistan AF AFG 004 EMR 1985 m 04 NA ``` --- We can clean-up the codes for easier interpretation. ```r tb.df$sex <- ifelse(tb.df$sex == "m", "Males", "Females") tb.df$agegroup <- ifelse(tb.df$agegroup == "04", "0-4", ifelse(tb.df$agegroup == "014", "0-14", ifelse(tb.df$agegroup == "1524", "15-24", ifelse(tb.df$agegroup == "2534", "25-34", ifelse(tb.df$agegroup == "3544", "35-44", ifelse(tb.df$agegroup == "4554", "45-54", ifelse(tb.df$agegroup == "514", "5-14", ifelse(tb.df$agegroup == "5564", "55-64", ifelse(tb.df$agegroup == "65", "65+", "Uknown"))))))))) head(tb.df) ``` ``` ## # A tibble: 6 x 9 ## country iso2 iso3 iso_numeric g_whoregion year sex agegroup cases ## <chr> <chr> <chr> <chr> <chr> <dbl> <chr> <chr> <dbl> ## 1 Afghanistan AF AFG 004 EMR 1980 Males 0-4 NA ## 2 Afghanistan AF AFG 004 EMR 1981 Males 0-4 NA ## 3 Afghanistan AF AFG 004 EMR 1982 Males 0-4 NA ## 4 Afghanistan AF AFG 004 EMR 1983 Males 0-4 NA ## 5 Afghanistan AF AFG 004 EMR 1984 Males 0-4 NA ## 6 Afghanistan AF AFG 004 EMR 1985 Males 0-4 NA ``` .salt[Note: This is using base R and not `case_when()` from `dplyr`] --- How many cases per country in the 1980-1990 period? ```r tbtab.01 <- tb.df %>% filter(year >= 1980 | year <= 1990) %>% group_by(country) %>% summarise(Ncases = sum(cases, na.rm = TRUE)) %>% arrange(-Ncases) head(tbtab.01) ``` ``` ## # A tibble: 6 x 2 ## country Ncases ## <chr> <dbl> ## 1 India 6492850 ## 2 China 5643738 ## 3 Indonesia 1886657 ## 4 South Africa 1435147 ## 5 Bangladesh 1072431 ## 6 Viet Nam 915525 ``` --- What about by sex and age-group? ```r tbtab.02 <- tb.df %>% filter(year >= 1980 | year <= 1990) %>% group_by(country, sex, agegroup) %>% summarise(Ncases = sum(cases, na.rm = TRUE)) %>% arrange(-Ncases) head(tbtab.02) ``` ``` ## # A tibble: 6 x 4 ## # Groups: country, sex [2] ## country sex agegroup Ncases ## <chr> <chr> <chr> <dbl> ## 1 India Males 35-44 963811 ## 2 India Males 25-34 913939 ## 3 India Males 45-54 835228 ## 4 India Males 15-24 791303 ## 5 China Males 65+ 748103 ## 6 China Males 45-54 668133 ``` This is just one example of how `dplyr` and `tidyr` -- two packages from the `tidyverse` -- can be used to deal with what seems to be quite messy data. The more you use it, the easier it becomes to figure out solutions to even very complicated messy data setups. --- class: inverse, middle, center # <br> .fat[.fancy[lubridate]] <center><img src = "images/hex-lubridate.png"></center> --- This package makes working with dates and times quite easy because it does what base R does but in more intuitive ways and perhaps more flexibly. Let us start with some date fields. ```r today1 = "20171217" today2 = "2017-12-17" today3 = "2017 December 17" today4 = "20171217143241" today5 = "2017 December 17 14:32:41" today6 = "December 17 2017 14:32:41" today7 = "17-Dec, 2017 14:32:41" ``` The formats are quite varied but `lubridate` deals with them quite seamlessly so long as you pay attention to the order -- is year first or last? What about month? day? Is time given in hours, minutes and seconds?. --- .pull-left[ ```r library(lubridate) ymd(today1) ``` ``` ## [1] "2017-12-17" ``` ```r ymd(today2) ``` ``` ## [1] "2017-12-17" ``` ```r ymd(today3) ``` ``` ## [1] "2017-12-17" ``` ] .pull-right[ ```r ymd_hms(today4) ``` ``` ## [1] "2017-12-17 14:32:41 UTC" ``` ```r ymd_hms(today5) ``` ``` ## [1] "2017-12-17 14:32:41 UTC" ``` ```r mdy_hms(today6) ``` ``` ## [1] "2017-12-17 14:32:41 UTC" ``` ```r dmy_hms(today7) ``` ``` ## [1] "2017-12-17 14:32:41 UTC" ``` ] --- If I need to extract data/time elements, I can do that as well. ```r today = dmy_hms(today7); today # a date and time set in today7 ``` ``` ## [1] "2017-12-17 14:32:41 UTC" ``` ```r today.y = year(today); today.y # The year ``` ``` ## [1] 2017 ``` ```r today.m1 = month(today); today.m1 # the month, as a number ``` ``` ## [1] 12 ``` ```r today.m2 = month(today, label = TRUE, abbr = TRUE); today.m2 # labeling the month but with an abbreviation ``` ``` ## [1] Dec ## Levels: Jan < Feb < Mar < Apr < May < Jun < Jul < Aug < Sep < Oct < Nov < Dec ``` ```r today.m3 = month(today, label = TRUE, abbr = FALSE); today.m3# fully labelling the month ``` ``` ## [1] December ## 12 Levels: January < February < March < April < May < June < July < ... < December ``` --- .pull-left[ ```r today.w = week(today) today.w # what week of the year is it? ``` ``` ## [1] 51 ``` ```r today.doy = yday(today) today.doy # what day of the year is it? ``` ``` ## [1] 351 ``` ```r today.dom = mday(today) today.dom # what day of the month is it? ``` ``` ## [1] 17 ``` ] .pull-right[ ```r today.dow1 = wday(today) today.dow1 # what day of the week is it, as a number? ``` ``` ## [1] 1 ``` ```r today.dow2 = wday(today, label = TRUE, abbr = TRUE) today.dow2 # day of the week, abbreviated label ``` ``` ## [1] Sun ## Levels: Sun < Mon < Tue < Wed < Thu < Fri < Sat ``` ```r today.dow3 = wday(today, label = TRUE, abbr = FALSE) today.dow3 # day of the week fully labelled ``` ``` ## [1] Sunday ## Levels: Sunday < Monday < Tuesday < Wednesday < Thursday < Friday < Saturday ``` ] --- ```r today.h = hour(today) today.h # what hour is it? ``` ``` ## [1] 14 ``` ```r today.m = minute(today) today.m # what minute is it? ``` ``` ## [1] 32 ``` ```r today.s = second(today) today.s # what second is it? ``` ``` ## [1] 41 ``` ```r today.tz = tz(today) today.tz # what time zone is it? ``` ``` ## [1] "UTC" ``` --- ### Intervals vs. Durations vs. Periods Calculating time lapsed between dates is tricky because you have to take into account daylight savings time<sup>1</sup>, leap years<sup>2</sup>, and leap seconds<sup>3</sup>. These render the length of months, weeks, days, hours, and minutes to be `relative units of time` but seconds tends to be `exact units of time`. As a result, and by design, `lubridate` differentiates between `intervals`, `durations`, and `periods`, each measuring time spans in different ways. .footnote[ [1] Daylight Saving Time (DST) is the practice of setting the clocks forward 1 hour from standard time during the summer months, and back again in the fall, in order to make better use of natural daylight. [Source](https://www.timeanddate.com/time/dst/) [2] Leap years are needed to keep our modern day Gregorian calendar in alignment with the Earth's revolutions around the sun. It takes the Earth approximately 365.242189 days – or 365 days, 5 hours, 48 minutes, and 45 seconds – to circle once around the Sun. This is called a tropical year, and is measured from the March equinox. However, the Gregorian calendar has only 365 days in a year, so if we didn't add a leap day on February 29 nearly every four years, we would lose almost six hours off our calendar every year. After only 100 years, our calendar would be off by around 24 days! [Source](https://www.timeanddate.com/date/leapyear.html) [3] Two components are used to determine UTC (Coordinated Universal Time): International Atomic Time (TAI): A time scale that combines the output of some 200 highly precise atomic clocks worldwide, and provides the exact speed for our clocks to tick. Universal Time (UT1), also known as Astronomical Time, refers to the Earth's rotation around its own axis, which determines the length of a day. Before the difference between UTC and UT1 reaches 0.9 seconds, a leap second is added to UTC and to clocks worldwide. By adding an additional second to the time count, our clocks are effectively stopped for that second to give Earth the opportunity to catch up. [Source](https://www.timeanddate.com/time/leapseconds.html). ] --- Let us start with a `duration`, the simplest measure of lapsed time since it measures the passing of time in seconds. Say I pick two dates and times ... `05:00am on March 9, 2019` and `05:00am on March 10, 2019`. ```r date1 <- ymd_hms("2019-03-09 05:00:00", tzone = "US/Eastern") date2 <- ymd_hms("2019-03-10 05:00:00", tzone = "US/Eastern") timint <- interval(date1, date2) timint ``` ``` ## [1] 2019-03-09 05:00:00 UTC--2019-03-10 05:00:00 UTC ## [2] NA--NA ``` How much time has lapsed between date1 and date2? We calculate with `as.duration()` ```r timelapsed.d <- as.duration(timint) ``` Why is it 23 hours and not 24 hours? Because daylight savings kicks in at 2:00 AM on March 10, and since the clocks kick forward by one hour, only 23 hours have passed, not 24 as we might naively expect. Go back and look at the output from `timeint`; see the EST versus EDT? In contrast to duration, if I ask for the `period of time`, what do I get? ```r timelapsed.p <- as.period(timint) ``` Aha! `as.period()` is imprecise because it tells me 1 day has passed --- Now, we can use the duration of time lapsed in whatever units we want, but if we want accuracy, we better work with durations, as shown below. .pull-left[ ```r time_length(timelapsed.d, unit = "second") ``` ``` ## [1] 86400 NA ``` ```r time_length(timelapsed.d, unit = "minute") ``` ``` ## [1] 1440 NA ``` ```r time_length(timelapsed.d, unit = "hour") ``` ``` ## [1] 24 NA ``` ] .pull-right[ ```r time_length(timelapsed.d, unit = "day") ``` ``` ## [1] 1 NA ``` ```r time_length(timelapsed.d, unit = "week") ``` ``` ## [1] 0.1428571 NA ``` ```r time_length(timelapsed.d, unit = "month") ``` ``` ## [1] 0.03285421 NA ``` ```r time_length(timelapsed.d, unit = "year") ``` ``` ## [1] 0.002737851 NA ``` ] `duration()` is always estimated in seconds but you can change the reporting unit --- If, unfortunately, we rely on periods, well then we have inherited a problem! .pull-left[ ```r time_length(timelapsed.p, unit = "second") ``` ``` ## [1] 86400 NA ``` ```r time_length(timelapsed.p, unit = "minute") ``` ``` ## [1] 1440 NA ``` ```r time_length(timelapsed.p, unit = "hour") ``` ``` ## [1] 24 NA ``` ] .pull-right[ ```r time_length(timelapsed.p, unit = "day") ``` ``` ## [1] 1 NA ``` ```r time_length(timelapsed.p, unit = "week") ``` ``` ## [1] 0.1428571 NA ``` ```r time_length(timelapsed.p, unit = "month") ``` ``` ## [1] 0.03285421 NA ``` ```r time_length(timelapsed.p, unit = "year") ``` ``` ## [1] 0.002737851 NA ``` ] `only` the `seconds` are correct and nothing else --- class: right, middle <img class="circle" src="https://github.com/aniruhil.png" width="175px"/> # Find me at... [<svg style="height:0.8em;top:.04em;position:relative;" viewBox="0 0 512 512"><path d="M459.37 151.716c.325 4.548.325 9.097.325 13.645 0 138.72-105.583 298.558-298.558 298.558-59.452 0-114.68-17.219-161.137-47.106 8.447.974 16.568 1.299 25.34 1.299 49.055 0 94.213-16.568 130.274-44.832-46.132-.975-84.792-31.188-98.112-72.772 6.498.974 12.995 1.624 19.818 1.624 9.421 0 18.843-1.3 27.614-3.573-48.081-9.747-84.143-51.98-84.143-102.985v-1.299c13.969 7.797 30.214 12.67 47.431 13.319-28.264-18.843-46.781-51.005-46.781-87.391 0-19.492 5.197-37.36 14.294-52.954 51.655 63.675 129.3 105.258 216.365 109.807-1.624-7.797-2.599-15.918-2.599-24.04 0-57.828 46.782-104.934 104.934-104.934 30.213 0 57.502 12.67 76.67 33.137 23.715-4.548 46.456-13.32 66.599-25.34-7.798 24.366-24.366 44.833-46.132 57.827 21.117-2.273 41.584-8.122 60.426-16.243-14.292 20.791-32.161 39.308-52.628 54.253z"/></svg> @aruhil](http://twitter.com/aruhil) [<svg style="height:0.8em;top:.04em;position:relative;" viewBox="0 0 512 512"><path d="M326.612 185.391c59.747 59.809 58.927 155.698.36 214.59-.11.12-.24.25-.36.37l-67.2 67.2c-59.27 59.27-155.699 59.262-214.96 0-59.27-59.26-59.27-155.7 0-214.96l37.106-37.106c9.84-9.84 26.786-3.3 27.294 10.606.648 17.722 3.826 35.527 9.69 52.721 1.986 5.822.567 12.262-3.783 16.612l-13.087 13.087c-28.026 28.026-28.905 73.66-1.155 101.96 28.024 28.579 74.086 28.749 102.325.51l67.2-67.19c28.191-28.191 28.073-73.757 0-101.83-3.701-3.694-7.429-6.564-10.341-8.569a16.037 16.037 0 0 1-6.947-12.606c-.396-10.567 3.348-21.456 11.698-29.806l21.054-21.055c5.521-5.521 14.182-6.199 20.584-1.731a152.482 152.482 0 0 1 20.522 17.197zM467.547 44.449c-59.261-59.262-155.69-59.27-214.96 0l-67.2 67.2c-.12.12-.25.25-.36.37-58.566 58.892-59.387 154.781.36 214.59a152.454 152.454 0 0 0 20.521 17.196c6.402 4.468 15.064 3.789 20.584-1.731l21.054-21.055c8.35-8.35 12.094-19.239 11.698-29.806a16.037 16.037 0 0 0-6.947-12.606c-2.912-2.005-6.64-4.875-10.341-8.569-28.073-28.073-28.191-73.639 0-101.83l67.2-67.19c28.239-28.239 74.3-28.069 102.325.51 27.75 28.3 26.872 73.934-1.155 101.96l-13.087 13.087c-4.35 4.35-5.769 10.79-3.783 16.612 5.864 17.194 9.042 34.999 9.69 52.721.509 13.906 17.454 20.446 27.294 10.606l37.106-37.106c59.271-59.259 59.271-155.699.001-214.959z"/></svg> aniruhil.org](https://aniruhil.org) [<svg style="height:0.8em;top:.04em;position:relative;" viewBox="0 0 512 512"><path d="M476 3.2L12.5 270.6c-18.1 10.4-15.8 35.6 2.2 43.2L121 358.4l287.3-253.2c5.5-4.9 13.3 2.6 8.6 8.3L176 407v80.5c0 23.6 28.5 32.9 42.5 15.8L282 426l124.6 52.2c14.2 6 30.4-2.9 33-18.2l72-432C515 7.8 493.3-6.8 476 3.2z"/></svg> ruhil@ohio.edu](mailto:ruhil@ohio.edu)