class: title-slide, center, middle background-image: url(images/ouaerial.jpeg) background-size: cover # .crimson[.fancy[Core Data Operations in R]] ## .crimson[.fancy[Ani Ruhil]] --- name: agenda ## .fancy[ Agenda ] <center><img src = "https://www.elderresearch.com/hs-fs/hubfs/Data%20Wrangling.bmp?width=1200&name=Data%20Wrangling.bmp", width = 600px></center> > Data wrangling is the process of gathering, selecting, and transforming data to answer an analytical question. Also known as data cleaning or “munging”, legend has it that this wrangling costs analytics professionals as much as 80% of their time, leaving only 20% for exploration and modeling. [@ Mike Thurber](https://www.elderresearch.com/blog/what-is-data-wrangling) > The process of manual data cleansing prior to analysis is known as data munging [... and ...] can be a laborious task without the right tools. The common interface used for data munging is often Excel, which lacks the sophistication for collaboration and automation to make the process efficient. ... Data munging is time consuming and disjointed process gets in the way of extracting true value and potential from data. [@Trifacta](https://www.trifacta.com/data-munging/) --- class: inverse, middle, center <center><img src = "images/hex-dplyr.png", width = 300px></center> --- "[dplyr](http://dplyr.tidyverse.org/) is a grammar of data manipulation, providing a consistent set of verbs that help you solve the most common data manipulation challenges" The core functions are: | **What you want to do ...** | **`{dplyr}` function** | | :----- | :----- | | select columns to work with? | `select()` | | use a subset of the data based on some criterion? | `filter()` | | arrange the data in ascending/descending order of variable(s)? | `arrange()` | | the results of your calculations to be a standalone data frame? | `summarise()` | | add your calculated value(s) to the existing data frame? | `mutate()` | | add your calculated value(s) to the existing data frame but also drop all variables/columns not being used in the calculation? | `transmute()` | | calculate averages, frequencies, etc by groups? | `group_by()` | --- In order to understand how these commands work we rely on a simple data frame -- all flights originating and departing from Columbus (Ohio) January through September of 2017. Let us load the data and `dplyr`. ```r library(tidyverse) # loads a few packages we need library(tidylog) # tracks the result of each command as we clean our data library(here) # our friend that helps us on the correct filepath load( here("data", "cmhflights_01092017.RData") ) dim(cmhflights) ``` ``` ## [1] 35993 110 ``` You should see 110 columns, the last being `X110`, an empty column so we drop it. ```r cmhflights$X110 <- NULL ``` --- class: inverse, middle, center # .large[.fancy[ Using `select()` to pick columns to work with ]] --- Say I only want the first five columns -- Year, Quarter, Month, DayofMonth, DayOfWeek. ```r cmhflights %>% select( Year:DayOfWeek # select sequential columns Year through DayOfWeek ) -> my.df names(my.df) ``` ``` ## [1] "Year" "Quarter" "Month" "DayofMonth" "DayOfWeek" ``` What if the columns I want are not sequentially located? ```r cmhflights %>% select( Year, FlightDate:UniqueCarrier, TailNum # select Year, then sequential columns FlightDate through UniqueCarrier, and then TailNum ) -> my.df names(my.df) ``` ``` ## [1] "Year" "FlightDate" "UniqueCarrier" "TailNum" ``` --- Could we use column numbers instead? Sure! ```r cmhflights %>% select( c(1, 3, 5, 7) # select columns 1, 3, 5, and 7 ) -> my.df names(my.df) ``` ``` ## [1] "Year" "Month" "DayOfWeek" "UniqueCarrier" ``` You can also choose columns in other ways, with `contains()`, `starts_with()`, and `ends_with()`. ```r cmhflights %>% select( contains("Carrier") # select columns with names that contain the word "Carrier" ) -> my.df names(my.df) ``` ``` ## [1] "UniqueCarrier" "Carrier" "CarrierDelay" ``` --- ```r cmhflights %>% select( starts_with("De") # select columns with names that contain the text "De" ) -> my.df names(my.df) ``` ``` ## [1] "DestAirportID" "DestAirportSeqID" "DestCityMarketID" ## [4] "Dest" "DestCityName" "DestState" ## [7] "DestStateFips" "DestStateName" "DestWac" ## [10] "DepTime" "DepDelay" "DepDelayMinutes" ## [13] "DepDel15" "DepartureDelayGroups" "DepTimeBlk" ``` ```r cmhflights %>% select( ends_with("Num") # select columns with names that contain "Num" ) -> my.df names(my.df) ``` ``` ## [1] "TailNum" "FlightNum" "Div1TailNum" "Div2TailNum" "Div3TailNum" ## [6] "Div4TailNum" "Div5TailNum" ``` --- class: inverse, middle, center # .large[.fancy[ Using `filter()` to select observations to work with ]] --- With `dplyr`, the `filter()` command helps us trim the data-set down to specific observations of interest. For instance, say we only want records from January, i.e., `Month == 1`. ```r # retain only the observations from January cmhflights %>% filter( Month == 1 # find all rows where Month is exactly equal to 1 ) -> my.df table(my.df$Month) ``` ``` ## ## 1 ## 3757 ``` What about only American Airline flights in January? ```r cmhflights %>% filter( Month == 1, UniqueCarrier == "AA" # retain only AA flights from January ) -> my.df table(my.df$Month, my.df$UniqueCarrier) ``` ``` ## ## AA ## 1 387 ``` --- What about United Airlines flights in January to CMH (Columbus, OH)? ```r cmhflights %>% filter( # retain only UA flights to CMH in January Month == 1, UniqueCarrier == "UA", Dest == "CMH" ) -> my.df table(my.df$Month) ``` ``` ## ## 1 ## 53 ``` ```r table(my.df$UniqueCarrier) ``` ``` ## ## UA ## 53 ``` ```r table(my.df$Dest) ``` ``` ## ## CMH ## 53 ``` --- What if I wanted a more complicated filter, say, flights in January or February to CMH or ORD? ```r cmhflights %>% filter( # retain only flights to CMH or ORD in January or February Month %in% c(1, 2), UniqueCarrier == "UA", Dest %in% c("CMH", "ORD") ) -> my.df table(my.df$Month) ``` ``` ## ## 1 2 ## 106 145 ``` ```r table(my.df$UniqueCarrier) ``` ``` ## ## UA ## 251 ``` ```r table(my.df$Dest) ``` ``` ## ## CMH ORD ## 132 119 ``` --- ### `filter()` accepts the following operators: | Operator | Meaning | Operator | Meaning | | :---- | :---- | :---- | :---- | | `\(<\)` | is less than | `\(>\)` | is greater than | | `\(==\)` | is equal to | `\(\leq\)` | is less than or equal to | | `\(\geq\)` | is greater than or equal to | != | is not equal to | | %in% | is a member of | is.na | is NA | | !is.na | is not NA | &,!,etc | logical operators | --- class: inverse, middle, center # .large[.fancy[ Using `arrange()` to organize the observations in ascending/descending order of specific variables ]] --- Say I wanted to arrange the resulting data frame by `ascending order of departure delays`. How might I do that? .pull-left[ ```r my.df %>% arrange(DepDelay) %>% select(DepDelay) ``` ``` ## # A tibble: 251 x 1 ## DepDelay ## <dbl> ## 1 -16 ## 2 -16 ## 3 -15 ## 4 -15 ## 5 -15 ## 6 -14 ## 7 -14 ## 8 -14 ## 9 -14 ## 10 -14 ## # … with 241 more rows ``` ] .pull-right[ And now in `descending order of delays`. ```r my.df %>% arrange(-DepDelay) %>% select(DepDelay) ``` ``` ## # A tibble: 251 x 1 ## DepDelay ## <dbl> ## 1 394 ## 2 376 ## 3 235 ## 4 178 ## 5 177 ## 6 164 ## 7 112 ## 8 109 ## 9 66 ## 10 61 ## # … with 241 more rows ``` The `-` before the column name tells `descending order` ] --- We could tweak this further, perhaps saying `sort by departure delays to CMH`, and `then to ORD (Chicago's O'Hare airport)`. .pull-left[ ```r my.df %>% arrange(Dest, -DepDelay) %>% select(Dest, -DepDelay) ``` ``` ## # A tibble: 251 x 1 ## Dest ## <chr> ## 1 CMH ## 2 CMH ## 3 CMH ## 4 CMH ## 5 CMH ## 6 CMH ## 7 CMH ## 8 CMH ## 9 CMH ## 10 CMH ## # … with 241 more rows ``` ] .pull-right[ and another one ... ```r cmhflights %>% select(Month, UniqueCarrier, Dest, DepDelay) %>% filter( Month %in% c(1, 2), UniqueCarrier == "UA", Dest %in% c("CMH", "ORD") ) %>% arrange(Month, Dest, -DepDelay) ``` ``` ## # A tibble: 251 x 4 ## Month UniqueCarrier Dest DepDelay ## <int> <chr> <chr> <dbl> ## 1 1 UA CMH 178 ## 2 1 UA CMH 61 ## 3 1 UA CMH 44 ## 4 1 UA CMH 39 ## 5 1 UA CMH 39 ## 6 1 UA CMH 30 ## 7 1 UA CMH 27 ## 8 1 UA CMH 18 ## 9 1 UA CMH 12 ## 10 1 UA CMH 10 ## # … with 241 more rows ``` ] --- class: inverse, middle, center # .fancy[.large[ Using `summarise()` to calculate summary statistics ]] --- .pull-left[ How many flights were there per Month? ```r cmhflights %>% count(Month) ``` ``` ## # A tibble: 9 x 2 ## Month n ## <int> <int> ## 1 1 3757 ## 2 2 3413 ## 3 3 4101 ## 4 4 4123 ## 5 5 4098 ## 6 6 4138 ## 7 7 4295 ## 8 8 4279 ## 9 9 3789 ``` Most flights were in July (n = 4295) ] .pull-right[ What about by days of the week AND by month? ```r cmhflights %>% count(Month, DayOfWeek) ``` ``` ## # A tibble: 63 x 3 ## Month DayOfWeek n ## <int> <int> <int> ## 1 1 1 660 ## 2 1 2 635 ## 3 1 3 516 ## 4 1 4 522 ## 5 1 5 523 ## 6 1 6 361 ## 7 1 7 540 ## 8 2 1 527 ## 9 2 2 506 ## 10 2 3 516 ## # … with 53 more rows ``` Note: The output is sorted by Month and then DayOfWeek ] --- I want to know the average departure delay and the average arrival delay for all flights, with the averages calculated in two ways -- as the mean, and as the median. ```r cmhflights %>% summarise( mean_arr_delay = mean(ArrDelay, na.rm = TRUE), mean_dep_delay = mean(DepDelay, na.rm = TRUE), median_arr_delay = median(ArrDelay, na.rm = TRUE), median_dep_delay = median(DepDelay, na.rm = TRUE) ) ``` ``` ## # A tibble: 1 x 4 ## mean_arr_delay mean_dep_delay median_arr_delay median_dep_delay ## <dbl> <dbl> <dbl> <dbl> ## 1 3.26 9.20 -6 -2 ``` Here, the `na.rm = TRUE` command is useful because R will not allow you to calculate any mean or median etc. You can see this below, where I have a small data-set called `df` with 4 values of x, but one of the four is missing and recorded as `NA`. See what happens when I try to calculate the mean with/without `na.rm = TRUE`. .pull-left[ ```r df = data.frame(x = c(2, 4, 9, NA)) df %>% summarise(mean.x = mean(x)) ``` ``` ## mean.x ## 1 NA ``` ] .pull-right[ ```r df %>% summarise( mean.x = mean(x, na.rm = TRUE) ) ``` ``` ## mean.x ## 1 5 ``` ] --- class: inverse, middle, center # .fancy[.large[ Using `group_by()` to apply commands by specific groups ]] --- I want some frequencies, say the number of flights by airline per month .pull-left[ ```r cmhflights %>% group_by(Month, Carrier) %>% tally() ``` ``` ## # A tibble: 63 x 3 ## # Groups: Month [9] ## Month Carrier n ## <int> <chr> <int> ## 1 1 AA 387 ## 2 1 DL 549 ## 3 1 EV 436 ## 4 1 F9 152 ## 5 1 OO 123 ## 6 1 UA 106 ## 7 1 WN 2004 ## 8 2 AA 360 ## 9 2 DL 511 ## 10 2 EV 362 ## # … with 53 more rows ``` ] .pull-right[ ```r cmhflights %>% group_by(Month, Carrier) %>% summarise(frequency = n()) ``` ``` ## # A tibble: 63 x 3 ## # Groups: Month [9] ## Month Carrier frequency ## <int> <chr> <int> ## 1 1 AA 387 ## 2 1 DL 549 ## 3 1 EV 436 ## 4 1 F9 152 ## 5 1 OO 123 ## 6 1 UA 106 ## 7 1 WN 2004 ## 8 2 AA 360 ## 9 2 DL 511 ## 10 2 EV 362 ## # … with 53 more rows ``` ] --- .pull-left[ Now I want the number of flights per month per airline per destination. ```r cmhflights %>% group_by(Month, Carrier, Dest) %>% tally() ``` ``` ## # A tibble: 398 x 4 ## # Groups: Month, Carrier [63] ## Month Carrier Dest n ## <int> <chr> <chr> <int> ## 1 1 AA CMH 193 ## 2 1 AA DFW 112 ## 3 1 AA LAX 24 ## 4 1 AA PHX 58 ## 5 1 DL ATL 224 ## 6 1 DL CMH 275 ## 7 1 DL DTW 2 ## 8 1 DL LAX 27 ## 9 1 DL MSP 21 ## 10 1 EV CMH 218 ## # … with 388 more rows ``` ] .pull-right[ We could keep complicating the grouping structure. For example, let us add the day of the week to the mix ... ```r cmhflights %>% group_by(Month, Carrier, Dest, DayOfWeek) %>% tally() ``` ``` ## # A tibble: 2,536 x 5 ## # Groups: Month, Carrier, Dest [398] ## Month Carrier Dest DayOfWeek n ## <int> <chr> <chr> <int> <int> ## 1 1 AA CMH 1 35 ## 2 1 AA CMH 2 23 ## 3 1 AA CMH 3 28 ## 4 1 AA CMH 4 28 ## 5 1 AA CMH 5 28 ## 6 1 AA CMH 6 21 ## 7 1 AA CMH 7 30 ## 8 1 AA DFW 1 20 ## 9 1 AA DFW 2 16 ## 10 1 AA DFW 3 16 ## # … with 2,526 more rows ``` ] --- .pull-left[ Now say I am really curious about mean departure delays for the preceding grouping structure. That is, what does mean departure delay look like for flights by day of the week, by month, by carrier, and by destination? ```r cmhflights %>% group_by(Month, Carrier, Dest, DayOfWeek) %>% summarise(mean_dep_delay = mean(DepDelay, na.rm = TRUE)) ``` ``` ## # A tibble: 2,536 x 5 ## # Groups: Month, Carrier, Dest [398] ## Month Carrier Dest DayOfWeek mean_dep_delay ## <int> <chr> <chr> <int> <dbl> ## 1 1 AA CMH 1 16.3 ## 2 1 AA CMH 2 0.304 ## 3 1 AA CMH 3 1.14 ## 4 1 AA CMH 4 3.11 ## 5 1 AA CMH 5 4.11 ## 6 1 AA CMH 6 24.9 ## 7 1 AA CMH 7 21.8 ## 8 1 AA DFW 1 25.4 ## 9 1 AA DFW 2 -2.62 ## 10 1 AA DFW 3 17.6 ## # … with 2,526 more rows ``` ] .pull-right[ But this is a complicated summary table. What if all I really want to know is what airline has the highest mean departure delays, regardless of month or destination or day of the week? ```r cmhflights %>% group_by(Carrier) %>% summarise(mean_dep_delay = mean(DepDelay, na.rm = TRUE)) %>% arrange(-mean_dep_delay) # ordered in descending order of delays ``` ``` ## # A tibble: 7 x 2 ## Carrier mean_dep_delay ## <chr> <dbl> ## 1 EV 15.2 ## 2 F9 10.8 ## 3 WN 9.58 ## 4 AA 7.73 ## 5 DL 7.10 ## 6 OO 6.44 ## 7 UA 6.39 ``` EV is Express Jet; F9 is Frontier Airlines; WN is Southwest Airlines; OO is SkyWest Airlines; AA is American Airlines; DL is Delta Airlines; UA is United Airlines. So clearly United Airlines had the lowest average departure delays. ] --- Would this still be true if we repeated the calculation by Month? ```r cmhflights %>% group_by(Carrier, Month) %>% summarise(mean_dep_delay = mean(DepDelay, na.rm = TRUE)) %>% arrange(mean_dep_delay) # ordered in descending order of delays ``` ``` ## # A tibble: 63 x 3 ## # Groups: Carrier [7] ## Carrier Month mean_dep_delay ## <chr> <int> <dbl> ## 1 OO 8 -0.873 ## 2 OO 2 -0.570 ## 3 OO 9 -0.179 ## 4 UA 9 0.699 ## 5 DL 8 1.85 ## 6 DL 2 2.24 ## 7 AA 2 2.30 ## 8 F9 9 2.51 ## 9 AA 9 2.62 ## 10 DL 3 2.98 ## # … with 53 more rows ``` --- So far so good. But now I am curious about what percent of flights operated by AA, DL, UA, and WN were delayed. How could I calculate this? (1) I need to use `filter()` to restrict the data-set to just these four airlines. (2) Then I need to generate a new column that identifies whether a flight was delayed or not (`late`). (3) Now we can calculate the total number of flights (`nflights`) and the total number of flights that were delayed (`nlate`). (4) If I then calculate `\(\left( \dfrac{nlate}{nflights} \right)\times 100\)` we will end up with the percent of flights that were delayed. .pull-left[ ```r cmhflights %>% select(c(Carrier, DepDelay)) %>% filter(Carrier %in% c("AA", "DL", "UA", "WN")) %>% mutate(late = case_when( DepDelay > 0 ~ "Yes", DepDelay <= 0 ~ "No" ) ) %>% group_by(Carrier) %>% mutate(nflights = n()) %>% group_by(Carrier, late) %>% mutate( nlate = n(), pct_late = (nlate / nflights) * 100) -> df1 ``` ] .pull-right[ ```r df1 ``` ``` ## # A tibble: 29,328 x 6 ## # Groups: Carrier, late [12] ## Carrier DepDelay late nflights nlate pct_late ## <chr> <dbl> <chr> <int> <int> <dbl> ## 1 AA -9 No 3891 2698 69.3 ## 2 AA 24 Yes 3891 1162 29.9 ## 3 AA -6 No 3891 2698 69.3 ## 4 AA -5 No 3891 2698 69.3 ## 5 AA -7 No 3891 2698 69.3 ## 6 AA 22 Yes 3891 1162 29.9 ## 7 AA -4 No 3891 2698 69.3 ## 8 AA -4 No 3891 2698 69.3 ## 9 AA 42 Yes 3891 1162 29.9 ## 10 AA -6 No 3891 2698 69.3 ## # … with 29,318 more rows ``` ] --- .pull-left[ ```r cmhflights %>% select(c(Carrier, DepDelay)) %>% filter(Carrier %in% c("AA", "DL", "UA", "WN")) %>% mutate(late = case_when( DepDelay > 0 ~ "Yes", DepDelay <= 0 ~ "No" ) ) %>% group_by(Carrier) %>% mutate(nflights = n()) %>% group_by(Carrier, late) %>% mutate( nlate = n(), pct_late = (nlate / nflights) * 100) ``` ] .pull-right[ There is a whole lot going on here so let us break it down. + `filter(Carrier %in% c("AA", "DL", "UA", "WN"))` is keeping specified airlines' data while dropping the rest + `mutate(late = case_when(...)` is creating a new column called `late` and storing a value of "Yes" if `DepDelay > 0` and "No" if `DepDelay <= 0` + `group_by(Carrier)` is grouping by Carrier and then counting how many flights there were per Carrier and storing this sum in a new column called `nflights` + `group_by(Carrier, late)` is grouping the data by Carrier and late + `mutate(nlate = n(), pct_late = (nlate / nflights) * 100)` is then creating two new columns, `nlate`, the number of flights per late values of "Yes" and "No", respectively, and then `pct_late` ] --- Now, we only want the flights that were late so let us apply `select()` to keep just a few columns and then we use `filter()` to keep only rows corresponding to `late = "Yes"`. This will still leave us with duplicate rows but we can drop these duplicate rows via a new command, `distinct()` ```r df1 %>% filter(late == "Yes") %>% select(Carrier, pct_late) %>% distinct() %>% arrange(pct_late) ``` ``` ## # A tibble: 4 x 3 ## # Groups: Carrier, late [4] ## late Carrier pct_late ## <chr> <chr> <dbl> ## 1 Yes UA 24.0 ## 2 Yes DL 27.7 ## 3 Yes AA 29.9 ## 4 Yes WN 41.1 ``` So! 24% of UA flights were late, the lowest in this group. --- What if we wanted to do this for all airlines, and we want it by Month? .pull-left[ ```r cmhflights %>% select(c(Carrier, Month, DepDelay)) %>% filter(Carrier %in% c("AA", "DL", "UA", "WN")) %>% mutate(late = case_when( DepDelay > 0 ~ "Yes", DepDelay <= 0 ~ "No" ) ) %>% group_by(Carrier, Month) %>% mutate(nflights = n()) %>% group_by(Carrier, Month, late) %>% mutate( nlate = n(), pct_late = (nlate / nflights) * 100) %>% filter(late == "Yes") %>% select(Carrier, Month, pct_late) %>% distinct() %>% arrange(pct_late) ``` ] .pull-right[ ``` ## # A tibble: 36 x 4 ## # Groups: Carrier, Month, late [36] ## late Carrier Month pct_late ## <chr> <chr> <int> <dbl> ## 1 Yes UA 9 17.9 ## 2 Yes UA 2 18.6 ## 3 Yes DL 9 20.2 ## 4 Yes UA 5 20.2 ## 5 Yes UA 4 20.4 ## 6 Yes AA 9 21.4 ## 7 Yes DL 2 21.5 ## 8 Yes DL 8 22.5 ## 9 Yes AA 2 23.1 ## 10 Yes UA 3 23.3 ## # … with 26 more rows ``` ] --- Before we move on, I want to point out something about `case_when()`. Specifically, we used it to create a new column called `late` from numeric values found in `DepDelay`. But what if we wanted to create a new column from a column that had categorical variables in it, like `Dest` or `Carrier`? Easy. ```r cmhflights %>% filter(Carrier %in% c("AA", "DL", "UA")) %>% mutate(carrier_name = case_when( Carrier == "AA" ~ "American Airlines", Carrier == "DL" ~ "Delta Airlines", Carrier == "UA" ~ "United Airlines" ) ) %>% select(Carrier, carrier_name) ``` ``` ## # A tibble: 10,864 x 2 ## Carrier carrier_name ## <chr> <chr> ## 1 AA American Airlines ## 2 AA American Airlines ## 3 AA American Airlines ## 4 AA American Airlines ## 5 AA American Airlines ## 6 AA American Airlines ## 7 AA American Airlines ## 8 AA American Airlines ## 9 AA American Airlines ## 10 AA American Airlines ## # … with 10,854 more rows ``` --- Second, `case_when()` includes an option that cuts down on our work. In particular, say I want to create a new column and label its values as "Weekend" if the DayOfWeek is Saturday or Sunday and "Weekday" if DayOfWeek is any other day. In doing this, it would serve us well to remember that the week begins on Sunday so DayOfWeek == 1 is Sunday, not Monday. ```r cmhflights %>% mutate(weekend = case_when( DayOfWeek %in% c(7, 1) ~ "Yes", TRUE ~ "No" ) ) %>% select(DayOfWeek, weekend) %>% distinct() ``` ``` ## # A tibble: 7 x 2 ## DayOfWeek weekend ## <int> <chr> ## 1 7 Yes ## 2 1 Yes ## 3 2 No ## 4 3 No ## 5 4 No ## 6 5 No ## 7 6 No ``` Notice how `TRUE` swept up all other values of `DayOfWeek` and coded them as "No." --- One final showcasing of `case_when()`. In **Module 01** we looked at the `hsb2` data and created some `factors` for columns such as female, ses, schtyp, and so on. Well, let us see how the same thing could be done with `case_when()`. ```r read.table( 'https://stats.idre.ucla.edu/stat/data/hsb2.csv', header = TRUE, sep = "," ) -> hsb2 hsb2 %>% mutate( female.f = case_when( female == 0 ~ "Male", female == 1 ~ "Female"), race.f = case_when( race == 1 ~ "Hispanic", race == 2 ~ "Asian", race == 3 ~ "African-American", TRUE ~ "White"), ses.f = case_when( ses == 1 ~ "Low", ses == 2 ~ "Medium", TRUE ~ "High"), schtyp.f = case_when( schtyp == 1 ~ "Public", TRUE ~ "Private"), prog.f = case_when( prog == 1 ~ "General", prog == 2 ~ "Academic", TRUE ~ "Vocational") ) -> hsb2 ``` --- ### Some other `dplyr()` commands We have seen `count()` in action but let us see it again, in a slightly different light. In particular, say I want to know how many unique destinations are there connected by air from Columbus. ```r cmhflights %>% filter(Origin == "CMH") %>% count(Dest, sort = TRUE) ``` ``` ## # A tibble: 26 x 2 ## Dest n ## <chr> <int> ## 1 ATL 2884 ## 2 MDW 1511 ## 3 MCO 1148 ## 4 DFW 1122 ## 5 DEN 971 ## 6 BWI 948 ## 7 LAS 815 ## 8 PHX 815 ## 9 ORD 803 ## 10 EWR 736 ## # … with 16 more rows ``` Note: (1) No need for `group_by()` and (2) `sort = TRUE` arranges the result in descending order of the frequency (`n`). --- .pull-left[ Here is another code example, this time adding Carrier to the mix. ```r cmhflights %>% filter(Origin == "CMH") %>% count(Carrier, Dest, sort = TRUE) ``` ``` ## # A tibble: 45 x 3 ## Carrier Dest n ## <chr> <chr> <int> ## 1 DL ATL 2141 ## 2 WN MDW 1511 ## 3 AA DFW 1122 ## 4 WN BWI 948 ## 5 WN MCO 929 ## 6 WN ATL 743 ## 7 EV EWR 736 ## 8 WN TPA 595 ## 9 UA ORD 577 ## 10 WN LAS 542 ## # … with 35 more rows ``` ] .pull-right[ Another useful command is `n_distinct()`, useful in the sense of allowing us to calculate the the number of distinct values of any column. For example, say I want to know how many unique aircraft (not airlines) are there in this data-set. ```r cmhflights %>% summarise(n_distinct(TailNum)) ``` ``` ## # A tibble: 1 x 1 ## `n_distinct(TailNum)` ## <int> ## 1 2248 ``` ] --- .pull-left[ How many flights do we see per unique aircraft? ```r cmhflights %>% group_by(TailNum) %>% summarise(Number = n()) %>% arrange(-Number) ``` ``` ## # A tibble: 2,248 x 2 ## TailNum Number ## <chr> <int> ## 1 N396SW 74 ## 2 <NA> 72 ## 3 N601WN 66 ## 4 N646SW 64 ## 5 N635SW 62 ## 6 N391SW 61 ## 7 N637SW 60 ## 8 N941FR 60 ## 9 N951FR 60 ## 10 N397SW 58 ## # … with 2,238 more rows ``` ] .pull-right[ If you wanted options in terms of carriers when flying out of Columbus, which destination would give you maximal options? ```r cmhflights %>% filter(Origin == "CMH") %>% group_by(Dest) %>% summarise( unique.carriers = n_distinct(Carrier) ) %>% arrange(-unique.carriers) ``` ``` ## # A tibble: 26 x 2 ## Dest unique.carriers ## <chr> <int> ## 1 DEN 4 ## 2 DTW 3 ## 3 IAH 3 ## 4 MCO 3 ## 5 MSP 3 ## 6 ORD 3 ## 7 RSW 3 ## 8 ATL 2 ## 9 LAS 2 ## 10 LAX 2 ## # … with 16 more rows ``` ] --- ### top_n() If you want to see the top 'n' number of observations, for example the 4 airlines with the most aircraft, you can lean on `top_n()`, as shown below. .pull-left[ ```r cmhflights %>% group_by(Carrier) %>% summarise(num.flights = n_distinct(TailNum)) %>% arrange(-num.flights) %>% top_n(4) ``` ``` ## # A tibble: 4 x 2 ## Carrier num.flights ## <chr> <int> ## 1 WN 751 ## 2 DL 539 ## 3 UA 289 ## 4 OO 222 ``` ] .pull-right[ I am also curious about which aircraft has flown the most, and then maybe 9 other aircraft that follow in descending order. ```r cmhflights %>% filter(!is.na(TailNum)) %>% # Removing some missing cases group_by(TailNum) %>% tally() %>% arrange(-n) %>% top_n(4) ``` ``` ## # A tibble: 4 x 2 ## TailNum n ## <chr> <int> ## 1 N396SW 74 ## 2 N601WN 66 ## 3 N646SW 64 ## 4 N635SW 62 ``` ] --- ## join() You will, from time to time, need to merge multiple data-sets together. For example, say I have the following data-sets I have created for demonstration purposes. ```r tibble( Name = c("Tim", "Tammy", "Bubbles", "Panda"), Score = c(5, 8, 9, 10)) -> df1 tibble( Name = c("Tim", "Tammy", "Bubbles"), Age = c(25, 78, 19)) -> df2 tibble( Name = c("Tim", "Tammy", "Panda"), Education = c("BA", "PhD", "JD")) -> df3 ``` --- ```r df1; df2; df3 ``` ``` ## # A tibble: 4 x 2 ## Name Score ## <chr> <dbl> ## 1 Tim 5 ## 2 Tammy 8 ## 3 Bubbles 9 ## 4 Panda 10 ``` ``` ## # A tibble: 3 x 2 ## Name Age ## <chr> <dbl> ## 1 Tim 25 ## 2 Tammy 78 ## 3 Bubbles 19 ``` ``` ## # A tibble: 3 x 2 ## Name Education ## <chr> <chr> ## 1 Tim BA ## 2 Tammy PhD ## 3 Panda JD ``` Notice that Panda is absent from `df2` and Bubbles is absent from `df3`. So if we wanted to build ONE data-set with all data for Tim, Tammy, Bubbles, and Panda, some of the information would be missing for some of these folks. But how could we construct ONE data-set? Via one of a few `join()` commands. --- #### full_join() Let us start with a simple full_join, where we link up every individual in df1 or df2 or df3 **regardless of whether they are seen in both data-sets**. ```r df1 %>% full_join(df2, by = "Name") %>% full_join(df3, by = "Name") ``` ``` ## # A tibble: 4 x 4 ## Name Score Age Education ## <chr> <dbl> <dbl> <chr> ## 1 Tim 5 25 BA ## 2 Tammy 8 78 PhD ## 3 Bubbles 9 19 <NA> ## 4 Panda 10 NA JD ``` Pay attention to two things: (i) Name connects the records in each data-set, and so it must be spelled exactly the same for a link to be made, and (ii) the `full_join()` links up all individuals regardless of whether they are missing any information in any of the data-sets. --- This is usually how most folks will link up multiple files unless they only want records found in a master file. For example, say I want to link up df2 and df3 but only such that the final result will include all records found in BOTH df2 and df3, with df2 serving as the master data-set. Eh? ```r df2 %>% left_join(df3, by = "Name") ``` ``` ## # A tibble: 3 x 3 ## Name Age Education ## <chr> <dbl> <chr> ## 1 Tim 25 BA ## 2 Tammy 78 PhD ## 3 Bubbles 19 <NA> ``` Notice that Panda is dropped because it is not found in df2. Maybe you want df3 to be the master file, in which case you would see a different result (with Bubbles not seen in the result since Bubbles is found in df2 but not in df3): ```r df3 %>% left_join(df2, by = "Name") ``` ``` ## # A tibble: 3 x 3 ## Name Education Age ## <chr> <chr> <dbl> ## 1 Tim BA 25 ## 2 Tammy PhD 78 ## 3 Panda JD NA ``` --- Rarely, but definitely not "never," you may want to see the records that are not found in both. Here, anti_join() comes in handy, thus: .pull-left[ ```r df2 %>% anti_join(df3, by = "Name") ``` ``` ## # A tibble: 1 x 2 ## Name Age ## <chr> <dbl> ## 1 Bubbles 19 ``` ] .pull-right[ ```r df3 %>% anti_join(df2, by = "Name") ``` ``` ## # A tibble: 1 x 2 ## Name Education ## <chr> <chr> ## 1 Panda JD ``` ] --- ## {santoku} Every now and then you may want to or need to create a grouped version of some numeric variable. For example, we have DepDelay for all flights but want to group this into `quartiles`. How can we do that? In many ways but the easiest might be to use a specific library -- `{santoku}`. Say, for example, I want to create 4 groups of `dep_delay`, and I want these such that we are grouping `DepDelay` into the bottom 25%, next 25%, the next 25%, and finally the highest 25%. Wait, these are the `quartiles`! Fair enough, but how can I do this? ```r library(santoku) cmhflights %>% filter(!is.na(DepDelay)) %>% mutate( depdelay_groups = chop_equally(DepDelay, groups = 4) ) %>% group_by(depdelay_groups) %>% tally() ``` ``` ## # A tibble: 4 x 2 ## depdelay_groups n ## <fct> <int> ## 1 [0%, 25%) 6887 ## 2 [25%, 50%) 9267 ## 3 [50%, 75%) 10143 ## 4 [75%, 100%] 9225 ``` --- What if we wanted to slice up DepDelay in specific intervals, first at 0, then at 15, then at 30, and then at 45? ```r cmhflights %>% mutate( depdelay_groups = chop(DepDelay, breaks = c(0, 15, 30, 45)) ) %>% group_by(depdelay_groups) %>% tally() ``` ``` ## # A tibble: 6 x 2 ## depdelay_groups n ## <fct> <int> ## 1 [-27, 0) 21108 ## 2 [0, 15) 7883 ## 3 [15, 30) 2567 ## 4 [30, 45] 1302 ## 5 (45, 1323] 2662 ## 6 <NA> 471 ``` --- ### Creating Quintiles ```r cmhflights %>% filter(!is.na(DepDelay)) %>% mutate( depdelay_groups = chop_quantiles( DepDelay, c(0.2, 0.4, 0.6, 0.8) ) ) %>% group_by(depdelay_groups) %>% tally() ``` ``` ## # A tibble: 5 x 2 ## depdelay_groups n ## <fct> <int> ## 1 [0%, 20%) 6887 ## 2 [20%, 40%) 6342 ## 3 [40%, 60%) 7879 ## 4 [60%, 80%] 7362 ## 5 (80%, 100%] 7052 ``` --- ### Creating Deciles ```r cmhflights %>% filter(!is.na(DepDelay)) %>% mutate( depdelay_groups = chop_quantiles( DepDelay, seq(0.1, 0.9, by = 0.1) ) ) %>% group_by(depdelay_groups) %>% tally() ``` ``` ## # A tibble: 10 x 2 ## depdelay_groups n ## <fct> <int> ## 1 [0%, 10%) 3006 ## 2 [10%, 20%) 3881 ## 3 [20%, 30%) 3344 ## 4 [30%, 40%) 2998 ## 5 [40%, 50%) 2925 ## 6 [50%, 60%) 4954 ## 7 [60%, 70%) 3262 ## 8 [70%, 80%) 3773 ## 9 [80%, 90%] 3904 ## 10 (90%, 100%] 3475 ``` --- ## ordered() more often than we would like to see happen, we end up with categorical variables that should follow a certain order but do not. For example, say you have survey data where people were asked to respond whether they Agree, are Neutral, or Disagree with some statement. Let us also assume that the frequencies are as follows: ```r tibble( response = c(rep("Agree", 25), rep("Neutral", 30), rep("Disagree", 45)) ) -> mydf mydf %>% group_by(response) %>% tally() ``` ``` ## # A tibble: 3 x 2 ## response n ## <chr> <int> ## 1 Agree 25 ## 2 Disagree 45 ## 3 Neutral 30 ``` --- Notice how the responses are out of order, with Agree followed by Disagree, then Neutral, since R defaults to alphabetic ordering for anything that is a categorical variable. One way to ensure the correct ordering of categorical variables is via `ordered`, as shown below. ```r mydf %>% mutate(ordered.response = ordered( response, levels = c("Agree", "Neutral", "Disagree") ) ) %>% group_by(ordered.response) %>% tally() ``` ``` ## # A tibble: 3 x 2 ## ordered.response n ## <ord> <int> ## 1 Agree 25 ## 2 Neutral 30 ## 3 Disagree 45 ``` --- class: right, middle <img class="circle" src="https://github.com/aniruhil.png" width="175px"/> # Find me at... [<svg style="height:0.8em;top:.04em;position:relative;" viewBox="0 0 512 512"><path d="M459.37 151.716c.325 4.548.325 9.097.325 13.645 0 138.72-105.583 298.558-298.558 298.558-59.452 0-114.68-17.219-161.137-47.106 8.447.974 16.568 1.299 25.34 1.299 49.055 0 94.213-16.568 130.274-44.832-46.132-.975-84.792-31.188-98.112-72.772 6.498.974 12.995 1.624 19.818 1.624 9.421 0 18.843-1.3 27.614-3.573-48.081-9.747-84.143-51.98-84.143-102.985v-1.299c13.969 7.797 30.214 12.67 47.431 13.319-28.264-18.843-46.781-51.005-46.781-87.391 0-19.492 5.197-37.36 14.294-52.954 51.655 63.675 129.3 105.258 216.365 109.807-1.624-7.797-2.599-15.918-2.599-24.04 0-57.828 46.782-104.934 104.934-104.934 30.213 0 57.502 12.67 76.67 33.137 23.715-4.548 46.456-13.32 66.599-25.34-7.798 24.366-24.366 44.833-46.132 57.827 21.117-2.273 41.584-8.122 60.426-16.243-14.292 20.791-32.161 39.308-52.628 54.253z"/></svg> @aruhil](http://twitter.com/aruhil) [<svg style="height:0.8em;top:.04em;position:relative;" viewBox="0 0 512 512"><path d="M326.612 185.391c59.747 59.809 58.927 155.698.36 214.59-.11.12-.24.25-.36.37l-67.2 67.2c-59.27 59.27-155.699 59.262-214.96 0-59.27-59.26-59.27-155.7 0-214.96l37.106-37.106c9.84-9.84 26.786-3.3 27.294 10.606.648 17.722 3.826 35.527 9.69 52.721 1.986 5.822.567 12.262-3.783 16.612l-13.087 13.087c-28.026 28.026-28.905 73.66-1.155 101.96 28.024 28.579 74.086 28.749 102.325.51l67.2-67.19c28.191-28.191 28.073-73.757 0-101.83-3.701-3.694-7.429-6.564-10.341-8.569a16.037 16.037 0 0 1-6.947-12.606c-.396-10.567 3.348-21.456 11.698-29.806l21.054-21.055c5.521-5.521 14.182-6.199 20.584-1.731a152.482 152.482 0 0 1 20.522 17.197zM467.547 44.449c-59.261-59.262-155.69-59.27-214.96 0l-67.2 67.2c-.12.12-.25.25-.36.37-58.566 58.892-59.387 154.781.36 214.59a152.454 152.454 0 0 0 20.521 17.196c6.402 4.468 15.064 3.789 20.584-1.731l21.054-21.055c8.35-8.35 12.094-19.239 11.698-29.806a16.037 16.037 0 0 0-6.947-12.606c-2.912-2.005-6.64-4.875-10.341-8.569-28.073-28.073-28.191-73.639 0-101.83l67.2-67.19c28.239-28.239 74.3-28.069 102.325.51 27.75 28.3 26.872 73.934-1.155 101.96l-13.087 13.087c-4.35 4.35-5.769 10.79-3.783 16.612 5.864 17.194 9.042 34.999 9.69 52.721.509 13.906 17.454 20.446 27.294 10.606l37.106-37.106c59.271-59.259 59.271-155.699.001-214.959z"/></svg> aniruhil.org](https://aniruhil.org) [<svg style="height:0.8em;top:.04em;position:relative;" viewBox="0 0 512 512"><path d="M476 3.2L12.5 270.6c-18.1 10.4-15.8 35.6 2.2 43.2L121 358.4l287.3-253.2c5.5-4.9 13.3 2.6 8.6 8.3L176 407v80.5c0 23.6 28.5 32.9 42.5 15.8L282 426l124.6 52.2c14.2 6 30.4-2.9 33-18.2l72-432C515 7.8 493.3-6.8 476 3.2z"/></svg> ruhil@ohio.edu](mailto:ruhil@ohio.edu)