Core Data Operations in RAni Ruhil1 / 48

Agenda

Data wrangling is the process of gathering, selecting, and transforming data to answer an analytical question. Also known as data cleaning or “munging”, legend has it that this wrangling costs analytics professionals as much as 80% of their time, leaving only 20% for exploration and modeling. @ Mike Thurber

The process of manual data cleansing prior to analysis is known as data munging [... and ...] can be a laborious task without the right tools. The common interface used for data munging is often Excel, which lacks the sophistication for collaboration and automation to make the process efficient. ... Data munging is time consuming and disjointed process gets in the way of extracting true value and potential from data. @Trifacta

2 / 48

3 / 48

"dplyr is a grammar of data manipulation, providing a consistent set of verbs that help you solve the most common data manipulation challenges"

The core functions are:

What you want to do ...	`{dplyr}` function
select columns to work with?	`select()`
use a subset of the data based on some criterion?	`filter()`
arrange the data in ascending/descending order of variable(s)?	`arrange()`
the results of your calculations to be a standalone data frame?	`summarise()`
add your calculated value(s) to the existing data frame?	`mutate()`
add your calculated value(s) to the existing data frame but also drop all variables/columns not being used in the calculation?	`transmute()`
calculate averages, frequencies, etc by groups?	`group_by()`

4 / 48

In order to understand how these commands work we rely on a simple data frame -- all flights originating and departing from Columbus (Ohio) January through September of 2017. Let us load the data and dplyr.

library(tidyverse) # loads a few packages we need 
library(tidylog) # tracks the result of each command as we clean our data
library(here) # our friend that helps us on the correct filepath 
load(
  here("data", "cmhflights_01092017.RData")
  )
dim(cmhflights)

## [1] 35993   110

You should see 110 columns, the last being X110, an empty column so we drop it.

cmhflights$X110 <- NULL

5 / 48

Using select() to pick columns to work with 6 / 48

Say I only want the first five columns -- Year, Quarter, Month, DayofMonth, DayOfWeek.

cmhflights %>% 
  select(
    Year:DayOfWeek # select sequential columns Year through DayOfWeek 
    ) -> my.df 
names(my.df)

## [1] "Year"       "Quarter"    "Month"      "DayofMonth" "DayOfWeek"

What if the columns I want are not sequentially located?

cmhflights %>% 
  select(
    Year, FlightDate:UniqueCarrier, TailNum # select Year, then sequential columns FlightDate through UniqueCarrier, and then TailNum 
    ) -> my.df 
names(my.df)

## [1] "Year"          "FlightDate"    "UniqueCarrier" "TailNum"

7 / 48

Could we use column numbers instead? Sure!

cmhflights %>% 
  select(
    c(1, 3, 5, 7) # select columns 1, 3, 5, and 7
    ) -> my.df 
names(my.df)

## [1] "Year"          "Month"         "DayOfWeek"     "UniqueCarrier"

You can also choose columns in other ways, with contains(), starts_with(), and ends_with().

cmhflights %>% 
  select(
    contains("Carrier") # select columns with names that contain the word "Carrier"
    ) -> my.df 
names(my.df)

## [1] "UniqueCarrier" "Carrier"       "CarrierDelay"

8 / 48

cmhflights %>% 
  select(
    starts_with("De") # select columns with names that contain the text "De"
    ) -> my.df 
names(my.df)

##  [1] "DestAirportID"        "DestAirportSeqID"     "DestCityMarketID"    
##  [4] "Dest"                 "DestCityName"         "DestState"           
##  [7] "DestStateFips"        "DestStateName"        "DestWac"             
## [10] "DepTime"              "DepDelay"             "DepDelayMinutes"     
## [13] "DepDel15"             "DepartureDelayGroups" "DepTimeBlk"

cmhflights %>% 
  select(
    ends_with("Num") # select columns with names that contain "Num"
         ) -> my.df
names(my.df)

## [1] "TailNum"     "FlightNum"   "Div1TailNum" "Div2TailNum" "Div3TailNum"
## [6] "Div4TailNum" "Div5TailNum"

9 / 48

Using filter() to select observations to work with 10 / 48

With dplyr, the filter() command helps us trim the data-set down to specific observations of interest. For instance, say we only want records from January, i.e., Month == 1.

# retain only the observations from January
cmhflights %>% 
  filter(
    Month == 1 # find all rows where Month is exactly equal to 1
    ) -> my.df 
table(my.df$Month)

## 
##    1 
## 3757

What about only American Airline flights in January?

cmhflights %>% 
  filter(
    Month == 1, UniqueCarrier == "AA" # retain only AA flights from January
    ) -> my.df 
table(my.df$Month, my.df$UniqueCarrier)

##    
##      AA
##   1 387

11 / 48

What about United Airlines flights in January to CMH (Columbus, OH)?

cmhflights %>% 
  filter( # retain only UA flights to CMH in January
    Month == 1,
    UniqueCarrier == "UA",
    Dest == "CMH" 
    ) -> my.df 
table(my.df$Month)

## 
##  1 
## 53

table(my.df$UniqueCarrier)

## 
## UA 
## 53

table(my.df$Dest)

## 
## CMH 
##  53

12 / 48

What if I wanted a more complicated filter, say, flights in January or February to CMH or ORD?

cmhflights %>% 
  filter( # retain only flights to CMH or ORD in January or February
    Month %in% c(1, 2),
    UniqueCarrier == "UA",
    Dest %in% c("CMH", "ORD")  
    ) -> my.df 
table(my.df$Month)

## 
##   1   2 
## 106 145

table(my.df$UniqueCarrier)

## 
##  UA 
## 251

table(my.df$Dest)

## 
## CMH ORD 
## 132 119

13 / 48

filter() accepts the following operators:

Operator
Meaning
Operator
Meaning


<<
is less than
>>
is greater than

====
is equal to
≤≤
is less than or equal to

≥≥
is greater than or equal to
!=
is not equal to

%in%
is a member of
is.na
is NA

!is.na
is not NA
&,!,etc
logical operators

14 / 48

Operator	Meaning	Operator	Meaning
$<$	is less than	$>$	is greater than
$==$	is equal to	$\leq$	is less than or equal to
$\geq$	is greater than or equal to	!=	is not equal to
%in%	is a member of	is.na	is NA
!is.na	is not NA	&,!,etc	logical operators

Using arrange() to organize the observations in ascending/descending order of specific variables 15 / 48

Say I wanted to arrange the resulting data frame by ascending order of departure delays. How might I do that?

my.df %>% 
  arrange(DepDelay) %>%
  select(DepDelay)

## # A tibble: 251 x 1
##    DepDelay
##       <dbl>
##  1      -16
##  2      -16
##  3      -15
##  4      -15
##  5      -15
##  6      -14
##  7      -14
##  8      -14
##  9      -14
## 10      -14
## # … with 241 more rows

And now in descending order of delays.

my.df %>% 
  arrange(-DepDelay) %>%
  select(DepDelay)

## # A tibble: 251 x 1
##    DepDelay
##       <dbl>
##  1      394
##  2      376
##  3      235
##  4      178
##  5      177
##  6      164
##  7      112
##  8      109
##  9       66
## 10       61
## # … with 241 more rows

The - before the column name tells descending order

16 / 48

We could tweak this further, perhaps saying sort by departure delays to CMH, and then to ORD (Chicago's O'Hare airport).

my.df %>% 
  arrange(Dest, -DepDelay) %>%
  select(Dest, -DepDelay)

## # A tibble: 251 x 1
##    Dest 
##    <chr>
##  1 CMH  
##  2 CMH  
##  3 CMH  
##  4 CMH  
##  5 CMH  
##  6 CMH  
##  7 CMH  
##  8 CMH  
##  9 CMH  
## 10 CMH  
## # … with 241 more rows

and another one ...

cmhflights %>% 
  select(Month, UniqueCarrier, Dest, DepDelay) %>% 
  filter(
    Month %in% c(1, 2),
    UniqueCarrier == "UA",
    Dest %in% c("CMH", "ORD")
    ) %>% 
  arrange(Month, Dest, -DepDelay)

## # A tibble: 251 x 4
##    Month UniqueCarrier Dest  DepDelay
##    <int> <chr>         <chr>    <dbl>
##  1     1 UA            CMH        178
##  2     1 UA            CMH         61
##  3     1 UA            CMH         44
##  4     1 UA            CMH         39
##  5     1 UA            CMH         39
##  6     1 UA            CMH         30
##  7     1 UA            CMH         27
##  8     1 UA            CMH         18
##  9     1 UA            CMH         12
## 10     1 UA            CMH         10
## # … with 241 more rows

17 / 48

Using summarise() to calculate summary statistics 18 / 48

How many flights were there per Month?

cmhflights %>%
  count(Month)

## # A tibble: 9 x 2
##   Month     n
##   <int> <int>
## 1     1  3757
## 2     2  3413
## 3     3  4101
## 4     4  4123
## 5     5  4098
## 6     6  4138
## 7     7  4295
## 8     8  4279
## 9     9  3789

Most flights were in July (n = 4295)

What about by days of the week AND by month?

cmhflights %>%
  count(Month, DayOfWeek)

## # A tibble: 63 x 3
##    Month DayOfWeek     n
##    <int>     <int> <int>
##  1     1         1   660
##  2     1         2   635
##  3     1         3   516
##  4     1         4   522
##  5     1         5   523
##  6     1         6   361
##  7     1         7   540
##  8     2         1   527
##  9     2         2   506
## 10     2         3   516
## # … with 53 more rows

Note: The output is sorted by Month and then DayOfWeek

19 / 48

I want to know the average departure delay and the average arrival delay for all flights, with the averages calculated in two ways -- as the mean, and as the median.

cmhflights %>%
  summarise(
    mean_arr_delay = mean(ArrDelay, na.rm = TRUE),
    mean_dep_delay = mean(DepDelay, na.rm = TRUE),
    median_arr_delay = median(ArrDelay, na.rm = TRUE),
    median_dep_delay = median(DepDelay, na.rm = TRUE)
    )

## # A tibble: 1 x 4
##   mean_arr_delay mean_dep_delay median_arr_delay median_dep_delay
##            <dbl>          <dbl>            <dbl>            <dbl>
## 1           3.26           9.20               -6               -2

Here, the na.rm = TRUE command is useful because R will not allow you to calculate any mean or median etc. You can see this below, where I have a small data-set called df with 4 values of x, but one of the four is missing and recorded as NA. See what happens when I try to calculate the mean with/without na.rm = TRUE.

df = data.frame(x = c(2, 4, 9, NA))
df %>%
  summarise(mean.x = mean(x))

##   mean.x
## 1     NA

df %>%
  summarise(
    mean.x = mean(x, na.rm = TRUE)
    )

##   mean.x
## 1      5

20 / 48

Using group_by() to apply commands by specific groups 21 / 48

I want some frequencies, say the number of flights by airline per month

cmhflights %>%
  group_by(Month, Carrier) %>%
  tally()

## # A tibble: 63 x 3
## # Groups:   Month [9]
##    Month Carrier     n
##    <int> <chr>   <int>
##  1     1 AA        387
##  2     1 DL        549
##  3     1 EV        436
##  4     1 F9        152
##  5     1 OO        123
##  6     1 UA        106
##  7     1 WN       2004
##  8     2 AA        360
##  9     2 DL        511
## 10     2 EV        362
## # … with 53 more rows

cmhflights %>%
  group_by(Month, Carrier) %>%
  summarise(frequency = n())

## # A tibble: 63 x 3
## # Groups:   Month [9]
##    Month Carrier frequency
##    <int> <chr>       <int>
##  1     1 AA            387
##  2     1 DL            549
##  3     1 EV            436
##  4     1 F9            152
##  5     1 OO            123
##  6     1 UA            106
##  7     1 WN           2004
##  8     2 AA            360
##  9     2 DL            511
## 10     2 EV            362
## # … with 53 more rows

22 / 48

Now I want the number of flights per month per airline per destination.

cmhflights %>%
  group_by(Month, Carrier, Dest) %>%
  tally()

## # A tibble: 398 x 4
## # Groups:   Month, Carrier [63]
##    Month Carrier Dest      n
##    <int> <chr>   <chr> <int>
##  1     1 AA      CMH     193
##  2     1 AA      DFW     112
##  3     1 AA      LAX      24
##  4     1 AA      PHX      58
##  5     1 DL      ATL     224
##  6     1 DL      CMH     275
##  7     1 DL      DTW       2
##  8     1 DL      LAX      27
##  9     1 DL      MSP      21
## 10     1 EV      CMH     218
## # … with 388 more rows

We could keep complicating the grouping structure. For example, let us add the day of the week to the mix ...

cmhflights %>%
  group_by(Month, Carrier, Dest, DayOfWeek) %>%
  tally()

## # A tibble: 2,536 x 5
## # Groups:   Month, Carrier, Dest [398]
##    Month Carrier Dest  DayOfWeek     n
##    <int> <chr>   <chr>     <int> <int>
##  1     1 AA      CMH           1    35
##  2     1 AA      CMH           2    23
##  3     1 AA      CMH           3    28
##  4     1 AA      CMH           4    28
##  5     1 AA      CMH           5    28
##  6     1 AA      CMH           6    21
##  7     1 AA      CMH           7    30
##  8     1 AA      DFW           1    20
##  9     1 AA      DFW           2    16
## 10     1 AA      DFW           3    16
## # … with 2,526 more rows

23 / 48

Now say I am really curious about mean departure delays for the preceding grouping structure. That is, what does mean departure delay look like for flights by day of the week, by month, by carrier, and by destination?

cmhflights %>%
  group_by(Month, Carrier, Dest, DayOfWeek) %>%
  summarise(mean_dep_delay = mean(DepDelay, na.rm = TRUE))

## # A tibble: 2,536 x 5
## # Groups:   Month, Carrier, Dest [398]
##    Month Carrier Dest  DayOfWeek mean_dep_delay
##    <int> <chr>   <chr>     <int>          <dbl>
##  1     1 AA      CMH           1         16.3  
##  2     1 AA      CMH           2          0.304
##  3     1 AA      CMH           3          1.14 
##  4     1 AA      CMH           4          3.11 
##  5     1 AA      CMH           5          4.11 
##  6     1 AA      CMH           6         24.9  
##  7     1 AA      CMH           7         21.8  
##  8     1 AA      DFW           1         25.4  
##  9     1 AA      DFW           2         -2.62 
## 10     1 AA      DFW           3         17.6  
## # … with 2,526 more rows

But this is a complicated summary table. What if all I really want to know is what airline has the highest mean departure delays, regardless of month or destination or day of the week?

cmhflights %>%
  group_by(Carrier) %>%
  summarise(mean_dep_delay = mean(DepDelay, na.rm = TRUE)) %>%
  arrange(-mean_dep_delay) # ordered in descending order of delays

## # A tibble: 7 x 2
##   Carrier mean_dep_delay
##   <chr>            <dbl>
## 1 EV               15.2 
## 2 F9               10.8 
## 3 WN                9.58
## 4 AA                7.73
## 5 DL                7.10
## 6 OO                6.44
## 7 UA                6.39

EV is Express Jet; F9 is Frontier Airlines; WN is Southwest Airlines; OO is SkyWest Airlines; AA is American Airlines; DL is Delta Airlines; UA is United Airlines. So clearly United Airlines had the lowest average departure delays.

24 / 48

Would this still be true if we repeated the calculation by Month?

cmhflights %>%
  group_by(Carrier, Month) %>%
  summarise(mean_dep_delay = mean(DepDelay, na.rm = TRUE)) %>%
  arrange(mean_dep_delay) # ordered in descending order of delays

## # A tibble: 63 x 3
## # Groups:   Carrier [7]
##    Carrier Month mean_dep_delay
##    <chr>   <int>          <dbl>
##  1 OO          8         -0.873
##  2 OO          2         -0.570
##  3 OO          9         -0.179
##  4 UA          9          0.699
##  5 DL          8          1.85 
##  6 DL          2          2.24 
##  7 AA          2          2.30 
##  8 F9          9          2.51 
##  9 AA          9          2.62 
## 10 DL          3          2.98 
## # … with 53 more rows

25 / 48

So far so good. But now I am curious about what percent of flights operated by AA, DL, UA, and WN were delayed. How could I calculate this?

(1) I need to use filter() to restrict the data-set to just these four airlines. (2) Then I need to generate a new column that identifies whether a flight was delayed or not (late). (3) Now we can calculate the total number of flights (nflights) and the total number of flights that were delayed (nlate). (4) If I then calculate $(\frac{n l a t e}{n f l i g h t s}) \times 100$ we will end up with the percent of flights that were delayed.

cmhflights %>%
  select(c(Carrier, DepDelay)) %>%
  filter(Carrier %in% c("AA", "DL", "UA", "WN")) %>%
  mutate(late = case_when(
    DepDelay > 0 ~ "Yes",
    DepDelay <= 0 ~ "No"
    )
    ) %>% 
  group_by(Carrier) %>%
  mutate(nflights = n()) %>%
  group_by(Carrier, late) %>%
  mutate(
    nlate = n(),
    pct_late = (nlate / nflights) * 100) -> df1

df1

## # A tibble: 29,328 x 6
## # Groups:   Carrier, late [12]
##    Carrier DepDelay late  nflights nlate pct_late
##    <chr>      <dbl> <chr>    <int> <int>    <dbl>
##  1 AA            -9 No        3891  2698     69.3
##  2 AA            24 Yes       3891  1162     29.9
##  3 AA            -6 No        3891  2698     69.3
##  4 AA            -5 No        3891  2698     69.3
##  5 AA            -7 No        3891  2698     69.3
##  6 AA            22 Yes       3891  1162     29.9
##  7 AA            -4 No        3891  2698     69.3
##  8 AA            -4 No        3891  2698     69.3
##  9 AA            42 Yes       3891  1162     29.9
## 10 AA            -6 No        3891  2698     69.3
## # … with 29,318 more rows

26 / 48

cmhflights %>%
  select(c(Carrier, DepDelay)) %>%
  filter(Carrier %in% c("AA", "DL", "UA", "WN")) %>%
  mutate(late = case_when(
    DepDelay > 0 ~ "Yes",
    DepDelay <= 0 ~ "No"
    )
    ) %>% 
  group_by(Carrier) %>%
  mutate(nflights = n()) %>%
  group_by(Carrier, late) %>%
  mutate(
    nlate = n(),
    pct_late = (nlate / nflights) * 100)

There is a whole lot going on here so let us break it down.

filter(Carrier %in% c("AA", "DL", "UA", "WN")) is keeping specified airlines' data while dropping the rest
mutate(late = case_when(...) is creating a new column called late and storing a value of "Yes" if DepDelay > 0 and "No" if DepDelay <= 0
group_by(Carrier) is grouping by Carrier and then counting how many flights there were per Carrier and storing this sum in a new column called nflights
group_by(Carrier, late) is grouping the data by Carrier and late
mutate(nlate = n(), pct_late = (nlate / nflights) * 100) is then creating two new columns, nlate, the number of flights per late values of "Yes" and "No", respectively, and then pct_late

27 / 48

Now, we only want the flights that were late so let us apply select() to keep just a few columns and then we use filter() to keep only rows corresponding to late = "Yes". This will still leave us with duplicate rows but we can drop these duplicate rows via a new command, distinct()

df1 %>%
  filter(late == "Yes") %>%
  select(Carrier, pct_late) %>%
  distinct() %>%
  arrange(pct_late)

## # A tibble: 4 x 3
## # Groups:   Carrier, late [4]
##   late  Carrier pct_late
##   <chr> <chr>      <dbl>
## 1 Yes   UA          24.0
## 2 Yes   DL          27.7
## 3 Yes   AA          29.9
## 4 Yes   WN          41.1

So! 24% of UA flights were late, the lowest in this group.

28 / 48

What if we wanted to do this for all airlines, and we want it by Month?

cmhflights %>%
  select(c(Carrier, Month, DepDelay)) %>%
  filter(Carrier %in% c("AA", "DL", "UA", "WN")) %>%
  mutate(late = case_when(
    DepDelay > 0 ~ "Yes",
    DepDelay <= 0 ~ "No"
    )
    ) %>% 
  group_by(Carrier, Month) %>%
  mutate(nflights = n()) %>%
  group_by(Carrier, Month, late) %>%
  mutate(
    nlate = n(),
    pct_late = (nlate / nflights) * 100) %>%
  filter(late == "Yes") %>%
  select(Carrier, Month, pct_late) %>%
  distinct() %>%
  arrange(pct_late)

## # A tibble: 36 x 4
## # Groups:   Carrier, Month, late [36]
##    late  Carrier Month pct_late
##    <chr> <chr>   <int>    <dbl>
##  1 Yes   UA          9     17.9
##  2 Yes   UA          2     18.6
##  3 Yes   DL          9     20.2
##  4 Yes   UA          5     20.2
##  5 Yes   UA          4     20.4
##  6 Yes   AA          9     21.4
##  7 Yes   DL          2     21.5
##  8 Yes   DL          8     22.5
##  9 Yes   AA          2     23.1
## 10 Yes   UA          3     23.3
## # … with 26 more rows

29 / 48

Before we move on, I want to point out something about case_when(). Specifically, we used it to create a new column called late from numeric values found in DepDelay. But what if we wanted to create a new column from a column that had categorical variables in it, like Dest or Carrier? Easy.

cmhflights %>%
  filter(Carrier %in% c("AA", "DL", "UA")) %>%
  mutate(carrier_name = case_when(
    Carrier == "AA" ~ "American Airlines",
    Carrier == "DL" ~ "Delta Airlines",
    Carrier == "UA" ~ "United Airlines"
    )
    ) %>%
  select(Carrier, carrier_name)

## # A tibble: 10,864 x 2
##    Carrier carrier_name     
##    <chr>   <chr>            
##  1 AA      American Airlines
##  2 AA      American Airlines
##  3 AA      American Airlines
##  4 AA      American Airlines
##  5 AA      American Airlines
##  6 AA      American Airlines
##  7 AA      American Airlines
##  8 AA      American Airlines
##  9 AA      American Airlines
## 10 AA      American Airlines
## # … with 10,854 more rows

30 / 48

Second, case_when() includes an option that cuts down on our work. In particular, say I want to create a new column and label its values as "Weekend" if the DayOfWeek is Saturday or Sunday and "Weekday" if DayOfWeek is any other day. In doing this, it would serve us well to remember that the week begins on Sunday so DayOfWeek == 1 is Sunday, not Monday.

cmhflights %>%
  mutate(weekend = case_when(
    DayOfWeek %in% c(7, 1) ~ "Yes",
    TRUE ~ "No"
    )
    ) %>%
  select(DayOfWeek, weekend) %>%
  distinct()

## # A tibble: 7 x 2
##   DayOfWeek weekend
##       <int> <chr>  
## 1         7 Yes    
## 2         1 Yes    
## 3         2 No     
## 4         3 No     
## 5         4 No     
## 6         5 No     
## 7         6 No

Notice how TRUE swept up all other values of DayOfWeek and coded them as "No."

31 / 48

One final showcasing of case_when(). In Module 01 we looked at the hsb2 data and created some factors for columns such as female, ses, schtyp, and so on. Well, let us see how the same thing could be done with case_when().

read.table(
  'https://stats.idre.ucla.edu/stat/data/hsb2.csv',
  header = TRUE,
  sep = ","
  ) -> hsb2
hsb2 %>%
  mutate(
    female.f = case_when(
      female == 0 ~ "Male",
      female == 1 ~ "Female"),
    race.f = case_when(
      race == 1 ~ "Hispanic",
      race == 2 ~ "Asian",
      race == 3 ~ "African-American",
      TRUE ~ "White"),
    ses.f = case_when(
      ses == 1 ~ "Low",
      ses == 2 ~ "Medium",
      TRUE ~ "High"),
    schtyp.f = case_when(
      schtyp == 1 ~ "Public",
      TRUE ~ "Private"),
    prog.f = case_when(
      prog == 1 ~ "General",
      prog == 2 ~ "Academic",
      TRUE ~ "Vocational")
    ) -> hsb2

32 / 48

Some other `dplyr()` commands

We have seen count() in action but let us see it again, in a slightly different light. In particular, say I want to know how many unique destinations are there connected by air from Columbus.

cmhflights %>%
  filter(Origin == "CMH") %>%
  count(Dest, sort = TRUE)

## # A tibble: 26 x 2
##    Dest      n
##    <chr> <int>
##  1 ATL    2884
##  2 MDW    1511
##  3 MCO    1148
##  4 DFW    1122
##  5 DEN     971
##  6 BWI     948
##  7 LAS     815
##  8 PHX     815
##  9 ORD     803
## 10 EWR     736
## # … with 16 more rows

Note: (1) No need for group_by() and (2) sort = TRUE arranges the result in descending order of the frequency (n).

33 / 48

Here is another code example, this time adding Carrier to the mix.

cmhflights %>%
  filter(Origin == "CMH") %>%
  count(Carrier, Dest, sort = TRUE)

## # A tibble: 45 x 3
##    Carrier Dest      n
##    <chr>   <chr> <int>
##  1 DL      ATL    2141
##  2 WN      MDW    1511
##  3 AA      DFW    1122
##  4 WN      BWI     948
##  5 WN      MCO     929
##  6 WN      ATL     743
##  7 EV      EWR     736
##  8 WN      TPA     595
##  9 UA      ORD     577
## 10 WN      LAS     542
## # … with 35 more rows

Another useful command is n_distinct(), useful in the sense of allowing us to calculate the the number of distinct values of any column. For example, say I want to know how many unique aircraft (not airlines) are there in this data-set.

cmhflights %>%
  summarise(n_distinct(TailNum))

## # A tibble: 1 x 1
##   `n_distinct(TailNum)`
##                   <int>
## 1                  2248

34 / 48

How many flights do we see per unique aircraft?

cmhflights %>% 
  group_by(TailNum) %>% 
  summarise(Number = n()) %>%
  arrange(-Number)

## # A tibble: 2,248 x 2
##    TailNum Number
##    <chr>    <int>
##  1 N396SW      74
##  2 <NA>        72
##  3 N601WN      66
##  4 N646SW      64
##  5 N635SW      62
##  6 N391SW      61
##  7 N637SW      60
##  8 N941FR      60
##  9 N951FR      60
## 10 N397SW      58
## # … with 2,238 more rows

If you wanted options in terms of carriers when flying out of Columbus, which destination would give you maximal options?

cmhflights %>% 
  filter(Origin == "CMH") %>% 
  group_by(Dest) %>% 
  summarise(
    unique.carriers = n_distinct(Carrier)
    ) %>% 
  arrange(-unique.carriers)

## # A tibble: 26 x 2
##    Dest  unique.carriers
##    <chr>           <int>
##  1 DEN                 4
##  2 DTW                 3
##  3 IAH                 3
##  4 MCO                 3
##  5 MSP                 3
##  6 ORD                 3
##  7 RSW                 3
##  8 ATL                 2
##  9 LAS                 2
## 10 LAX                 2
## # … with 16 more rows

35 / 48

top_n()

If you want to see the top 'n' number of observations, for example the 4 airlines with the most aircraft, you can lean on top_n(), as shown below.

cmhflights %>%
  group_by(Carrier) %>%
  summarise(num.flights = n_distinct(TailNum)) %>%
  arrange(-num.flights) %>% 
  top_n(4)

## # A tibble: 4 x 2
##   Carrier num.flights
##   <chr>         <int>
## 1 WN              751
## 2 DL              539
## 3 UA              289
## 4 OO              222

I am also curious about which aircraft has flown the most, and then maybe 9 other aircraft that follow in descending order.

cmhflights %>%
  filter(!is.na(TailNum)) %>% # Removing some missing cases 
  group_by(TailNum) %>%
  tally() %>% 
  arrange(-n) %>%
  top_n(4)

## # A tibble: 4 x 2
##   TailNum     n
##   <chr>   <int>
## 1 N396SW     74
## 2 N601WN     66
## 3 N646SW     64
## 4 N635SW     62

36 / 48

join()

You will, from time to time, need to merge multiple data-sets together. For example, say I have the following data-sets I have created for demonstration purposes.

tibble(
  Name = c("Tim", "Tammy", "Bubbles", "Panda"),
  Score = c(5, 8, 9, 10)) -> df1
tibble(
  Name = c("Tim", "Tammy", "Bubbles"),
  Age = c(25, 78, 19)) -> df2
tibble(
  Name = c("Tim", "Tammy", "Panda"),
  Education = c("BA", "PhD", "JD")) -> df3

37 / 48

df1; df2; df3

## # A tibble: 4 x 2
##   Name    Score
##   <chr>   <dbl>
## 1 Tim         5
## 2 Tammy       8
## 3 Bubbles     9
## 4 Panda      10

## # A tibble: 3 x 2
##   Name      Age
##   <chr>   <dbl>
## 1 Tim        25
## 2 Tammy      78
## 3 Bubbles    19

## # A tibble: 3 x 2
##   Name  Education
##   <chr> <chr>    
## 1 Tim   BA       
## 2 Tammy PhD      
## 3 Panda JD

Notice that Panda is absent from df2 and Bubbles is absent from df3. So if we wanted to build ONE data-set with all data for Tim, Tammy, Bubbles, and Panda, some of the information would be missing for some of these folks. But how could we construct ONE data-set? Via one of a few join() commands.

38 / 48

full_join()

Let us start with a simple full_join, where we link up every individual in df1 or df2 or df3 regardless of whether they are seen in both data-sets.

df1 %>%
  full_join(df2, by = "Name") %>%
  full_join(df3, by = "Name")

## # A tibble: 4 x 4
##   Name    Score   Age Education
##   <chr>   <dbl> <dbl> <chr>    
## 1 Tim         5    25 BA       
## 2 Tammy       8    78 PhD      
## 3 Bubbles     9    19 <NA>     
## 4 Panda      10    NA JD

Pay attention to two things: (i) Name connects the records in each data-set, and so it must be spelled exactly the same for a link to be made, and (ii) the full_join() links up all individuals regardless of whether they are missing any information in any of the data-sets.

39 / 48

This is usually how most folks will link up multiple files unless they only want records found in a master file. For example, say I want to link up df2 and df3 but only such that the final result will include all records found in BOTH df2 and df3, with df2 serving as the master data-set. Eh?

df2 %>%
  left_join(df3, by = "Name")

## # A tibble: 3 x 3
##   Name      Age Education
##   <chr>   <dbl> <chr>    
## 1 Tim        25 BA       
## 2 Tammy      78 PhD      
## 3 Bubbles    19 <NA>

Notice that Panda is dropped because it is not found in df2.

Maybe you want df3 to be the master file, in which case you would see a different result (with Bubbles not seen in the result since Bubbles is found in df2 but not in df3):

df3 %>%
  left_join(df2, by = "Name")

## # A tibble: 3 x 3
##   Name  Education   Age
##   <chr> <chr>     <dbl>
## 1 Tim   BA           25
## 2 Tammy PhD          78
## 3 Panda JD           NA

40 / 48

Rarely, but definitely not "never," you may want to see the records that are not found in both. Here, anti_join() comes in handy, thus:

df2 %>%
  anti_join(df3, by = "Name")

## # A tibble: 1 x 2
##   Name      Age
##   <chr>   <dbl>
## 1 Bubbles    19

df3 %>%
  anti_join(df2, by = "Name")

## # A tibble: 1 x 2
##   Name  Education
##   <chr> <chr>    
## 1 Panda JD

41 / 48

{santoku}

Every now and then you may want to or need to create a grouped version of some numeric variable. For example, we have DepDelay for all flights but want to group this into quartiles. How can we do that? In many ways but the easiest might be to use a specific library -- {santoku}. Say, for example, I want to create 4 groups of dep_delay, and I want these such that we are grouping DepDelay into the bottom 25%, next 25%, the next 25%, and finally the highest 25%. Wait, these are the quartiles! Fair enough, but how can I do this?

library(santoku)
cmhflights %>%
  filter(!is.na(DepDelay)) %>% 
  mutate(
    depdelay_groups = chop_equally(DepDelay, groups = 4)
  ) %>%
  group_by(depdelay_groups) %>%
  tally()

## # A tibble: 4 x 2
##   depdelay_groups     n
##   <fct>           <int>
## 1 [0%, 25%)        6887
## 2 [25%, 50%)       9267
## 3 [50%, 75%)      10143
## 4 [75%, 100%]      9225

42 / 48

What if we wanted to slice up DepDelay in specific intervals, first at 0, then at 15, then at 30, and then at 45?

cmhflights %>%
  mutate(
    depdelay_groups = chop(DepDelay, breaks = c(0, 15, 30, 45))
  ) %>%
  group_by(depdelay_groups) %>%
  tally()

## # A tibble: 6 x 2
##   depdelay_groups     n
##   <fct>           <int>
## 1 [-27, 0)        21108
## 2 [0, 15)          7883
## 3 [15, 30)         2567
## 4 [30, 45]         1302
## 5 (45, 1323]       2662
## 6 <NA>              471

43 / 48

Creating Quintiles

cmhflights %>%
  filter(!is.na(DepDelay)) %>%
  mutate(
    depdelay_groups = chop_quantiles(
      DepDelay, c(0.2, 0.4, 0.6, 0.8)
      )
  ) %>%
  group_by(depdelay_groups) %>%
  tally()

## # A tibble: 5 x 2
##   depdelay_groups     n
##   <fct>           <int>
## 1 [0%, 20%)        6887
## 2 [20%, 40%)       6342
## 3 [40%, 60%)       7879
## 4 [60%, 80%]       7362
## 5 (80%, 100%]      7052

44 / 48

Creating Deciles

cmhflights %>%
  filter(!is.na(DepDelay)) %>%
  mutate(
    depdelay_groups = chop_quantiles(
      DepDelay, seq(0.1, 0.9, by = 0.1)
      )
  ) %>%
  group_by(depdelay_groups) %>%
  tally()

## # A tibble: 10 x 2
##    depdelay_groups     n
##    <fct>           <int>
##  1 [0%, 10%)        3006
##  2 [10%, 20%)       3881
##  3 [20%, 30%)       3344
##  4 [30%, 40%)       2998
##  5 [40%, 50%)       2925
##  6 [50%, 60%)       4954
##  7 [60%, 70%)       3262
##  8 [70%, 80%)       3773
##  9 [80%, 90%]       3904
## 10 (90%, 100%]      3475

45 / 48

ordered()

more often than we would like to see happen, we end up with categorical variables that should follow a certain order but do not. For example, say you have survey data where people were asked to respond whether they Agree, are Neutral, or Disagree with some statement. Let us also assume that the frequencies are as follows:

tibble(
  response = c(rep("Agree", 25), rep("Neutral", 30), rep("Disagree", 45))
       ) -> mydf
mydf %>%
  group_by(response) %>%
  tally()

## # A tibble: 3 x 2
##   response     n
##   <chr>    <int>
## 1 Agree       25
## 2 Disagree    45
## 3 Neutral     30

46 / 48

Notice how the responses are out of order, with Agree followed by Disagree, then Neutral, since R defaults to alphabetic ordering for anything that is a categorical variable. One way to ensure the correct ordering of categorical variables is via ordered, as shown below.

mydf %>%
  mutate(ordered.response = ordered(
    response,
    levels = c("Agree", "Neutral", "Disagree")
    )
    ) %>%
  group_by(ordered.response) %>%
  tally()

## # A tibble: 3 x 2
##   ordered.response     n
##   <ord>            <int>
## 1 Agree               25
## 2 Neutral             30
## 3 Disagree            45

47 / 48

Find me at...

@aruhil
aniruhil.org
ruhil@ohio.edu

48 / 48

Agenda

Data wrangling is the process of gathering, selecting, and transforming data to answer an analytical question. Also known as data cleaning or “munging”, legend has it that this wrangling costs analytics professionals as much as 80% of their time, leaving only 20% for exploration and modeling. @ Mike Thurber

The process of manual data cleansing prior to analysis is known as data munging [... and ...] can be a laborious task without the right tools. The common interface used for data munging is often Excel, which lacks the sophistication for collaboration and automation to make the process efficient. ... Data munging is time consuming and disjointed process gets in the way of extracting true value and potential from data. @Trifacta

↑, ←, Pg Up, k	Go to previous slide
↓, →, Pg Dn, Space, j	Go to next slide
Home	Go to first slide
End	Go to last slide
Number + Return	Go to specific slide
b / m / f	Toggle blackout / mirrored / fullscreen mode
c	Clone slideshow
p	Toggle presenter mode
t	Restart the presentation timer
?, h	Toggle this help

Core Data Operations in R

Ani Ruhil

Agenda

Using `select()` to pick columns to work with

Using `filter()` to select observations to work with

`filter()` accepts the following operators:

Using `arrange()` to organize the observations in ascending/descending order of specific variables

Using `summarise()` to calculate summary statistics

Using `group_by()` to apply commands by specific groups

Some other `dplyr()` commands

top_n()

join()

full_join()

{santoku}

Creating Quintiles

Creating Deciles

ordered()

Find me at...

Agenda

Help

Core Data Operations in R

Ani Ruhil

Agenda

Using select() to pick columns to work with

Using filter() to select observations to work with

filter() accepts the following operators:

Using arrange() to organize the observations in ascending/descending order of specific variables

Using summarise() to calculate summary statistics

Using group_by() to apply commands by specific groups

Some other dplyr() commands

top_n()

join()

full_join()

{santoku}

Creating Quintiles

Creating Deciles

ordered()

Find me at...

Agenda

Help

Using `select()` to pick columns to work with

Using `filter()` to select observations to work with

`filter()` accepts the following operators:

Using `arrange()` to organize the observations in ascending/descending order of specific variables

Using `summarise()` to calculate summary statistics

Using `group_by()` to apply commands by specific groups

Some other `dplyr()` commands