Tidy and Tweak Your Data

In this module you will learn how three very powerful packages work their magic. {dplyr} and {tidyr} have quickly gained a following because they allow you to manage seemingly unwieldy data to calculate quantities of interest with relative ease, and to flip data from “long” to “wide” formats or vice-versa. {stringr} is a sister-package that works well with strings and often useful for key data cleaning.

Ani Ruhil
2022-02-03

Our goal is to understand how two very powerful packages work their magic. Of recent vintage, {dplyr} and {tidyr} have quickly gained a following because they allow you to manage seemingly unwieldy data to calculate quantities of interest with relative ease. In this module we will see some of their core abilities.

{dplyr}

You will usually end up needing and using some or all of these six functions:

In order to understand how these commands work we rely on a simple data frame – all flights originating and departing from Columbus (Ohio) for January through December of 2017. Let us load the data and {dplyr}.

library(here)
library(tidyverse)
library(tidylog, warn.conflicts = FALSE)

load(
  here("data", "cmhflights2017.RData")
  )

You will want to install {tidylog}, a great package by Benjamin Elbers, since tidylog shows you what each step of a basic {dplyr} operation has done.

select(...)

Often your data frame will have more columns than you need to calculate some quantity – for example, the mean of some variable or the frequency distribution of some categorical variable. Say I only want the second through sixth column – Year, Quarter, Month, DayofMonth, and DayofWeek. Before I do that, though, let me render all column names to be lowercase; running clean_names() from the {janitor} package after you read in any data-set is a good habit to acquire early on in your career.

library(janitor)
clean_names(cmhflights2017) -> cmh.df 

Selecting consecutive columns

Having done that, I could generate a data frame containing only the desired columns as follows:

cmh.df %>% 
  select(
    year:day_of_week
    ) -> my.df 

names(my.df)
[1] "year"        "quarter"     "month"       "dayof_month"
[5] "day_of_week"

Selecting both consecutive and non-consecutive columns

What if the columns were not sequentially located? In that case we would need to list each column we want, using the : operator as shorthand for sequential columns.

cmh.df %>% 
  select(
    year,
    flight_date:reporting_airline,
    tail_number
    ) -> my.df 

names(my.df)
[1] "year"              "flight_date"       "reporting_airline"
[4] "tail_number"      

Selecting with column numbers instead of column names

Could we use column numbers instead? Absolutely, provided you know the column numbers. Yes, provided you know the column numbers of the columns you want (and these can be obtained by running names() on the original data-set).

cmh.df %>% 
  select(
    c(1, 3, 5, 7)
    ) -> my.df 

names(my.df)
[1] "id"          "quarter"     "dayof_month" "flight_date"

Selecting columns in other ways

You can also select columns in other ways, with contains, starts_with, ends_with, any_of, all_of, matches, and num_range. As these function names suggest, contains() will look for a specific string in the column names, starts_with() and ends_with() look for specific starting and ending strings, respectively, matches() will look for a matching string, and any_of() will look for any of the strings listed in quotes inside the parentheses while all_of() will look for all listed strings and throw an error if even one is not found.

cmh.df %>% 
  select(
    contains("carrier")
    ) -> my.df # using contains() 

names(my.df)
[1] "carrier_delay"
cmh.df %>% 
  select(
    starts_with("de")
    ) -> my.df # using starts_with() 

names(my.df)
 [1] "dest_airport_id"        "dest_airport_seq_id"   
 [3] "dest_city_market_id"    "dest"                  
 [5] "dest_city_name"         "dest_state"            
 [7] "dest_state_fips"        "dest_state_name"       
 [9] "dest_wac"               "dep_time"              
[11] "dep_delay"              "dep_delay_minutes"     
[13] "dep_del15"              "departure_delay_groups"
[15] "dep_time_blk"          
cmh.df %>% 
  select(
    ends_with("num")
    ) -> my.df # using ends_with() 

names(my.df)
[1] "div1tail_num" "div2tail_num" "div3tail_num" "div4tail_num"
[5] "div5tail_num"

Now here is any_of(), with the desired variables fed via creating a new object called myvars

myvars = c("year", "month", "dest")

cmh.df %>% 
  select(
    any_of(
      myvars
      )
    ) -> my.df # using any_of() 

names(my.df)
[1] "year"  "month" "dest" 

Note that destined is not found but the other two are extracted any way. What about with all_of()?

cmh.df %>%
  select(
    all_of(myvars)
  ) -> my.df

What if I misspelled any of the variables? Try running the code below

mymalvars = c("year", "month", "destined")

cmh.df %>%
  select(
    all_of(mymalvars)
  ) -> my.df

and you should see this message:

Error: Can't subset columns that don't exist.
x Column `destined` doesn't exist.
cmh.df %>% 
  select(
    matches(
      "wheel"
      )
    ) -> my.df # using matches()

names(my.df)
 [1] "wheels_off"     "wheels_on"      "div1wheels_on" 
 [4] "div1wheels_off" "div2wheels_on"  "div2wheels_off"
 [7] "div3wheels_on"  "div3wheels_off" "div4wheels_on" 
[10] "div4wheels_off" "div5wheels_on"  "div5wheels_off"

What about num_range()?

billboard %>%
  select(
    num_range("wk", 1:4)
  ) -> my.bb

names(my.bb)
[1] "wk1" "wk2" "wk3" "wk4"

rename(...)

Variable names can be modified with the rename() command. The sequence is easy to remember: newname = oldname, as shown below. Just remember: When you do this, the original variable name has been overwritten so do not invoke it in any subsequent code.

cmh.df %>%
  rename(
    flt_number = flight_number_reporting_airline
    ) %>%
  select(
    tail_number, flt_number
    ) -> df.rename 

names(df.rename)
[1] "tail_number" "flt_number" 

filter(...)

Recall that with the diamonds data, in the previous week we used subset() to select particular cases for analysis. With {dplyr}, the filter() command serves the same purpose. Say we only want records from January, i.e., month == 1.

cmh.df %>% 
  filter(
    month == 1
    ) -> my.df 

What about only American Airline flights in January?

cmh.df %>% 
  filter(
    month == 1 &
    reporting_airline == "AA"
    ) -> my.df 

What about United Airlines flights in January to CMH (Columbus, OH) from any originating airport?

cmh.df %>% 
  filter(
    month == 1 & reporting_airline == "UA" & dest == "CMH"
    ) -> my.df 

Note: Instead of & you could use , … it serves the same purpose; test it out for yourself with the code below.

cmh.df %>% 
  filter(
    month == 1, reporting_airline == "UA", dest == "CMH"
    ) -> my.df 

filter(...) with %in%

What if I wanted a more complicated filter, say, flights in January or February either to CMH (Columbus, Ohio) or to ORD (the O’Hare airport in Chicago, Illinois)?

cmh.df %>% 
  filter(
    month %in% c(1, 2) &
      reporting_airline == "UA" &
      dest %in% c("CMH", "ORD")
    ) -> my.df 

Some times you may also want to filter with the opposite of %in%, as in not in some list. You can do this as well, but note where the ! is applied – to the column-name.

cmh.df %>% 
  filter(
    month %in% c(1, 2) &
      reporting_airline == "UA" &
      !dest %in% c("CMH", "ORD")
    ) -> my.df 

Now let us say I wanted to arrange the resulting data frame in ascending order of departure delays. How might I do that?

arrange(...)

my.df %>% 
  arrange(
    dep_delay_minutes
    ) -> my.df2 

And now in descending order of delays. Note the use of the minus symbol - to reverse the sort order

my.df %>% 
  arrange(
    -dep_delay_minutes
    ) -> my.df2 

We could tweak this further, perhaps asking R to first sort by descending order of departure delays to CMH and then to ORD. Note that dest is ascending and DepDelayMinutes is descending

my.df %>% 
  arrange(
    dest, -dep_delay_minutes
    ) -> my.df2 

Piping away with %>%

So far, we have seen each function in isolation. Now we streamline things a bit so that we only end up with the columns we want, organized as we want them to be. That is, we can string together multiple functions to have a slim data frame of only those columns we selected, filtered, and arranged according to our wishes.

cmh.df %>% 
  select(
    month, reporting_airline, dest, dep_delay_minutes
    ) %>% 
  filter(
    month %in% c(1, 2) &
      reporting_airline == "UA" &
      dest %in% c("CMH", "ORD")
    ) %>% 
  arrange(
    month, dest, desc(dep_delay_minutes)
    ) -> my.df3 

Here, the end result is a data frame arranged by month, then within month by destination, and then finally by descending order of flight delay times.

group_by(...)

You often want to carry out some operations/calculations by specific groups, for example, calculate the mean or median age of the men in our sample versus the women, find out how many unique destinations does each airline fly to from Columbus, and so on. The group_by() command comes in very handy here to set the grouping structure.

Let us calculate the Mean and Median delay times. To do this, we will lean on the group_by() function to calculate median and mean delays by month and destination. While we are at it, let us see the standard deviation as well.

cmh.df %>% 
  select(
    month, reporting_airline, dest, dep_delay_minutes
    ) %>% 
  filter(
    month %in% c(1, 2),
    reporting_airline == "UA",
    dest %in% c("CMH", "ORD")
    ) %>% 
  group_by(month, dest) %>% 
  summarise(
    median.delay = median(dep_delay_minutes, na.rm = TRUE), 
    mean.delay = mean(dep_delay_minutes, na.rm = TRUE),
    sd.delay = sd(dep_delay_minutes, na.rm = TRUE)
    ) %>% 
  arrange(month, dest) -> tab.01 

What if wanted to know how many airports can you fly to from Columbus, and how many flights there were to each airport in July?

cmh.df %>% 
  select(
    month, reporting_airline, dest, dep_delay_minutes, origin
    ) %>% 
  filter(
    month %in% c(7),
    origin %in% c("CMH")
    ) %>% 
  group_by(dest) %>% 
  summarise(
    no.of.flights = n()
    ) -> tab.02 

How many distinct airlines fly out of Columbus?

cmh.df %>% 
  select(
    month, reporting_airline, dest, dep_delay_minutes, origin
    ) %>% 
  filter(
    origin %in% c("CMH")
    ) %>% 
  group_by(reporting_airline) %>% 
  summarise(
    no.of.airlines = n_distinct(reporting_airline)
    ) -> tab.03 

And how many flights did each airline operate?

cmh.df %>% 
  select(
    month, reporting_airline, dest, dep_delay_minutes, origin
    ) %>% 
  filter(
    origin %in% c("CMH")
    ) %>% 
  group_by(reporting_airline) %>% 
  summarise(
    no.of.flights = n()
    ) -> tab.04 

Of all of these airlines, which one has the highest mean departure delay?

cmh.df %>% 
  select(
    month, reporting_airline, dest, dep_delay_minutes, origin
    ) %>% 
  filter(
    origin %in% c("CMH")
    ) %>% 
  group_by(reporting_airline) %>% 
  summarise(
    mean.delay = mean(dep_delay_minutes, na.rm = TRUE)
    ) %>%
  arrange(
    desc(mean.delay)
    ) -> tab.05 

ungroup(...)

If you want to remove a grouping structure you merely need to specify ungroup(), as in

cmh.df %>% 
  select(
    month, reporting_airline, dest, dep_delay_minutes, origin
    ) %>% 
  filter(
    origin %in% c("CMH")
    ) %>% 
  group_by(reporting_airline) %>% 
  summarise(
    median.delay = median(dep_delay_minutes, na.rm = TRUE)
    ) %>%
  ungroup() -> tab.06 

ungroup() is often handy when you need to pipe several commands but the grouping structure is different for each. For example, maybe I want to calculate departure delays by unique airline, and then by unique combinations of airline & origin.

cmh.df %>% 
  select(
    month, reporting_airline, dest, dep_delay_minutes, origin
    ) %>% 
  group_by(reporting_airline) %>% 
  mutate(
    mean.delay.1 = mean(dep_delay_minutes, na.rm = TRUE)
    ) %>%
  ungroup() %>%
  group_by(reporting_airline, origin) %>% 
  mutate(
    mean.delay.2 = mean(dep_delay_minutes, na.rm = TRUE)
    )
# A tibble: 47,787 × 7
# Groups:   reporting_airline, origin [53]
   month reporting_airline dest  dep_delay_minutes origin mean.delay.1
   <dbl> <chr>             <chr>             <dbl> <chr>         <dbl>
 1     1 DL                ATL                   0 CMH            8.90
 2     1 DL                ATL                   1 CMH            8.90
 3     1 DL                CMH                   0 ATL            8.90
 4     1 DL                ATL                   0 CMH            8.90
 5     1 DL                CMH                   0 ATL            8.90
 6     1 DL                CMH                   0 LAX            8.90
 7     1 DL                CMH                   0 ATL            8.90
 8     1 DL                ATL                   0 CMH            8.90
 9     1 DL                LAX                   0 CMH            8.90
10     1 DL                CMH                   0 ATL            8.90
# … with 47,777 more rows, and 1 more variable: mean.delay.2 <dbl>

distinct(...)

Since the preceding code yields duplicate rows, I can use distinct to retain only unique rows

cmh.df %>% 
  select(
    month, reporting_airline, dest, dep_delay_minutes, origin
    ) %>% 
  group_by(reporting_airline) %>% 
  mutate(
    mean.delay.1 = mean(dep_delay_minutes, na.rm = TRUE)
    ) %>%
  ungroup() %>%
  group_by(reporting_airline, origin) %>% 
  mutate(
    mean.delay.2 = mean(dep_delay_minutes, na.rm = TRUE)
    ) %>%
  distinct()
# A tibble: 13,332 × 7
# Groups:   reporting_airline, origin [53]
   month reporting_airline dest  dep_delay_minutes origin mean.delay.1
   <dbl> <chr>             <chr>             <dbl> <chr>         <dbl>
 1     1 DL                ATL                   0 CMH            8.90
 2     1 DL                ATL                   1 CMH            8.90
 3     1 DL                CMH                   0 ATL            8.90
 4     1 DL                CMH                   0 LAX            8.90
 5     1 DL                LAX                   0 CMH            8.90
 6     1 DL                CMH                   7 ATL            8.90
 7     1 DL                CMH                   1 LAX            8.90
 8     1 DL                ATL                   2 CMH            8.90
 9     1 DL                CMH                 105 ATL            8.90
10     1 DL                CMH                   1 ATL            8.90
# … with 13,322 more rows, and 1 more variable: mean.delay.2 <dbl>

mutate(...)

Thus far we have been creating a summary data frame containing whatever we calculated – the frequencies, averages, and so on. Some times, however, you may want these calculations to be be appended to the original data frame and stored as new columns. This can be done with mutate(), as shown below. In this particular example, I am calculating median departure delay by airline (for each airport it departs from) and then including this as a column called Median.Delay in cmhflights.

cmh.df %>% 
  group_by(
    reporting_airline, origin
    ) %>% 
  mutate(
    median.delay = median(dep_delay_minutes, na.rm = TRUE)
    ) -> cmhflights 

Note the 111th column. Now, if we wanted to add this column but drop all other columns not being used in the calculation, we could do that via transmute(). Be careful with this one though.

cmh.df -> mut.df # cloning cmh.df into mut.df 

mut.df %>% 
  group_by(reporting_airline, dest) %>% 
  transmute(
    median.arr.delay = median(arr_delay_minutes, na.rm = TRUE)
    ) -> mut.df2 

names(mut.df2)
[1] "reporting_airline" "dest"              "median.arr.delay" 

case_when(...)

Just as one can use ifelse() to generate new variables or modify existing variables, so can one do the same thing with case_when() from dplyr. Here are a few examples. First, working with categorical variables, where I create a new variable that flags whether the flight originated in BWI, DCA, OR IAH versus some other airport.

cmh.df %>%
  mutate(
    DC_area = case_when(
      origin %in% c("BWI","DCA", "IAH") ~ "DC Area",
      !origin %in% c("BWI","DCA", "IAH") ~ "Not in DC Area"
      )
    ) -> cw.df 

table(cw.df$origin, cw.df$DC_area)
     
      DC Area Not in DC Area
  ATL       0           3794
  BNA       0            692
  BOS       0            623
  BWI    1301              0
  CLT       0              3
  CMH       0          23893
  DAL       0            369
  DCA     674              0
  DEN       0           1224
  DFW       0           1482
  DTW       0            755
  EWR       0            847
  FLL       0            321
  HOU       0             16
  IAH     760              0
  LAS       0           1074
  LAX       0            641
  LGA       0            173
  MCO       0           1488
  MDW       0           1992
  MSP       0            945
  MSY       0             35
  OAK       0            365
  ORD       0           1089
  PHL       0             57
  PHX       0           1075
  RSW       0            657
  STL       0            655
  TPA       0            787

The same result would obtain if you had run the following code:

cmh.df %>%
  mutate(
    DC_area = case_when(
      origin %in% c("BWI","DCA", "IAH") ~ "DC Area",
      TRUE ~ "Not in DC Area"
      )
    ) -> cw.df 

table(cw.df$origin, cw.df$DC_area)
     
      DC Area Not in DC Area
  ATL       0           3794
  BNA       0            692
  BOS       0            623
  BWI    1301              0
  CLT       0              3
  CMH       0          23893
  DAL       0            369
  DCA     674              0
  DEN       0           1224
  DFW       0           1482
  DTW       0            755
  EWR       0            847
  FLL       0            321
  HOU       0             16
  IAH     760              0
  LAS       0           1074
  LAX       0            641
  LGA       0            173
  MCO       0           1488
  MDW       0           1992
  MSP       0            945
  MSY       0             35
  OAK       0            365
  ORD       0           1089
  PHL       0             57
  PHX       0           1075
  RSW       0            657
  STL       0            655
  TPA       0            787

Similarly, we could use numerical variables to create new ones. For example, let us call a flight delayed if dep_delay_minutes is \(> 30\).

cw.df %>%
  mutate(
    delayed = case_when(
      dep_delay_minutes > 30 ~ "Yes",
      dep_delay_minutes < 30 ~ "No",
      is.na(dep_delay_minutes) ~ "Unknown"
      )
    ) -> cw.df 

case_when() is often useful when you have to chain multiple ifelse() commands since managing a large ifelse() chain can be tiresome. Further, you should remember that %in% can be used to ease coding in case_when(), as shown below. Say I want to group flights into weekday versus weekend flights. Note that the week starts on Sundays hence weekdays are defined here as days 2 through 6 (i.e., Monday through Friday).

cw.df %>%
  mutate(
    weekend = case_when(
      day_of_week %in% c(2:6) ~ "No",
      TRUE ~ "Yes"
      )
    ) -> cw.df 

unnest(...)

Every now and then the data you have to work with includes some list-columns – columns that are like mini-data frames themselves. Suddenly R functions that would otherwise work will not execute as desired. The unnest function from the tidyverse unpacks these list-columns into a regular tibble. First a small data frame that shows a list-column.

library(tidyverse)
data_frame(
  x = 1:3,
  y = c("a", "d,e,f", "g,h")
  ) -> bad.df 

If I want to clean up list-column y, I can do that as shown below.

bad.df %>% 
  transform(
    y = strsplit(y, ",")
    ) %>% 
  unnest(y) -> good.df 

Now see a slightly more complicated list-column, with y made of named rows …. a and b, a with 1 value and b with 2 values.

data_frame(
  x = 1:2,
  y = list(a = 1, b = 3:4)
  ) -> badder.df 

To clean this up we could do:

badder.df %>% 
  unnest(
    .,
    .id = "name"
    ) -> gooder.df 

{tidyr}

Most packages you use in R call for data to be {tidy}. What are tidy data?