Tidy and Tweak Your Data

In this module you will learn how three very powerful packages work their magic. {dplyr} and {tidyr} have quickly gained a following because they allow you to manage seemingly unwieldy data to calculate quantities of interest with relative ease, and to flip data from “long” to “wide” formats or vice-versa. {stringr} is a sister-package that works well with strings and often useful for key data cleaning.

Ani Ruhil
2022-02-03

Our goal is to understand how two very powerful packages work their magic. Of recent vintage, {dplyr} and {tidyr} have quickly gained a following because they allow you to manage seemingly unwieldy data to calculate quantities of interest with relative ease. In this module we will see some of their core abilities.

{dplyr}

You will usually end up needing and using some or all of these six functions:

In order to understand how these commands work we rely on a simple data frame – all flights originating and departing from Columbus (Ohio) for January through December of 2017. Let us load the data and {dplyr}.

library(here)
library(tidyverse)
library(tidylog, warn.conflicts = FALSE)

load(
  here("data", "cmhflights2017.RData")
  )

You will want to install {tidylog}, a great package by Benjamin Elbers, since tidylog shows you what each step of a basic {dplyr} operation has done.

select(...)

Often your data frame will have more columns than you need to calculate some quantity – for example, the mean of some variable or the frequency distribution of some categorical variable. Say I only want the second through sixth column – Year, Quarter, Month, DayofMonth, and DayofWeek. Before I do that, though, let me render all column names to be lowercase; running clean_names() from the {janitor} package after you read in any data-set is a good habit to acquire early on in your career.

library(janitor)
clean_names(cmhflights2017) -> cmh.df 

Selecting consecutive columns

Having done that, I could generate a data frame containing only the desired columns as follows:

cmh.df %>% 
  select(
    year:day_of_week
    ) -> my.df 

names(my.df)
[1] "year"        "quarter"     "month"       "dayof_month"
[5] "day_of_week"

Selecting both consecutive and non-consecutive columns

What if the columns were not sequentially located? In that case we would need to list each column we want, using the : operator as shorthand for sequential columns.

cmh.df %>% 
  select(
    year,
    flight_date:reporting_airline,
    tail_number
    ) -> my.df 

names(my.df)
[1] "year"              "flight_date"       "reporting_airline"
[4] "tail_number"      

Selecting with column numbers instead of column names

Could we use column numbers instead? Absolutely, provided you know the column numbers. Yes, provided you know the column numbers of the columns you want (and these can be obtained by running names() on the original data-set).

cmh.df %>% 
  select(
    c(1, 3, 5, 7)
    ) -> my.df 

names(my.df)
[1] "id"          "quarter"     "dayof_month" "flight_date"

Selecting columns in other ways

You can also select columns in other ways, with contains, starts_with, ends_with, any_of, all_of, matches, and num_range. As these function names suggest, contains() will look for a specific string in the column names, starts_with() and ends_with() look for specific starting and ending strings, respectively, matches() will look for a matching string, and any_of() will look for any of the strings listed in quotes inside the parentheses while all_of() will look for all listed strings and throw an error if even one is not found.

cmh.df %>% 
  select(
    contains("carrier")
    ) -> my.df # using contains() 

names(my.df)
[1] "carrier_delay"
cmh.df %>% 
  select(
    starts_with("de")
    ) -> my.df # using starts_with() 

names(my.df)
 [1] "dest_airport_id"        "dest_airport_seq_id"   
 [3] "dest_city_market_id"    "dest"                  
 [5] "dest_city_name"         "dest_state"            
 [7] "dest_state_fips"        "dest_state_name"       
 [9] "dest_wac"               "dep_time"              
[11] "dep_delay"              "dep_delay_minutes"     
[13] "dep_del15"              "departure_delay_groups"
[15] "dep_time_blk"          
cmh.df %>% 
  select(
    ends_with("num")
    ) -> my.df # using ends_with() 

names(my.df)
[1] "div1tail_num" "div2tail_num" "div3tail_num" "div4tail_num"
[5] "div5tail_num"

Now here is any_of(), with the desired variables fed via creating a new object called myvars

myvars = c("year", "month", "dest")

cmh.df %>% 
  select(
    any_of(
      myvars
      )
    ) -> my.df # using any_of() 

names(my.df)
[1] "year"  "month" "dest" 

Note that destined is not found but the other two are extracted any way. What about with all_of()?

cmh.df %>%
  select(
    all_of(myvars)
  ) -> my.df

What if I misspelled any of the variables? Try running the code below

mymalvars = c("year", "month", "destined")

cmh.df %>%
  select(
    all_of(mymalvars)
  ) -> my.df

and you should see this message:

Error: Can't subset columns that don't exist.
x Column `destined` doesn't exist.
cmh.df %>% 
  select(
    matches(
      "wheel"
      )
    ) -> my.df # using matches()

names(my.df)
 [1] "wheels_off"     "wheels_on"      "div1wheels_on" 
 [4] "div1wheels_off" "div2wheels_on"  "div2wheels_off"
 [7] "div3wheels_on"  "div3wheels_off" "div4wheels_on" 
[10] "div4wheels_off" "div5wheels_on"  "div5wheels_off"

What about num_range()?

billboard %>%
  select(
    num_range("wk", 1:4)
  ) -> my.bb

names(my.bb)
[1] "wk1" "wk2" "wk3" "wk4"

rename(...)

Variable names can be modified with the rename() command. The sequence is easy to remember: newname = oldname, as shown below. Just remember: When you do this, the original variable name has been overwritten so do not invoke it in any subsequent code.

cmh.df %>%
  rename(
    flt_number = flight_number_reporting_airline
    ) %>%
  select(
    tail_number, flt_number
    ) -> df.rename 

names(df.rename)
[1] "tail_number" "flt_number" 

filter(...)

Recall that with the diamonds data, in the previous week we used subset() to select particular cases for analysis. With {dplyr}, the filter() command serves the same purpose. Say we only want records from January, i.e., month == 1.

cmh.df %>% 
  filter(
    month == 1
    ) -> my.df 

What about only American Airline flights in January?

cmh.df %>% 
  filter(
    month == 1 &
    reporting_airline == "AA"
    ) -> my.df 

What about United Airlines flights in January to CMH (Columbus, OH) from any originating airport?

cmh.df %>% 
  filter(
    month == 1 & reporting_airline == "UA" & dest == "CMH"
    ) -> my.df 

Note: Instead of & you could use , … it serves the same purpose; test it out for yourself with the code below.

cmh.df %>% 
  filter(
    month == 1, reporting_airline == "UA", dest == "CMH"
    ) -> my.df 

filter(...) with %in%

What if I wanted a more complicated filter, say, flights in January or February either to CMH (Columbus, Ohio) or to ORD (the O’Hare airport in Chicago, Illinois)?

cmh.df %>% 
  filter(
    month %in% c(1, 2) &
      reporting_airline == "UA" &
      dest %in% c("CMH", "ORD")
    ) -> my.df 

Some times you may also want to filter with the opposite of %in%, as in not in some list. You can do this as well, but note where the ! is applied – to the column-name.

cmh.df %>% 
  filter(
    month %in% c(1, 2) &
      reporting_airline == "UA" &
      !dest %in% c("CMH", "ORD")
    ) -> my.df 

Now let us say I wanted to arrange the resulting data frame in ascending order of departure delays. How might I do that?

arrange(...)

my.df %>% 
  arrange(
    dep_delay_minutes
    ) -> my.df2 

And now in descending order of delays. Note the use of the minus symbol - to reverse the sort order

my.df %>% 
  arrange(
    -dep_delay_minutes
    ) -> my.df2 

We could tweak this further, perhaps asking R to first sort by descending order of departure delays to CMH and then to ORD. Note that dest is ascending and DepDelayMinutes is descending

my.df %>% 
  arrange(
    dest, -dep_delay_minutes
    ) -> my.df2 

Piping away with %>%

So far, we have seen each function in isolation. Now we streamline things a bit so that we only end up with the columns we want, organized as we want them to be. That is, we can string together multiple functions to have a slim data frame of only those columns we selected, filtered, and arranged according to our wishes.

cmh.df %>% 
  select(
    month, reporting_airline, dest, dep_delay_minutes
    ) %>% 
  filter(
    month %in% c(1, 2) &
      reporting_airline == "UA" &
      dest %in% c("CMH", "ORD")
    ) %>% 
  arrange(
    month, dest, desc(dep_delay_minutes)
    ) -> my.df3 

Here, the end result is a data frame arranged by month, then within month by destination, and then finally by descending order of flight delay times.

group_by(...)

You often want to carry out some operations/calculations by specific groups, for example, calculate the mean or median age of the men in our sample versus the women, find out how many unique destinations does each airline fly to from Columbus, and so on. The group_by() command comes in very handy here to set the grouping structure.

Let us calculate the Mean and Median delay times. To do this, we will lean on the group_by() function to calculate median and mean delays by month and destination. While we are at it, let us see the standard deviation as well.

cmh.df %>% 
  select(
    month, reporting_airline, dest, dep_delay_minutes
    ) %>% 
  filter(
    month %in% c(1, 2),
    reporting_airline == "UA",
    dest %in% c("CMH", "ORD")
    ) %>% 
  group_by(month, dest) %>% 
  summarise(
    median.delay = median(dep_delay_minutes, na.rm = TRUE), 
    mean.delay = mean(dep_delay_minutes, na.rm = TRUE),
    sd.delay = sd(dep_delay_minutes, na.rm = TRUE)
    ) %>% 
  arrange(month, dest) -> tab.01 

What if wanted to know how many airports can you fly to from Columbus, and how many flights there were to each airport in July?

cmh.df %>% 
  select(
    month, reporting_airline, dest, dep_delay_minutes, origin
    ) %>% 
  filter(
    month %in% c(7),
    origin %in% c("CMH")
    ) %>% 
  group_by(dest) %>% 
  summarise(
    no.of.flights = n()
    ) -> tab.02 

How many distinct airlines fly out of Columbus?

cmh.df %>% 
  select(
    month, reporting_airline, dest, dep_delay_minutes, origin
    ) %>% 
  filter(
    origin %in% c("CMH")
    ) %>% 
  group_by(reporting_airline) %>% 
  summarise(
    no.of.airlines = n_distinct(reporting_airline)
    ) -> tab.03 

And how many flights did each airline operate?

cmh.df %>% 
  select(
    month, reporting_airline, dest, dep_delay_minutes, origin
    ) %>% 
  filter(
    origin %in% c("CMH")
    ) %>% 
  group_by(reporting_airline) %>% 
  summarise(
    no.of.flights = n()
    ) -> tab.04 

Of all of these airlines, which one has the highest mean departure delay?

cmh.df %>% 
  select(
    month, reporting_airline, dest, dep_delay_minutes, origin
    ) %>% 
  filter(
    origin %in% c("CMH")
    ) %>% 
  group_by(reporting_airline) %>% 
  summarise(
    mean.delay = mean(dep_delay_minutes, na.rm = TRUE)
    ) %>%
  arrange(
    desc(mean.delay)
    ) -> tab.05 

ungroup(...)

If you want to remove a grouping structure you merely need to specify ungroup(), as in

cmh.df %>% 
  select(
    month, reporting_airline, dest, dep_delay_minutes, origin
    ) %>% 
  filter(
    origin %in% c("CMH")
    ) %>% 
  group_by(reporting_airline) %>% 
  summarise(
    median.delay = median(dep_delay_minutes, na.rm = TRUE)
    ) %>%
  ungroup() -> tab.06 

ungroup() is often handy when you need to pipe several commands but the grouping structure is different for each. For example, maybe I want to calculate departure delays by unique airline, and then by unique combinations of airline & origin.

cmh.df %>% 
  select(
    month, reporting_airline, dest, dep_delay_minutes, origin
    ) %>% 
  group_by(reporting_airline) %>% 
  mutate(
    mean.delay.1 = mean(dep_delay_minutes, na.rm = TRUE)
    ) %>%
  ungroup() %>%
  group_by(reporting_airline, origin) %>% 
  mutate(
    mean.delay.2 = mean(dep_delay_minutes, na.rm = TRUE)
    )
# A tibble: 47,787 × 7
# Groups:   reporting_airline, origin [53]
   month reporting_airline dest  dep_delay_minutes origin mean.delay.1
   <dbl> <chr>             <chr>             <dbl> <chr>         <dbl>
 1     1 DL                ATL                   0 CMH            8.90
 2     1 DL                ATL                   1 CMH            8.90
 3     1 DL                CMH                   0 ATL            8.90
 4     1 DL                ATL                   0 CMH            8.90
 5     1 DL                CMH                   0 ATL            8.90
 6     1 DL                CMH                   0 LAX            8.90
 7     1 DL                CMH                   0 ATL            8.90
 8     1 DL                ATL                   0 CMH            8.90
 9     1 DL                LAX                   0 CMH            8.90
10     1 DL                CMH                   0 ATL            8.90
# … with 47,777 more rows, and 1 more variable: mean.delay.2 <dbl>

distinct(...)

Since the preceding code yields duplicate rows, I can use distinct to retain only unique rows

cmh.df %>% 
  select(
    month, reporting_airline, dest, dep_delay_minutes, origin
    ) %>% 
  group_by(reporting_airline) %>% 
  mutate(
    mean.delay.1 = mean(dep_delay_minutes, na.rm = TRUE)
    ) %>%
  ungroup() %>%
  group_by(reporting_airline, origin) %>% 
  mutate(
    mean.delay.2 = mean(dep_delay_minutes, na.rm = TRUE)
    ) %>%
  distinct()
# A tibble: 13,332 × 7
# Groups:   reporting_airline, origin [53]
   month reporting_airline dest  dep_delay_minutes origin mean.delay.1
   <dbl> <chr>             <chr>             <dbl> <chr>         <dbl>
 1     1 DL                ATL                   0 CMH            8.90
 2     1 DL                ATL                   1 CMH            8.90
 3     1 DL                CMH                   0 ATL            8.90
 4     1 DL                CMH                   0 LAX            8.90
 5     1 DL                LAX                   0 CMH            8.90
 6     1 DL                CMH                   7 ATL            8.90
 7     1 DL                CMH                   1 LAX            8.90
 8     1 DL                ATL                   2 CMH            8.90
 9     1 DL                CMH                 105 ATL            8.90
10     1 DL                CMH                   1 ATL            8.90
# … with 13,322 more rows, and 1 more variable: mean.delay.2 <dbl>

mutate(...)

Thus far we have been creating a summary data frame containing whatever we calculated – the frequencies, averages, and so on. Some times, however, you may want these calculations to be be appended to the original data frame and stored as new columns. This can be done with mutate(), as shown below. In this particular example, I am calculating median departure delay by airline (for each airport it departs from) and then including this as a column called Median.Delay in cmhflights.

cmh.df %>% 
  group_by(
    reporting_airline, origin
    ) %>% 
  mutate(
    median.delay = median(dep_delay_minutes, na.rm = TRUE)
    ) -> cmhflights 

Note the 111th column. Now, if we wanted to add this column but drop all other columns not being used in the calculation, we could do that via transmute(). Be careful with this one though.

cmh.df -> mut.df # cloning cmh.df into mut.df 

mut.df %>% 
  group_by(reporting_airline, dest) %>% 
  transmute(
    median.arr.delay = median(arr_delay_minutes, na.rm = TRUE)
    ) -> mut.df2 

names(mut.df2)
[1] "reporting_airline" "dest"              "median.arr.delay" 

case_when(...)

Just as one can use ifelse() to generate new variables or modify existing variables, so can one do the same thing with case_when() from dplyr. Here are a few examples. First, working with categorical variables, where I create a new variable that flags whether the flight originated in BWI, DCA, OR IAH versus some other airport.

cmh.df %>%
  mutate(
    DC_area = case_when(
      origin %in% c("BWI","DCA", "IAH") ~ "DC Area",
      !origin %in% c("BWI","DCA", "IAH") ~ "Not in DC Area"
      )
    ) -> cw.df 

table(cw.df$origin, cw.df$DC_area)
     
      DC Area Not in DC Area
  ATL       0           3794
  BNA       0            692
  BOS       0            623
  BWI    1301              0
  CLT       0              3
  CMH       0          23893
  DAL       0            369
  DCA     674              0
  DEN       0           1224
  DFW       0           1482
  DTW       0            755
  EWR       0            847
  FLL       0            321
  HOU       0             16
  IAH     760              0
  LAS       0           1074
  LAX       0            641
  LGA       0            173
  MCO       0           1488
  MDW       0           1992
  MSP       0            945
  MSY       0             35
  OAK       0            365
  ORD       0           1089
  PHL       0             57
  PHX       0           1075
  RSW       0            657
  STL       0            655
  TPA       0            787

The same result would obtain if you had run the following code:

cmh.df %>%
  mutate(
    DC_area = case_when(
      origin %in% c("BWI","DCA", "IAH") ~ "DC Area",
      TRUE ~ "Not in DC Area"
      )
    ) -> cw.df 

table(cw.df$origin, cw.df$DC_area)
     
      DC Area Not in DC Area
  ATL       0           3794
  BNA       0            692
  BOS       0            623
  BWI    1301              0
  CLT       0              3
  CMH       0          23893
  DAL       0            369
  DCA     674              0
  DEN       0           1224
  DFW       0           1482
  DTW       0            755
  EWR       0            847
  FLL       0            321
  HOU       0             16
  IAH     760              0
  LAS       0           1074
  LAX       0            641
  LGA       0            173
  MCO       0           1488
  MDW       0           1992
  MSP       0            945
  MSY       0             35
  OAK       0            365
  ORD       0           1089
  PHL       0             57
  PHX       0           1075
  RSW       0            657
  STL       0            655
  TPA       0            787

Similarly, we could use numerical variables to create new ones. For example, let us call a flight delayed if dep_delay_minutes is \(> 30\).

cw.df %>%
  mutate(
    delayed = case_when(
      dep_delay_minutes > 30 ~ "Yes",
      dep_delay_minutes < 30 ~ "No",
      is.na(dep_delay_minutes) ~ "Unknown"
      )
    ) -> cw.df 

case_when() is often useful when you have to chain multiple ifelse() commands since managing a large ifelse() chain can be tiresome. Further, you should remember that %in% can be used to ease coding in case_when(), as shown below. Say I want to group flights into weekday versus weekend flights. Note that the week starts on Sundays hence weekdays are defined here as days 2 through 6 (i.e., Monday through Friday).

cw.df %>%
  mutate(
    weekend = case_when(
      day_of_week %in% c(2:6) ~ "No",
      TRUE ~ "Yes"
      )
    ) -> cw.df 

unnest(...)

Every now and then the data you have to work with includes some list-columns – columns that are like mini-data frames themselves. Suddenly R functions that would otherwise work will not execute as desired. The unnest function from the tidyverse unpacks these list-columns into a regular tibble. First a small data frame that shows a list-column.

library(tidyverse)
data_frame(
  x = 1:3,
  y = c("a", "d,e,f", "g,h")
  ) -> bad.df 

If I want to clean up list-column y, I can do that as shown below.

bad.df %>% 
  transform(
    y = strsplit(y, ",")
    ) %>% 
  unnest(y) -> good.df 

Now see a slightly more complicated list-column, with y made of named rows …. a and b, a with 1 value and b with 2 values.

data_frame(
  x = 1:2,
  y = list(a = 1, b = 3:4)
  ) -> badder.df 

To clean this up we could do:

badder.df %>% 
  unnest(
    .,
    .id = "name"
    ) -> gooder.df 

{tidyr}

Most packages you use in R call for data to be {tidy}. What are tidy data?

Artwork by @allison_horst

Figure 1: Artwork by @allison_horst

Here are some untidy data

  1. The Pew survey data
read.delim(
  file = "http://stat405.had.co.nz/data/pew.txt",
  header = TRUE,
  stringsAsFactors = FALSE,
  check.names = FALSE
  ) -> pew 

head(pew)
            religion <$10k $10-20k $20-30k $30-40k $40-50k $50-75k
1           Agnostic    27      34      60      81      76     137
2            Atheist    12      27      37      52      35      70
3           Buddhist    27      21      30      34      33      58
4           Catholic   418     617     732     670     638    1116
5 Don’t know/refused    15      14      15      11      10      35
6   Evangelical Prot   575     869    1064     982     881    1486
  $75-100k $100-150k >150k Don't know/refused
1      122       109    84                 96
2       73        59    74                 76
3       62        39    53                 54
4      949       792   633               1489
5       21        17    18                116
6      949       723   414               1529
  1. Here are some more examples
# remotes::install_github("garrettgman/DSR")
library(DSR)
table2
# A tibble: 12 × 4
   country      year key             value
   <fct>       <int> <fct>           <int>
 1 Afghanistan  1999 cases             745
 2 Afghanistan  1999 population   19987071
 3 Afghanistan  2000 cases            2666
 4 Afghanistan  2000 population   20595360
 5 Brazil       1999 cases           37737
 6 Brazil       1999 population  172006362
 7 Brazil       2000 cases           80488
 8 Brazil       2000 population  174504898
 9 China        1999 cases          212258
10 China        1999 population 1272915272
11 China        2000 cases          213766
12 China        2000 population 1280428583
table3
# A tibble: 6 × 3
  country      year rate             
  <fct>       <int> <chr>            
1 Afghanistan  1999 745/19987071     
2 Afghanistan  2000 2666/20595360    
3 Brazil       1999 37737/172006362  
4 Brazil       2000 80488/174504898  
5 China        1999 212258/1272915272
6 China        2000 213766/1280428583
table4
# A tibble: 3 × 3
  country     `1999` `2000`
  <fct>        <int>  <int>
1 Afghanistan    745   2666
2 Brazil       37737  80488
3 China       212258 213766
table5
# A tibble: 3 × 3
  country         `1999`     `2000`
  <fct>            <int>      <int>
1 Afghanistan   19987071   20595360
2 Brazil       172006362  174504898
3 China       1272915272 1280428583

What would a tidy version of the examples in (b) look like?

table1
# A tibble: 6 × 4
  country      year  cases population
  <fct>       <int>  <int>      <int>
1 Afghanistan  1999    745   19987071
2 Afghanistan  2000   2666   20595360
3 Brazil       1999  37737  172006362
4 Brazil       2000  80488  174504898
5 China        1999 212258 1272915272
6 China        2000 213766 1280428583

Here, we have

The {tidyr} package is designed to make messy data tidy in order to proceed with analysis. It has four primary functions that we will see in turn.

separate

You will run into variables that you want to break into smaller pieces. For instance, take table3 where rate is a ratio of two variables, cases and population, each of which should be in their own columns. How can we achieve that?

library(tidyr)
table3 %>% 
  separate(
    col = rate, 
    into = c("cases", "population"),
    sep = "/", 
    remove = TRUE,
    convert = TRUE
    ) -> s.table3  

s.table3
# A tibble: 6 × 4
  country      year  cases population
  <fct>       <int>  <int>      <int>
1 Afghanistan  1999    745   19987071
2 Afghanistan  2000   2666   20595360
3 Brazil       1999  37737  172006362
4 Brazil       2000  80488  174504898
5 China        1999 212258 1272915272
6 China        2000 213766 1280428583

unite(...)

The unite(...) command does the exact opposite of separate, combining columns

s.table3 %>% 
  unite(
    "rate",
    cases, population,
    sep = ":",
    remove = FALSE
    ) -> u.table3 

u.table3
# A tibble: 6 × 5
  country      year rate               cases population
  <fct>       <int> <chr>              <int>      <int>
1 Afghanistan  1999 745:19987071         745   19987071
2 Afghanistan  2000 2666:20595360       2666   20595360
3 Brazil       1999 37737:172006362    37737  172006362
4 Brazil       2000 80488:174504898    80488  174504898
5 China        1999 212258:1272915272 212258 1272915272
6 China        2000 213766:1280428583 213766 1280428583

Here you specify the name of the new column first, and then list the columns to be combined, specify the separator, and whether the original columns should be removed or not.

spread(...) and pivot_wider(...)

Revisit table2

table2
# A tibble: 12 × 4
   country      year key             value
   <fct>       <int> <fct>           <int>
 1 Afghanistan  1999 cases             745
 2 Afghanistan  1999 population   19987071
 3 Afghanistan  2000 cases            2666
 4 Afghanistan  2000 population   20595360
 5 Brazil       1999 cases           37737
 6 Brazil       1999 population  172006362
 7 Brazil       2000 cases           80488
 8 Brazil       2000 population  174504898
 9 China        1999 cases          212258
10 China        1999 population 1272915272
11 China        2000 cases          213766
12 China        2000 population 1280428583

I want to flip this into a format where the data frame has two columns per country per year – cases and population. How can I do this?

table2 %>% 
  spread(
    key,
    value
    ) -> sp.table2 

sp.table2
# A tibble: 6 × 4
  country      year  cases population
  <fct>       <int>  <int>      <int>
1 Afghanistan  1999    745   19987071
2 Afghanistan  2000   2666   20595360
3 Brazil       1999  37737  172006362
4 Brazil       2000  80488  174504898
5 China        1999 212258 1272915272
6 China        2000 213766 1280428583

Notice that the key column had two unique values – cases and population – so two new columns were created. Notice also that tidyr knew to populate the value of each of the new cells with what it finds in the value column. The result is what we call data in a wide format.

gather(...) and pivot_longer(...)

gather(...) does the opposite of spread(...), converting wide data into what we call the long form. Revisit table4

table4
# A tibble: 3 × 3
  country     `1999` `2000`
  <fct>        <int>  <int>
1 Afghanistan    745   2666
2 Brazil       37737  80488
3 China       212258 213766

Note that each country has two columns, one per year. This is untidy. So we’ll flip this tibble as follows:

table4 %>% 
  gather(
    "year",
    "cases",
    2:3
    ) -> g.table4 

g.table4
# A tibble: 6 × 3
  country     year   cases
  <fct>       <chr>  <int>
1 Afghanistan 1999     745
2 Brazil      1999   37737
3 China       1999  212258
4 Afghanistan 2000    2666
5 Brazil      2000   80488
6 China       2000  213766

Notice that you had to specify the columns you wanted to gather and we did this via 2:3. If you forget to specify columns it creates something else altogether so be careful.

World Health Organization (WHO) Data Cleaning Example

This is a slightly modified version of Garrett Grolemund’s WHO Data Case Study.

At this website you can find data provided by countries to WHO and estimates of TB burden generated by WHO for the Global Tuberculosis Report. I want you to gather the following data (all are in the csv format):

Read the data dictionary carefully. Then read-in the case notifications data.

The database has too many variables (175 or so) and so I’ll have you use just a few for this practice task.

library(readr)
read_csv(
  here("data", "TB_notifications_2018-02-07.csv")
  ) -> tb 

Let us use dplyr to select and filter the tb database so that it meets the following criteria: (i) observations from 2012 or earlier years are kept, and (ii) keep only columns 1 through 6 and 25 (new_sp_m04) through 44 (new_sp_fu).

library(dplyr)
tb %>% 
  select(
    c(1:6, 25:44)
    ) %>% 
  filter(
    year <= 2012
    ) -> tb2 

glimpse(tb2)
Rows: 7,023
Columns: 26
$ country      <chr> "Afghanistan", "Afghanistan", "Afghanistan", "A…
$ iso2         <chr> "AF", "AF", "AF", "AF", "AF", "AF", "AF", "AF",…
$ iso3         <chr> "AFG", "AFG", "AFG", "AFG", "AFG", "AFG", "AFG"…
$ iso_numeric  <chr> "004", "004", "004", "004", "004", "004", "004"…
$ g_whoregion  <chr> "EMR", "EMR", "EMR", "EMR", "EMR", "EMR", "EMR"…
$ year         <dbl> 1980, 1981, 1982, 1983, 1984, 1985, 1986, 1987,…
$ new_sp_m04   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ new_sp_m514  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ new_sp_m014  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ new_sp_m1524 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ new_sp_m2534 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ new_sp_m3544 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ new_sp_m4554 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ new_sp_m5564 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ new_sp_m65   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ new_sp_mu    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ new_sp_f04   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ new_sp_f514  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ new_sp_f014  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ new_sp_f1524 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ new_sp_f2534 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ new_sp_f3544 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ new_sp_f4554 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ new_sp_f5564 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ new_sp_f65   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ new_sp_fu    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…

Now, new_sp_xx refer to new pulmonary smear-positive cases. The m refers to males and f refers to females. Finally, 04 is the 0-4 age group, 514 is the 5-14 age group, and so on. new_sp_mu is males with an unknown age-group and new_sp_fu is females with an unknown age-group. This is the messy part of the data. Ideally we’d have a column for gender, flagged as Male, Female, Unknown, and a separate column for age-groups flagged 0-4, 5-14, 0-14, and so on. Time to tidy these data.

Since the data are in the wide format we can use gather to flip it into the long format.

tb2 %>% 
  gather(
    "group",
    "cases",
    7:26
    ) -> tb3 

glimpse(tb3)
Rows: 140,460
Columns: 8
$ country     <chr> "Afghanistan", "Afghanistan", "Afghanistan", "Af…
$ iso2        <chr> "AF", "AF", "AF", "AF", "AF", "AF", "AF", "AF", …
$ iso3        <chr> "AFG", "AFG", "AFG", "AFG", "AFG", "AFG", "AFG",…
$ iso_numeric <chr> "004", "004", "004", "004", "004", "004", "004",…
$ g_whoregion <chr> "EMR", "EMR", "EMR", "EMR", "EMR", "EMR", "EMR",…
$ year        <dbl> 1980, 1981, 1982, 1983, 1984, 1985, 1986, 1987, …
$ group       <chr> "new_sp_m04", "new_sp_m04", "new_sp_m04", "new_s…
$ cases       <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …

Now we split the new_sp_xx values into three columns

tb3 %>% 
  separate(
    col = group,
    into = c("new", "sp", "sexage"),
    sep = "_"
    ) -> tb4 

glimpse(tb4)
Rows: 140,460
Columns: 10
$ country     <chr> "Afghanistan", "Afghanistan", "Afghanistan", "Af…
$ iso2        <chr> "AF", "AF", "AF", "AF", "AF", "AF", "AF", "AF", …
$ iso3        <chr> "AFG", "AFG", "AFG", "AFG", "AFG", "AFG", "AFG",…
$ iso_numeric <chr> "004", "004", "004", "004", "004", "004", "004",…
$ g_whoregion <chr> "EMR", "EMR", "EMR", "EMR", "EMR", "EMR", "EMR",…
$ year        <dbl> 1980, 1981, 1982, 1983, 1984, 1985, 1986, 1987, …
$ new         <chr> "new", "new", "new", "new", "new", "new", "new",…
$ sp          <chr> "sp", "sp", "sp", "sp", "sp", "sp", "sp", "sp", …
$ sexage      <chr> "m04", "m04", "m04", "m04", "m04", "m04", "m04",…
$ cases       <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …

and then split sexage into two columns by specifying that the split should occur after the first value (which in our case will be after m or f).

tb4 %>% 
  separate(
    col = sexage,
    into = c("sex", "agegroup"),
    sep = 1
    ) -> tb5 

glimpse(tb5)
Rows: 140,460
Columns: 11
$ country     <chr> "Afghanistan", "Afghanistan", "Afghanistan", "Af…
$ iso2        <chr> "AF", "AF", "AF", "AF", "AF", "AF", "AF", "AF", …
$ iso3        <chr> "AFG", "AFG", "AFG", "AFG", "AFG", "AFG", "AFG",…
$ iso_numeric <chr> "004", "004", "004", "004", "004", "004", "004",…
$ g_whoregion <chr> "EMR", "EMR", "EMR", "EMR", "EMR", "EMR", "EMR",…
$ year        <dbl> 1980, 1981, 1982, 1983, 1984, 1985, 1986, 1987, …
$ new         <chr> "new", "new", "new", "new", "new", "new", "new",…
$ sp          <chr> "sp", "sp", "sp", "sp", "sp", "sp", "sp", "sp", …
$ sex         <chr> "m", "m", "m", "m", "m", "m", "m", "m", "m", "m"…
$ agegroup    <chr> "04", "04", "04", "04", "04", "04", "04", "04", …
$ cases       <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …

The beauty of dplyr and tidyr is the ability to combine all these steps into a single piped command!

tb %>% 
  select(
    c(1:6, 25:44)
    ) %>% 
  filter(
    year <= 2012
    )  %>% 
  gather(
    "group", 
    "cases", 
    7:26
    ) %>% 
  separate(
    col = group, 
    into = c("new", "sp", "sexage"),
    sep = "_"
    ) %>% 
  separate(
    col = sexage,
    into = c("sex", "agegroup"),
    sep = 1
    ) %>% 
  select(
    c(1:6, 9:11)
    ) -> tb.df 

glimpse(tb.df)
Rows: 140,460
Columns: 9
$ country     <chr> "Afghanistan", "Afghanistan", "Afghanistan", "Af…
$ iso2        <chr> "AF", "AF", "AF", "AF", "AF", "AF", "AF", "AF", …
$ iso3        <chr> "AFG", "AFG", "AFG", "AFG", "AFG", "AFG", "AFG",…
$ iso_numeric <chr> "004", "004", "004", "004", "004", "004", "004",…
$ g_whoregion <chr> "EMR", "EMR", "EMR", "EMR", "EMR", "EMR", "EMR",…
$ year        <dbl> 1980, 1981, 1982, 1983, 1984, 1985, 1986, 1987, …
$ sex         <chr> "m", "m", "m", "m", "m", "m", "m", "m", "m", "m"…
$ agegroup    <chr> "04", "04", "04", "04", "04", "04", "04", "04", …
$ cases       <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …

Now, we can clean-up the codes for easier interpretation.

tb.df$sex.f <- ifelse(tb.df$sex == "m", "Males", "Females") 

tb.df$agegroup.f <- ifelse(tb.df$agegroup == "04", "0-4", 
  ifelse(tb.df$agegroup == "014", "0-14",
  ifelse(tb.df$agegroup == "1524", "15-24",
  ifelse(tb.df$agegroup == "2534", "25-34",
  ifelse(tb.df$agegroup == "3544", "35-44",
  ifelse(tb.df$agegroup == "4554", "45-54",
  ifelse(tb.df$agegroup == "514", "5-14",
  ifelse(tb.df$agegroup == "5564", "55-64",
  ifelse(tb.df$agegroup == "65", "65+", "Uknown")
  ))))))))

glimpse(tb.df)
Rows: 140,460
Columns: 11
$ country     <chr> "Afghanistan", "Afghanistan", "Afghanistan", "Af…
$ iso2        <chr> "AF", "AF", "AF", "AF", "AF", "AF", "AF", "AF", …
$ iso3        <chr> "AFG", "AFG", "AFG", "AFG", "AFG", "AFG", "AFG",…
$ iso_numeric <chr> "004", "004", "004", "004", "004", "004", "004",…
$ g_whoregion <chr> "EMR", "EMR", "EMR", "EMR", "EMR", "EMR", "EMR",…
$ year        <dbl> 1980, 1981, 1982, 1983, 1984, 1985, 1986, 1987, …
$ sex         <chr> "m", "m", "m", "m", "m", "m", "m", "m", "m", "m"…
$ agegroup    <chr> "04", "04", "04", "04", "04", "04", "04", "04", …
$ cases       <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ sex.f       <chr> "Males", "Males", "Males", "Males", "Males", "Ma…
$ agegroup.f  <chr> "0-4", "0-4", "0-4", "0-4", "0-4", "0-4", "0-4",…

The same thing, albeit with case_when()

tb.df %>%
  mutate(
    sex.f = case_when(
      sex == "m" ~ "Males",
      sex == "f" ~ "Females"
    ),
    agegroup.f = case_when(
      agegroup == "014" ~ "0-14",
      agegroup == "1524" ~ "15-24",
      agegroup == "2534" ~ "25-34",
      agegroup == "3544" ~ "35-44",
      agegroup == "4554" ~ "45-54",
      agegroup == "514" ~ "5-14",
      agegroup == "5564" ~ "55-64",
      agegroup == "65" ~ "65+", 
      TRUE ~ "Uknown")
  ) -> tb.df 

glimpse(tb.df)
Rows: 140,460
Columns: 11
$ country     <chr> "Afghanistan", "Afghanistan", "Afghanistan", "Af…
$ iso2        <chr> "AF", "AF", "AF", "AF", "AF", "AF", "AF", "AF", …
$ iso3        <chr> "AFG", "AFG", "AFG", "AFG", "AFG", "AFG", "AFG",…
$ iso_numeric <chr> "004", "004", "004", "004", "004", "004", "004",…
$ g_whoregion <chr> "EMR", "EMR", "EMR", "EMR", "EMR", "EMR", "EMR",…
$ year        <dbl> 1980, 1981, 1982, 1983, 1984, 1985, 1986, 1987, …
$ sex         <chr> "m", "m", "m", "m", "m", "m", "m", "m", "m", "m"…
$ agegroup    <chr> "04", "04", "04", "04", "04", "04", "04", "04", …
$ cases       <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ sex.f       <chr> "Males", "Males", "Males", "Males", "Males", "Ma…
$ agegroup.f  <chr> "Uknown", "Uknown", "Uknown", "Uknown", "Uknown"…

How many cases per country in the 1980-1990 period?

tb.df %>% 
  filter(
    year >= 1980 | year <= 1990
    ) %>% 
  group_by(country) %>% 
  summarise(
    ncases = sum(cases, na.rm = TRUE)
    ) %>% 
  arrange(-ncases) -> tbtab.01 

glimpse(tbtab.01)
Rows: 219
Columns: 2
$ country <chr> "India", "China", "Indonesia", "South Africa", "Bang…
$ ncases  <dbl> 6492850, 5643738, 1886657, 1435147, 1072431, 915525,…

What about by sex and age-group?

tb.df %>% 
  filter(
    year >= 1980 | year <= 1990
    ) %>% 
  group_by(
    country,  sex, agegroup
    ) %>% 
  summarise(
    ncases = sum(cases, na.rm = TRUE)
    ) %>% 
  arrange(-ncases) -> tbtab.02 

This is just one example of how dplyr and tidyr – two packages from the tidyverse – can be used to deal with what seems to be quite messy data. The more you use it, the easier it becomes to figure out solutions to even very complicated messy data setups.

{forcats}

The tidyverse team has another package for working with factors – forcats. How is it useful?

For example, say I have the mtcars data and want to see the distribution of the number of cylinders a car has.

data(mtcars)

cyl is stored as a numeric so I can flip this into a factor via two ways, with base factor() or then with fct_recode().

library(forcats)

mtcars %>%
  mutate(
    f.cyl = factor(cyl)
    ) -> mtcars.df  
mtcars.df %>%
  mutate(
    f.cyl2 = fct_recode(
      f.cyl, 
      Four = '4',
      Six = '6',
      Eight = '8'
      )
    ) -> mtcars.df

I am going to plot mpg per car but before I do so, I would like to know what model is being referred to. The row.names(mtcars) -> mtcars$model command pulls the model details stored as row.names in mtcars and creates a new column called model with this information. Note: You will need to install the {hrbrthemes} package before you plot.

mtcars.df$model <- row.names(mtcars) 

library(ggplot2)
ggplot(
  mtcars.df,
  aes(
    x = mpg,
    y = model
    )
  ) + 
  geom_point() + 
  labs(
    x = "Miles per gallon",
    y = "Car Model"
    ) +
  hrbrthemes::theme_ipsum_rc() +
  theme(legend.position = "none")

This isn’t very helpful since the models are not arranged in ascending/descending order of miles per gallon. We can easily fix this problem with fct_reorder(...)

ggplot(
  mtcars.df,
  aes(
    x = mpg, 
    fct_reorder(
      .f = model,
      .x = mpg
      )
    )
  ) +
  geom_point() + 
  labs(
    x = "Miles per gallon", 
    y = "Car Model"
    ) +
  hrbrthemes::theme_ipsum_rc() +
  theme(legend.position = "none")

If I want the plot to be in descending order of mpg,

ggplot(
  mtcars.df,
  aes(
    x = mpg, 
    fct_reorder(
    .f = model,
    .x = -mpg
    )
    )
  ) + 
  geom_point() + 
  labs(
    x = "Miles per gallon",
    y = "Car Model"
    ) +
  hrbrthemes::theme_ipsum_rc() +
  theme(legend.position = "none")

I can do this with bar-plots too.

ggplot(
  mtcars.df,
  aes(f.cyl2)
  ) +
  geom_bar(
    aes(
      fill = f.cyl2
      )
    ) +
  theme(legend.position = "none") +
  labs(
    x = "Number of cylinders",
    y = "Frequency"
    ) +
  hrbrthemes::theme_ipsum_rc() +
  theme(legend.position = "none")

And now the bars ordered by decreasing frequency, and then by increasing frequency.

ggplot(
  mtcars.df,
  aes(fct_infreq(f.cyl2))
  ) +
  geom_bar(
    aes(fill = f.cyl2)
    ) +
  theme(legend.position = "none") +
  labs(
    x = "Number of cylinders", 
    y = "Frequency"
    ) +
  hrbrthemes::theme_ipsum_rc() +
  theme(legend.position = "none")

ggplot(
  mtcars.df,
  aes(fct_rev(fct_infreq(f.cyl2)))
  ) +
  geom_bar(
    aes(fill = f.cyl2)
    ) +
  theme(legend.position = "none") +
  labs(
    x = "Number of cylinders", 
    y = "Frequency"
    ) +
  hrbrthemes::theme_ipsum_rc() +
  theme(legend.position = "none")

Note the use of fct_infreq() for descending and then fct_rev(fct_infreq()) for ascending.

Practice Tasks

Ex. 1: Pew

Take the pew data shown to you early on in this module and using dplyr and tidyr, tidy it up so that the final data-set (a) has three columns — religion, income-group, frequency, and (b) is arranged in ascending order of religion.

Ex. 2: Weather

Tidy these weather data

read.delim(
 file = "http://stat405.had.co.nz/data/weather.txt",
 stringsAsFactors = FALSE
 ) -> weather 

The tidy data should have the days as rows and TMIN and TMAX as columns (i.e., you are going from wide to long).

Ex. 3: Climate Change

Download these data from the World bank’s World Development Indicators and Climate Change Knowledge Portal on climate systems, exposure to climate impacts, resilience, greenhouse gas emissions, and energy use. The data are in a worksheet called, wait for it, Data, and need to be tidied up such that for every country you have one row per year and then columns for each of the 58 measures (called Series name in the database).

If you really want to challenge yourself, create separate data frames from the Country worksheet and the Series worksheet. Then merge these with data frame you tidied up.

Advanced dplyr (or verbs I often forget exist in the dplyr stable)

If you really dig into dplyr you find little nuggets here and there that are not hidden but for some reason few ever seem to blog about them or post examples where they are in use. I’ve cobbled together a few examples, inspired largely by Suzan Baert and then from the teradata documentation. It is also useful to bear in mind that packages continue to evolve and can, at times, deprecate or render obsolete code that worked in the past.

select_if, select_at, select_all

Let us start with select_if. This verb will allow you to select columns to operate on if some criterion is met. For example, perhaps I only want to pull out character columns:

cmh.df %>%
  select_if(is.character) %>%
  names()
 [1] "reporting_airline"           "iata_code_reporting_airline"
 [3] "tail_number"                 "origin"                     
 [5] "origin_city_name"            "origin_state"               
 [7] "origin_state_fips"           "origin_state_name"          
 [9] "dest"                        "dest_city_name"             
[11] "dest_state"                  "dest_state_fips"            
[13] "dest_state_name"             "crs_dep_time"               
[15] "dep_time"                    "dep_time_blk"               
[17] "wheels_off"                  "wheels_on"                  
[19] "crs_arr_time"                "arr_time"                   
[21] "arr_time_blk"                "cancellation_code"          
[23] "first_dep_time"              "div1airport"                
[25] "div1wheels_on"               "div1wheels_off"             
[27] "div1tail_num"               
library(lubridate)
cmh.df %>%
  select_if(is.Date) %>%
  names()
[1] "flight_date"

You can also include seemingly complicated conditions, as shown below where we ask that not only should the column be numeric but that the column’s mean should exceed 10 (an arbitrary value chosen to showcase this example).

cmh.df %>%
  select_if(
    ~is.numeric(.) & mean(., na.rm = TRUE) > 10
    ) %>%
  names()
 [1] "year"                           
 [2] "dayof_month"                    
 [3] "dot_id_reporting_airline"       
 [4] "flight_number_reporting_airline"
 [5] "origin_airport_id"              
 [6] "origin_airport_seq_id"          
 [7] "origin_city_market_id"          
 [8] "origin_wac"                     
 [9] "dest_airport_id"                
[10] "dest_airport_seq_id"            
[11] "dest_city_market_id"            
[12] "dest_wac"                       
[13] "dep_delay_minutes"              
[14] "taxi_out"                       
[15] "arr_delay_minutes"              
[16] "crs_elapsed_time"               
[17] "actual_elapsed_time"            
[18] "air_time"                       
[19] "distance"                       
[20] "carrier_delay"                  
[21] "nas_delay"                      
[22] "late_aircraft_delay"            
[23] "total_add_g_time"               
[24] "longest_add_g_time"             
[25] "div_actual_elapsed_time"        
[26] "div_arr_delay"                  
[27] "div_distance"                   
[28] "div1airport_id"                 
[29] "div1airport_seq_id"             
[30] "div1total_g_time"               
[31] "div1longest_g_time"             

You can also select with other conditions, for example, if the column has fewer than some number of distinct values (here I chose 5)

cmh.df %>%
  select_if(
    ~n_distinct(.) < 5
    ) %>%
  names()
 [1] "year"                 "quarter"             
 [3] "dep_del15"            "arr_del15"           
 [5] "cancelled"            "cancellation_code"   
 [7] "diverted"             "flights"             
 [9] "div_airport_landings" "div_reached_dest"    
[11] "div2airport"          "div2airport_id"      
[13] "div2airport_seq_id"   "div2wheels_on"       
[15] "div2total_g_time"     "div2longest_g_time"  
[17] "div2wheels_off"       "div2tail_num"        
[19] "div3airport"          "div3airport_id"      
[21] "div3airport_seq_id"   "div3wheels_on"       
[23] "div3total_g_time"     "div3longest_g_time"  
[25] "div3wheels_off"       "div3tail_num"        
[27] "div4airport"          "div4airport_id"      
[29] "div4airport_seq_id"   "div4wheels_on"       
[31] "div4total_g_time"     "div4longest_g_time"  
[33] "div4wheels_off"       "div4tail_num"        
[35] "div5airport"          "div5airport_id"      
[37] "div5airport_seq_id"   "div5wheels_on"       
[39] "div5total_g_time"     "div5longest_g_time"  
[41] "div5wheels_off"       "div5tail_num"        
[43] "x110"                

Don’t forget everything() …You can rearrange columns more swiftly too, as shown below where the first two columns I want are explicitly specified and then all remaining columns via everything()

cmh.df %>%
  select(
    flight_date, flight_number_reporting_airline,
    everything()
    ) -> df.ev 

head(df.ev[, c(1:6)], 15)
   flight_date flight_number_reporting_airline id year quarter month
1   2017-01-02                            2419  1 2017       1     1
2   2017-01-02                            2583  1 2017       1     1
3   2017-01-03                             977  1 2017       1     1
4   2017-01-03                             977  1 2017       1     1
5   2017-01-03                            1119  1 2017       1     1
6   2017-01-03                            1170  1 2017       1     1
7   2017-01-03                            1276  1 2017       1     1
8   2017-01-03                            1276  1 2017       1     1
9   2017-01-03                            1327  1 2017       1     1
10  2017-01-03                            1405  1 2017       1     1
11  2017-01-03                            1405  1 2017       1     1
12  2017-01-03                            1533  1 2017       1     1
13  2017-01-03                            1533  1 2017       1     1
14  2017-01-03                            1736  1 2017       1     1
15  2017-01-03                            1897  1 2017       1     1

If you want to make some sweeping changes across all columns, you can do that with select_all(). Here we flip the column names to uppercase, and then I am replacing the underscore (_) with no whitespace at all.

cmh.df %>%
  select_all(toupper) -> df.up

names(df.up[, 1:10])
 [1] "ID"                          "YEAR"                       
 [3] "QUARTER"                     "MONTH"                      
 [5] "DAYOF_MONTH"                 "DAY_OF_WEEK"                
 [7] "FLIGHT_DATE"                 "REPORTING_AIRLINE"          
 [9] "DOT_ID_REPORTING_AIRLINE"    "IATA_CODE_REPORTING_AIRLINE"
cmh.df %>%
  select_all(toupper) %>%
  select_all(~str_replace_all(., "_", "")) -> df.up 

names(df.up[, 1:10])
 [1] "ID"                       "YEAR"                    
 [3] "QUARTER"                  "MONTH"                   
 [5] "DAYOFMONTH"               "DAYOFWEEK"               
 [7] "FLIGHTDATE"               "REPORTINGAIRLINE"        
 [9] "DOTIDREPORTINGAIRLINE"    "IATACODEREPORTINGAIRLINE"

In contrast, select_at() will allow you to operate on variables that meet a specific criterion. For example, here, only columns with the string ar in their column name or that have column names starting with c are being selected and flipped to uppercase.

mtcars %>%
  select_at(
    vars(contains("ar"), starts_with("c")
         ),
    toupper
    ) %>%
  head()
                  GEAR CARB CYL
Mazda RX4            4    4   6
Mazda RX4 Wag        4    4   6
Datsun 710           4    1   4
Hornet 4 Drive       3    1   6
Hornet Sportabout    3    2   8
Valiant              3    1   6

rownames_to_column

This is a handy function to flip a data-frame’s rownames into a column

head(mtcars[, c(1:4)])
                   mpg cyl disp  hp
Mazda RX4         21.0   6  160 110
Mazda RX4 Wag     21.0   6  160 110
Datsun 710        22.8   4  108  93
Hornet 4 Drive    21.4   6  258 110
Hornet Sportabout 18.7   8  360 175
Valiant           18.1   6  225 105
mtcars %>%
  rownames_to_column() %>%
  head(3)
        rowname  mpg cyl disp  hp drat    wt  qsec vs am gear carb
1     Mazda RX4 21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
2 Mazda RX4 Wag 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
3    Datsun 710 22.8   4  108  93 3.85 2.320 18.61  1  1    4    1

mutate_at() and mutate_if()

mutate_at() will allow you to mutate specific columns when the specified condition is met. For example, if the column name contains “el” I want the contents of the column to be rendered lowercase.

mtcars %>%
  mutate_at(
    vars(
      contains("el")
      ),
    tolower
    ) -> mut.df 

head(mut.df)
                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

Or if the column name contains “pg”, I want the column’s values to be divided by 10

mtcars %>%
  mutate_at(
    vars(
      contains("pg")
      ),
    ~(. / 10)
    ) -> mut.df 

head(mut.df)
                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         2.10   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     2.10   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        2.28   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    2.14   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 1.87   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant           1.81   6  225 105 2.76 3.460 20.22  1  0    3    1

Here, if a column is a factor, that column’s values will be converted into a uppercase

mtcars %>% 
  mutate_if(
    is.factor,
    toupper
    ) -> mut.df 

head(mut.df)
                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

na_if()

Need to convert values that meet specific criteria into NA? Here we are asking that if any variable mpg through vs has the value of 110, this value should be converted to NA (a missing value).

mtcars %>% 
  select(mpg:vs) %>%
  na_if(110) -> naif.df

head(naif.df)
                   mpg cyl disp  hp drat    wt  qsec vs
Mazda RX4         21.0   6  160  NA 3.90 2.620 16.46  0
Mazda RX4 Wag     21.0   6  160  NA 3.90 2.875 17.02  0
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1
Hornet 4 Drive    21.4   6  258  NA 3.08 3.215 19.44  1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0
Valiant           18.1   6  225 105 2.76 3.460 20.22  1

rowwise()

This comes in handy if you have to operate on two or more columns but want the calculation done for each row. Here the average of mpg + qsec is being calculated for each model

mtcars %>% 
  rowwise() %>%
  mutate(
    average = mean(c(mpg, qsec))
    ) -> row.df 

head(row.df[, c(1, 7, 12)])
# A tibble: 6 × 3
# Rowwise: 
    mpg  qsec average
  <dbl> <dbl>   <dbl>
1  21    16.5    18.7
2  21    17.0    19.0
3  22.8  18.6    20.7
4  21.4  19.4    20.4
5  18.7  17.0    17.9
6  18.1  20.2    19.2

between()

As the name suggests, this allows you to filter for values that fall between the specified range (here mpg of 21 to 24). These boundary values are included in the filtered results

mtcars %>% 
  filter(
    between(mpg, 21, 24)
    ) -> between.df 

between.df
                mpg cyl  disp  hp drat    wt  qsec vs am gear carb
Mazda RX4      21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag  21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
Datsun 710     22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive 21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
Merc 230       22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
Toyota Corona  21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
Volvo 142E     21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2

near()

This allows you to filter within a range of values, the range specified via tol. Here, for example, we are asking for any case where mpg is inside the \(\pm 0.5\) range of \(mpg = 21\), i.e., \(20.5 < x < 21.5\)

mtcars %>% 
  filter(
    near(mpg, 21, tol = 0.5)
    ) -> near.df

near.df
                mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4      21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag  21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Hornet 4 Drive 21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Volvo 142E     21.4   4  121 109 4.11 2.780 18.60  1  1    4    2
mtcars %>% 
  filter(
    near(mpg, 21, tol = 0.5 * sd(mpg))
    ) -> near.df 

head(near.df)
                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

xor()

This is just another way of saying filter if mpg < 22 | cyl < 6

mtcars %>%
  filter(
    xor(mpg < 22, cyl < 6)
    ) -> xor.df

head(xor.df)
                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
mtcars %>%
  filter(
    mpg < 22 | cyl < 6
    ) -> nonxor.df 

head(nonxor.df)
                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

filter_all()

This will filter all columns based on the condition specified, for example, finding all instances where baby names include the string “nnie”

library(babynames)

babynames %>%
  filter_all(
    any_vars(
      str_detect(., pattern = "nnie")
      )
    ) -> filt.df 

head(filt.df)
# A tibble: 6 × 5
   year sex   name       n     prop
  <dbl> <chr> <chr>  <int>    <dbl>
1  1880 F     Minnie  1746 0.0179  
2  1880 F     Annie   1258 0.0129  
3  1880 F     Jennie   793 0.00812 
4  1880 F     Fannie   560 0.00574 
5  1880 F     Nannie   248 0.00254 
6  1880 F     Winnie    49 0.000502
babynames %>%
  select_if(is.numeric) %>%
  filter_all(
    any_vars(
      . == 1880
      )
    ) -> filt.df

head(filt.df)
# A tibble: 6 × 3
   year     n   prop
  <dbl> <int>  <dbl>
1  1880  7065 0.0724
2  1880  2604 0.0267
3  1880  2003 0.0205
4  1880  1939 0.0199
5  1880  1746 0.0179
6  1880  1578 0.0162

This has retained only those rows where at least one column/variable has a value of 1800

filter_if()

This will allow you to filter if the logical condition is met, here if the column is a character column and if any character column has a missing value

starwars %>%
  filter_if(
    is.character,
    any_vars(is.na(.)
             )
    ) -> filt.df

head(filt.df)
# A tibble: 6 × 14
  name   height  mass hair_color skin_color eye_color birth_year sex  
  <chr>   <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr>
1 C-3PO     167    75 <NA>       gold       yellow           112 none 
2 R2-D2      96    32 <NA>       white, bl… red               33 none 
3 R5-D4      97    32 <NA>       white, red red               NA none 
4 Greedo    173    74 <NA>       green      black             44 male 
5 Jabba…    175  1358 <NA>       green-tan… orange           600 herm…
6 Yoda       66    17 white      green      brown            896 male 
# … with 6 more variables: gender <chr>, homeworld <chr>,
#   species <chr>, films <list>, vehicles <list>, starships <list>

filter_at()

This will come in very handy if you want to filter specific variables if a condition is met

mtcars %>%
  filter_at(
    vars(mpg, qsec),
    all_vars(. > 19)
    ) -> filt.df 

head(filt.df)
                mpg cyl  disp  hp drat    wt  qsec vs am gear carb
Hornet 4 Drive 21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
Merc 240D      24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
Merc 230       22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
Fiat 128       32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
Toyota Corolla 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
Toyota Corona  21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1

This looked for mpg and qsec and then retained rows where either variable had a value > 19

mtcars %>%
  filter_at(
    vars(contains("t")),
    any_vars(. > 3)
    ) -> filt.df 

head(filt.df)
                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

Here we asked for columns with the letter t in their name, and any of these selected columns with values > 3

summarise_all()

Here we are asking that means be calculated for all selected variables – mpg:qsec – each on its own, of course

mtcars %>%
  select(mpg:qsec) %>%
  summarise_all(mean, na.rm = TRUE) -> summ.df 

summ.df
       mpg    cyl     disp       hp     drat      wt     qsec
1 20.09062 6.1875 230.7219 146.6875 3.596563 3.21725 17.84875

summarise_if()

Here we are asking for the means of specific variables – if they are numeric – to be calculated

mtcars %>%
  summarise_if(
    is.numeric, mean, na.rm = TRUE
    ) %>%
  rename_if(
    is.numeric, ~paste0("mean_", .)
    ) -> summ.df 

summ.df
  mean_mpg mean_cyl mean_disp  mean_hp mean_drat mean_wt mean_qsec
1 20.09062   6.1875  230.7219 146.6875  3.596563 3.21725  17.84875
  mean_vs mean_am mean_gear mean_carb
1  0.4375 0.40625    3.6875    2.8125

rename_if() is being used to append recognizable labels to the calculated values

summarise_at()

Here we are asking for the means of specific variables – those with the word “Delay” in their name – to be calculated

cmh.df %>%
  summarise_at(
    vars(contains("delay")), mean, na.rm = TRUE
    ) %>%
  rename_at(
    vars(contains("delay")), ~paste0("mean_", .)
    ) -> summ.df 

summ.df
  mean_dep_delay mean_dep_delay_minutes mean_departure_delay_groups
1       8.756439               11.54685                 -0.03739762
  mean_arr_delay mean_arr_delay_minutes mean_arrival_delay_groups
1       2.601107               11.07277                -0.3285672
  mean_carrier_delay mean_weather_delay mean_nas_delay
1           18.72072           2.676069       12.33165
  mean_security_delay mean_late_aircraft_delay mean_div_arr_delay
1          0.04338477                 26.01961           178.8784

rename_at() is being used to append recognizable labels to the calculated values

top_n()

This will give you the top or bottom n values (here n has been set to 7)

cmh.df %>%
  group_by(reporting_airline) %>%
  filter(dest == "CMH") %>%
  summarise(
    average.delay = mean(arr_delay, na.rm = TRUE)
    ) %>%
  arrange(-average.delay) %>%
  top_n(7)
# A tibble: 7 × 2
  reporting_airline average.delay
  <chr>                     <dbl>
1 F9                        7.54 
2 WN                        5.20 
3 AA                        4.92 
4 OO                        4.66 
5 EV                        0.966
6 DL                       -1.01 
7 UA                       -2.27 
cmh.df %>%
  group_by(reporting_airline) %>%
  filter(dest == "CMH") %>%
  summarise(
    average.delay = mean(arr_delay, na.rm = TRUE)
    ) %>%
  arrange(average.delay) %>%
  top_n(-7)
# A tibble: 7 × 2
  reporting_airline average.delay
  <chr>                     <dbl>
1 UA                       -2.27 
2 DL                       -1.01 
3 EV                        0.966
4 OO                        4.66 
5 AA                        4.92 
6 WN                        5.20 
7 F9                        7.54 

separate_rows

If you need to separate values accidentally stored in a single column into rows, use separate_rows

library(tibble)
df <- tibble(
  x = 1:3,
  y = c("a", "d,e,f", "g,h"),
  z = c("1", "2,3,4", "5,6")
  )

df
# A tibble: 3 × 3
      x y     z    
  <int> <chr> <chr>
1     1 a     1    
2     2 d,e,f 2,3,4
3     3 g,h   5,6  
separate_rows(df, y, z, convert = TRUE)
# A tibble: 6 × 3
      x y         z
  <int> <chr> <int>
1     1 a         1
2     2 d         2
3     2 e         3
4     2 f         4
5     3 g         5
6     3 h         6

dplyr’s join alternative to merge

dplyr has it own commands that execute what merge() did for us in the previous module. Let us see how this alternative route works but first we recreate the data we used with merge.

data.frame(
  Score = c(10, 21, 33, 12, 35, 67, 43, 99),
  ID = c("A12", "A23", "Z14", "WX1", "Y31", "D66", "C31", "Q22")
  ) -> data1 

data.frame(
  Sex = c("Male", "Female", "Male", "Female", "Male", "Female", "Male", "Female"),
  ID = c("A12", "A23", "Z14", "WX1", "Y31", "D66", "E52", "H71")
  ) -> data2 

data.frame(
  Score = c(10, 21, 33, 12, 35, 67, 43, 99),
  ID1 = c("A12", "A23", "Z14", "WX1", "Y31", "D66", "C31", "Q22")
  ) -> data3 

data.frame(
  Sex = c("Male", "Female", "Male", "Female", "Male", "Female", "Male", "Female"),
  ID2 = c("A12", "A23", "Z14", "WX1", "Y31", "D66", "E52", "H71")
  ) -> data4 

data.frame(
  Score = c(10, 21, 33, 12, 35, 67, 43, 99),
  ID = c("A12", "A23", "Z14", "WX1", "Y31", "D66", "C31", "Q22")
  ) -> df1 

data.frame(
  Sex = c("Male", "Female", "Male", "Female", "Male", "Female", "Male", "Female"),
  ID = c("A12", "A23", "Z14", "WX1", "Y31", "D66", "E52", "H71")
  ) -> df2 

data.frame(
  Age = c(6, 7, 6, 8, 8, 9, 10, 5),
  ID = c("A12", "A23", "Z14", "WX1", "Y31", "D66", "E52", "H71")
  ) -> df3 

inner_join

This merges based on columns that appear in both data-sets, and will only merge rows that have the same value per column. In brief, only those observations common to both data-sets are retained.

inner_join(data1, data2, by = "ID")
  Score  ID    Sex
1    10 A12   Male
2    21 A23 Female
3    33 Z14   Male
4    12 WX1 Female
5    35 Y31   Male
6    67 D66 Female

If you do not specify the by = column dplyr defaults to all columns with similar names in each data-set …

inner_join(data1, data2)
  Score  ID    Sex
1    10 A12   Male
2    21 A23 Female
3    33 Z14   Male
4    12 WX1 Female
5    35 Y31   Male
6    67 D66 Female

left_join

This merge will result in all rows being retained from the first data-set listed in the command, but will include only those rows from the second data-set that have the same value(s) for the by variable(s) …

data1 %>%
  left_join(data2, by = "ID")
  Score  ID    Sex
1    10 A12   Male
2    21 A23 Female
3    33 Z14   Male
4    12 WX1 Female
5    35 Y31   Male
6    67 D66 Female
7    43 C31   <NA>
8    99 Q22   <NA>
data2 %>%
  left_join(data1, by = "ID")
     Sex  ID Score
1   Male A12    10
2 Female A23    21
3   Male Z14    33
4 Female WX1    12
5   Male Y31    35
6 Female D66    67
7   Male E52    NA
8 Female H71    NA

right_join

This merge will result in all rows being retained from the second data-set listed in the command, but will include only those rows from the first data-set that have the same value(s) for the by variable(s) …

data1 %>%
  right_join(data2, by = "ID")
  Score  ID    Sex
1    10 A12   Male
2    21 A23 Female
3    33 Z14   Male
4    12 WX1 Female
5    35 Y31   Male
6    67 D66 Female
7    NA E52   Male
8    NA H71 Female
data2 %>%
  right_join(data1, by = "ID")
     Sex  ID Score
1   Male A12    10
2 Female A23    21
3   Male Z14    33
4 Female WX1    12
5   Male Y31    35
6 Female D66    67
7   <NA> C31    43
8   <NA> Q22    99

full_join

This merge keeps all rows from each data-set, with those only found in one data-set being given NA values as necessary.

data1 %>%
  full_join(data2, by = "ID")
   Score  ID    Sex
1     10 A12   Male
2     21 A23 Female
3     33 Z14   Male
4     12 WX1 Female
5     35 Y31   Male
6     67 D66 Female
7     43 C31   <NA>
8     99 Q22   <NA>
9     NA E52   Male
10    NA H71 Female

semi_join

What if you only want observations that match in both data-sets and only those columns found in one data-set but not in the other? Use a semi-join. In the examples below, you will notice that the column Sex is not included in the first result since it is only found in data2 but is there in the second example since data2 is guiding the semi-Join().

data.frame(
  Score = c(10, 21, 33, 12, 35, 67, 43, 99),
  ID = c("A12", "A23", "Z14", "WX1", "Y31", "D66", "C31", "Q22")
  ) -> data1 

data.frame(
  Score = c(10, 21, 33, 12, 35, 67, NA, NA),
  Sex = c("Male", "Female", "Male", "Female", "Male", "Female", "Male", "Female"),
  ID = c("A12", "A23", "Z14", "WX1", "Y31", "D66", "E52", "H71")
  ) -> data2 

semi_join(data1, data2)
  Score  ID
1    10 A12
2    21 A23
3    33 Z14
4    12 WX1
5    35 Y31
6    67 D66
semi_join(data2, data1, by = "ID")
  Score    Sex  ID
1    10   Male A12
2    21 Female A23
3    33   Male Z14
4    12 Female WX1
5    35   Male Y31
6    67 Female D66

anti_join

This will only retain records not found in the second data-set listed.

anti_join(data1, data2)
  Score  ID
1    43 C31
2    99 Q22
anti_join(data2, data1)
  Score    Sex  ID
1    NA   Male E52
2    NA Female H71

Vectorized functions within mutate/transmute

You can create new columns by executing vectorized functions that vectors of some length as the input and return vectors of the same length as the output. These functions are shown below.

tibble::tibble(
  id = c(rep("A", 3), rep("B", 4), rep("C", 1), rep("D", 4)),
  x = c(seq(1, 10, by = 1), -1, -2),
  y = c(seq(11, 20, by = 1), -11, -12)
  ) -> a.df 

lag() versus lead()

To lag some value implies to assign to the current observation (i.e., row) the preceding value. Similarly, to lead implies to assign to the current observation (i.e., row) the value from a subsequent row. If I lead by 1, then I am assigning the value from the next row, lead by 2 then I am assigning the value from 2 rows ahead, and so on.

Say I want to lag the values of x by 1, first regardless of id but then within each id.

a.df %>%
  mutate(
    x.lag = lag(x, 1)
    )
# A tibble: 12 × 4
   id        x     y x.lag
   <chr> <dbl> <dbl> <dbl>
 1 A         1    11    NA
 2 A         2    12     1
 3 A         3    13     2
 4 B         4    14     3
 5 B         5    15     4
 6 B         6    16     5
 7 B         7    17     6
 8 C         8    18     7
 9 D         9    19     8
10 D        10    20     9
11 D        -1   -11    10
12 D        -2   -12    -1
a.df %>%
  group_by(id) %>%
  mutate(
    x.lag = lag(x, 1)
    )
# A tibble: 12 × 4
# Groups:   id [4]
   id        x     y x.lag
   <chr> <dbl> <dbl> <dbl>
 1 A         1    11    NA
 2 A         2    12     1
 3 A         3    13     2
 4 B         4    14    NA
 5 B         5    15     4
 6 B         6    16     5
 7 B         7    17     6
 8 C         8    18    NA
 9 D         9    19    NA
10 D        10    20     9
11 D        -1   -11    10
12 D        -2   -12    -1

Similarly, I can also lead by a specific “period”

a.df %>%
  mutate(
    x.lead = lead(x, 2)
    )
# A tibble: 12 × 4
   id        x     y x.lead
   <chr> <dbl> <dbl>  <dbl>
 1 A         1    11      3
 2 A         2    12      4
 3 A         3    13      5
 4 B         4    14      6
 5 B         5    15      7
 6 B         6    16      8
 7 B         7    17      9
 8 C         8    18     10
 9 D         9    19     -1
10 D        10    20     -2
11 D        -1   -11     NA
12 D        -2   -12     NA
a.df %>%
  group_by(id) %>%
  mutate(
    y.lead = lead(y, 1)
    )
# A tibble: 12 × 4
# Groups:   id [4]
   id        x     y y.lead
   <chr> <dbl> <dbl>  <dbl>
 1 A         1    11     12
 2 A         2    12     13
 3 A         3    13     NA
 4 B         4    14     15
 5 B         5    15     16
 6 B         6    16     17
 7 B         7    17     NA
 8 C         8    18     NA
 9 D         9    19     20
10 D        10    20    -11
11 D        -1   -11    -12
12 D        -2   -12     NA

Cumulative aggregates

We can also calculate various aggregates by cumulating over/across columns.

a.df %>%
  mutate(
    c = cummin(x)
    ) # c in any row is the smallest value observed thus far 
# A tibble: 12 × 4
   id        x     y     c
   <chr> <dbl> <dbl> <dbl>
 1 A         1    11     1
 2 A         2    12     1
 3 A         3    13     1
 4 B         4    14     1
 5 B         5    15     1
 6 B         6    16     1
 7 B         7    17     1
 8 C         8    18     1
 9 D         9    19     1
10 D        10    20     1
11 D        -1   -11    -1
12 D        -2   -12    -2
a.df %>%
  mutate(
    c = cummax(x)
    ) # c in any row is the largest value observed thus far 
# A tibble: 12 × 4
   id        x     y     c
   <chr> <dbl> <dbl> <dbl>
 1 A         1    11     1
 2 A         2    12     2
 3 A         3    13     3
 4 B         4    14     4
 5 B         5    15     5
 6 B         6    16     6
 7 B         7    17     7
 8 C         8    18     8
 9 D         9    19     9
10 D        10    20    10
11 D        -1   -11    10
12 D        -2   -12    10
a.df %>%
  mutate(
    c = cummean(x)
    ) # c is the average of the current + preceding value(s) 
# A tibble: 12 × 4
   id        x     y     c
   <chr> <dbl> <dbl> <dbl>
 1 A         1    11  1   
 2 A         2    12  1.5 
 3 A         3    13  2   
 4 B         4    14  2.5 
 5 B         5    15  3   
 6 B         6    16  3.5 
 7 B         7    17  4   
 8 C         8    18  4.5 
 9 D         9    19  5   
10 D        10    20  5.5 
11 D        -1   -11  4.91
12 D        -2   -12  4.33
a.df %>%
  mutate(
    c = cumsum(x)
    ) # c is the cumulative sum of the current + preceding value(s)  
# A tibble: 12 × 4
   id        x     y     c
   <chr> <dbl> <dbl> <dbl>
 1 A         1    11     1
 2 A         2    12     3
 3 A         3    13     6
 4 B         4    14    10
 5 B         5    15    15
 6 B         6    16    21
 7 B         7    17    28
 8 C         8    18    36
 9 D         9    19    45
10 D        10    20    55
11 D        -1   -11    54
12 D        -2   -12    52
a.df %>%
  mutate(
    c = cumprod(x)
    ) # c is the cumulative product of the current value of x and the preceding value of c 
# A tibble: 12 × 4
   id        x     y        c
   <chr> <dbl> <dbl>    <dbl>
 1 A         1    11        1
 2 A         2    12        2
 3 A         3    13        6
 4 B         4    14       24
 5 B         5    15      120
 6 B         6    16      720
 7 B         7    17     5040
 8 C         8    18    40320
 9 D         9    19   362880
10 D        10    20  3628800
11 D        -1   -11 -3628800
12 D        -2   -12  7257600

Say we have information on bank balance we hold by date. If the balance is negative, then we have leaned on an overdraft. We can use cumany() and cumall() to extract rows once a condition is met.

data.frame(
  date = as.Date("2020-01-01") + 0:6,
  balance = c(100, 50, 25, -25, -50, 30, 120)
  ) -> df 

df
        date balance
1 2020-01-01     100
2 2020-01-02      50
3 2020-01-03      25
4 2020-01-04     -25
5 2020-01-05     -50
6 2020-01-06      30
7 2020-01-07     120
# show me all rows starting with when the first overdraft occurs 
df %>%
  filter(
    cumany(balance < 0)
    )
        date balance
1 2020-01-04     -25
2 2020-01-05     -50
3 2020-01-06      30
4 2020-01-07     120
# show me all rows preceding the first overdraft 
df %>%
  filter(
    cumall(!(balance < 0))
    )
        date balance
1 2020-01-01     100
2 2020-01-02      50
3 2020-01-03      25

Ranking functions

These come in handy when we need to rank observations based on some criterion.

cume_dist()

a.df %>%
  mutate(
    d = cume_dist(x)
    ) # gives a cumulative proportion of all values of x less than or equal to the current value. 
# A tibble: 12 × 4
   id        x     y      d
   <chr> <dbl> <dbl>  <dbl>
 1 A         1    11 0.25  
 2 A         2    12 0.333 
 3 A         3    13 0.417 
 4 B         4    14 0.5   
 5 B         5    15 0.583 
 6 B         6    16 0.667 
 7 B         7    17 0.75  
 8 C         8    18 0.833 
 9 D         9    19 0.917 
10 D        10    20 1     
11 D        -1   -11 0.167 
12 D        -2   -12 0.0833

Note that -2 is the smallest value so it is ranked first, and the proportion becomes d = (1/12). -1 is bigger than -2, hence ranked second. Now d = (2/12) = 0.167 because two observations have a value of x equal to -1 or less. And so on …

row_number()

Say we have multiple observations for some group and want to flag whether a specific observation is the first, second, third, etc. within the group.

a.df %>%
  group_by(id) %>%
  mutate(
    flag_row = row_number()
    )
# A tibble: 12 × 4
# Groups:   id [4]
   id        x     y flag_row
   <chr> <dbl> <dbl>    <int>
 1 A         1    11        1
 2 A         2    12        2
 3 A         3    13        3
 4 B         4    14        1
 5 B         5    15        2
 6 B         6    16        3
 7 B         7    17        4
 8 C         8    18        1
 9 D         9    19        1
10 D        10    20        2
11 D        -1   -11        3
12 D        -2   -12        4

If you want to flag in ascending or descending order of some variable, than you can run the following and when you do, pay attention to what happens with D since the smallest value of x is flagged as the first observation and NOT the order in which x appears in the data-set.

a.df %>%
  group_by(id) %>%
  mutate(
    flag_row = row_number(x)
    )
# A tibble: 12 × 4
# Groups:   id [4]
   id        x     y flag_row
   <chr> <dbl> <dbl>    <int>
 1 A         1    11        1
 2 A         2    12        2
 3 A         3    13        3
 4 B         4    14        1
 5 B         5    15        2
 6 B         6    16        3
 7 B         7    17        4
 8 C         8    18        1
 9 D         9    19        3
10 D        10    20        4
11 D        -1   -11        2
12 D        -2   -12        1

min_rank() and percent_rank()

This allows you to flag the minimum rank for an observation as a rank or as a proportion

a.df %>%
  mutate(
    min.rank = min_rank(x)
    )
# A tibble: 12 × 4
   id        x     y min.rank
   <chr> <dbl> <dbl>    <int>
 1 A         1    11        3
 2 A         2    12        4
 3 A         3    13        5
 4 B         4    14        6
 5 B         5    15        7
 6 B         6    16        8
 7 B         7    17        9
 8 C         8    18       10
 9 D         9    19       11
10 D        10    20       12
11 D        -1   -11        2
12 D        -2   -12        1
a.df %>%
  group_by(id) %>%
  mutate(
    min.rank = min_rank(x)
    )
# A tibble: 12 × 4
# Groups:   id [4]
   id        x     y min.rank
   <chr> <dbl> <dbl>    <int>
 1 A         1    11        1
 2 A         2    12        2
 3 A         3    13        3
 4 B         4    14        1
 5 B         5    15        2
 6 B         6    16        3
 7 B         7    17        4
 8 C         8    18        1
 9 D         9    19        3
10 D        10    20        4
11 D        -1   -11        2
12 D        -2   -12        1
a.df %>%
  mutate(
    pct.rank = percent_rank(x)
    )
# A tibble: 12 × 4
   id        x     y pct.rank
   <chr> <dbl> <dbl>    <dbl>
 1 A         1    11   0.182 
 2 A         2    12   0.273 
 3 A         3    13   0.364 
 4 B         4    14   0.455 
 5 B         5    15   0.545 
 6 B         6    16   0.636 
 7 B         7    17   0.727 
 8 C         8    18   0.818 
 9 D         9    19   0.909 
10 D        10    20   1     
11 D        -1   -11   0.0909
12 D        -2   -12   0     
a.df %>%
  group_by(id) %>%  
  mutate(
    pct.rank = percent_rank(x)
    )
# A tibble: 12 × 4
# Groups:   id [4]
   id        x     y pct.rank
   <chr> <dbl> <dbl>    <dbl>
 1 A         1    11    0    
 2 A         2    12    0.5  
 3 A         3    13    1    
 4 B         4    14    0    
 5 B         5    15    0.333
 6 B         6    16    0.667
 7 B         7    17    1    
 8 C         8    18  NaN    
 9 D         9    19    0.667
10 D        10    20    1    
11 D        -1   -11    0.333
12 D        -2   -12    0    

In addition, you can use first(), last(), or then a specific position in a vector via nth(somevariable, position) as shown below. Note, however, that here the rows are not sorted in ascending/descending order of your target variable but instead used as they occur in the dataset.

a.df %>%
  mutate(
    first.value = first(y)
    )
# A tibble: 12 × 4
   id        x     y first.value
   <chr> <dbl> <dbl>       <dbl>
 1 A         1    11          11
 2 A         2    12          11
 3 A         3    13          11
 4 B         4    14          11
 5 B         5    15          11
 6 B         6    16          11
 7 B         7    17          11
 8 C         8    18          11
 9 D         9    19          11
10 D        10    20          11
11 D        -1   -11          11
12 D        -2   -12          11
a.df %>%
  mutate(
    last.value = last(y)
    )
# A tibble: 12 × 4
   id        x     y last.value
   <chr> <dbl> <dbl>      <dbl>
 1 A         1    11        -12
 2 A         2    12        -12
 3 A         3    13        -12
 4 B         4    14        -12
 5 B         5    15        -12
 6 B         6    16        -12
 7 B         7    17        -12
 8 C         8    18        -12
 9 D         9    19        -12
10 D        10    20        -12
11 D        -1   -11        -12
12 D        -2   -12        -12
a.df %>%
  mutate(
    third.value = nth(y, 3)
    )
# A tibble: 12 × 4
   id        x     y third.value
   <chr> <dbl> <dbl>       <dbl>
 1 A         1    11          13
 2 A         2    12          13
 3 A         3    13          13
 4 B         4    14          13
 5 B         5    15          13
 6 B         6    16          13
 7 B         7    17          13
 8 C         8    18          13
 9 D         9    19          13
10 D        10    20          13
11 D        -1   -11          13
12 D        -2   -12          13

And then of course the grouped variants

a.df %>%
  group_by(id) %>% 
  mutate(
    first.value = first(y)
    )
# A tibble: 12 × 4
# Groups:   id [4]
   id        x     y first.value
   <chr> <dbl> <dbl>       <dbl>
 1 A         1    11          11
 2 A         2    12          11
 3 A         3    13          11
 4 B         4    14          14
 5 B         5    15          14
 6 B         6    16          14
 7 B         7    17          14
 8 C         8    18          18
 9 D         9    19          19
10 D        10    20          19
11 D        -1   -11          19
12 D        -2   -12          19
a.df %>%
  group_by(id) %>%   
  mutate(
    last.value = last(y)
    )
# A tibble: 12 × 4
# Groups:   id [4]
   id        x     y last.value
   <chr> <dbl> <dbl>      <dbl>
 1 A         1    11         13
 2 A         2    12         13
 3 A         3    13         13
 4 B         4    14         17
 5 B         5    15         17
 6 B         6    16         17
 7 B         7    17         17
 8 C         8    18         18
 9 D         9    19        -12
10 D        10    20        -12
11 D        -1   -11        -12
12 D        -2   -12        -12
a.df %>%
  group_by(id) %>%   
  mutate(
    third.value = nth(y, 3)
    )
# A tibble: 12 × 4
# Groups:   id [4]
   id        x     y third.value
   <chr> <dbl> <dbl>       <dbl>
 1 A         1    11          13
 2 A         2    12          13
 3 A         3    13          13
 4 B         4    14          16
 5 B         5    15          16
 6 B         6    16          16
 7 B         7    17          16
 8 C         8    18          NA
 9 D         9    19         -11
10 D        10    20         -11
11 D        -1   -11         -11
12 D        -2   -12         -11

Citation

For attribution, please cite this work as

Ruhil (2022, Feb. 3). Tidy and Tweak Your Data. Retrieved from https://aniruhil.org/courses/mpa6020/handouts/module04.html

BibTeX citation

@misc{ruhil2022tidy,
  author = {Ruhil, Ani},
  title = {Tidy and Tweak Your Data},
  url = {https://aniruhil.org/courses/mpa6020/handouts/module04.html},
  year = {2022}
}