Data Manipulation with dplyr (1/2)

@aniruhil
Updated on 2020-12-18

The Bureau of Transportation Statistics gathers a lot of data but the one we will use to learn dplyr is their data on airlines’ on-time performance. I have here a snippet of the available data – data for almost all domestic flights originating from O’Hare or Midway airports in Chicago in the month of August 2019.

library(here)
load(
  here("workshops/athensr/handouts/data", "chicago.flts.aug.RData")
  )

library(tidyverse)

readr::read_delim(
  here("workshops/athensr/handouts/data", "airline-ontime-docs.txt"), 
  "\t", escape_double = FALSE, trim_ws = TRUE
  ) -> airline_ontime_docs 

DT::datatable(airline_ontime_docs, caption = "Codebook for the data fields")

filter()

Say you want to retain very specific records for analysis. This can be done via filter() by providing the column name(s) and the selection criteria that should be applied to the column’s values.

Character values in filter()

We start simple, with character values, for example, all flights that originate in Midway.

chicago.flts.aug %>%
  filter(
    origin == "MDW" 
    ) -> tab01

table(tab01$origin)

 MDW 
7397 

What if I want flights to Los Angeles, San Francisco, and Seattle, regardless of origin being O’Hare or Midway?

chicago.flts.aug %>%
  filter(
    dest %in% c("LAX", "SFO", "SEA")
    ) -> tab02 

table(tab02$dest)

LAX SEA SFO 
892 663 714 

What if I want flights from O’Hare to these three airports?

chicago.flts.aug %>%
  filter(
    origin == "ORD", 
    dest %in% c("LAX", "SFO", "SEA")
    ) -> tab03 

table(tab03$origin, tab03$dest)
     
      LAX SEA SFO
  ORD 711 595 660

Note: Here, , is the same as &. If we wanted to specify or, that would entail using | as in the example below. That is, how about flights either from MDW or to any of the three airports flagged earlier?

chicago.flts.aug %>%
  filter(
    origin == "MDW" | 
    dest %in% c("LAX", "SFO", "SEA")
    ) -> tab04 

table(tab04$origin, tab04$dest)
     
      ABQ ALB ATL AUS BDL BHM BNA BOI BOS BUF BUR BWI CHS CLE CLT CMH
  MDW  58  62 386 111  88  61 208   4 154  93  28 200  63 143  84 191
  ORD   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
     
      CVG DAL DCA DEN DTW ECP EWR FLL GRR HOU IND JAX LAS LAX LGA MCI
  MDW 141 214 186 277 200   2 167 105  84 181  37  31 241 181 186 239
  ORD   0   0   0   0   0   0   0   0   0   0   0   0   0 711   0   0
     
      MCO MEM MHT MSP MSY OAK OKC OMA ONT ORF PDX PHL PHX PIT PNS PVD
  MDW 210  57  84 418 114 108  32 131  31  41  27 172 178 148   7  84
  ORD   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
     
      PWM RDU RNO RSW SAN SAT SDF SEA SFO SJC SJU SLC SMF STL TPA TUS
  MDW   5  97  32  71 117  67 110  68  54  51   9  62  35 225 115  31
  ORD   0   0   0   0   0   0   0 595 660   0   0   0   0   0   0   0

What if I need to filter when the column has values not equal to some target value(s)? This can be done via != if you are dealing with a single target value.

chicago.flts.aug %>%
  filter(
    origin != "ORD" 
    ) -> tab05 

table(tab05$origin)

 ABE  ABQ  ALB  ALO  ANC  ASE  ATL  ATW  AUS  AVL  AVP  AZO  BDL  BFM 
  73  136  268   61   99   72 1054   92  383  124   83   89  343    9 
 BGR  BHM  BIL  BIS  BKG  BMI  BNA  BOI  BOS  BTV  BUF  BUR  BWI  BZN 
  57  214   19    3    8   88  669   97  869  180  361   28  487   72 
 CAK  CGI  CHO  CHS  CID  CKB  CLE  CLT  CMH  CMI  CMX  COD  COS  COU 
  82   27   93  175  222   31  573  535  636  148   62    1   87  105 
 CRW  CVG  CWA  DAL  DAY  DBQ  DCA  DEN  DFW  DLH  DRO  DSM  DTW  EAU 
  22  657   86  214  189   88  872  883  672  186   10  266  796   62 
 ECP  ELP  ERI  EUG  EVV  EWR  EYW  FAI  FAR  FAT  FCA  FLL  FNT  FSD 
   7   88   31   31   95  685    6   31  134   31   66  226  155  103 
 FWA  GEG  GJT  GRB  GRR  GSO  GSP  GTF  HHH  HNL  HOU  HPN  HRL  HSV 
 186   30    5  198  433   89  132   20    6   31  187   93    9   81 
 IAD  IAH  ICT  ILM  IND  JAC  JAX  JFK  JLN  LAN  LAS  LAX  LEX  LGA 
 185  432  171   46  455  132  211  277   62   76  656  888   94 1500 
 LIT  LNK  LSE  LWB  MBS  MCI  MCO  MDT  MDW  MEI  MEM  MHK  MHT  MIA 
 143   28  122   34   88  601  564  179 7397   31  253   92  146  321 
 MKE  MKG  MLI  MQT  MSN  MSO  MSP  MSY  MTJ  MYR  OAK  OGG  OGS  OKC 
 145   62  122   45  273   36 1090  230    4   61  122   21   22  229 
 OMA  ONT  ORF  PAH  PBI  PDX  PHL  PHX  PIA  PIT  PNS  PVD  PWM  RAP 
 434   31  222   57   31  247  624  592   98  518   49  224  147  113 
 RDM  RDU  RIC  RNO  ROA  ROC  RST  RSW  SAN  SAT  SAV  SBN  SCE  SDF 
  30  406  228  125   51  234  123  202  404  242  108   77   64  287 
 SEA  SFO  SGF  SHD  SJC  SJU  SLC  SLN  SMF  SNA  SPI  SRQ  STL  SUX 
 662  713  153   26  163   74  357   26  153  238   88   31  488   71 
 SYR  TOL  TPA  TTN  TUL  TUS  TVC  TYS  UIN  VPS  XNA 
 270   87  336   31  159   93  246  188   53    5  228 

If you have multiple target values, then you should use the following:

chicago.flts.aug %>%
  filter(
    !dest %in% c("LAX", "SFO", "SEA")
    ) -> tab06 

table(tab06$dest)

  ABE   ABQ   ALB   ALO   ANC   ASE   ATL   ATW   AUS   AVL   AVP 
   73   135   269    60    99    72  1064    92   369   124    83 
  AZO   BDL   BFM   BGR   BHM   BIL   BIS   BKG   BMI   BNA   BOI 
   90   342     9    57   214    18     3     8    88   692    99 
  BOS   BTV   BUF   BUR   BWI   BZN   CAK   CGI   CHO   CHS   CID 
  872   179   329    28   471    72    82    26    93   176   222 
  CKB   CLE   CLT   CMH   CMI   CMX   COD   COS   COU   CRW   CVG 
   31   572   531   636   149    62     2    88   104    22   657 
  CWA   DAL   DAY   DBQ   DCA   DEN   DFW   DLH   DRO   DSM   DTW 
   86   214   189    88   869   883   668   186    10   256   796 
  EAU   ECP   ELP   ERI   EUG   EVV   EWR   EYW   FAI   FAR   FAT 
   62     7    87    32    31    94   679     6    31   110    31 
  FCA   FLL   FNT   FSD   FWA   GEG   GJT   GRB   GRR   GSO   GSP 
   66   233   154   103   186    31     5   197   465    89   133 
  GTF   HHH   HNL   HOU   HPN   HRL   HSV   IAD   IAH   ICT   ILM 
   19     6    31   181    93     9    78   185   437   170    46 
  IND   JAC   JAX   JFK   JLN   LAN   LAS   LEX   LGA   LIT   LNK 
  470   132   211   277    62    76   658    94  1501   167    58 
  LSE   LWB   MBS   MCI   MCO   MDT   MDW   MEI   MEM   MHK   MHT 
  122    34    88   605   560   179  7393    31   251    92   146 
  MIA   MKE   MKG   MLI   MQT   MSN   MSO   MSP   MSY   MTJ   MYR 
  323   145    62   122    45   273    36  1089   258     4    61 
  OAK   OGG   OGS   OKC   OMA   ONT   ORD   ORF   PAH   PBI   PDX 
  139    21    22   231   415    31 30770   222    58    31   248 
  PHL   PHX   PIA   PIT   PNS   PVD   PWM   RAP   RDM   RDU   RIC 
  630   578    98   520    49   208   147   126    31   406   227 
  RNO   ROA   ROC   RST   RSW   SAN   SAT   SAV   SBN   SCE   SDF 
  123    52   236   123   201   385   243   118    77    64   256 
  SGF   SHD   SJC   SJU   SLC   SLN   SMF   SNA   SPI   SRQ   STL 
  154    26   159    74   349    26   153   236    88    31   477 
  SUX   SYR   TOL   TPA   TTN   TUL   TUS   TVC   TYS   UIN   VPS 
   71   270    87   336    31   160    93   256   196    53     5 
  XNA 
  217 

Note the ! goes before the column name and not before %in%; this is an oft-forgotten switch.

Numeric values in filter()

If we are dealing with numeric values we could ask for values that fall within/outside a range, or then == >= <= >, or < some value. We could also employ not equal to, as in !=.

I am going to use dep_delay_minutes and select all instances of positive delays.

chicago.flts.aug %>%
  filter(
    dep_delay_minutes > 0
    ) -> tab07 

summary(tab07$dep_delay_minutes)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   1.00    7.00   20.00   46.09   55.00 1667.00 

What about only delays of 1 or 5 minutes?

chicago.flts.aug %>%
  filter(
    dep_delay_minutes %in% c(1, 5)
    ) -> tab08 

table(tab08$dep_delay_minutes)

   1    5 
1696  966 

Delays anywhere between 1 through 5 minutes?

chicago.flts.aug %>%
  filter(
    dep_delay_minutes %in% seq(1:5)
    ) -> tab09 

table(tab09$dep_delay_minutes)

   1    2    3    4    5 
1696 1320 1177 1029  966 

Delays less than or equal to 10 minutes or longer than 30 minutes?

chicago.flts.aug %>%
  filter(
    dep_delay_minutes <= 10 |
    dep_delay_minutes > 30  
    ) -> tab10 

summary(tab10$dep_delay_minutes)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   0.00    0.00    0.00   17.61    4.00 1667.00 

select()

If the data-set has a lot of columns, and you do not need all of them for your analyses, select() comes in handy to retain specific columns. You can select in multiple ways – by column name(s), column number(s), and then by string or other attributes.

chicago.flts.aug %>%
  select(c(6:7, 11, 15, 19, 22:30)) %>%
  names()
 [1] "flight_date"                    
 [2] "reporting_airline"              
 [3] "flight_number_reporting_airline"
 [4] "origin"                         
 [5] "origin_state_name"              
 [6] "dest_airport_seq_id"            
 [7] "dest_city_market_id"            
 [8] "dest"                           
 [9] "dest_city_name"                 
[10] "dest_state"                     
[11] "dest_state_fips"                
[12] "dest_state_name"                
[13] "dest_wac"                       
[14] "crs_dep_time"                   
chicago.flts.aug %>%
  select(c(year, month, dayof_month, reporting_airline, origin, dest)) %>%
  names()
[1] "year"              "month"             "dayof_month"      
[4] "reporting_airline" "origin"            "dest"             
chicago.flts.aug %>%
  select(contains("city")) %>%
  names()
[1] "origin_city_market_id" "origin_city_name"     
[3] "dest_city_market_id"   "dest_city_name"       
chicago.flts.aug %>%
  select(starts_with("origin")) %>%
  names()
[1] "origin_airport_id"     "origin_airport_seq_id"
[3] "origin_city_market_id" "origin"               
[5] "origin_city_name"      "origin_state"         
[7] "origin_state_fips"     "origin_state_name"    
[9] "origin_wac"           
chicago.flts.aug %>%
  select(ends_with("airline")) %>%
  names()
[1] "reporting_airline"              
[2] "dot_id_reporting_airline"       
[3] "iata_code_reporting_airline"    
[4] "flight_number_reporting_airline"
chicago.flts.aug %>%
  select(matches("air")) %>%
  names()
 [1] "reporting_airline"              
 [2] "dot_id_reporting_airline"       
 [3] "iata_code_reporting_airline"    
 [4] "flight_number_reporting_airline"
 [5] "origin_airport_id"              
 [6] "origin_airport_seq_id"          
 [7] "dest_airport_id"                
 [8] "dest_airport_seq_id"            
 [9] "air_time"                       
[10] "late_aircraft_delay"            
[11] "div_airport_landings"           
[12] "div1airport"                    
[13] "div1airport_id"                 
[14] "div1airport_seq_id"             
[15] "div2airport"                    
[16] "div2airport_id"                 
[17] "div2airport_seq_id"             
[18] "div3airport"                    
[19] "div3airport_id"                 
[20] "div3airport_seq_id"             
[21] "div4airport"                    
[22] "div4airport_id"                 
[23] "div4airport_seq_id"             
[24] "div5airport"                    
[25] "div5airport_id"                 
[26] "div5airport_seq_id"             

There are other options as well that you should explore online.

slice()

If the goal is to retain specific rows instead of columns, then slice() is your friend.

chicago.flts.aug %>%
  slice(1:10) 
# A tibble: 10 x 110
    year quarter month dayof_month day_of_week flight_date
   <dbl>   <dbl> <dbl>       <dbl>       <dbl> <date>     
 1  2019       3     8           1           4 2019-08-01 
 2  2019       3     8           1           4 2019-08-01 
 3  2019       3     8           1           4 2019-08-01 
 4  2019       3     8           1           4 2019-08-01 
 5  2019       3     8           1           4 2019-08-01 
 6  2019       3     8           1           4 2019-08-01 
 7  2019       3     8           1           4 2019-08-01 
 8  2019       3     8           1           4 2019-08-01 
 9  2019       3     8           1           4 2019-08-01 
10  2019       3     8           1           4 2019-08-01 
# … with 104 more variables: reporting_airline <chr>,
#   dot_id_reporting_airline <dbl>,
#   iata_code_reporting_airline <chr>, tail_number <chr>,
#   flight_number_reporting_airline <dbl>, origin_airport_id <dbl>,
#   origin_airport_seq_id <dbl>, origin_city_market_id <dbl>,
#   origin <chr>, origin_city_name <chr>, origin_state <chr>,
#   origin_state_fips <chr>, origin_state_name <chr>,
#   origin_wac <dbl>, dest_airport_id <dbl>,
#   dest_airport_seq_id <dbl>, dest_city_market_id <dbl>, dest <chr>,
#   dest_city_name <chr>, dest_state <chr>, dest_state_fips <chr>,
#   dest_state_name <chr>, dest_wac <dbl>, crs_dep_time <chr>,
#   dep_time <chr>, dep_delay <dbl>, dep_delay_minutes <dbl>,
#   dep_del15 <dbl>, departure_delay_groups <dbl>,
#   dep_time_blk <chr>, taxi_out <dbl>, wheels_off <chr>,
#   wheels_on <chr>, taxi_in <dbl>, crs_arr_time <chr>,
#   arr_time <chr>, arr_delay <dbl>, arr_delay_minutes <dbl>,
#   arr_del15 <dbl>, arrival_delay_groups <dbl>, arr_time_blk <chr>,
#   cancelled <dbl>, cancellation_code <chr>, diverted <dbl>,
#   crs_elapsed_time <dbl>, actual_elapsed_time <dbl>,
#   air_time <dbl>, flights <dbl>, distance <dbl>,
#   distance_group <dbl>, carrier_delay <dbl>, weather_delay <dbl>,
#   nas_delay <dbl>, security_delay <dbl>, late_aircraft_delay <dbl>,
#   first_dep_time <chr>, total_add_g_time <dbl>,
#   longest_add_g_time <dbl>, div_airport_landings <dbl>,
#   div_reached_dest <dbl>, div_actual_elapsed_time <dbl>,
#   div_arr_delay <dbl>, div_distance <dbl>, div1airport <chr>,
#   div1airport_id <dbl>, div1airport_seq_id <dbl>,
#   div1wheels_on <chr>, div1total_g_time <dbl>,
#   div1longest_g_time <dbl>, div1wheels_off <chr>,
#   div1tail_num <chr>, div2airport <lgl>, div2airport_id <lgl>,
#   div2airport_seq_id <lgl>, div2wheels_on <lgl>,
#   div2total_g_time <lgl>, div2longest_g_time <lgl>,
#   div2wheels_off <lgl>, div2tail_num <lgl>, div3airport <lgl>,
#   div3airport_id <lgl>, div3airport_seq_id <lgl>,
#   div3wheels_on <lgl>, div3total_g_time <lgl>,
#   div3longest_g_time <lgl>, div3wheels_off <lgl>,
#   div3tail_num <lgl>, div4airport <lgl>, div4airport_id <lgl>,
#   div4airport_seq_id <lgl>, div4wheels_on <lgl>,
#   div4total_g_time <lgl>, div4longest_g_time <lgl>,
#   div4wheels_off <lgl>, div4tail_num <lgl>, div5airport <lgl>,
#   div5airport_id <lgl>, div5airport_seq_id <lgl>,
#   div5wheels_on <lgl>, div5total_g_time <lgl>, …
chicago.flts.aug %>%
  slice(1, 30, 500, 721, 2103) 
# A tibble: 5 x 110
   year quarter month dayof_month day_of_week flight_date
  <dbl>   <dbl> <dbl>       <dbl>       <dbl> <date>     
1  2019       3     8           1           4 2019-08-01 
2  2019       3     8           1           4 2019-08-01 
3  2019       3     8          21           3 2019-08-21 
4  2019       3     8          23           5 2019-08-23 
5  2019       3     8          27           2 2019-08-27 
# … with 104 more variables: reporting_airline <chr>,
#   dot_id_reporting_airline <dbl>,
#   iata_code_reporting_airline <chr>, tail_number <chr>,
#   flight_number_reporting_airline <dbl>, origin_airport_id <dbl>,
#   origin_airport_seq_id <dbl>, origin_city_market_id <dbl>,
#   origin <chr>, origin_city_name <chr>, origin_state <chr>,
#   origin_state_fips <chr>, origin_state_name <chr>,
#   origin_wac <dbl>, dest_airport_id <dbl>,
#   dest_airport_seq_id <dbl>, dest_city_market_id <dbl>, dest <chr>,
#   dest_city_name <chr>, dest_state <chr>, dest_state_fips <chr>,
#   dest_state_name <chr>, dest_wac <dbl>, crs_dep_time <chr>,
#   dep_time <chr>, dep_delay <dbl>, dep_delay_minutes <dbl>,
#   dep_del15 <dbl>, departure_delay_groups <dbl>,
#   dep_time_blk <chr>, taxi_out <dbl>, wheels_off <chr>,
#   wheels_on <chr>, taxi_in <dbl>, crs_arr_time <chr>,
#   arr_time <chr>, arr_delay <dbl>, arr_delay_minutes <dbl>,
#   arr_del15 <dbl>, arrival_delay_groups <dbl>, arr_time_blk <chr>,
#   cancelled <dbl>, cancellation_code <chr>, diverted <dbl>,
#   crs_elapsed_time <dbl>, actual_elapsed_time <dbl>,
#   air_time <dbl>, flights <dbl>, distance <dbl>,
#   distance_group <dbl>, carrier_delay <dbl>, weather_delay <dbl>,
#   nas_delay <dbl>, security_delay <dbl>, late_aircraft_delay <dbl>,
#   first_dep_time <chr>, total_add_g_time <dbl>,
#   longest_add_g_time <dbl>, div_airport_landings <dbl>,
#   div_reached_dest <dbl>, div_actual_elapsed_time <dbl>,
#   div_arr_delay <dbl>, div_distance <dbl>, div1airport <chr>,
#   div1airport_id <dbl>, div1airport_seq_id <dbl>,
#   div1wheels_on <chr>, div1total_g_time <dbl>,
#   div1longest_g_time <dbl>, div1wheels_off <chr>,
#   div1tail_num <chr>, div2airport <lgl>, div2airport_id <lgl>,
#   div2airport_seq_id <lgl>, div2wheels_on <lgl>,
#   div2total_g_time <lgl>, div2longest_g_time <lgl>,
#   div2wheels_off <lgl>, div2tail_num <lgl>, div3airport <lgl>,
#   div3airport_id <lgl>, div3airport_seq_id <lgl>,
#   div3wheels_on <lgl>, div3total_g_time <lgl>,
#   div3longest_g_time <lgl>, div3wheels_off <lgl>,
#   div3tail_num <lgl>, div4airport <lgl>, div4airport_id <lgl>,
#   div4airport_seq_id <lgl>, div4wheels_on <lgl>,
#   div4total_g_time <lgl>, div4longest_g_time <lgl>,
#   div4wheels_off <lgl>, div4tail_num <lgl>, div5airport <lgl>,
#   div5airport_id <lgl>, div5airport_seq_id <lgl>,
#   div5wheels_on <lgl>, div5total_g_time <lgl>, …

mutate()

If you want to overwrite an existing column (a terrible idea) or create a new column based on some operation carried out on an existing column, mutate() allows you do so. For example, say I want to create new variables that are numeric versions of crs_dep_time and dep_time.

chicago.flts.aug %>%
  mutate(
    crs_departure_time = as.numeric(crs_dep_time),
    departure_time = as.numeric(dep_time)
    ) -> chicago.df

summary(chicago.df$crs_departure_time)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
     15     901    1312    1319    1736    2359 
summary(chicago.df$departure_time)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
      1     905    1319    1327    1751    2400    1733 

Virtually all mathematical operations are possible. For example, I will carry out an operation that is flawed here, essentially converting dep_delay_minutes into seconds by multiplying dep_delay_minutes by 60, and then down-converting back to minutes.

chicago.flts.aug %>%
  mutate(
    departure_delay_seconds = dep_delay_minutes * 60,
    departure_delay_minutes = departure_delay_seconds / 60
    ) -> chicago.df

summary(chicago.df$departure_delay_seconds)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
      0       0       0    1066     660  100020    1733 
summary(chicago.df$departure_delay_minutes)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
   0.00    0.00    0.00   17.76   11.00 1667.00    1733 
summary(chicago.df$dep_delay_minutes)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
   0.00    0.00    0.00   17.76   11.00 1667.00    1733 

Let me now create factors that attach labels to dep_delay15 and arr_delay15.

chicago.flts.aug %>%
  mutate(
    departure_delayed_15 = factor(
      dep_del15,
      levels = c(0, 1),
      labels = c("No", "Yes")
      ),
    arrival_delayed_15 = factor(
      arr_del15,
      levels = c(0, 1),
      labels = c("No", "Yes")
      )
    ) -> chicago.df

table(chicago.df$dep_del15)

    0     1 
57761 16830 
table(chicago.df$departure_delayed_15)

   No   Yes 
57761 16830 
table(chicago.df$arr_del15)

    0     1 
57141 17160 
table(chicago.df$arrival_delayed_15)

   No   Yes 
57141 17160 

transmute()

If you want to modify an existing column or create a new column based on an existing column but only retain the new column, transmute() will do that for you. Be careful!; all other columns get dropped so this may not be a command you want to use without a lot of thought.

chicago.flts.aug %>%
  transmute(
    departure_delay_seconds = dep_delay_minutes * 60,
    departure_delay_minutes = departure_delay_seconds / 60,
    departure_delayed_15 = factor(
      dep_del15,
      levels = c(0, 1),
      labels = c("No", "Yes")
      ),
    arrival_delayed_15 = factor(
      arr_del15,
      levels = c(0, 1),
      labels = c("No", "Yes")
      )
    ) %>%
  glimpse()
Rows: 76,324
Columns: 4
$ departure_delay_seconds <dbl> 3960, 0, 3240, 0, 0, 1260, 3360, 0,…
$ departure_delay_minutes <dbl> 66, 0, 54, 0, 0, 21, 56, 0, 0, 34, …
$ departure_delayed_15    <fct> Yes, No, Yes, No, No, Yes, Yes, No,…
$ arrival_delayed_15      <fct> Yes, Yes, Yes, No, No, No, Yes, No,…

summarize()

You often want to calculate some quantity of interest and retain these calculated quantities rather than the raw data. For example, say I want to know the average departure delay and arrival delay. I can set out to calculate the mean, median, and standard deviation of each, as follows:

chicago.flts.aug %>%
  summarise(
    mean_dep = mean(dep_delay_minutes, na.rm = TRUE),
    median_dep = median(dep_delay_minutes, na.rm = TRUE),
    mean_arr = mean(arr_delay_minutes, na.rm = TRUE),
    median_arr = median(arr_delay_minutes, na.rm = TRUE),
    sd_dep = sd(dep_delay_minutes, na.rm = TRUE),
    sd_arr = sd(arr_delay_minutes, na.rm = TRUE)
    ) 
# A tibble: 1 x 6
  mean_dep median_dep mean_arr median_arr sd_dep sd_arr
     <dbl>      <dbl>    <dbl>      <dbl>  <dbl>  <dbl>
1     17.8          0     18.3          0   53.7   54.1

The true power of mutate() and summarise() becomes visible when you combine these commands with grouped operations via group_by(), and that is the focus of our next working session.