The Bureau of Transportation Statistics gathers a lot of data but the one we will use to learn dplyr
is their data on airlines’ on-time performance. I have here a snippet of the available data – data for almost all domestic flights originating from O’Hare or Midway airports in Chicago in the month of August 2019.
library(here)
load(
here("workshops/athensr/handouts/data", "chicago.flts.aug.RData")
)
library(tidyverse)
readr::read_delim(
here("workshops/athensr/handouts/data", "airline-ontime-docs.txt"),
"\t", escape_double = FALSE, trim_ws = TRUE
) -> airline_ontime_docs
DT::datatable(airline_ontime_docs, caption = "Codebook for the data fields")
filter()
Say you want to retain very specific records for analysis. This can be done via filter()
by providing the column name(s) and the selection criteria that should be applied to the column’s values.
filter()
We start simple, with character values, for example, all flights that originate in Midway.
What if I want flights to Los Angeles, San Francisco, and Seattle, regardless of origin being O’Hare or Midway?
LAX SEA SFO
892 663 714
What if I want flights from O’Hare to these three airports?
chicago.flts.aug %>%
filter(
origin == "ORD",
dest %in% c("LAX", "SFO", "SEA")
) -> tab03
table(tab03$origin, tab03$dest)
LAX SEA SFO
ORD 711 595 660
Note: Here, ,
is the same as &
. If we wanted to specify or
, that would entail using |
as in the example below. That is, how about flights either from MDW or to any of the three airports flagged earlier?
chicago.flts.aug %>%
filter(
origin == "MDW" |
dest %in% c("LAX", "SFO", "SEA")
) -> tab04
table(tab04$origin, tab04$dest)
ABQ ALB ATL AUS BDL BHM BNA BOI BOS BUF BUR BWI CHS CLE CLT CMH
MDW 58 62 386 111 88 61 208 4 154 93 28 200 63 143 84 191
ORD 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
CVG DAL DCA DEN DTW ECP EWR FLL GRR HOU IND JAX LAS LAX LGA MCI
MDW 141 214 186 277 200 2 167 105 84 181 37 31 241 181 186 239
ORD 0 0 0 0 0 0 0 0 0 0 0 0 0 711 0 0
MCO MEM MHT MSP MSY OAK OKC OMA ONT ORF PDX PHL PHX PIT PNS PVD
MDW 210 57 84 418 114 108 32 131 31 41 27 172 178 148 7 84
ORD 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
PWM RDU RNO RSW SAN SAT SDF SEA SFO SJC SJU SLC SMF STL TPA TUS
MDW 5 97 32 71 117 67 110 68 54 51 9 62 35 225 115 31
ORD 0 0 0 0 0 0 0 595 660 0 0 0 0 0 0 0
What if I need to filter when the column has values not equal to some target value(s)? This can be done via !=
if you are dealing with a single target value.
ABE ABQ ALB ALO ANC ASE ATL ATW AUS AVL AVP AZO BDL BFM
73 136 268 61 99 72 1054 92 383 124 83 89 343 9
BGR BHM BIL BIS BKG BMI BNA BOI BOS BTV BUF BUR BWI BZN
57 214 19 3 8 88 669 97 869 180 361 28 487 72
CAK CGI CHO CHS CID CKB CLE CLT CMH CMI CMX COD COS COU
82 27 93 175 222 31 573 535 636 148 62 1 87 105
CRW CVG CWA DAL DAY DBQ DCA DEN DFW DLH DRO DSM DTW EAU
22 657 86 214 189 88 872 883 672 186 10 266 796 62
ECP ELP ERI EUG EVV EWR EYW FAI FAR FAT FCA FLL FNT FSD
7 88 31 31 95 685 6 31 134 31 66 226 155 103
FWA GEG GJT GRB GRR GSO GSP GTF HHH HNL HOU HPN HRL HSV
186 30 5 198 433 89 132 20 6 31 187 93 9 81
IAD IAH ICT ILM IND JAC JAX JFK JLN LAN LAS LAX LEX LGA
185 432 171 46 455 132 211 277 62 76 656 888 94 1500
LIT LNK LSE LWB MBS MCI MCO MDT MDW MEI MEM MHK MHT MIA
143 28 122 34 88 601 564 179 7397 31 253 92 146 321
MKE MKG MLI MQT MSN MSO MSP MSY MTJ MYR OAK OGG OGS OKC
145 62 122 45 273 36 1090 230 4 61 122 21 22 229
OMA ONT ORF PAH PBI PDX PHL PHX PIA PIT PNS PVD PWM RAP
434 31 222 57 31 247 624 592 98 518 49 224 147 113
RDM RDU RIC RNO ROA ROC RST RSW SAN SAT SAV SBN SCE SDF
30 406 228 125 51 234 123 202 404 242 108 77 64 287
SEA SFO SGF SHD SJC SJU SLC SLN SMF SNA SPI SRQ STL SUX
662 713 153 26 163 74 357 26 153 238 88 31 488 71
SYR TOL TPA TTN TUL TUS TVC TYS UIN VPS XNA
270 87 336 31 159 93 246 188 53 5 228
If you have multiple target values, then you should use the following:
ABE ABQ ALB ALO ANC ASE ATL ATW AUS AVL AVP
73 135 269 60 99 72 1064 92 369 124 83
AZO BDL BFM BGR BHM BIL BIS BKG BMI BNA BOI
90 342 9 57 214 18 3 8 88 692 99
BOS BTV BUF BUR BWI BZN CAK CGI CHO CHS CID
872 179 329 28 471 72 82 26 93 176 222
CKB CLE CLT CMH CMI CMX COD COS COU CRW CVG
31 572 531 636 149 62 2 88 104 22 657
CWA DAL DAY DBQ DCA DEN DFW DLH DRO DSM DTW
86 214 189 88 869 883 668 186 10 256 796
EAU ECP ELP ERI EUG EVV EWR EYW FAI FAR FAT
62 7 87 32 31 94 679 6 31 110 31
FCA FLL FNT FSD FWA GEG GJT GRB GRR GSO GSP
66 233 154 103 186 31 5 197 465 89 133
GTF HHH HNL HOU HPN HRL HSV IAD IAH ICT ILM
19 6 31 181 93 9 78 185 437 170 46
IND JAC JAX JFK JLN LAN LAS LEX LGA LIT LNK
470 132 211 277 62 76 658 94 1501 167 58
LSE LWB MBS MCI MCO MDT MDW MEI MEM MHK MHT
122 34 88 605 560 179 7393 31 251 92 146
MIA MKE MKG MLI MQT MSN MSO MSP MSY MTJ MYR
323 145 62 122 45 273 36 1089 258 4 61
OAK OGG OGS OKC OMA ONT ORD ORF PAH PBI PDX
139 21 22 231 415 31 30770 222 58 31 248
PHL PHX PIA PIT PNS PVD PWM RAP RDM RDU RIC
630 578 98 520 49 208 147 126 31 406 227
RNO ROA ROC RST RSW SAN SAT SAV SBN SCE SDF
123 52 236 123 201 385 243 118 77 64 256
SGF SHD SJC SJU SLC SLN SMF SNA SPI SRQ STL
154 26 159 74 349 26 153 236 88 31 477
SUX SYR TOL TPA TTN TUL TUS TVC TYS UIN VPS
71 270 87 336 31 160 93 256 196 53 5
XNA
217
Note the !
goes before the column name and not before %in%
; this is an oft-forgotten switch.
filter()
If we are dealing with numeric values we could ask for values that fall within/outside a range, or then ==
>=
<=
>
, or <
some value. We could also employ not equal to, as in !=
.
I am going to use dep_delay_minutes
and select all instances of positive delays.
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.00 7.00 20.00 46.09 55.00 1667.00
What about only delays of 1 or 5 minutes?
chicago.flts.aug %>%
filter(
dep_delay_minutes %in% c(1, 5)
) -> tab08
table(tab08$dep_delay_minutes)
1 5
1696 966
Delays anywhere between 1 through 5 minutes?
chicago.flts.aug %>%
filter(
dep_delay_minutes %in% seq(1:5)
) -> tab09
table(tab09$dep_delay_minutes)
1 2 3 4 5
1696 1320 1177 1029 966
Delays less than or equal to 10 minutes or longer than 30 minutes?
chicago.flts.aug %>%
filter(
dep_delay_minutes <= 10 |
dep_delay_minutes > 30
) -> tab10
summary(tab10$dep_delay_minutes)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00 0.00 0.00 17.61 4.00 1667.00
select()
If the data-set has a lot of columns, and you do not need all of them for your analyses, select()
comes in handy to retain specific columns. You can select in multiple ways – by column name(s), column number(s), and then by string or other attributes.
[1] "flight_date"
[2] "reporting_airline"
[3] "flight_number_reporting_airline"
[4] "origin"
[5] "origin_state_name"
[6] "dest_airport_seq_id"
[7] "dest_city_market_id"
[8] "dest"
[9] "dest_city_name"
[10] "dest_state"
[11] "dest_state_fips"
[12] "dest_state_name"
[13] "dest_wac"
[14] "crs_dep_time"
chicago.flts.aug %>%
select(c(year, month, dayof_month, reporting_airline, origin, dest)) %>%
names()
[1] "year" "month" "dayof_month"
[4] "reporting_airline" "origin" "dest"
chicago.flts.aug %>%
select(contains("city")) %>%
names()
[1] "origin_city_market_id" "origin_city_name"
[3] "dest_city_market_id" "dest_city_name"
chicago.flts.aug %>%
select(starts_with("origin")) %>%
names()
[1] "origin_airport_id" "origin_airport_seq_id"
[3] "origin_city_market_id" "origin"
[5] "origin_city_name" "origin_state"
[7] "origin_state_fips" "origin_state_name"
[9] "origin_wac"
chicago.flts.aug %>%
select(ends_with("airline")) %>%
names()
[1] "reporting_airline"
[2] "dot_id_reporting_airline"
[3] "iata_code_reporting_airline"
[4] "flight_number_reporting_airline"
chicago.flts.aug %>%
select(matches("air")) %>%
names()
[1] "reporting_airline"
[2] "dot_id_reporting_airline"
[3] "iata_code_reporting_airline"
[4] "flight_number_reporting_airline"
[5] "origin_airport_id"
[6] "origin_airport_seq_id"
[7] "dest_airport_id"
[8] "dest_airport_seq_id"
[9] "air_time"
[10] "late_aircraft_delay"
[11] "div_airport_landings"
[12] "div1airport"
[13] "div1airport_id"
[14] "div1airport_seq_id"
[15] "div2airport"
[16] "div2airport_id"
[17] "div2airport_seq_id"
[18] "div3airport"
[19] "div3airport_id"
[20] "div3airport_seq_id"
[21] "div4airport"
[22] "div4airport_id"
[23] "div4airport_seq_id"
[24] "div5airport"
[25] "div5airport_id"
[26] "div5airport_seq_id"
There are other options as well that you should explore online.
slice()
If the goal is to retain specific rows instead of columns, then slice()
is your friend.
chicago.flts.aug %>%
slice(1:10)
# A tibble: 10 x 110
year quarter month dayof_month day_of_week flight_date
<dbl> <dbl> <dbl> <dbl> <dbl> <date>
1 2019 3 8 1 4 2019-08-01
2 2019 3 8 1 4 2019-08-01
3 2019 3 8 1 4 2019-08-01
4 2019 3 8 1 4 2019-08-01
5 2019 3 8 1 4 2019-08-01
6 2019 3 8 1 4 2019-08-01
7 2019 3 8 1 4 2019-08-01
8 2019 3 8 1 4 2019-08-01
9 2019 3 8 1 4 2019-08-01
10 2019 3 8 1 4 2019-08-01
# … with 104 more variables: reporting_airline <chr>,
# dot_id_reporting_airline <dbl>,
# iata_code_reporting_airline <chr>, tail_number <chr>,
# flight_number_reporting_airline <dbl>, origin_airport_id <dbl>,
# origin_airport_seq_id <dbl>, origin_city_market_id <dbl>,
# origin <chr>, origin_city_name <chr>, origin_state <chr>,
# origin_state_fips <chr>, origin_state_name <chr>,
# origin_wac <dbl>, dest_airport_id <dbl>,
# dest_airport_seq_id <dbl>, dest_city_market_id <dbl>, dest <chr>,
# dest_city_name <chr>, dest_state <chr>, dest_state_fips <chr>,
# dest_state_name <chr>, dest_wac <dbl>, crs_dep_time <chr>,
# dep_time <chr>, dep_delay <dbl>, dep_delay_minutes <dbl>,
# dep_del15 <dbl>, departure_delay_groups <dbl>,
# dep_time_blk <chr>, taxi_out <dbl>, wheels_off <chr>,
# wheels_on <chr>, taxi_in <dbl>, crs_arr_time <chr>,
# arr_time <chr>, arr_delay <dbl>, arr_delay_minutes <dbl>,
# arr_del15 <dbl>, arrival_delay_groups <dbl>, arr_time_blk <chr>,
# cancelled <dbl>, cancellation_code <chr>, diverted <dbl>,
# crs_elapsed_time <dbl>, actual_elapsed_time <dbl>,
# air_time <dbl>, flights <dbl>, distance <dbl>,
# distance_group <dbl>, carrier_delay <dbl>, weather_delay <dbl>,
# nas_delay <dbl>, security_delay <dbl>, late_aircraft_delay <dbl>,
# first_dep_time <chr>, total_add_g_time <dbl>,
# longest_add_g_time <dbl>, div_airport_landings <dbl>,
# div_reached_dest <dbl>, div_actual_elapsed_time <dbl>,
# div_arr_delay <dbl>, div_distance <dbl>, div1airport <chr>,
# div1airport_id <dbl>, div1airport_seq_id <dbl>,
# div1wheels_on <chr>, div1total_g_time <dbl>,
# div1longest_g_time <dbl>, div1wheels_off <chr>,
# div1tail_num <chr>, div2airport <lgl>, div2airport_id <lgl>,
# div2airport_seq_id <lgl>, div2wheels_on <lgl>,
# div2total_g_time <lgl>, div2longest_g_time <lgl>,
# div2wheels_off <lgl>, div2tail_num <lgl>, div3airport <lgl>,
# div3airport_id <lgl>, div3airport_seq_id <lgl>,
# div3wheels_on <lgl>, div3total_g_time <lgl>,
# div3longest_g_time <lgl>, div3wheels_off <lgl>,
# div3tail_num <lgl>, div4airport <lgl>, div4airport_id <lgl>,
# div4airport_seq_id <lgl>, div4wheels_on <lgl>,
# div4total_g_time <lgl>, div4longest_g_time <lgl>,
# div4wheels_off <lgl>, div4tail_num <lgl>, div5airport <lgl>,
# div5airport_id <lgl>, div5airport_seq_id <lgl>,
# div5wheels_on <lgl>, div5total_g_time <lgl>, …
chicago.flts.aug %>%
slice(1, 30, 500, 721, 2103)
# A tibble: 5 x 110
year quarter month dayof_month day_of_week flight_date
<dbl> <dbl> <dbl> <dbl> <dbl> <date>
1 2019 3 8 1 4 2019-08-01
2 2019 3 8 1 4 2019-08-01
3 2019 3 8 21 3 2019-08-21
4 2019 3 8 23 5 2019-08-23
5 2019 3 8 27 2 2019-08-27
# … with 104 more variables: reporting_airline <chr>,
# dot_id_reporting_airline <dbl>,
# iata_code_reporting_airline <chr>, tail_number <chr>,
# flight_number_reporting_airline <dbl>, origin_airport_id <dbl>,
# origin_airport_seq_id <dbl>, origin_city_market_id <dbl>,
# origin <chr>, origin_city_name <chr>, origin_state <chr>,
# origin_state_fips <chr>, origin_state_name <chr>,
# origin_wac <dbl>, dest_airport_id <dbl>,
# dest_airport_seq_id <dbl>, dest_city_market_id <dbl>, dest <chr>,
# dest_city_name <chr>, dest_state <chr>, dest_state_fips <chr>,
# dest_state_name <chr>, dest_wac <dbl>, crs_dep_time <chr>,
# dep_time <chr>, dep_delay <dbl>, dep_delay_minutes <dbl>,
# dep_del15 <dbl>, departure_delay_groups <dbl>,
# dep_time_blk <chr>, taxi_out <dbl>, wheels_off <chr>,
# wheels_on <chr>, taxi_in <dbl>, crs_arr_time <chr>,
# arr_time <chr>, arr_delay <dbl>, arr_delay_minutes <dbl>,
# arr_del15 <dbl>, arrival_delay_groups <dbl>, arr_time_blk <chr>,
# cancelled <dbl>, cancellation_code <chr>, diverted <dbl>,
# crs_elapsed_time <dbl>, actual_elapsed_time <dbl>,
# air_time <dbl>, flights <dbl>, distance <dbl>,
# distance_group <dbl>, carrier_delay <dbl>, weather_delay <dbl>,
# nas_delay <dbl>, security_delay <dbl>, late_aircraft_delay <dbl>,
# first_dep_time <chr>, total_add_g_time <dbl>,
# longest_add_g_time <dbl>, div_airport_landings <dbl>,
# div_reached_dest <dbl>, div_actual_elapsed_time <dbl>,
# div_arr_delay <dbl>, div_distance <dbl>, div1airport <chr>,
# div1airport_id <dbl>, div1airport_seq_id <dbl>,
# div1wheels_on <chr>, div1total_g_time <dbl>,
# div1longest_g_time <dbl>, div1wheels_off <chr>,
# div1tail_num <chr>, div2airport <lgl>, div2airport_id <lgl>,
# div2airport_seq_id <lgl>, div2wheels_on <lgl>,
# div2total_g_time <lgl>, div2longest_g_time <lgl>,
# div2wheels_off <lgl>, div2tail_num <lgl>, div3airport <lgl>,
# div3airport_id <lgl>, div3airport_seq_id <lgl>,
# div3wheels_on <lgl>, div3total_g_time <lgl>,
# div3longest_g_time <lgl>, div3wheels_off <lgl>,
# div3tail_num <lgl>, div4airport <lgl>, div4airport_id <lgl>,
# div4airport_seq_id <lgl>, div4wheels_on <lgl>,
# div4total_g_time <lgl>, div4longest_g_time <lgl>,
# div4wheels_off <lgl>, div4tail_num <lgl>, div5airport <lgl>,
# div5airport_id <lgl>, div5airport_seq_id <lgl>,
# div5wheels_on <lgl>, div5total_g_time <lgl>, …
mutate()
If you want to overwrite an existing column (a terrible idea) or create a new column based on some operation carried out on an existing column, mutate()
allows you do so. For example, say I want to create new variables that are numeric versions of crs_dep_time
and dep_time
.
chicago.flts.aug %>%
mutate(
crs_departure_time = as.numeric(crs_dep_time),
departure_time = as.numeric(dep_time)
) -> chicago.df
summary(chicago.df$crs_departure_time)
Min. 1st Qu. Median Mean 3rd Qu. Max.
15 901 1312 1319 1736 2359
summary(chicago.df$departure_time)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
1 905 1319 1327 1751 2400 1733
Virtually all mathematical operations are possible. For example, I will carry out an operation that is flawed here, essentially converting dep_delay_minutes
into seconds by multiplying dep_delay_minutes
by 60, and then down-converting back to minutes.
chicago.flts.aug %>%
mutate(
departure_delay_seconds = dep_delay_minutes * 60,
departure_delay_minutes = departure_delay_seconds / 60
) -> chicago.df
summary(chicago.df$departure_delay_seconds)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0 0 0 1066 660 100020 1733
summary(chicago.df$departure_delay_minutes)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0.00 0.00 0.00 17.76 11.00 1667.00 1733
summary(chicago.df$dep_delay_minutes)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0.00 0.00 0.00 17.76 11.00 1667.00 1733
Let me now create factors that attach labels to dep_delay15
and arr_delay15
.
chicago.flts.aug %>%
mutate(
departure_delayed_15 = factor(
dep_del15,
levels = c(0, 1),
labels = c("No", "Yes")
),
arrival_delayed_15 = factor(
arr_del15,
levels = c(0, 1),
labels = c("No", "Yes")
)
) -> chicago.df
table(chicago.df$dep_del15)
0 1
57761 16830
table(chicago.df$departure_delayed_15)
No Yes
57761 16830
table(chicago.df$arr_del15)
0 1
57141 17160
table(chicago.df$arrival_delayed_15)
No Yes
57141 17160
transmute()
If you want to modify an existing column or create a new column based on an existing column but only retain the new column, transmute()
will do that for you. Be careful!; all other columns get dropped so this may not be a command you want to use without a lot of thought.
chicago.flts.aug %>%
transmute(
departure_delay_seconds = dep_delay_minutes * 60,
departure_delay_minutes = departure_delay_seconds / 60,
departure_delayed_15 = factor(
dep_del15,
levels = c(0, 1),
labels = c("No", "Yes")
),
arrival_delayed_15 = factor(
arr_del15,
levels = c(0, 1),
labels = c("No", "Yes")
)
) %>%
glimpse()
Rows: 76,324
Columns: 4
$ departure_delay_seconds <dbl> 3960, 0, 3240, 0, 0, 1260, 3360, 0,…
$ departure_delay_minutes <dbl> 66, 0, 54, 0, 0, 21, 56, 0, 0, 34, …
$ departure_delayed_15 <fct> Yes, No, Yes, No, No, Yes, Yes, No,…
$ arrival_delayed_15 <fct> Yes, Yes, Yes, No, No, No, Yes, No,…
summarize()
You often want to calculate some quantity of interest and retain these calculated quantities rather than the raw data. For example, say I want to know the average departure delay and arrival delay. I can set out to calculate the mean, median, and standard deviation of each, as follows:
chicago.flts.aug %>%
summarise(
mean_dep = mean(dep_delay_minutes, na.rm = TRUE),
median_dep = median(dep_delay_minutes, na.rm = TRUE),
mean_arr = mean(arr_delay_minutes, na.rm = TRUE),
median_arr = median(arr_delay_minutes, na.rm = TRUE),
sd_dep = sd(dep_delay_minutes, na.rm = TRUE),
sd_arr = sd(arr_delay_minutes, na.rm = TRUE)
)
# A tibble: 1 x 6
mean_dep median_dep mean_arr median_arr sd_dep sd_arr
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 17.8 0 18.3 0 53.7 54.1
The true power of mutate()
and summarise()
becomes visible when you combine these commands with grouped operations via group_by()
, and that is the focus of our next working session.