Data Manipulation with dplyr (1/2)

The Bureau of Transportation Statistics gathers a lot of data but the one we will use to learn dplyr is their data on airlines’ on-time performance. I have here a snippet of the available data – data for almost all domestic flights originating from O’Hare or Midway airports in Chicago in the month of August 2019.

library(here)
load(
  here("workshops/athensr/handouts/data", "chicago.flts.aug.RData")
  )

library(tidyverse)

readr::read_delim(
  here("workshops/athensr/handouts/data", "airline-ontime-docs.txt"), 
  "\t", escape_double = FALSE, trim_ws = TRUE
  ) -> airline_ontime_docs 

DT::datatable(airline_ontime_docs, caption = "Codebook for the data fields")

Show entries

Search:

Codebook for the data fields
	Year	Year_1
1	Quarter	Quarter (1-4)
2	Month	Month
3	DayofMonth	Day of Month
4	DayOfWeek	Day of Week
5	FlightDate	Flight Date (yyyymmdd)
6	Reporting_Airline	Unique Carrier Code. When the same code has been used by multiple carriers, a numeric suffix is used for earlier users, for example, PA, PA(1), PA(2). Use this field for analysis across a range of years.
7	DOT_ID_Reporting_Airline	An identification number assigned by US DOT to identify a unique airline (carrier). A unique airline (carrier) is defined as one holding and reporting under the same DOT certificate regardless of its Code, Name, or holding company/corporation.
8	IATA_CODE_Reporting_Airline	Code assigned by IATA and commonly used to identify a carrier. As the same code may have been assigned to different carriers over time, the code is not always unique. For analysis, use the Unique Carrier Code.
9	Tail_Number	Tail Number
10	Flight_Number_Reporting_Airline	Flight Number

Showing 1 to 10 of 108 entries

Previous1 2 3 4 5…11Next

filter()

Say you want to retain very specific records for analysis. This can be done via filter() by providing the column name(s) and the selection criteria that should be applied to the column’s values.

Character values in filter()

We start simple, with character values, for example, all flights that originate in Midway.

chicago.flts.aug %>%
  filter(
    origin == "MDW" 
    ) -> tab01

table(tab01$origin)

What if I want flights to Los Angeles, San Francisco, and Seattle, regardless of origin being O’Hare or Midway?

chicago.flts.aug %>%
  filter(
    dest %in% c("LAX", "SFO", "SEA")
    ) -> tab02 

table(tab02$dest)

chicago.flts.aug %>%
  filter(
    origin == "ORD", 
    dest %in% c("LAX", "SFO", "SEA")
    ) -> tab03 

table(tab03$origin, tab03$dest)

Note: Here, , is the same as &. If we wanted to specify or, that would entail using | as in the example below. That is, how about flights either from MDW or to any of the three airports flagged earlier?

chicago.flts.aug %>%
  filter(
    origin == "MDW" | 
    dest %in% c("LAX", "SFO", "SEA")
    ) -> tab04 

table(tab04$origin, tab04$dest)

What if I need to filter when the column has values not equal to some target value(s)? This can be done via != if you are dealing with a single target value.

chicago.flts.aug %>%
  filter(
    origin != "ORD" 
    ) -> tab05 

table(tab05$origin)

chicago.flts.aug %>%
  filter(
    !dest %in% c("LAX", "SFO", "SEA")
    ) -> tab06 

table(tab06$dest)

Note the ! goes before the column name and not before %in%; this is an oft-forgotten switch.

Numeric values in filter()

If we are dealing with numeric values we could ask for values that fall within/outside a range, or then == >= <= >, or < some value. We could also employ not equal to, as in !=.

I am going to use dep_delay_minutes and select all instances of positive delays.

chicago.flts.aug %>%
  filter(
    dep_delay_minutes > 0
    ) -> tab07 

summary(tab07$dep_delay_minutes)

chicago.flts.aug %>%
  filter(
    dep_delay_minutes %in% c(1, 5)
    ) -> tab08 

table(tab08$dep_delay_minutes)

chicago.flts.aug %>%
  filter(
    dep_delay_minutes %in% seq(1:5)
    ) -> tab09 

table(tab09$dep_delay_minutes)

chicago.flts.aug %>%
  filter(
    dep_delay_minutes <= 10 |
    dep_delay_minutes > 30  
    ) -> tab10 

summary(tab10$dep_delay_minutes)

select()

If the data-set has a lot of columns, and you do not need all of them for your analyses, select() comes in handy to retain specific columns. You can select in multiple ways – by column name(s), column number(s), and then by string or other attributes.

chicago.flts.aug %>%
  select(c(6:7, 11, 15, 19, 22:30)) %>%
  names()

chicago.flts.aug %>%
  select(c(year, month, dayof_month, reporting_airline, origin, dest)) %>%
  names()

chicago.flts.aug %>%
  select(contains("city")) %>%
  names()

chicago.flts.aug %>%
  select(starts_with("origin")) %>%
  names()

chicago.flts.aug %>%
  select(ends_with("airline")) %>%
  names()

chicago.flts.aug %>%
  select(matches("air")) %>%
  names()

slice()

If the goal is to retain specific rows instead of columns, then slice() is your friend.

chicago.flts.aug %>%
  slice(1:10)

chicago.flts.aug %>%
  slice(1, 30, 500, 721, 2103)

mutate()

If you want to overwrite an existing column (a terrible idea) or create a new column based on some operation carried out on an existing column, mutate() allows you do so. For example, say I want to create new variables that are numeric versions of crs_dep_time and dep_time.

chicago.flts.aug %>%
  mutate(
    crs_departure_time = as.numeric(crs_dep_time),
    departure_time = as.numeric(dep_time)
    ) -> chicago.df

summary(chicago.df$crs_departure_time)

summary(chicago.df$departure_time)

Virtually all mathematical operations are possible. For example, I will carry out an operation that is flawed here, essentially converting dep_delay_minutes into seconds by multiplying dep_delay_minutes by 60, and then down-converting back to minutes.

chicago.flts.aug %>%
  mutate(
    departure_delay_seconds = dep_delay_minutes * 60,
    departure_delay_minutes = departure_delay_seconds / 60
    ) -> chicago.df

summary(chicago.df$departure_delay_seconds)

summary(chicago.df$departure_delay_minutes)

summary(chicago.df$dep_delay_minutes)

Let me now create factors that attach labels to dep_delay15 and arr_delay15.

chicago.flts.aug %>%
  mutate(
    departure_delayed_15 = factor(
      dep_del15,
      levels = c(0, 1),
      labels = c("No", "Yes")
      ),
    arrival_delayed_15 = factor(
      arr_del15,
      levels = c(0, 1),
      labels = c("No", "Yes")
      )
    ) -> chicago.df

table(chicago.df$dep_del15)

table(chicago.df$departure_delayed_15)

table(chicago.df$arr_del15)

table(chicago.df$arrival_delayed_15)

transmute()

If you want to modify an existing column or create a new column based on an existing column but only retain the new column, transmute() will do that for you. Be careful!; all other columns get dropped so this may not be a command you want to use without a lot of thought.

chicago.flts.aug %>%
  transmute(
    departure_delay_seconds = dep_delay_minutes * 60,
    departure_delay_minutes = departure_delay_seconds / 60,
    departure_delayed_15 = factor(
      dep_del15,
      levels = c(0, 1),
      labels = c("No", "Yes")
      ),
    arrival_delayed_15 = factor(
      arr_del15,
      levels = c(0, 1),
      labels = c("No", "Yes")
      )
    ) %>%
  glimpse()

summarize()

You often want to calculate some quantity of interest and retain these calculated quantities rather than the raw data. For example, say I want to know the average departure delay and arrival delay. I can set out to calculate the mean, median, and standard deviation of each, as follows:

chicago.flts.aug %>%
  summarise(
    mean_dep = mean(dep_delay_minutes, na.rm = TRUE),
    median_dep = median(dep_delay_minutes, na.rm = TRUE),
    mean_arr = mean(arr_delay_minutes, na.rm = TRUE),
    median_arr = median(arr_delay_minutes, na.rm = TRUE),
    sd_dep = sd(dep_delay_minutes, na.rm = TRUE),
    sd_arr = sd(arr_delay_minutes, na.rm = TRUE)
    )

The true power of mutate() and summarise() becomes visible when you combine these commands with grouped operations via group_by(), and that is the focus of our next working session.

Data Manipulation with dplyr (1/2)

Author

Affiliation

Published

DOI

`filter()`

Character values in `filter()`

Numeric values in `filter()`

`select()`

`slice()`

`mutate()`

`transmute()`

`summarize()`

Footnotes

Data Manipulation with dplyr (1/2)

Author

Affiliation

Published

DOI

filter()

Character values in filter()

Numeric values in filter()

select()

slice()

mutate()

transmute()

summarize()

Footnotes

`filter()`

Character values in `filter()`

Numeric values in `filter()`

`select()`

`slice()`

`mutate()`

`transmute()`

`summarize()`