In this module you will learn how three very powerful packages work their magic. {dplyr} and {tidyr} have quickly gained a following because they allow you to manage seemingly unwieldy data to calculate quantities of interest with relative ease, and to flip data from “long” to “wide” formats or vice-versa. {stringr} is a sister-package that works well with strings and often useful for key data cleaning.
Our goal is to understand how two very powerful packages work their
magic. Of recent vintage, {dplyr}
and {tidyr}
have quickly gained a following because they allow you to manage
seemingly unwieldy data to calculate quantities of interest with
relative ease. In this module we will see some of their core
abilities.
You will usually end up needing and using some or all of these six functions:
select()
filter()
arrange()
summarise()
mutate()
group_by()
In order to understand how these commands work we rely on a simple
data frame – all flights originating and departing from Columbus (Ohio)
for January through December of 2017. Let us load the data and
{dplyr}
.
library(here)
library(tidyverse)
library(tidylog, warn.conflicts = FALSE)
load(
here("data", "cmhflights2017.RData")
)
You will want to install {tidylog}
, a great package by
Benjamin Elbers, since tidylog
shows you what each step of a basic {dplyr}
operation has
done.
select(...)
Often your data frame will have more columns than you need to
calculate some quantity – for example, the mean of some variable or the
frequency distribution of some categorical variable. Say I only want the
second through sixth column – Year
, Quarter
,
Month
, DayofMonth
, and DayofWeek
.
Before I do that, though, let me render all column names to be
lowercase; running clean_names()
from the
{janitor}
package after you read in any data-set is a good
habit to acquire early on in your career.
library(janitor)
clean_names(cmhflights2017) -> cmh.df
Having done that, I could generate a data frame containing only the desired columns as follows:
%>%
cmh.df select(
:day_of_week
year-> my.df
)
names(my.df)
[1] "year" "quarter" "month" "dayof_month"
[5] "day_of_week"
What if the columns were not sequentially located? In that case we
would need to list each column we want, using the :
operator as shorthand for sequential columns.
%>%
cmh.df select(
year,:reporting_airline,
flight_date
tail_number-> my.df
)
names(my.df)
[1] "year" "flight_date" "reporting_airline"
[4] "tail_number"
Could we use column numbers instead? Absolutely, provided you know
the column numbers. Yes, provided you know the column numbers of the
columns you want (and these can be obtained by running
names()
on the original data-set).
%>%
cmh.df select(
c(1, 3, 5, 7)
-> my.df
)
names(my.df)
[1] "id" "quarter" "dayof_month" "flight_date"
You can also select columns in other ways, with
contains
, starts_with
, ends_with
,
any_of
, all_of
, matches
, and
num_range
. As these function names suggest,
contains()
will look for a specific string in the column
names, starts_with()
and ends_with()
look for
specific starting and ending strings, respectively,
matches()
will look for a matching string, and
any_of()
will look for any of the strings listed in quotes
inside the parentheses while all_of()
will look for all
listed strings and throw an error if even one is not found.
%>%
cmh.df select(
contains("carrier")
-> my.df # using contains()
)
names(my.df)
[1] "carrier_delay"
%>%
cmh.df select(
starts_with("de")
-> my.df # using starts_with()
)
names(my.df)
[1] "dest_airport_id" "dest_airport_seq_id"
[3] "dest_city_market_id" "dest"
[5] "dest_city_name" "dest_state"
[7] "dest_state_fips" "dest_state_name"
[9] "dest_wac" "dep_time"
[11] "dep_delay" "dep_delay_minutes"
[13] "dep_del15" "departure_delay_groups"
[15] "dep_time_blk"
%>%
cmh.df select(
ends_with("num")
-> my.df # using ends_with()
)
names(my.df)
[1] "div1tail_num" "div2tail_num" "div3tail_num" "div4tail_num"
[5] "div5tail_num"
Now here is any_of()
, with the desired variables fed via
creating a new object called myvars
= c("year", "month", "dest")
myvars
%>%
cmh.df select(
any_of(
myvars
)-> my.df # using any_of()
)
names(my.df)
[1] "year" "month" "dest"
Note that destined
is not found but the other two are
extracted any way. What about with all_of()
?
%>%
cmh.df select(
all_of(myvars)
-> my.df )
What if I misspelled any of the variables? Try running the code below
= c("year", "month", "destined")
mymalvars
%>%
cmh.df select(
all_of(mymalvars)
-> my.df )
and you should see this message:
Error: Can't subset columns that don't exist. x Column `destined` doesn't exist.
%>%
cmh.df select(
matches(
"wheel"
)-> my.df # using matches()
)
names(my.df)
[1] "wheels_off" "wheels_on" "div1wheels_on"
[4] "div1wheels_off" "div2wheels_on" "div2wheels_off"
[7] "div3wheels_on" "div3wheels_off" "div4wheels_on"
[10] "div4wheels_off" "div5wheels_on" "div5wheels_off"
What about num_range()
?
%>%
billboard select(
num_range("wk", 1:4)
-> my.bb
)
names(my.bb)
[1] "wk1" "wk2" "wk3" "wk4"
rename(...)
Variable names can be modified with the rename()
command. The sequence is easy to remember: newname
=
oldname
, as shown below. Just remember: When you do this,
the original variable name has been overwritten so do not invoke it in
any subsequent code.
%>%
cmh.df rename(
flt_number = flight_number_reporting_airline
%>%
) select(
tail_number, flt_number-> df.rename
)
names(df.rename)
[1] "tail_number" "flt_number"
filter(...)
Recall that with the diamonds
data, in the previous week
we used subset()
to select particular cases for analysis.
With {dplyr}
, the filter()
command serves the
same purpose. Say we only want records from January, i.e.,
month == 1
.
%>%
cmh.df filter(
== 1
month -> my.df )
What about only American Airline flights in January?
%>%
cmh.df filter(
== 1 &
month == "AA"
reporting_airline -> my.df )
What about United Airlines flights in January to CMH (Columbus, OH) from any originating airport?
%>%
cmh.df filter(
== 1 & reporting_airline == "UA" & dest == "CMH"
month -> my.df )
Note: Instead of &
you could use ,
… it
serves the same purpose; test it out for yourself with the code
below.
%>%
cmh.df filter(
== 1, reporting_airline == "UA", dest == "CMH"
month -> my.df )
filter(...)
with
%in%
What if I wanted a more complicated filter, say, flights in January or February either to CMH (Columbus, Ohio) or to ORD (the O’Hare airport in Chicago, Illinois)?
%>%
cmh.df filter(
%in% c(1, 2) &
month == "UA" &
reporting_airline %in% c("CMH", "ORD")
dest -> my.df )
Some times you may also want to filter with the opposite of
%in%
, as in not in
some list. You can do this
as well, but note where the !
is applied – to the
column-name.
%>%
cmh.df filter(
%in% c(1, 2) &
month == "UA" &
reporting_airline !dest %in% c("CMH", "ORD")
-> my.df )
Now let us say I wanted to arrange the resulting data frame in
ascending
order of departure delays. How might I do
that?
arrange(...)
%>%
my.df arrange(
dep_delay_minutes-> my.df2 )
And now in descending
order of delays. Note the use of
the minus symbol -
to reverse the sort order
%>%
my.df arrange(
-dep_delay_minutes
-> my.df2 )
We could tweak this further, perhaps asking R to first sort by
descending order of departure delays to CMH and then to ORD. Note that
dest
is ascending and DepDelayMinutes
is
descending
%>%
my.df arrange(
-dep_delay_minutes
dest, -> my.df2 )
%>%
So far, we have seen each function in isolation. Now we streamline things a bit so that we only end up with the columns we want, organized as we want them to be. That is, we can string together multiple functions to have a slim data frame of only those columns we selected, filtered, and arranged according to our wishes.
%>%
cmh.df select(
month, reporting_airline, dest, dep_delay_minutes%>%
) filter(
%in% c(1, 2) &
month == "UA" &
reporting_airline %in% c("CMH", "ORD")
dest %>%
) arrange(
desc(dep_delay_minutes)
month, dest, -> my.df3 )
Here, the end result is a data frame arranged by month, then within month by destination, and then finally by descending order of flight delay times.
group_by(...)
You often want to carry out some operations/calculations by specific
groups, for example, calculate the mean or median age of the men in our
sample versus the women, find out how many unique destinations does each
airline fly to from Columbus, and so on. The group_by()
command comes in very handy here to set the grouping structure.
Let us calculate the Mean and Median delay times. To do this, we will
lean on the group_by()
function to calculate median and
mean delays by month and destination. While we are at it, let us see the
standard deviation as well.
%>%
cmh.df select(
month, reporting_airline, dest, dep_delay_minutes%>%
) filter(
%in% c(1, 2),
month == "UA",
reporting_airline %in% c("CMH", "ORD")
dest %>%
) group_by(month, dest) %>%
summarise(
median.delay = median(dep_delay_minutes, na.rm = TRUE),
mean.delay = mean(dep_delay_minutes, na.rm = TRUE),
sd.delay = sd(dep_delay_minutes, na.rm = TRUE)
%>%
) arrange(month, dest) -> tab.01
What if wanted to know how many airports can you fly to from Columbus, and how many flights there were to each airport in July?
%>%
cmh.df select(
month, reporting_airline, dest, dep_delay_minutes, origin%>%
) filter(
%in% c(7),
month %in% c("CMH")
origin %>%
) group_by(dest) %>%
summarise(
no.of.flights = n()
-> tab.02 )
How many distinct airlines fly out of Columbus?
%>%
cmh.df select(
month, reporting_airline, dest, dep_delay_minutes, origin%>%
) filter(
%in% c("CMH")
origin %>%
) group_by(reporting_airline) %>%
summarise(
no.of.airlines = n_distinct(reporting_airline)
-> tab.03 )
And how many flights did each airline operate?
%>%
cmh.df select(
month, reporting_airline, dest, dep_delay_minutes, origin%>%
) filter(
%in% c("CMH")
origin %>%
) group_by(reporting_airline) %>%
summarise(
no.of.flights = n()
-> tab.04 )
Of all of these airlines, which one has the highest mean departure delay?
%>%
cmh.df select(
month, reporting_airline, dest, dep_delay_minutes, origin%>%
) filter(
%in% c("CMH")
origin %>%
) group_by(reporting_airline) %>%
summarise(
mean.delay = mean(dep_delay_minutes, na.rm = TRUE)
%>%
) arrange(
desc(mean.delay)
-> tab.05 )
ungroup(...)
If you want to remove a grouping structure you merely need to specify
ungroup()
, as in
%>%
cmh.df select(
month, reporting_airline, dest, dep_delay_minutes, origin%>%
) filter(
%in% c("CMH")
origin %>%
) group_by(reporting_airline) %>%
summarise(
median.delay = median(dep_delay_minutes, na.rm = TRUE)
%>%
) ungroup() -> tab.06
ungroup()
is often handy when you need to pipe several
commands but the grouping structure is different for each. For example,
maybe I want to calculate departure delays by unique airline, and then
by unique combinations of airline & origin.
%>%
cmh.df select(
month, reporting_airline, dest, dep_delay_minutes, origin%>%
) group_by(reporting_airline) %>%
mutate(
mean.delay.1 = mean(dep_delay_minutes, na.rm = TRUE)
%>%
) ungroup() %>%
group_by(reporting_airline, origin) %>%
mutate(
mean.delay.2 = mean(dep_delay_minutes, na.rm = TRUE)
)
# A tibble: 47,787 × 7
# Groups: reporting_airline, origin [53]
month reporting_airline dest dep_delay_minutes origin mean.delay.1
<dbl> <chr> <chr> <dbl> <chr> <dbl>
1 1 DL ATL 0 CMH 8.90
2 1 DL ATL 1 CMH 8.90
3 1 DL CMH 0 ATL 8.90
4 1 DL ATL 0 CMH 8.90
5 1 DL CMH 0 ATL 8.90
6 1 DL CMH 0 LAX 8.90
7 1 DL CMH 0 ATL 8.90
8 1 DL ATL 0 CMH 8.90
9 1 DL LAX 0 CMH 8.90
10 1 DL CMH 0 ATL 8.90
# … with 47,777 more rows, and 1 more variable: mean.delay.2 <dbl>
distinct(...)
Since the preceding code yields duplicate rows, I can use
distinct
to retain only unique rows
%>%
cmh.df select(
month, reporting_airline, dest, dep_delay_minutes, origin%>%
) group_by(reporting_airline) %>%
mutate(
mean.delay.1 = mean(dep_delay_minutes, na.rm = TRUE)
%>%
) ungroup() %>%
group_by(reporting_airline, origin) %>%
mutate(
mean.delay.2 = mean(dep_delay_minutes, na.rm = TRUE)
%>%
) distinct()
# A tibble: 13,332 × 7
# Groups: reporting_airline, origin [53]
month reporting_airline dest dep_delay_minutes origin mean.delay.1
<dbl> <chr> <chr> <dbl> <chr> <dbl>
1 1 DL ATL 0 CMH 8.90
2 1 DL ATL 1 CMH 8.90
3 1 DL CMH 0 ATL 8.90
4 1 DL CMH 0 LAX 8.90
5 1 DL LAX 0 CMH 8.90
6 1 DL CMH 7 ATL 8.90
7 1 DL CMH 1 LAX 8.90
8 1 DL ATL 2 CMH 8.90
9 1 DL CMH 105 ATL 8.90
10 1 DL CMH 1 ATL 8.90
# … with 13,322 more rows, and 1 more variable: mean.delay.2 <dbl>
mutate(...)
Thus far we have been creating a summary data frame containing
whatever we calculated – the frequencies, averages, and so on. Some
times, however, you may want these calculations to be be appended to the
original data frame and stored as new columns. This can be done with
mutate()
, as shown below. In this particular example, I am
calculating median departure delay by airline (for each airport it
departs from) and then including this as a column called
Median.Delay
in cmhflights.
%>%
cmh.df group_by(
reporting_airline, origin%>%
) mutate(
median.delay = median(dep_delay_minutes, na.rm = TRUE)
-> cmhflights )
Note the 111th column. Now, if we wanted to add this column but drop
all other columns not being used in the calculation, we could do that
via transmute()
. Be careful with this one though.
-> mut.df # cloning cmh.df into mut.df
cmh.df
%>%
mut.df group_by(reporting_airline, dest) %>%
transmute(
median.arr.delay = median(arr_delay_minutes, na.rm = TRUE)
-> mut.df2
)
names(mut.df2)
[1] "reporting_airline" "dest" "median.arr.delay"
case_when(...)
Just as one can use ifelse()
to generate new variables
or modify existing variables, so can one do the same thing with
case_when()
from dplyr
. Here are a few
examples. First, working with categorical variables, where I create a
new variable that flags whether the flight originated in BWI, DCA, OR
IAH versus some other airport.
%>%
cmh.df mutate(
DC_area = case_when(
%in% c("BWI","DCA", "IAH") ~ "DC Area",
origin !origin %in% c("BWI","DCA", "IAH") ~ "Not in DC Area"
)-> cw.df
)
table(cw.df$origin, cw.df$DC_area)
DC Area Not in DC Area
ATL 0 3794
BNA 0 692
BOS 0 623
BWI 1301 0
CLT 0 3
CMH 0 23893
DAL 0 369
DCA 674 0
DEN 0 1224
DFW 0 1482
DTW 0 755
EWR 0 847
FLL 0 321
HOU 0 16
IAH 760 0
LAS 0 1074
LAX 0 641
LGA 0 173
MCO 0 1488
MDW 0 1992
MSP 0 945
MSY 0 35
OAK 0 365
ORD 0 1089
PHL 0 57
PHX 0 1075
RSW 0 657
STL 0 655
TPA 0 787
The same result would obtain if you had run the following code:
%>%
cmh.df mutate(
DC_area = case_when(
%in% c("BWI","DCA", "IAH") ~ "DC Area",
origin TRUE ~ "Not in DC Area"
)-> cw.df
)
table(cw.df$origin, cw.df$DC_area)
DC Area Not in DC Area
ATL 0 3794
BNA 0 692
BOS 0 623
BWI 1301 0
CLT 0 3
CMH 0 23893
DAL 0 369
DCA 674 0
DEN 0 1224
DFW 0 1482
DTW 0 755
EWR 0 847
FLL 0 321
HOU 0 16
IAH 760 0
LAS 0 1074
LAX 0 641
LGA 0 173
MCO 0 1488
MDW 0 1992
MSP 0 945
MSY 0 35
OAK 0 365
ORD 0 1089
PHL 0 57
PHX 0 1075
RSW 0 657
STL 0 655
TPA 0 787
Similarly, we could use numerical variables to create new ones. For
example, let us call a flight delayed
if
dep_delay_minutes
is \(>
30\).
%>%
cw.df mutate(
delayed = case_when(
> 30 ~ "Yes",
dep_delay_minutes < 30 ~ "No",
dep_delay_minutes is.na(dep_delay_minutes) ~ "Unknown"
)-> cw.df )
case_when()
is often useful when you have to chain
multiple ifelse()
commands since managing a large
ifelse()
chain can be tiresome. Further, you should
remember that %in%
can be used to ease coding in
case_when()
, as shown below. Say I want to group flights
into weekday versus weekend flights. Note that the week starts on
Sundays hence weekdays are defined here as days 2 through 6 (i.e.,
Monday through Friday).
%>%
cw.df mutate(
weekend = case_when(
%in% c(2:6) ~ "No",
day_of_week TRUE ~ "Yes"
)-> cw.df )
unnest(...)
Every now and then the data you have to work with includes some
list-columns – columns that are like mini-data frames themselves.
Suddenly R functions that would otherwise work will not execute as
desired. The unnest
function from the
tidyverse
unpacks these list-columns into a regular tibble.
First a small data frame that shows a list-column.
library(tidyverse)
data_frame(
x = 1:3,
y = c("a", "d,e,f", "g,h")
-> bad.df )
If I want to clean up list-column y
, I can do that as
shown below.
%>%
bad.df transform(
y = strsplit(y, ",")
%>%
) unnest(y) -> good.df
Now see a slightly more complicated list-column, with y
made of named rows …. a and b, a with 1 value and b with 2 values.
data_frame(
x = 1:2,
y = list(a = 1, b = 3:4)
-> badder.df )
To clean this up we could do:
%>%
badder.df unnest(
.,.id = "name"
-> gooder.df )
Most packages you use in R call for data to be {tidy}
.
What are tidy
data?