Our goal is to understand how some very powerful packages work their magic. Of somewhat recent vintage, {dplyr}
, {tidyr}
and {lubridate}
have quickly gained a fan following because they allow you to clean and organize your data and then calculate quantities of interest with surprising ease. In this module we will start to see some of these packages’ core functionalities and wrap up this particular leg of our learning journey in the next module.
{dplyr}
There is a common quote that is tossed about a good bit in the hallways of data science, that, and I am paraphrasing here, a data scientists spends about 80% of their time gathering, cleaning, and organizing their data and spends only about 20% of their time on the analysis per se. This may or may not be true, especially in the initial stages of a new project but yes, we do spend an awfully large amount of our time getting the data ready for visualizations and other analyses. You do this work long enough and you come to realize that anything you could do to speed up the cleaning phase would be time and money saved. And yet, data cleaning skills are in short supply. No more, I say, because packages like {dplyr}
and {data.table}
have simplified what were once nightmare tasks.
Nightmare
is not a word to be tossed around lightly and so let us setup a seemingly large data-set with 100+ columns, and tons of information hidden in it. Once we setup this Cyclops, spell out a few questions we would like to answer, we might better appreciate how {dplyr}
comes to our aid. In particular, we might come to understand how {dplyr}
uses seven core verbs:
What you want to do … | {dplyr} function |
---|---|
you need to select columns to work with? | select() |
you need to use a subset of the data based on some criterion? | filter() |
you need to arrange the data in ascending/descending order of variable(s)? | arrange() |
you want the results of your calculations to be a standalone data frame? | summarise() |
you want to add your calculated value(s) to the existing data frame? | mutate() |
you want to add your calculated value(s) to the existing data frame but also drop all variables/columns not being used in the calculation? | transmute() |
you need to calculate averages, frequencies, etc by groups? | group_by() |
Here comes cyclops – all flights originating and departing from Columbus (Ohio) January through September of 2017. Let us load the data, and the {tidyverse}
and {here}
packages.
library(tidyverse)
library(here)
load(
here("data", "cmhflights_01092017.RData")
)
names(cmhflights) # will show you the column names
## [1] "Year" "Quarter" "Month"
## [4] "DayofMonth" "DayOfWeek" "FlightDate"
## [7] "UniqueCarrier" "AirlineID" "Carrier"
## [10] "TailNum" "FlightNum" "OriginAirportID"
## [13] "OriginAirportSeqID" "OriginCityMarketID" "Origin"
## [16] "OriginCityName" "OriginState" "OriginStateFips"
## [19] "OriginStateName" "OriginWac" "DestAirportID"
## [22] "DestAirportSeqID" "DestCityMarketID" "Dest"
## [25] "DestCityName" "DestState" "DestStateFips"
## [28] "DestStateName" "DestWac" "CRSDepTime"
## [31] "DepTime" "DepDelay" "DepDelayMinutes"
## [34] "DepDel15" "DepartureDelayGroups" "DepTimeBlk"
## [37] "TaxiOut" "WheelsOff" "WheelsOn"
## [40] "TaxiIn" "CRSArrTime" "ArrTime"
## [43] "ArrDelay" "ArrDelayMinutes" "ArrDel15"
## [46] "ArrivalDelayGroups" "ArrTimeBlk" "Cancelled"
## [49] "CancellationCode" "Diverted" "CRSElapsedTime"
## [52] "ActualElapsedTime" "AirTime" "Flights"
## [55] "Distance" "DistanceGroup" "CarrierDelay"
## [58] "WeatherDelay" "NASDelay" "SecurityDelay"
## [61] "LateAircraftDelay" "FirstDepTime" "TotalAddGTime"
## [64] "LongestAddGTime" "DivAirportLandings" "DivReachedDest"
## [67] "DivActualElapsedTime" "DivArrDelay" "DivDistance"
## [70] "Div1Airport" "Div1AirportID" "Div1AirportSeqID"
## [73] "Div1WheelsOn" "Div1TotalGTime" "Div1LongestGTime"
## [76] "Div1WheelsOff" "Div1TailNum" "Div2Airport"
## [79] "Div2AirportID" "Div2AirportSeqID" "Div2WheelsOn"
## [82] "Div2TotalGTime" "Div2LongestGTime" "Div2WheelsOff"
## [85] "Div2TailNum" "Div3Airport" "Div3AirportID"
## [88] "Div3AirportSeqID" "Div3WheelsOn" "Div3TotalGTime"
## [91] "Div3LongestGTime" "Div3WheelsOff" "Div3TailNum"
## [94] "Div4Airport" "Div4AirportID" "Div4AirportSeqID"
## [97] "Div4WheelsOn" "Div4TotalGTime" "Div4LongestGTime"
## [100] "Div4WheelsOff" "Div4TailNum" "Div5Airport"
## [103] "Div5AirportID" "Div5AirportSeqID" "Div5WheelsOn"
## [106] "Div5TotalGTime" "Div5LongestGTime" "Div5WheelsOff"
## [109] "Div5TailNum" "X110"
The output should be rather onerous with 110 columns displayed, the last being X110
, an empty column that can be dropped.
Now we are down to 109 columns and ready to get to work.
As in the present case of 100+ columns, often our data frame will have more columns than we plan to work with. In such instances it is a good idea to drop all unwanted columns; this will make the data-set more manageable and tax our computing resources less. For example, say I only want the first five columns (Year, Quarter, Month, DayofMonth, DayOfWeek). I could use select
to create a data frame with only these columns:
cmhflights %>%
select(Year:DayOfWeek) -> my.df # The : is the bridge between consecutive columns
names(my.df)
## [1] "Year" "Quarter" "Month" "DayofMonth" "DayOfWeek"
The same result could have been obtained by taking the longer route of listing each column:
## [1] "Year" "Quarter" "Month" "DayofMonth" "DayOfWeek"
What if the columns were not sequentially located? In that case we would need to list each column we want. Say I want Year, FlightDate, UniqueCarrier, and TailNum.
## [1] "Year" "FlightDate" "UniqueCarrier" "TailNum"
Could we have used column numbers instead of column names? Absolutely.
## [1] "Year" "Month" "DayOfWeek" "UniqueCarrier"
## [1] "Year" "Quarter" "Month" "DayofMonth" "DayOfWeek"
## [1] "Year" "FlightDate" "UniqueCarrier" "AirlineID"
## [5] "Carrier" "OriginAirportID"
We can also select columns in other ways, by specifying that the column name contain
some element. The code below shows you how this is done if I am looking for column names with the phrase “Carrier”, then with “De”, and then with “Num”
## [1] "UniqueCarrier" "Carrier" "CarrierDelay"
## [1] "DestAirportID" "DestAirportSeqID" "DestCityMarketID"
## [4] "Dest" "DestCityName" "DestState"
## [7] "DestStateFips" "DestStateName" "DestWac"
## [10] "DepTime" "DepDelay" "DepDelayMinutes"
## [13] "DepDel15" "DepartureDelayGroups" "DepTimeBlk"
## [1] "TailNum" "FlightNum" "Div1TailNum" "Div2TailNum" "Div3TailNum"
## [6] "Div4TailNum" "Div5TailNum"
Do you really want all the rows in the data-set or very specific rows that meet some criteria? Say we only want to look at certain months, or just flights on Saturdays and Sundays, or flights in a given month. For example, say we only want flights in January, i.e., Month == 1
.
cmhflights %>%
filter(Month == 1) -> my.df
table(my.df$Month) # Show me a frequency table for Month in my.df
##
## 1
## 3757
What about only American Airline (UniqueCarrier code for this airline is AA) flights in January?
cmhflights %>%
filter(Month == 1, UniqueCarrier == "AA") -> my.df
table(my.df$Month, my.df$UniqueCarrier) # a simple frequency table
##
## AA
## 1 387
What about United Airlines flights in January to CMH (the airport code for Columbus, OH) to any destination?
What if I wanted a more complicated filter, say, flights in January or February to CMH or ORD (the airport code for O’Hare in Chicago)?
cmhflights %>%
filter(
Month %in% c(1, 2), UniqueCarrier == "UA", Dest %in% c("CMH", "ORD")
) -> my.df
table(my.df$Month) # frequency table of Month
##
## 1 2
## 106 145
##
## UA
## 251
##
## CMH ORD
## 132 119
Beautiful, just beautiful. It may not be readily apparent at this point but using %in% c(...)
makes applying complex filters quite easy. Note the operators that work with filter()
Operator | Meaning | Operator | Meaning |
---|---|---|---|
\(<\) | less than | \(>\) | greater than |
\(==\) | equal to | \(\leq\) | less than or equal to |
\(\geq\) | greater than or equal to | != |
not equal to |
%in% |
is a member of | is.na |
is NA |
!is.na |
is not NA | &,!,etc |
Boolean operators |
Now let us say I wanted to arrange the resulting data frame in ascending order
of departure delays. How might I do that? Via arrange()
And now in descending order
of delays by adding the minus symbol -
to the column name.
We could tweak this further, perhaps saying sort by departure delays to CMH, and then to ORD.
So far, we have seen each function in isolation. Now we streamline things a bit so that we only end up with the columns and rows we want to work with, arranged as we want the resulting data-set to be.
cmhflights %>%
select(Month, UniqueCarrier, Dest, DepDelayMinutes) %>%
filter(
Month %in% c(1, 2), UniqueCarrier == "UA", Dest %in% c("CMH", "ORD")
) %>%
arrange(Month, Dest, -DepDelayMinutes) -> my.df3
Here, the end result is a data frame arranged by Month, then within Month by Destination, and then finally by descending order of flight delays. This is the beauty of dplyr
, allowing us to chain together various functions to get what we want. How is this helpful? Well, now you have a data frame that you can analyze.
What if we need to calculate frequencies? For example, how many flights per month are there? What if we want the mean DepDelay or median ArrDelay? These can be easily calculated as shown below.
## # A tibble: 9 x 2
## Month n
## <int> <int>
## 1 1 3757
## 2 2 3413
## 3 3 4101
## 4 4 4123
## 5 5 4098
## 6 6 4138
## 7 7 4295
## 8 8 4279
## 9 9 3789
What about by days of the week AND by month?
## # A tibble: 63 x 3
## Month DayOfWeek n
## <int> <int> <int>
## 1 1 1 660
## 2 1 2 635
## 3 1 3 516
## 4 1 4 522
## 5 1 5 523
## 6 1 6 361
## 7 1 7 540
## 8 2 1 527
## 9 2 2 506
## 10 2 3 516
## # … with 53 more rows
I want to know the average departure delay and the average arrival delay for all flights, with the averages calculated in two ways – as the mean, and as the median.
cmhflights %>%
summarise(
mean_arr_delay = mean(ArrDelay, na.rm = TRUE),
mean_dep_delay = mean(DepDelay, na.rm = TRUE),
median_arr_delay = median(ArrDelay, na.rm = TRUE),
median_dep_delay = median(DepDelay, na.rm = TRUE)
)
## # A tibble: 1 x 4
## mean_arr_delay mean_dep_delay median_arr_delay median_dep_delay
## <dbl> <dbl> <dbl> <dbl>
## 1 3.26 9.20 -6 -2
Here, the na.rm = TRUE
command is useful because R will not allow you to calculate any mean or median etc. You can see this below, where I have a small data-set called df
with 4 values of x, but one of the four is missing and recorded as NA
. See what happens when I try to calculate the mean with/without na.rm = TRUE
.
df = data.frame(x = c(2, 4, 9, NA))
df %>%
summarise(mean.x = mean(x)) # You get no meaningful values
## mean.x
## 1 NA
## mean.x
## 1 5
These summaries are fine if you want to calculate aggregate quantities of interest but what if you wanted to calculate number of flights per month by airline? Average delays by airline? Now things get interesting because group-by()
will open up this new world for us! The first thing I will calculate is the number of flights by airline per month.
## # A tibble: 63 x 3
## # Groups: Month [9]
## Month Carrier n
## <int> <chr> <int>
## 1 1 AA 387
## 2 1 DL 549
## 3 1 EV 436
## 4 1 F9 152
## 5 1 OO 123
## 6 1 UA 106
## 7 1 WN 2004
## 8 2 AA 360
## 9 2 DL 511
## 10 2 EV 362
## # … with 53 more rows
## # A tibble: 63 x 3
## # Groups: Month [9]
## Month Carrier frequency
## <int> <chr> <int>
## 1 1 AA 387
## 2 1 DL 549
## 3 1 EV 436
## 4 1 F9 152
## 5 1 OO 123
## 6 1 UA 106
## 7 1 WN 2004
## 8 2 AA 360
## 9 2 DL 511
## 10 2 EV 362
## # … with 53 more rows
Two ways to do this, first with tally()
and the second with summarise(frequency = n())
… both yielding the same result. I would urge you to remember tally()
because it is shorter code. For example, now I want a silly table that gives us the number of flights per month per airline per destination.
## # A tibble: 398 x 4
## # Groups: Month, Carrier [63]
## Month Carrier Dest n
## <int> <chr> <chr> <int>
## 1 1 AA CMH 193
## 2 1 AA DFW 112
## 3 1 AA LAX 24
## 4 1 AA PHX 58
## 5 1 DL ATL 224
## 6 1 DL CMH 275
## 7 1 DL DTW 2
## 8 1 DL LAX 27
## 9 1 DL MSP 21
## 10 1 EV CMH 218
## # … with 388 more rows
We could keep complicating the grouping structure. For example, let us add the day of the week to the mix …
## # A tibble: 2,536 x 5
## # Groups: Month, Carrier, Dest [398]
## Month Carrier Dest DayOfWeek n
## <int> <chr> <chr> <int> <int>
## 1 1 AA CMH 1 35
## 2 1 AA CMH 2 23
## 3 1 AA CMH 3 28
## 4 1 AA CMH 4 28
## 5 1 AA CMH 5 28
## 6 1 AA CMH 6 21
## 7 1 AA CMH 7 30
## 8 1 AA DFW 1 20
## 9 1 AA DFW 2 16
## 10 1 AA DFW 3 16
## # … with 2,526 more rows
Now say I am really curious about mean departure delays for the preceding grouping structure. That is, what does mean departure delay look like for flights by day of the week, by month, by carrier, and by destination?
cmhflights %>%
group_by(Month, Carrier, Dest, DayOfWeek) %>%
summarise(mean_dep_delay = mean(DepDelay, na.rm = TRUE))
## # A tibble: 2,536 x 5
## # Groups: Month, Carrier, Dest [398]
## Month Carrier Dest DayOfWeek mean_dep_delay
## <int> <chr> <chr> <int> <dbl>
## 1 1 AA CMH 1 16.3
## 2 1 AA CMH 2 0.304
## 3 1 AA CMH 3 1.14
## 4 1 AA CMH 4 3.11
## 5 1 AA CMH 5 4.11
## 6 1 AA CMH 6 24.9
## 7 1 AA CMH 7 21.8
## 8 1 AA DFW 1 25.4
## 9 1 AA DFW 2 -2.62
## 10 1 AA DFW 3 17.6
## # … with 2,526 more rows
But this is a complicated summary table. What if all I really want to know is what airline has the highest mean departure delays, regardless of month or destination or day of the week?
cmhflights %>%
group_by(Carrier) %>%
summarise(mean_dep_delay = mean(DepDelay, na.rm = TRUE)) %>%
arrange(-mean_dep_delay) # ordered in descending order of delays
## # A tibble: 7 x 2
## Carrier mean_dep_delay
## <chr> <dbl>
## 1 EV 15.2
## 2 F9 10.8
## 3 WN 9.58
## 4 AA 7.73
## 5 DL 7.10
## 6 OO 6.44
## 7 UA 6.39
EV is Express Jet; F9 is Frontier Airlines; WN is Southwest Airlines; OO is SkyWest Airlines; AA is American Airlines; DL is Delta Airlines; UA is United Airlines. So clearly United Airlines had the lowest average departure delays. Would this still be true if we repeated the calculation by Month?
cmhflights %>%
group_by(Carrier, Month) %>%
summarise(mean_dep_delay = mean(DepDelay, na.rm = TRUE)) %>%
arrange(mean_dep_delay) # ordered in descending order of delays
## # A tibble: 63 x 3
## # Groups: Carrier [7]
## Carrier Month mean_dep_delay
## <chr> <int> <dbl>
## 1 OO 8 -0.873
## 2 OO 2 -0.570
## 3 OO 9 -0.179
## 4 UA 9 0.699
## 5 DL 8 1.85
## 6 DL 2 2.24
## 7 AA 2 2.30
## 8 F9 9 2.51
## 9 AA 9 2.62
## 10 DL 3 2.98
## # … with 53 more rows
All righty then! Looks like three of the lowest mean departure delays were for SkyWest. Do not let the negative numbers throw you for a loop; a negative value implies the flight left earlier than scheduled.
So far so good. But now I am curious about what percent of flights operated by AA, DL, UA, and WN were delayed. How could I calculate this?
filter()
to restrict the data-set to just these four airlines.late
).nflights
) and the total number of flights that were delayed (nlate
).cmhflights %>%
select(c(Carrier, DepDelay)) %>%
filter(Carrier %in% c("AA", "DL", "UA", "WN")) %>%
mutate(late = case_when(
DepDelay > 0 ~ "Yes",
DepDelay <= 0 ~ "No"
)
) %>%
group_by(Carrier) %>%
mutate(nflights = n()) %>%
group_by(Carrier, late) %>%
mutate(
nlate = n(),
pct_late = (nlate / nflights) * 100) -> df1
df1
## # A tibble: 29,328 x 6
## # Groups: Carrier, late [12]
## Carrier DepDelay late nflights nlate pct_late
## <chr> <dbl> <chr> <int> <int> <dbl>
## 1 AA -9 No 3891 2698 69.3
## 2 AA 24 Yes 3891 1162 29.9
## 3 AA -6 No 3891 2698 69.3
## 4 AA -5 No 3891 2698 69.3
## 5 AA -7 No 3891 2698 69.3
## 6 AA 22 Yes 3891 1162 29.9
## 7 AA -4 No 3891 2698 69.3
## 8 AA -4 No 3891 2698 69.3
## 9 AA 42 Yes 3891 1162 29.9
## 10 AA -6 No 3891 2698 69.3
## # … with 29,318 more rows
There is a whole lot going on here so let us break it down.
filter(Carrier %in% c("AA", "DL", "UA", "WN"))
is keeping specified airlines’ data while dropping the restmutate(late = case_when(...)
is creating a new column called late
and storing a value of “Yes” if DepDelay > 0
and “No” if DepDelay <= 0
group_by(Carrier)
is grouping by Carrier and then counting how many flights there were per Carrier and storing this sum in a new column called nflights
group_by(Carrier, late)
is grouping the data by Carrier and latemutate(nlate = n(), pct_late = (nlate / nflights) * 100)
is then creating two new columns, nlate
, the number of flights per late values of “Yes” and “No”, respectively, and then pct_late
Now, we only want the flights that were late so let us apply select()
to keep just a few columns and then we use filter()
to keep only rows corresponding to late = "Yes"
. This will still leave us with duplicate rows but we can drop these duplicate rows via a new command, distinct()
## # A tibble: 4 x 3
## # Groups: Carrier, late [4]
## late Carrier pct_late
## <chr> <chr> <dbl>
## 1 Yes UA 24.0
## 2 Yes DL 27.7
## 3 Yes AA 29.9
## 4 Yes WN 41.1
So! 24% of UA flights were late, the lowest in this group.
What if we wanted to do this for all airlines, and we want it by Month?
cmhflights %>%
select(c(Carrier, Month, DepDelay)) %>%
filter(Carrier %in% c("AA", "DL", "UA", "WN")) %>%
mutate(late = case_when(
DepDelay > 0 ~ "Yes",
DepDelay <= 0 ~ "No"
)
) %>%
group_by(Carrier, Month) %>%
mutate(nflights = n()) %>%
group_by(Carrier, Month, late) %>%
mutate(
nlate = n(),
pct_late = (nlate / nflights) * 100) %>%
filter(late == "Yes") %>%
select(Carrier, Month, pct_late) %>%
distinct() %>%
arrange(pct_late)
## # A tibble: 36 x 4
## # Groups: Carrier, Month, late [36]
## late Carrier Month pct_late
## <chr> <chr> <int> <dbl>
## 1 Yes UA 9 17.9
## 2 Yes UA 2 18.6
## 3 Yes DL 9 20.2
## 4 Yes UA 5 20.2
## 5 Yes UA 4 20.4
## 6 Yes AA 9 21.4
## 7 Yes DL 2 21.5
## 8 Yes DL 8 22.5
## 9 Yes AA 2 23.1
## 10 Yes UA 3 23.3
## # … with 26 more rows
Before we move on, I want to point out something about case_when()
. Specifically, we used it to create a new column called late
from numeric values found in DepDelay
. But what if we wanted to create a new column from a column that had categorical variables in it, like Dest
or Carrier
? Easy.
cmhflights %>%
filter(Carrier %in% c("AA", "DL", "UA")) %>%
mutate(carrier_name = case_when(
Carrier == "AA" ~ "American Airlines",
Carrier == "DL" ~ "Delta Airlines",
Carrier == "UA" ~ "United Airlines"
)
) %>%
select(Carrier, carrier_name)
## # A tibble: 10,864 x 2
## Carrier carrier_name
## <chr> <chr>
## 1 AA American Airlines
## 2 AA American Airlines
## 3 AA American Airlines
## 4 AA American Airlines
## 5 AA American Airlines
## 6 AA American Airlines
## 7 AA American Airlines
## 8 AA American Airlines
## 9 AA American Airlines
## 10 AA American Airlines
## # … with 10,854 more rows
Second, case_when()
includes an option that cuts down on our work. In particular, say I want to create a new column and label its values as “Weekend” if the DayOfWeek is Saturday or Sunday and “Weekday” if DayOfWeek is any other day. In doing this, it would serve us well to remember that the week begins on Sunday so DayOfWeek == 1 is Sunday, not Monday.
cmhflights %>%
mutate(weekend = case_when(
DayOfWeek %in% c(7, 1) ~ "Yes",
TRUE ~ "No"
)
) %>%
select(DayOfWeek, weekend) %>%
distinct()
## # A tibble: 7 x 2
## DayOfWeek weekend
## <int> <chr>
## 1 7 Yes
## 2 1 Yes
## 3 2 No
## 4 3 No
## 5 4 No
## 6 5 No
## 7 6 No
Notice how TRUE
swept up all other values of DayOfWeek
and coded them as “No.”
One final showcasing of case_when()
. In Module 01 we looked at the hsb2
data and created some factors
for columns such as female, ses, schtyp, and so on. Well, let us see how the same thing could be done with case_when()
.
read.table(
'https://stats.idre.ucla.edu/stat/data/hsb2.csv',
header = TRUE,
sep = ","
) -> hsb2
hsb2 %>%
mutate(
female.f = case_when(
female == 0 ~ "Male",
female == 1 ~ "Female"),
race.f = case_when(
race == 1 ~ "Hispanic",
race == 2 ~ "Asian",
race == 3 ~ "African-American",
TRUE ~ "White"),
ses.f = case_when(
ses == 1 ~ "Low",
ses == 2 ~ "Medium",
TRUE ~ "High"),
schtyp.f = case_when(
schtyp == 1 ~ "Public",
TRUE ~ "Private"),
prog.f = case_when(
prog == 1 ~ "General",
prog == 2 ~ "Academic",
TRUE ~ "Vocational")
) -> hsb2
dplyr()
commandsWe have seen count()
in action but let us see it again, in a slightly different light. In particular, say I want to know how many unique destinations are there connected by air from Columbus.
## # A tibble: 26 x 2
## Dest n
## <chr> <int>
## 1 ATL 2884
## 2 MDW 1511
## 3 MCO 1148
## 4 DFW 1122
## 5 DEN 971
## 6 BWI 948
## 7 LAS 815
## 8 PHX 815
## 9 ORD 803
## 10 EWR 736
## # … with 16 more rows
Note: There is no need for group_by()
here. And sort = TRUE
arranges the result in descending order of the frequency (n
). Here is another code example, this time adding Carrier to the mix.
## # A tibble: 45 x 3
## Carrier Dest n
## <chr> <chr> <int>
## 1 DL ATL 2141
## 2 WN MDW 1511
## 3 AA DFW 1122
## 4 WN BWI 948
## 5 WN MCO 929
## 6 WN ATL 743
## 7 EV EWR 736
## 8 WN TPA 595
## 9 UA ORD 577
## 10 WN LAS 542
## # … with 35 more rows
How does this help us? Well, now we know that if we were flying to Atlanta, Delta would have the most flights, but if we were flying to the Chicago area then Southwest should be our pick.
Another useful command is n_distinct()
, useful in the sense of allowing us to calculate the the number of distinct values of any column. For example, say I want to know how many unique aircraft (not airlines) are there in this data-set.
## # A tibble: 1 x 1
## `n_distinct(TailNum)`
## <int>
## 1 2248
If you want to see the top ‘n’ number of observations, for example the 4 airlines with the most aircraft, you can lean on top_n()
, as shown below.
cmhflights %>%
group_by(Carrier) %>%
summarise(num.flights = n_distinct(TailNum)) %>%
arrange(-num.flights) %>%
top_n(4)
## # A tibble: 4 x 2
## Carrier num.flights
## <chr> <int>
## 1 WN 751
## 2 DL 539
## 3 UA 289
## 4 OO 222
I am also curious about which aircraft has flown the most, and then maybe 9 other aircraft that follow in descending order.
cmhflights %>%
filter(!is.na(TailNum)) %>% # Removing some missing cases
group_by(TailNum) %>%
tally() %>%
arrange(-n) %>%
top_n(4)
## # A tibble: 4 x 2
## TailNum n
## <chr> <int>
## 1 N396SW 74
## 2 N601WN 66
## 3 N646SW 64
## 4 N635SW 62
You will, from time to time, need to merge multiple data-sets together. For example, say I have the following data-sets I have created for demonstration purposes.
tibble(
Name = c("Tim", "Tammy", "Bubbles", "Panda"),
Score = c(5, 8, 9, 10)) -> df1
tibble(
Name = c("Tim", "Tammy", "Bubbles"),
Age = c(25, 78, 19)) -> df2
tibble(
Name = c("Tim", "Tammy", "Panda"),
Education = c("BA", "PhD", "JD")) -> df3
df1; df2; df3
## # A tibble: 4 x 2
## Name Score
## <chr> <dbl>
## 1 Tim 5
## 2 Tammy 8
## 3 Bubbles 9
## 4 Panda 10
## # A tibble: 3 x 2
## Name Age
## <chr> <dbl>
## 1 Tim 25
## 2 Tammy 78
## 3 Bubbles 19
## # A tibble: 3 x 2
## Name Education
## <chr> <chr>
## 1 Tim BA
## 2 Tammy PhD
## 3 Panda JD
Notice that Panda is absent from df2
and Bubbles is absent from df3
. So if we wanted to build ONE data-set with all data for Tim, Tammy, Bubbles, and Panda, some of the information would be missing for some of these folks. But how could we construct ONE data-set? Via one of a few join()
commands.
Let us start with a simple full_join, where we link up every individual in df1 or df2 or df3 regardless of whether they are seen in both data-sets.
## # A tibble: 4 x 4
## Name Score Age Education
## <chr> <dbl> <dbl> <chr>
## 1 Tim 5 25 BA
## 2 Tammy 8 78 PhD
## 3 Bubbles 9 19 <NA>
## 4 Panda 10 NA JD
Pay attention to two things: (i) Name connects the records in each data-set, and so it must be spelled exactly the same for a link to be made, and (ii) the full_join()
links up all individuals regardless of whether they are missing any information in any of the data-sets. This is usually how most folks will link up multiple files unless they only want records found in a master file. For example, say I want to link up df2 and df3 but only such that the final result will include all records found in BOTH df2 and df3, with df2 serving as the master data-set. Eh?
## # A tibble: 3 x 3
## Name Age Education
## <chr> <dbl> <chr>
## 1 Tim 25 BA
## 2 Tammy 78 PhD
## 3 Bubbles 19 <NA>
Notice that Panda is dropped because it is not found in df2.
Maybe you want df3 to be the master file, in which case you would see a different result (with Bubbles not seen in the result since Bubbles is found in df2 but not in df3):
## # A tibble: 3 x 3
## Name Education Age
## <chr> <chr> <dbl>
## 1 Tim BA 25
## 2 Tammy PhD 78
## 3 Panda JD NA
Rarely, but definitely not “never,” you may want to see the records that are not found in both. Here, anti_join() comes in handy, thus:
## # A tibble: 1 x 2
## Name Age
## <chr> <dbl>
## 1 Bubbles 19
## # A tibble: 1 x 2
## Name Education
## <chr> <chr>
## 1 Panda JD
Every now and then you may want to or need to create a grouped version of some numeric variable. For example, we have DepDelay for all flights but want to group this into quartiles
. How can we do that? In many ways but the easiest might be to use a specific library – {santoku}
. Say, for example, I want to create 4 groups of dep_delay
, and I want these such that we are grouping DepDelay
into the bottom 25%, next 25%, the next 25%, and finally the highest 25%. Wait, these are the quartiles
! Fair enough, but how can I do this?
library(santoku)
cmhflights %>%
mutate(
depdelay_groups = chop_equally(DepDelay, groups = 4)
) %>%
group_by(depdelay_groups) %>%
tally()
## # A tibble: 5 x 2
## depdelay_groups n
## <fct> <int>
## 1 [0%, 25%) 6887
## 2 [25%, 50%) 9267
## 3 [50%, 75%) 10143
## 4 [75%, 100%] 9225
## 5 <NA> 471
What if we wanted to slice up DepDelay in specific intervals, first at 0, then at 15, then at 30, and then at 45?
cmhflights %>%
mutate(
depdelay_groups = chop(DepDelay, breaks = c(0, 15, 30, 45))
) %>%
group_by(depdelay_groups) %>%
tally()
## # A tibble: 6 x 2
## depdelay_groups n
## <fct> <int>
## 1 [-27, 0) 21108
## 2 [0, 15) 7883
## 3 [15, 30) 2567
## 4 [30, 45] 1302
## 5 (45, 1323] 2662
## 6 <NA> 471
We could also create quintiles (5 groups) or deciles (10 groups) as shown below:
cmhflights %>%
filter(!is.na(DepDelay)) %>%
mutate(
depdelay_groups = chop_quantiles(
DepDelay, c(0.2, 0.4, 0.6, 0.8)
)
) %>%
group_by(depdelay_groups) %>%
tally()
## # A tibble: 5 x 2
## depdelay_groups n
## <fct> <int>
## 1 [0%, 20%) 6887
## 2 [20%, 40%) 6342
## 3 [40%, 60%) 7879
## 4 [60%, 80%] 7362
## 5 (80%, 100%] 7052
cmhflights %>%
filter(!is.na(DepDelay)) %>%
mutate(
depdelay_groups = chop_quantiles(
DepDelay, seq(0.1, 0.9, by = 0.1)
)
) %>%
group_by(depdelay_groups) %>%
tally()
## # A tibble: 10 x 2
## depdelay_groups n
## <fct> <int>
## 1 [0%, 10%) 3006
## 2 [10%, 20%) 3881
## 3 [20%, 30%) 3344
## 4 [30%, 40%) 2998
## 5 [40%, 50%) 2925
## 6 [50%, 60%) 4954
## 7 [60%, 70%) 3262
## 8 [70%, 80%) 3773
## 9 [80%, 90%] 3904
## 10 (90%, 100%] 3475
more often than we would like to see happen, we end up with categorical variables that should follow a certain order but do not. For example, say you have survey data where people were asked to respond whether they Agree, are Neutral, or Disagree with some statement. Let us also assume that the frequencies are as follows:
tibble(
response = c(rep("Agree", 25), rep("Neutral", 30), rep("Disagree", 45))
) -> mydf
mydf %>%
group_by(response) %>%
tally()
## # A tibble: 3 x 2
## response n
## <chr> <int>
## 1 Agree 25
## 2 Disagree 45
## 3 Neutral 30
Notice how the responses are out of order, with Agree followed by Disagree, then Neutral, since R defaults to alphabetic ordering for anything that is a categorical variable. One way to ensure the correct ordering of categorical variables is via ordered
, as shown below.
mydf %>%
mutate(ordered.response = ordered(
response,
levels = c("Agree", "Neutral", "Disagree")
)
) %>%
group_by(ordered.response) %>%
tally()
## # A tibble: 3 x 2
## ordered.response n
## <ord> <int>
## 1 Agree 25
## 2 Neutral 30
## 3 Disagree 45
We have a covered a lot of ground here but every inch has been critical space. Google any question we have tackled and you will see how many R-users ask the same questions … how do I calculate mean for groups in R? What you have seen is the heart of the dplyr()
package. We saw grouped operations, we saw the use of summarise, mutate, case_when, distinct, filter, arrange, select, count, and tally. I will let you in on a secret; while these are core functions, there are others you could experiment with. Look up the cheat-sheet here.
Practice what we have done, with the {nycflights13}
data-set perhaps to get something familiar yet sufficiently different to test your fundamentals. Maybe pick a non-travel data-set altogether, perhaps one of the tidytuesday
data-sets. What is that you ask? Discover it for yourself here. Bon voyage! Don’t go too far because we will be working with two new packages next week – {tidyr}
and {lubridate}
.
Why are our best and most experienced employees leaving prematurely?
The data available here includes information on several current and former employees of an anonymous organization.** Fields in the data-set include:
hrdata
and save it in RData format as hrdata.RData
library(tidyverse)
library(tidylog)
library(here)
read_csv(
"https://aniruhil.github.io/avsr/teaching/dataviz/HR_comma_sep.csv"
) -> hrdata
hrdata
.hrdata %>%
mutate(
had_accident = case_when(
Work_accident == 0 ~ "No",
Work_accident == 1 ~ "Yes"),
left_company = case_when(
left == 0 ~ "No",
left == 1 ~ "Yes"),
recently_promoted = case_when(
promotion_last_5years == 0 ~ "No",
promotion_last_5years == 1 ~ "Yes"),
salary_group = ordered(
salary,
levels = c("low", "medium", "high"),
labels = c("Low", "Medium", "High"))
) -> hrdata
hr01
hr01
data-set, how many employees do you have per sales department? What sales department has the most number of employees?## # A tibble: 10 x 2
## sales n
## <chr> <int>
## 1 sales 1007
## 2 technical 694
## 3 support 552
## 4 IT 270
## 5 hr 215
## 6 accounting 204
## 7 marketing 203
## 8 product_mng 198
## 9 RandD 121
## 10 management 88
hr01 %>%
group_by(sales) %>%
summarise(
mean.satisfaction = mean(satisfaction_level, na.rm = TRUE),
sd.satisfaction = sd(satisfaction_level, na.rm = TRUE),
mean.evaluation = mean(last_evaluation, na.rm = TRUE),
sd.evaluation = sd(last_evaluation, na.rm = TRUE)
)
## # A tibble: 10 x 5
## sales mean.satisfaction sd.satisfaction mean.evaluation sd.evaluation
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 accounting 0.403 0.257 0.695 0.199
## 2 hr 0.433 0.243 0.680 0.197
## 3 IT 0.411 0.273 0.733 0.193
## 4 management 0.410 0.266 0.732 0.204
## 5 marketing 0.453 0.249 0.692 0.200
## 6 product_mng 0.482 0.264 0.727 0.203
## 7 RandD 0.433 0.282 0.745 0.194
## 8 sales 0.447 0.259 0.712 0.198
## 9 support 0.451 0.264 0.728 0.192
## 10 technical 0.434 0.276 0.734 0.198
hr01 %>%
group_by(sales) %>%
summarise(mean.satisfaction = mean(satisfaction_level, na.rm = TRUE)) %>%
arrange(mean.satisfaction)
## # A tibble: 10 x 2
## sales mean.satisfaction
## <chr> <dbl>
## 1 accounting 0.403
## 2 management 0.410
## 3 IT 0.411
## 4 RandD 0.433
## 5 hr 0.433
## 6 technical 0.434
## 7 sales 0.447
## 8 support 0.451
## 9 marketing 0.453
## 10 product_mng 0.482
Accounting has the lowest mean satisfaction level. All departments have mean satisfaction in the 0.403 to 0.482 range, so some difference but a huge one.
chop_evenly()
. Then show the frequencies of each group.library(santoku)
hr01 %>%
mutate(
grouped_hours = chop_evenly(average_montly_hours, groups = 4)
) -> hr01
hr01 %>%
group_by(grouped_hours) %>%
tally()
## # A tibble: 4 x 2
## grouped_hours n
## <fct> <int>
## 1 [126, 172) 1594
## 2 [172, 218) 88
## 3 [218, 264) 1037
## 4 [264, 310] 833
Thanks to the frenetic work of many individuals, the global spread of the Novel Coronavirus (COVID-19) has been tracked and the data made available for analysis. Yanchang Zhao is one such individual and for this exercise we will use a spcific version of his data that I have named cvdata.RData
and made available via Slack. Make sure to upload that data-set to your RStudio Cloud data
folder, and then to read it in via the load()
command. We can then answer a few questions. Note the contents:
country =
name of the countrydate =
date of indidents as recordedconfirmed =
cumulative count of the number of people who tested positivedeaths =
cumulative count of the number of people lost to Covid-19deaths =
cumulative count of the number of people recoveredcv0310
.at least one
person to this tragedy? “Others” should not show up as one of the countries.## # A tibble: 1 x 2
## ncountries n
## <int> <int>
## 1 24 24
cv0310 %>%
filter(deaths >= 1, country != "Others") %>%
group_by(country) %>%
arrange(-confirmed) %>%
select(country, confirmed) %>%
ungroup() %>%
top_n(10)
## # A tibble: 10 x 2
## country confirmed
## <fct> <int>
## 1 Mainland China 80757
## 2 Italy 10149
## 3 Iran (Islamic Republic of) 8042
## 4 Republic of Korea 7513
## 5 France 1784
## 6 Spain 1695
## 7 US 1670
## 8 Germany 1457
## 9 Japan 581
## 10 Switzerland 491
fatality_rate
, defined for our purposes as the percent of deaths. excluding “Others”, and only keeping countried that have had at least 10
confirmed cases, arrange the result to show the top-10 countries in descending order of fatality_rate
.cv0310 %>%
filter(country != "Others", confirmed >= 10) %>%
mutate(fatality_rate = (deaths / confirmed)*100) %>%
group_by(country) %>%
arrange(-fatality_rate) %>%
select(country, fatality_rate) %>%
ungroup() %>%
top_n(10)
## # A tibble: 10 x 2
## country fatality_rate
## <fct> <dbl>
## 1 Iraq 9.86
## 2 Italy 6.22
## 3 Argentina 5.88
## 4 San Marino 3.92
## 5 Mainland China 3.88
## 6 Iran (Islamic Republic of) 3.62
## 7 US 3.35
## 8 Philippines 3.03
## 9 Australia 2.80
## 10 Hong Kong SAR 2.5
Say we only want to focus on the Baltic countries (Estonia, Latvia, and Lithuania) as a unified group and compare this group to the ASEAN nations (Brunei, Cambodia, Indonesia, Laos, Malaysia, Myanmar, Philippines, Singapore, Thailand, and Vietnam). Use cv0310
to complete the followng tasks:
region
that only takes on two values – “Baltic” if the country is a Baltic country and “Asean” if the country is an ASEAN country.cv0310 %>%
mutate(
region = case_when(
country %in% c("Estonia", "Latvia", "Lithuania") ~ "Baltic",
country %in% c("Brunei", "Cambodia", "Indonesia", "Laos",
"Malaysia", "Myanmar", "Philippines", "Singapore",
"Thailand", "Vietnam") ~ "ASEAN")
) %>%
group_by(region) %>%
summarise(total.confirmed = sum(confirmed))
## # A tibble: 3 x 2
## region total.confirmed
## <chr> <int>
## 1 ASEAN 405
## 2 Baltic 21
## 3 <NA> 118877