In this module you will learn how three very powerful packages work their magic. {dplyr} and {tidyr} have quickly gained a following because they allow you to manage seemingly unwieldy data to calculate quantities of interest with relative ease, and to flip data from “long” to “wide” formats or vice-versa. {stringr} is a sister-package that works well with strings and often useful for key data cleaning.
Our goal is to understand how two very powerful packages work their magic. Of recent vintage, {dplyr}
and {tidyr}
have quickly gained a following because they allow you to manage seemingly unwieldy data to calculate quantities of interest with relative ease. In this module we will see some of their core abilities.
You will usually end up needing and using some or all of these six functions:
select()
filter()
arrange()
summarise()
mutate()
group_by()
In order to understand how these commands work we rely on a simple data frame – all flights originating and departing from Columbus (Ohio) for January through December of 2017. Let us load the data and {dplyr}
.
library(here)
library(tidyverse)
library(tidylog, warn.conflicts = FALSE)
load(
here(
"data",
"cmhflights2017.RData"
)
)
You will want to install {tidylog}
, a great package by Benjamin Elbers, since tidylog shows you what each step of a basic {dplyr}
operation has done.
select(...)
Often your data frame will have more columns than you need to calculate some quantity – for example, the mean of some variable or the frequency distribution of some categorical variable. Say I only want the second through sixth column – Year
, Quarter
, Month
, DayofMonth
, and DayofWeek
. Before I do that, though, let me render all column names to be lowercase; running clean_names()
from the {janitor}
package after you read in any data-set is a good habit to acquire early on in your career.
library(janitor)
clean_names(cmhflights2017) -> cmh.df
Having done that, I could generate a data frame containing only the desired columns as follows:
cmh.df %>%
select(
year:day_of_week
) -> my.df
names(my.df)
[1] "year" "quarter" "month" "dayof_month"
[5] "day_of_week"
What if the columns were not sequentially located? In that case we would need to list each column we want, using the :
operator as shorthand for sequential columns.
cmh.df %>%
select(
year,
flight_date:reporting_airline,
tail_number
) -> my.df
names(my.df)
[1] "year" "flight_date" "reporting_airline"
[4] "tail_number"
Could we use column numbers instead? Absolutely, provided you know the column numbers. Yes, provided you know the column numbers of the columns you want (and these can be obtained by running names()
on the original data-set).
cmh.df %>%
select(
c(1, 3, 5, 7)
) -> my.df
names(my.df)
[1] "id" "quarter" "dayof_month" "flight_date"
You can also select columns in other ways, with contains
, starts_with
, ends_with
, any_of
, all_of
, matches
, and num_range
. As these function names suggest, contains()
will look for a specific string in the column names, starts_with()
and ends_with()
look for specific starting and ending strings, respectively, matches()
will look for a matching string, and any_of()
will look for any of the strings listed in quotes inside the parentheses while all_of()
will look for all listed strings and throw an error if even one is not found.
cmh.df %>%
select(
contains("carrier")
) -> my.df # using contains()
names(my.df)
[1] "carrier_delay"
cmh.df %>%
select(
starts_with("de")
) -> my.df # using starts_with()
names(my.df)
[1] "dest_airport_id" "dest_airport_seq_id"
[3] "dest_city_market_id" "dest"
[5] "dest_city_name" "dest_state"
[7] "dest_state_fips" "dest_state_name"
[9] "dest_wac" "dep_time"
[11] "dep_delay" "dep_delay_minutes"
[13] "dep_del15" "departure_delay_groups"
[15] "dep_time_blk"
cmh.df %>%
select(
ends_with("num")
) -> my.df # using ends_with()
names(my.df)
[1] "div1tail_num" "div2tail_num" "div3tail_num" "div4tail_num"
[5] "div5tail_num"
Now here is any_of()
, with the desired variables fed via creating a new object called myvars
myvars = c("year", "month", "dest")
cmh.df %>%
select(
any_of(
myvars
)
) -> my.df # using any_of()
names(my.df)
[1] "year" "month" "dest"
Note that destined
is not found but the other two are extracted any way. What about with all_of()
?
cmh.df %>%
select(
all_of(myvars)
) -> my.df
What if I misspelled any of the variables? Try running the code below
mymalvars = c("year", "month", "destined")
cmh.df %>%
select(
all_of(mymalvars)
) -> my.df
and you should see this message:
Error in `.fun()`:
ℹ In argument: `all_of(mymalvars)`.
Caused by error in `all_of()`:
Error: Can't subset columns that don't exist.
x Column `destined` doesn't exist.
cmh.df %>%
select(
matches(
"wheel"
)
) -> my.df # using matches()
names(my.df)
[1] "wheels_off" "wheels_on" "div1wheels_on"
[4] "div1wheels_off" "div2wheels_on" "div2wheels_off"
[7] "div3wheels_on" "div3wheels_off" "div4wheels_on"
[10] "div4wheels_off" "div5wheels_on" "div5wheels_off"
What about num_range()
?
billboard %>%
select(
num_range("wk", 1:4)
) -> my.bb
names(my.bb)
[1] "wk1" "wk2" "wk3" "wk4"
rename(...)
Variable names can be modified with the rename()
command. The sequence is easy to remember: newname
= oldname
, as shown below. Just remember: When you do this, the original variable name has been overwritten so do not invoke it in any subsequent code.
cmh.df %>%
rename(
flt_number = flight_number_reporting_airline
) %>%
select(
tail_number, flt_number
) -> df.rename
names(df.rename)
[1] "tail_number" "flt_number"
filter(...)
Recall that with the diamonds
data, in the previous week we used subset()
to select particular cases for analysis. With {dplyr}
, the filter()
command serves the same purpose. Say we only want records from January, i.e., month == 1
.
cmh.df %>%
filter(
month == 1
) -> my.df
What about only American Airline flights in January?
cmh.df %>%
filter(
month == 1 &
reporting_airline == "AA"
) -> my.df
What about United Airlines flights in January to CMH (Columbus, OH) from any originating airport?
cmh.df %>%
filter(
month == 1 & reporting_airline == "UA" & dest == "CMH"
) -> my.df
Note: Instead of &
you could use ,
… it serves the same purpose; test it out for yourself with the code below.
cmh.df %>%
filter(
month == 1, reporting_airline == "UA", dest == "CMH"
) -> my.df
filter(...)
with %in%
What if I wanted a more complicated filter, say, flights in January or February either to CMH (Columbus, Ohio) or to ORD (the O’Hare airport in Chicago, Illinois)?
cmh.df %>%
filter(
month %in% c(1, 2) &
reporting_airline == "UA" &
dest %in% c("CMH", "ORD")
) -> my.df
Some times you may also want to filter with the opposite of %in%
, as in not in
some list. You can do this as well, but note where the !
is applied – to the column-name.
cmh.df %>%
filter(
month %in% c(1, 2) &
reporting_airline == "UA" &
!dest %in% c("CMH", "ORD")
) -> my.df
Now let us say I wanted to arrange the resulting data frame in ascending
order of departure delays. How might I do that?
arrange(...)
my.df %>%
arrange(
dep_delay_minutes
) -> my.df2
And now in descending
order of delays. Note the use of the minus symbol -
to reverse the sort order
my.df %>%
arrange(
-dep_delay_minutes
) -> my.df2
We could tweak this further, perhaps asking R to first sort by descending order of departure delays to CMH and then to ORD. Note that dest
is ascending and DepDelayMinutes
is descending
my.df %>%
arrange(
dest, -dep_delay_minutes
) -> my.df2
%>%
So far, we have seen each function in isolation. Now we streamline things a bit so that we only end up with the columns we want, organized as we want them to be. That is, we can string together multiple functions to have a slim data frame of only those columns we selected, filtered, and arranged according to our wishes.
cmh.df %>%
select(
month, reporting_airline, dest, dep_delay_minutes
) %>%
filter(
month %in% c(1, 2) &
reporting_airline == "UA" &
dest %in% c("CMH", "ORD")
) %>%
arrange(
month, dest, desc(dep_delay_minutes)
) -> my.df3
Here, the end result is a data frame arranged by month, then within month by destination, and then finally by descending order of flight delay times.
group_by(...)
You often want to carry out some operations/calculations by specific groups, for example, calculate the mean or median age of the men in our sample versus the women, find out how many unique destinations does each airline fly to from Columbus, and so on. The group_by()
command comes in very handy here to set the grouping structure.
Let us calculate the Mean and Median delay times. To do this, we will lean on the group_by()
function to calculate median and mean delays by month and destination. While we are at it, let us see the standard deviation as well.
cmh.df %>%
select(
month, reporting_airline, dest, dep_delay_minutes
) %>%
filter(
month %in% c(1, 2),
reporting_airline == "UA",
dest %in% c("CMH", "ORD")
) %>%
group_by(month, dest) %>%
summarise(
median.delay = median(dep_delay_minutes, na.rm = TRUE),
mean.delay = mean(dep_delay_minutes, na.rm = TRUE),
sd.delay = sd(dep_delay_minutes, na.rm = TRUE)
) %>%
arrange(month, dest) -> tab.01
What if wanted to know how many airports can you fly to from Columbus, and how many flights there were to each airport in July?
cmh.df %>%
select(
month, reporting_airline, dest, dep_delay_minutes, origin
) %>%
filter(
month %in% c(7),
origin %in% c("CMH")
) %>%
group_by(dest) %>%
summarise(
no.of.flights = n()
) -> tab.02
How many distinct airlines fly out of Columbus?
cmh.df %>%
select(
month, reporting_airline, dest, dep_delay_minutes, origin
) %>%
filter(
origin %in% c("CMH")
) %>%
group_by(reporting_airline) %>%
summarise(
no.of.airlines = n_distinct(reporting_airline)
) -> tab.03
And how many flights did each airline operate?
cmh.df %>%
select(
month, reporting_airline, dest, dep_delay_minutes, origin
) %>%
filter(
origin %in% c("CMH")
) %>%
group_by(reporting_airline) %>%
summarise(
no.of.flights = n()
) -> tab.04
Of all of these airlines, which one has the highest mean departure delay?
cmh.df %>%
select(
month, reporting_airline, dest, dep_delay_minutes, origin
) %>%
filter(
origin %in% c("CMH")
) %>%
group_by(reporting_airline) %>%
summarise(
mean.delay = mean(dep_delay_minutes, na.rm = TRUE)
) %>%
arrange(
desc(mean.delay)
) -> tab.05
ungroup(...)
If you want to remove a grouping structure you merely need to specify ungroup()
, as in
cmh.df %>%
select(
month, reporting_airline, dest, dep_delay_minutes, origin
) %>%
filter(
origin %in% c("CMH")
) %>%
group_by(reporting_airline) %>%
summarise(
median.delay = median(dep_delay_minutes, na.rm = TRUE)
) %>%
ungroup() -> tab.06
ungroup()
is often handy when you need to pipe several commands but the grouping structure is different for each. For example, maybe I want to calculate departure delays by unique airline, and then by unique combinations of airline & origin.
cmh.df %>%
select(
month, reporting_airline, dest, dep_delay_minutes, origin
) %>%
group_by(reporting_airline) %>%
mutate(
mean.delay.1 = mean(dep_delay_minutes, na.rm = TRUE)
) %>%
ungroup() %>%
group_by(reporting_airline, origin) %>%
mutate(
mean.delay.2 = mean(dep_delay_minutes, na.rm = TRUE)
)
# A tibble: 47,787 × 7
# Groups: reporting_airline, origin [53]
month reporting_airline dest dep_delay_minutes origin mean.delay.1
<dbl> <chr> <chr> <dbl> <chr> <dbl>
1 1 DL ATL 0 CMH 8.90
2 1 DL ATL 1 CMH 8.90
3 1 DL CMH 0 ATL 8.90
4 1 DL ATL 0 CMH 8.90
5 1 DL CMH 0 ATL 8.90
6 1 DL CMH 0 LAX 8.90
7 1 DL CMH 0 ATL 8.90
8 1 DL ATL 0 CMH 8.90
9 1 DL LAX 0 CMH 8.90
10 1 DL CMH 0 ATL 8.90
# … with 47,777 more rows, and 1 more variable: mean.delay.2 <dbl>
distinct(...)
Since the preceding code yields duplicate rows, I can use distinct
to retain only unique rows
cmh.df %>%
select(
month, reporting_airline, dest, dep_delay_minutes, origin
) %>%
group_by(reporting_airline) %>%
mutate(
mean.delay.1 = mean(dep_delay_minutes, na.rm = TRUE)
) %>%
ungroup() %>%
group_by(reporting_airline, origin) %>%
mutate(
mean.delay.2 = mean(dep_delay_minutes, na.rm = TRUE)
) %>%
distinct()
# A tibble: 13,332 × 7
# Groups: reporting_airline, origin [53]
month reporting_airline dest dep_delay_minutes origin mean.delay.1
<dbl> <chr> <chr> <dbl> <chr> <dbl>
1 1 DL ATL 0 CMH 8.90
2 1 DL ATL 1 CMH 8.90
3 1 DL CMH 0 ATL 8.90
4 1 DL CMH 0 LAX 8.90
5 1 DL LAX 0 CMH 8.90
6 1 DL CMH 7 ATL 8.90
7 1 DL CMH 1 LAX 8.90
8 1 DL ATL 2 CMH 8.90
9 1 DL CMH 105 ATL 8.90
10 1 DL CMH 1 ATL 8.90
# … with 13,322 more rows, and 1 more variable: mean.delay.2 <dbl>
mutate(...)
Thus far we have been creating a summary data frame containing whatever we calculated – the frequencies, averages, and so on. Some times, however, you may want these calculations to be be appended to the original data frame and stored as new columns. This can be done with mutate()
, as shown below. In this particular example, I am calculating median departure delay by airline (for each airport it departs from) and then including this as a column called Median.Delay
in cmhflights.
cmh.df %>%
group_by(
reporting_airline, origin
) %>%
mutate(
median.delay = median(dep_delay_minutes, na.rm = TRUE)
) -> cmhflights
Note the 111th column. Now, if we wanted to add this column but drop all other columns not being used in the calculation, we could do that via transmute()
. Be careful with this one though.
cmh.df -> mut.df # cloning cmh.df into mut.df
mut.df %>%
group_by(reporting_airline, dest) %>%
transmute(
median.arr.delay = median(arr_delay_minutes, na.rm = TRUE)
) -> mut.df2
names(mut.df2)
[1] "reporting_airline" "dest" "median.arr.delay"
case_when(...)
Just as one can use ifelse()
to generate new variables or modify existing variables, so can one do the same thing with case_when()
from dplyr
. Here are a few examples. First, working with categorical variables, where I create a new variable that flags whether the flight originated in BWI, DCA, OR IAH versus some other airport.
cmh.df %>%
mutate(
DC_area = case_when(
origin %in% c("BWI","DCA", "IAH") ~ "DC Area",
!origin %in% c("BWI","DCA", "IAH") ~ "Not in DC Area"
)
) -> cw.df
table(cw.df$origin, cw.df$DC_area)
DC Area Not in DC Area
ATL 0 3794
BNA 0 692
BOS 0 623
BWI 1301 0
CLT 0 3
CMH 0 23893
DAL 0 369
DCA 674 0
DEN 0 1224
DFW 0 1482
DTW 0 755
EWR 0 847
FLL 0 321
HOU 0 16
IAH 760 0
LAS 0 1074
LAX 0 641
LGA 0 173
MCO 0 1488
MDW 0 1992
MSP 0 945
MSY 0 35
OAK 0 365
ORD 0 1089
PHL 0 57
PHX 0 1075
RSW 0 657
STL 0 655
TPA 0 787
The same result would obtain if you had run the following code:
cmh.df %>%
mutate(
DC_area = case_when(
origin %in% c("BWI","DCA", "IAH") ~ "DC Area",
TRUE ~ "Not in DC Area"
)
) -> cw.df
table(cw.df$origin, cw.df$DC_area)
DC Area Not in DC Area
ATL 0 3794
BNA 0 692
BOS 0 623
BWI 1301 0
CLT 0 3
CMH 0 23893
DAL 0 369
DCA 674 0
DEN 0 1224
DFW 0 1482
DTW 0 755
EWR 0 847
FLL 0 321
HOU 0 16
IAH 760 0
LAS 0 1074
LAX 0 641
LGA 0 173
MCO 0 1488
MDW 0 1992
MSP 0 945
MSY 0 35
OAK 0 365
ORD 0 1089
PHL 0 57
PHX 0 1075
RSW 0 657
STL 0 655
TPA 0 787
Similarly, we could use numerical variables to create new ones. For example, let us call a flight delayed
if dep_delay_minutes
is \(> 30\).
cw.df %>%
mutate(
delayed = case_when(
dep_delay_minutes > 30 ~ "Yes",
dep_delay_minutes < 30 ~ "No",
is.na(dep_delay_minutes) ~ "Unknown"
)
) -> cw.df
case_when()
is often useful when you have to chain multiple ifelse()
commands since managing a large ifelse()
chain can be tiresome. Further, you should remember that %in%
can be used to ease coding in case_when()
, as shown below. Say I want to group flights into weekday versus weekend flights. Note that the week starts on Sundays hence weekdays are defined here as days 2 through 6 (i.e., Monday through Friday).
cw.df %>%
mutate(
weekend = case_when(
day_of_week %in% c(2:6) ~ "No",
TRUE ~ "Yes"
)
) -> cw.df
unnest(...)
Every now and then the data you have to work with includes some list-columns – columns that are like mini-data frames themselves. Suddenly R functions that would otherwise work will not execute as desired. The unnest
function from the tidyverse
unpacks these list-columns into a regular tibble. First a small data frame that shows a list-column.
library(tidyverse)
data_frame(
x = 1:3,
y = c("a", "d,e,f", "g,h")
) -> bad.df
If I want to clean up list-column y
, I can do that as shown below.
bad.df %>%
transform(
y = strsplit(y, ",")
) %>%
unnest(y) -> good.df
Now see a slightly more complicated list-column, with y
made of named rows …. a and b, a with 1 value and b with 2 values.
data_frame(
x = 1:2,
y = list(a = 1, b = 3:4)
) -> badder.df
To clean this up we could do:
badder.df %>%
unnest(
.,
.id = "name"
) -> gooder.df
Most packages you use in R call for data to be {tidy}
. What are tidy data?
Figure 1: Artwork by @allison_horst
Here are some untidy data
read.delim(
file = "http://stat405.had.co.nz/data/pew.txt",
header = TRUE,
stringsAsFactors = FALSE,
check.names = FALSE
) -> pew
head(pew)
religion <$10k $10-20k $20-30k $30-40k $40-50k $50-75k
1 Agnostic 27 34 60 81 76 137
2 Atheist 12 27 37 52 35 70
3 Buddhist 27 21 30 34 33 58
4 Catholic 418 617 732 670 638 1116
5 Don’t know/refused 15 14 15 11 10 35
6 Evangelical Prot 575 869 1064 982 881 1486
$75-100k $100-150k >150k Don't know/refused
1 122 109 84 96
2 73 59 74 76
3 62 39 53 54
4 949 792 633 1489
5 21 17 18 116
6 949 723 414 1529
# remotes::install_github("garrettgman/DSR")
library(DSR)
table2
# A tibble: 12 × 4
country year key value
<fct> <int> <fct> <int>
1 Afghanistan 1999 cases 745
2 Afghanistan 1999 population 19987071
3 Afghanistan 2000 cases 2666
4 Afghanistan 2000 population 20595360
5 Brazil 1999 cases 37737
6 Brazil 1999 population 172006362
7 Brazil 2000 cases 80488
8 Brazil 2000 population 174504898
9 China 1999 cases 212258
10 China 1999 population 1272915272
11 China 2000 cases 213766
12 China 2000 population 1280428583
table3
# A tibble: 6 × 3
country year rate
<fct> <int> <chr>
1 Afghanistan 1999 745/19987071
2 Afghanistan 2000 2666/20595360
3 Brazil 1999 37737/172006362
4 Brazil 2000 80488/174504898
5 China 1999 212258/1272915272
6 China 2000 213766/1280428583
table4
# A tibble: 3 × 3
country `1999` `2000`
<fct> <int> <int>
1 Afghanistan 745 2666
2 Brazil 37737 80488
3 China 212258 213766
table5
# A tibble: 3 × 3
country `1999` `2000`
<fct> <int> <int>
1 Afghanistan 19987071 20595360
2 Brazil 172006362 174504898
3 China 1272915272 1280428583
What would a tidy version of the examples in (b) look like?
table1
# A tibble: 6 × 4
country year cases population
<fct> <int> <int> <int>
1 Afghanistan 1999 745 19987071
2 Afghanistan 2000 2666 20595360
3 Brazil 1999 37737 172006362
4 Brazil 2000 80488 174504898
5 China 1999 212258 1272915272
6 China 2000 213766 1280428583
Here, we have
The {tidyr}
package is designed to make messy data tidy in order to proceed with analysis. It has four primary functions that we will see in turn.
You will run into variables that you want to break into smaller pieces. For instance, take table3
where rate
is a ratio of two variables, cases
and population
, each of which should be in their own columns. How can we achieve that?
library(tidyr)
table3 %>%
separate(
col = rate,
into = c("cases", "population"),
sep = "/",
remove = TRUE,
convert = TRUE
) -> s.table3
s.table3
# A tibble: 6 × 4
country year cases population
<fct> <int> <int> <int>
1 Afghanistan 1999 745 19987071
2 Afghanistan 2000 2666 20595360
3 Brazil 1999 37737 172006362
4 Brazil 2000 80488 174504898
5 China 1999 212258 1272915272
6 China 2000 213766 1280428583
col =
to identify which column must be splitinto = c("", "")
/
remove = TRUE
. Set this to `FALSE if you want to retain the original columnconvert = TRUE
. If you omit this switch then the new columns will be created as characters.unite(...)
The unite(...)
command does the exact opposite of separate, combining columns
s.table3 %>%
unite(
"rate",
cases, population,
sep = ":",
remove = FALSE
) -> u.table3
u.table3
# A tibble: 6 × 5
country year rate cases population
<fct> <int> <chr> <int> <int>
1 Afghanistan 1999 745:19987071 745 19987071
2 Afghanistan 2000 2666:20595360 2666 20595360
3 Brazil 1999 37737:172006362 37737 172006362
4 Brazil 2000 80488:174504898 80488 174504898
5 China 1999 212258:1272915272 212258 1272915272
6 China 2000 213766:1280428583 213766 1280428583
Here you specify the name of the new column first, and then list the columns to be combined, specify the separator, and whether the original columns should be removed or not.
spread(...)
and pivot_wider(...)
Revisit table2
table2
# A tibble: 12 × 4
country year key value
<fct> <int> <fct> <int>
1 Afghanistan 1999 cases 745
2 Afghanistan 1999 population 19987071
3 Afghanistan 2000 cases 2666
4 Afghanistan 2000 population 20595360
5 Brazil 1999 cases 37737
6 Brazil 1999 population 172006362
7 Brazil 2000 cases 80488
8 Brazil 2000 population 174504898
9 China 1999 cases 212258
10 China 1999 population 1272915272
11 China 2000 cases 213766
12 China 2000 population 1280428583
I want to flip this into a format where the data frame has two columns per country per year – cases
and population
. How can I do this?
table2 %>%
spread(
key,
value
) -> sp.table2
sp.table2
# A tibble: 6 × 4
country year cases population
<fct> <int> <int> <int>
1 Afghanistan 1999 745 19987071
2 Afghanistan 2000 2666 20595360
3 Brazil 1999 37737 172006362
4 Brazil 2000 80488 174504898
5 China 1999 212258 1272915272
6 China 2000 213766 1280428583
Notice that the key
column had two unique values – cases and population – so two new columns were created. Notice also that tidyr knew to populate the value of each of the new cells with what it finds in the value
column. The result is what we call data in a wide format.
gather(...)
and pivot_longer(...)
gather(...)
does the opposite of spread(...)
, converting wide data into what we call the long form. Revisit table4
table4
# A tibble: 3 × 3
country `1999` `2000`
<fct> <int> <int>
1 Afghanistan 745 2666
2 Brazil 37737 80488
3 China 212258 213766
Note that each country has two columns, one per year. This is untidy. So we’ll flip this tibble as follows:
table4 %>%
gather(
"year",
"cases",
2:3
) -> g.table4
g.table4
# A tibble: 6 × 3
country year cases
<fct> <chr> <int>
1 Afghanistan 1999 745
2 Brazil 1999 37737
3 China 1999 212258
4 Afghanistan 2000 2666
5 Brazil 2000 80488
6 China 2000 213766
Notice that you had to specify the columns you wanted to gather and we did this via 2:3
. If you forget to specify columns it creates something else altogether so be careful.
This is a slightly modified version of Garrett Grolemund’s WHO Data Case Study.
At this website you can find data provided by countries to WHO and estimates of TB burden generated by WHO for the Global Tuberculosis Report. I want you to gather the following data (all are in the csv format):
Read the data dictionary carefully. Then read-in the case notifications data.
The database has too many variables (175 or so) and so I’ll have you use just a few for this practice task.
library(readr)
read_csv(
here("data", "TB_notifications_2018-02-07.csv")
) -> tb
Let us use dplyr
to select and filter the tb database so that it meets the following criteria: (i) observations from 2012 or earlier years are kept, and (ii) keep only columns 1 through 6 and 25 (new_sp_m04) through 44 (new_sp_fu).
library(dplyr)
tb %>%
select(
c(1:6, 25:44)
) %>%
filter(
year <= 2012
) -> tb2
glimpse(tb2)
Rows: 7,023
Columns: 26
$ country <chr> "Afghanistan", "Afghanistan", "Afghanistan", "A…
$ iso2 <chr> "AF", "AF", "AF", "AF", "AF", "AF", "AF", "AF",…
$ iso3 <chr> "AFG", "AFG", "AFG", "AFG", "AFG", "AFG", "AFG"…
$ iso_numeric <chr> "004", "004", "004", "004", "004", "004", "004"…
$ g_whoregion <chr> "EMR", "EMR", "EMR", "EMR", "EMR", "EMR", "EMR"…
$ year <dbl> 1980, 1981, 1982, 1983, 1984, 1985, 1986, 1987,…
$ new_sp_m04 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ new_sp_m514 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ new_sp_m014 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ new_sp_m1524 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ new_sp_m2534 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ new_sp_m3544 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ new_sp_m4554 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ new_sp_m5564 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ new_sp_m65 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ new_sp_mu <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ new_sp_f04 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ new_sp_f514 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ new_sp_f014 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ new_sp_f1524 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ new_sp_f2534 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ new_sp_f3544 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ new_sp_f4554 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ new_sp_f5564 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ new_sp_f65 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ new_sp_fu <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
Now, new_sp_xx
refer to new pulmonary smear-positive cases. The m
refers to males and f
refers to females. Finally, 04
is the 0-4 age group, 514
is the 5-14 age group, and so on. new_sp_mu
is males with an unknown age-group and new_sp_fu
is females with an unknown age-group. This is the messy part of the data. Ideally we’d have a column for gender, flagged as Male, Female, Unknown
, and a separate column for age-groups flagged 0-4, 5-14, 0-14, and so on
. Time to tidy these data.
Since the data are in the wide format we can use gather to flip it into the long format.
tb2 %>%
gather(
"group",
"cases",
7:26
) -> tb3
glimpse(tb3)
Rows: 140,460
Columns: 8
$ country <chr> "Afghanistan", "Afghanistan", "Afghanistan", "Af…
$ iso2 <chr> "AF", "AF", "AF", "AF", "AF", "AF", "AF", "AF", …
$ iso3 <chr> "AFG", "AFG", "AFG", "AFG", "AFG", "AFG", "AFG",…
$ iso_numeric <chr> "004", "004", "004", "004", "004", "004", "004",…
$ g_whoregion <chr> "EMR", "EMR", "EMR", "EMR", "EMR", "EMR", "EMR",…
$ year <dbl> 1980, 1981, 1982, 1983, 1984, 1985, 1986, 1987, …
$ group <chr> "new_sp_m04", "new_sp_m04", "new_sp_m04", "new_s…
$ cases <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
Now we split the new_sp_xx
values into three columns
tb3 %>%
separate(
col = group,
into = c("new", "sp", "sexage"),
sep = "_"
) -> tb4
glimpse(tb4)
Rows: 140,460
Columns: 10
$ country <chr> "Afghanistan", "Afghanistan", "Afghanistan", "Af…
$ iso2 <chr> "AF", "AF", "AF", "AF", "AF", "AF", "AF", "AF", …
$ iso3 <chr> "AFG", "AFG", "AFG", "AFG", "AFG", "AFG", "AFG",…
$ iso_numeric <chr> "004", "004", "004", "004", "004", "004", "004",…
$ g_whoregion <chr> "EMR", "EMR", "EMR", "EMR", "EMR", "EMR", "EMR",…
$ year <dbl> 1980, 1981, 1982, 1983, 1984, 1985, 1986, 1987, …
$ new <chr> "new", "new", "new", "new", "new", "new", "new",…
$ sp <chr> "sp", "sp", "sp", "sp", "sp", "sp", "sp", "sp", …
$ sexage <chr> "m04", "m04", "m04", "m04", "m04", "m04", "m04",…
$ cases <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
and then split sexage
into two columns by specifying that the split should occur after the first value (which in our case will be after m
or f
).
tb4 %>%
separate(
col = sexage,
into = c("sex", "agegroup"),
sep = 1
) -> tb5
glimpse(tb5)
Rows: 140,460
Columns: 11
$ country <chr> "Afghanistan", "Afghanistan", "Afghanistan", "Af…
$ iso2 <chr> "AF", "AF", "AF", "AF", "AF", "AF", "AF", "AF", …
$ iso3 <chr> "AFG", "AFG", "AFG", "AFG", "AFG", "AFG", "AFG",…
$ iso_numeric <chr> "004", "004", "004", "004", "004", "004", "004",…
$ g_whoregion <chr> "EMR", "EMR", "EMR", "EMR", "EMR", "EMR", "EMR",…
$ year <dbl> 1980, 1981, 1982, 1983, 1984, 1985, 1986, 1987, …
$ new <chr> "new", "new", "new", "new", "new", "new", "new",…
$ sp <chr> "sp", "sp", "sp", "sp", "sp", "sp", "sp", "sp", …
$ sex <chr> "m", "m", "m", "m", "m", "m", "m", "m", "m", "m"…
$ agegroup <chr> "04", "04", "04", "04", "04", "04", "04", "04", …
$ cases <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
The beauty of dplyr and tidyr is the ability to combine all these steps into a single piped command!
tb %>%
select(
c(1:6, 25:44)
) %>%
filter(
year <= 2012
) %>%
gather(
"group",
"cases",
7:26
) %>%
separate(
col = group,
into = c("new", "sp", "sexage"),
sep = "_"
) %>%
separate(
col = sexage,
into = c("sex", "agegroup"),
sep = 1
) %>%
select(
c(1:6, 9:11)
) -> tb.df
glimpse(tb.df)
Rows: 140,460
Columns: 9
$ country <chr> "Afghanistan", "Afghanistan", "Afghanistan", "Af…
$ iso2 <chr> "AF", "AF", "AF", "AF", "AF", "AF", "AF", "AF", …
$ iso3 <chr> "AFG", "AFG", "AFG", "AFG", "AFG", "AFG", "AFG",…
$ iso_numeric <chr> "004", "004", "004", "004", "004", "004", "004",…
$ g_whoregion <chr> "EMR", "EMR", "EMR", "EMR", "EMR", "EMR", "EMR",…
$ year <dbl> 1980, 1981, 1982, 1983, 1984, 1985, 1986, 1987, …
$ sex <chr> "m", "m", "m", "m", "m", "m", "m", "m", "m", "m"…
$ agegroup <chr> "04", "04", "04", "04", "04", "04", "04", "04", …
$ cases <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
Now, we can clean-up the codes for easier interpretation.
tb.df$sex.f <- ifelse(tb.df$sex == "m", "Males", "Females")
tb.df$agegroup.f <- ifelse(tb.df$agegroup == "04", "0-4",
ifelse(tb.df$agegroup == "014", "0-14",
ifelse(tb.df$agegroup == "1524", "15-24",
ifelse(tb.df$agegroup == "2534", "25-34",
ifelse(tb.df$agegroup == "3544", "35-44",
ifelse(tb.df$agegroup == "4554", "45-54",
ifelse(tb.df$agegroup == "514", "5-14",
ifelse(tb.df$agegroup == "5564", "55-64",
ifelse(tb.df$agegroup == "65", "65+", "Uknown")
))))))))
glimpse(tb.df)
Rows: 140,460
Columns: 11
$ country <chr> "Afghanistan", "Afghanistan", "Afghanistan", "Af…
$ iso2 <chr> "AF", "AF", "AF", "AF", "AF", "AF", "AF", "AF", …
$ iso3 <chr> "AFG", "AFG", "AFG", "AFG", "AFG", "AFG", "AFG",…
$ iso_numeric <chr> "004", "004", "004", "004", "004", "004", "004",…
$ g_whoregion <chr> "EMR", "EMR", "EMR", "EMR", "EMR", "EMR", "EMR",…
$ year <dbl> 1980, 1981, 1982, 1983, 1984, 1985, 1986, 1987, …
$ sex <chr> "m", "m", "m", "m", "m", "m", "m", "m", "m", "m"…
$ agegroup <chr> "04", "04", "04", "04", "04", "04", "04", "04", …
$ cases <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ sex.f <chr> "Males", "Males", "Males", "Males", "Males", "Ma…
$ agegroup.f <chr> "0-4", "0-4", "0-4", "0-4", "0-4", "0-4", "0-4",…
The same thing, albeit with case_when()
…
tb.df %>%
mutate(
sex.f = case_when(
sex == "m" ~ "Males",
sex == "f" ~ "Females"
),
agegroup.f = case_when(
agegroup == "014" ~ "0-14",
agegroup == "1524" ~ "15-24",
agegroup == "2534" ~ "25-34",
agegroup == "3544" ~ "35-44",
agegroup == "4554" ~ "45-54",
agegroup == "514" ~ "5-14",
agegroup == "5564" ~ "55-64",
agegroup == "65" ~ "65+",
TRUE ~ "Uknown")
) -> tb.df
glimpse(tb.df)
Rows: 140,460
Columns: 11
$ country <chr> "Afghanistan", "Afghanistan", "Afghanistan", "Af…
$ iso2 <chr> "AF", "AF", "AF", "AF", "AF", "AF", "AF", "AF", …
$ iso3 <chr> "AFG", "AFG", "AFG", "AFG", "AFG", "AFG", "AFG",…
$ iso_numeric <chr> "004", "004", "004", "004", "004", "004", "004",…
$ g_whoregion <chr> "EMR", "EMR", "EMR", "EMR", "EMR", "EMR", "EMR",…
$ year <dbl> 1980, 1981, 1982, 1983, 1984, 1985, 1986, 1987, …
$ sex <chr> "m", "m", "m", "m", "m", "m", "m", "m", "m", "m"…
$ agegroup <chr> "04", "04", "04", "04", "04", "04", "04", "04", …
$ cases <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ sex.f <chr> "Males", "Males", "Males", "Males", "Males", "Ma…
$ agegroup.f <chr> "Uknown", "Uknown", "Uknown", "Uknown", "Uknown"…
How many cases per country in the 1980-1990 period?
tb.df %>%
filter(
year >= 1980 | year <= 1990
) %>%
group_by(country) %>%
summarise(
ncases = sum(cases, na.rm = TRUE)
) %>%
arrange(-ncases) -> tbtab.01
glimpse(tbtab.01)
Rows: 219
Columns: 2
$ country <chr> "India", "China", "Indonesia", "South Africa", "Bang…
$ ncases <dbl> 6492850, 5643738, 1886657, 1435147, 1072431, 915525,…
What about by sex and age-group?
tb.df %>%
filter(
year >= 1980 | year <= 1990
) %>%
group_by(
country, sex, agegroup
) %>%
summarise(
ncases = sum(cases, na.rm = TRUE)
) %>%
arrange(-ncases) -> tbtab.02
This is just one example of how dplyr
and tidyr
– two packages from the tidyverse
– can be used to deal with what seems to be quite messy data. The more you use it, the easier it becomes to figure out solutions to even very complicated messy data setups.
{forcats}
The tidyverse team has another package for working with factors – forcats
. How is it useful?
For example, say I have the mtcars
data and want to see the distribution of the number of cylinders a car has.
data(mtcars)
cyl
is stored as a numeric so I can flip this into a factor via two ways, with base factor()
or then with fct_recode()
.
library(forcats)
mtcars %>%
mutate(
f.cyl = factor(cyl)
) -> mtcars.df
mtcars.df %>%
mutate(
f.cyl2 = fct_recode(
f.cyl,
Four = '4',
Six = '6',
Eight = '8'
)
) -> mtcars.df
I am going to plot mpg
per car but before I do so, I would like to know what model is being referred to. The row.names(mtcars) -> mtcars$model
command pulls the model details stored as row.names in mtcars
and creates a new column called model
with this information. Note: You will need to install the {hrbrthemes}
package before you plot.
mtcars.df$model <- row.names(mtcars)
library(ggplot2)
ggplot(
mtcars.df,
aes(
x = mpg,
y = model
)
) +
geom_point() +
labs(
x = "Miles per gallon",
y = "Car Model"
) +
hrbrthemes::theme_ipsum_rc() +
theme(legend.position = "none")
This isn’t very helpful since the models are not arranged in ascending/descending order of miles per gallon. We can easily fix this problem with fct_reorder(...)
…
ggplot(
mtcars.df,
aes(
x = mpg,
fct_reorder(
.f = model,
.x = mpg
)
)
) +
geom_point() +
labs(
x = "Miles per gallon",
y = "Car Model"
) +
hrbrthemes::theme_ipsum_rc() +
theme(legend.position = "none")
If I want the plot to be in descending order of mpg
,
ggplot(
mtcars.df,
aes(
x = mpg,
fct_reorder(
.f = model,
.x = -mpg
)
)
) +
geom_point() +
labs(
x = "Miles per gallon",
y = "Car Model"
) +
hrbrthemes::theme_ipsum_rc() +
theme(legend.position = "none")
I can do this with bar-plots too.
ggplot(
mtcars.df,
aes(f.cyl2)
) +
geom_bar(
aes(
fill = f.cyl2
)
) +
theme(legend.position = "none") +
labs(
x = "Number of cylinders",
y = "Frequency"
) +
hrbrthemes::theme_ipsum_rc() +
theme(legend.position = "none")
And now the bars ordered by decreasing frequency, and then by increasing frequency.
ggplot(
mtcars.df,
aes(fct_infreq(f.cyl2))
) +
geom_bar(
aes(fill = f.cyl2)
) +
theme(legend.position = "none") +
labs(
x = "Number of cylinders",
y = "Frequency"
) +
hrbrthemes::theme_ipsum_rc() +
theme(legend.position = "none")
ggplot(
mtcars.df,
aes(fct_rev(fct_infreq(f.cyl2)))
) +
geom_bar(
aes(fill = f.cyl2)
) +
theme(legend.position = "none") +
labs(
x = "Number of cylinders",
y = "Frequency"
) +
hrbrthemes::theme_ipsum_rc() +
theme(legend.position = "none")
Note the use of fct_infreq()
for descending and then fct_rev(fct_infreq())
for ascending.
Take the pew
data shown to you early on in this module and using dplyr and tidyr, tidy it up so that the final data-set (a) has three columns — religion, income-group, frequency, and (b) is arranged in ascending order of religion.
Tidy these weather data
read.delim(
file = "http://stat405.had.co.nz/data/weather.txt",
stringsAsFactors = FALSE
) -> weather
The tidy data should have the days as rows and TMIN and TMAX as columns (i.e., you are going from wide to long).
Download these data from the World bank’s World Development Indicators and Climate Change Knowledge Portal on climate systems, exposure to climate impacts, resilience, greenhouse gas emissions, and energy use. The data are in a worksheet called, wait for it, Data
, and need to be tidied up such that for every country you have one row per year and then columns for each of the 58 measures (called Series name
in the database).
If you really want to challenge yourself, create separate data frames from the Country
worksheet and the Series
worksheet. Then merge these with data frame you tidied up.
If you really dig into dplyr you find little nuggets here and there that are not hidden but for some reason few ever seem to blog about them or post examples where they are in use. I’ve cobbled together a few examples, inspired largely by Suzan Baert and then from the teradata documentation. It is also useful to bear in mind that packages continue to evolve and can, at times, deprecate or render obsolete code that worked in the past.
select_if
, select_at
, select_all
Let us start with select_if
. This verb will allow you to select columns to operate on if some criterion is met. For example, perhaps I only want to pull out character columns:
cmh.df %>%
select_if(is.character) %>%
names()
[1] "reporting_airline" "iata_code_reporting_airline"
[3] "tail_number" "origin"
[5] "origin_city_name" "origin_state"
[7] "origin_state_fips" "origin_state_name"
[9] "dest" "dest_city_name"
[11] "dest_state" "dest_state_fips"
[13] "dest_state_name" "crs_dep_time"
[15] "dep_time" "dep_time_blk"
[17] "wheels_off" "wheels_on"
[19] "crs_arr_time" "arr_time"
[21] "arr_time_blk" "cancellation_code"
[23] "first_dep_time" "div1airport"
[25] "div1wheels_on" "div1wheels_off"
[27] "div1tail_num"
library(lubridate)
cmh.df %>%
select_if(is.Date) %>%
names()
[1] "flight_date"
You can also include seemingly complicated conditions, as shown below where we ask that not only should the column be numeric but that the column’s mean should exceed 10 (an arbitrary value chosen to showcase this example).
cmh.df %>%
select_if(
~is.numeric(.) & mean(., na.rm = TRUE) > 10
) %>%
names()
[1] "year"
[2] "dayof_month"
[3] "dot_id_reporting_airline"
[4] "flight_number_reporting_airline"
[5] "origin_airport_id"
[6] "origin_airport_seq_id"
[7] "origin_city_market_id"
[8] "origin_wac"
[9] "dest_airport_id"
[10] "dest_airport_seq_id"
[11] "dest_city_market_id"
[12] "dest_wac"
[13] "dep_delay_minutes"
[14] "taxi_out"
[15] "arr_delay_minutes"
[16] "crs_elapsed_time"
[17] "actual_elapsed_time"
[18] "air_time"
[19] "distance"
[20] "carrier_delay"
[21] "nas_delay"
[22] "late_aircraft_delay"
[23] "total_add_g_time"
[24] "longest_add_g_time"
[25] "div_actual_elapsed_time"
[26] "div_arr_delay"
[27] "div_distance"
[28] "div1airport_id"
[29] "div1airport_seq_id"
[30] "div1total_g_time"
[31] "div1longest_g_time"
You can also select with other conditions, for example, if the column has fewer than some number of distinct values (here I chose 5)
cmh.df %>%
select_if(
~n_distinct(.) < 5
) %>%
names()
[1] "year" "quarter"
[3] "dep_del15" "arr_del15"
[5] "cancelled" "cancellation_code"
[7] "diverted" "flights"
[9] "div_airport_landings" "div_reached_dest"
[11] "div2airport" "div2airport_id"
[13] "div2airport_seq_id" "div2wheels_on"
[15] "div2total_g_time" "div2longest_g_time"
[17] "div2wheels_off" "div2tail_num"
[19] "div3airport" "div3airport_id"
[21] "div3airport_seq_id" "div3wheels_on"
[23] "div3total_g_time" "div3longest_g_time"
[25] "div3wheels_off" "div3tail_num"
[27] "div4airport" "div4airport_id"
[29] "div4airport_seq_id" "div4wheels_on"
[31] "div4total_g_time" "div4longest_g_time"
[33] "div4wheels_off" "div4tail_num"
[35] "div5airport" "div5airport_id"
[37] "div5airport_seq_id" "div5wheels_on"
[39] "div5total_g_time" "div5longest_g_time"
[41] "div5wheels_off" "div5tail_num"
[43] "x110"
Don’t forget everything()
…You can rearrange columns more swiftly too, as shown below where the first two columns I want are explicitly specified and then all remaining columns via everything()
cmh.df %>%
select(
flight_date, flight_number_reporting_airline,
everything()
) -> df.ev
head(df.ev[, c(1:6)], 15)
flight_date flight_number_reporting_airline id year quarter month
1 2017-01-02 2419 1 2017 1 1
2 2017-01-02 2583 1 2017 1 1
3 2017-01-03 977 1 2017 1 1
4 2017-01-03 977 1 2017 1 1
5 2017-01-03 1119 1 2017 1 1
6 2017-01-03 1170 1 2017 1 1
7 2017-01-03 1276 1 2017 1 1
8 2017-01-03 1276 1 2017 1 1
9 2017-01-03 1327 1 2017 1 1
10 2017-01-03 1405 1 2017 1 1
11 2017-01-03 1405 1 2017 1 1
12 2017-01-03 1533 1 2017 1 1
13 2017-01-03 1533 1 2017 1 1
14 2017-01-03 1736 1 2017 1 1
15 2017-01-03 1897 1 2017 1 1
If you want to make some sweeping changes across all columns, you can do that with select_all()
. Here we flip the column names to uppercase, and then I am replacing the underscore (_)
with no whitespace at all.
cmh.df %>%
select_all(toupper) -> df.up
names(df.up[, 1:10])
[1] "ID" "YEAR"
[3] "QUARTER" "MONTH"
[5] "DAYOF_MONTH" "DAY_OF_WEEK"
[7] "FLIGHT_DATE" "REPORTING_AIRLINE"
[9] "DOT_ID_REPORTING_AIRLINE" "IATA_CODE_REPORTING_AIRLINE"
cmh.df %>%
select_all(toupper) %>%
select_all(~str_replace_all(., "_", "")) -> df.up
names(df.up[, 1:10])
[1] "ID" "YEAR"
[3] "QUARTER" "MONTH"
[5] "DAYOFMONTH" "DAYOFWEEK"
[7] "FLIGHTDATE" "REPORTINGAIRLINE"
[9] "DOTIDREPORTINGAIRLINE" "IATACODEREPORTINGAIRLINE"
In contrast, select_at()
will allow you to operate on variables that meet a specific criterion. For example, here, only columns with the string ar
in their column name or that have column names starting with c
are being selected and flipped to uppercase.
mtcars %>%
select_at(
vars(contains("ar"), starts_with("c")
),
toupper
) %>%
head()
GEAR CARB CYL
Mazda RX4 4 4 6
Mazda RX4 Wag 4 4 6
Datsun 710 4 1 4
Hornet 4 Drive 3 1 6
Hornet Sportabout 3 2 8
Valiant 3 1 6
rownames_to_column
This is a handy function to flip a data-frame’s rownames into a column
head(mtcars[, c(1:4)])
mpg cyl disp hp
Mazda RX4 21.0 6 160 110
Mazda RX4 Wag 21.0 6 160 110
Datsun 710 22.8 4 108 93
Hornet 4 Drive 21.4 6 258 110
Hornet Sportabout 18.7 8 360 175
Valiant 18.1 6 225 105
mtcars %>%
rownames_to_column() %>%
head(3)
rowname mpg cyl disp hp drat wt qsec vs am gear carb
1 Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
2 Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
3 Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
mutate_at()
and mutate_if()
mutate_at() will allow you to mutate specific columns when the specified condition is met. For example, if the column name contains “el” I want the contents of the column to be rendered lowercase.
mtcars %>%
mutate_at(
vars(
contains("el")
),
tolower
) -> mut.df
head(mut.df)
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
Or if the column name contains “pg”, I want the column’s values to be divided by 10
mtcars %>%
mutate_at(
vars(
contains("pg")
),
~(. / 10)
) -> mut.df
head(mut.df)
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 2.10 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 2.10 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 2.28 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 2.14 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 1.87 8 360 175 3.15 3.440 17.02 0 0 3 2
Valiant 1.81 6 225 105 2.76 3.460 20.22 1 0 3 1
Here, if a column is a factor, that column’s values will be converted into a uppercase
mtcars %>%
mutate_if(
is.factor,
toupper
) -> mut.df
head(mut.df)
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
na_if()
Need to convert values that meet specific criteria into NA? Here we are asking that if any variable mpg through vs
has the value of 110
, this value should be converted to NA
(a missing value).
mtcars %>%
select(mpg:vs) %>%
na_if(110) -> naif.df
head(naif.df)
mpg cyl disp hp drat wt qsec vs
Mazda RX4 21.0 6 160 NA 3.90 2.620 16.46 0
Mazda RX4 Wag 21.0 6 160 NA 3.90 2.875 17.02 0
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1
Hornet 4 Drive 21.4 6 258 NA 3.08 3.215 19.44 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0
Valiant 18.1 6 225 105 2.76 3.460 20.22 1
rowwise()
This comes in handy if you have to operate on two or more columns but want the calculation done for each row. Here the average of mpg + qsec
is being calculated for each model
mtcars %>%
rowwise() %>%
mutate(
average = mean(c(mpg, qsec))
) -> row.df
head(row.df[, c(1, 7, 12)])
# A tibble: 6 × 3
# Rowwise:
mpg qsec average
<dbl> <dbl> <dbl>
1 21 16.5 18.7
2 21 17.0 19.0
3 22.8 18.6 20.7
4 21.4 19.4 20.4
5 18.7 17.0 17.9
6 18.1 20.2 19.2
between()
As the name suggests, this allows you to filter for values that fall between the specified range (here mpg of 21 to 24). These boundary values are included in the filtered results
mtcars %>%
filter(
between(mpg, 21, 24)
) -> between.df
between.df
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
near()
This allows you to filter within a range of values, the range specified via tol
. Here, for example, we are asking for any case where mpg is inside
the \(\pm 0.5\) range of \(mpg = 21\), i.e., \(20.5 < x < 21.5\)
mtcars %>%
filter(
near(mpg, 21, tol = 0.5)
) -> near.df
near.df
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Volvo 142E 21.4 4 121 109 4.11 2.780 18.60 1 1 4 2
mtcars %>%
filter(
near(mpg, 21, tol = 0.5 * sd(mpg))
) -> near.df
head(near.df)
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
xor()
This is just another way of saying filter if mpg < 22 | cyl < 6
mtcars %>%
filter(
xor(mpg < 22, cyl < 6)
) -> xor.df
head(xor.df)
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
mtcars %>%
filter(
mpg < 22 | cyl < 6
) -> nonxor.df
head(nonxor.df)
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
filter_all()
This will filter all columns based on the condition specified, for example, finding all instances where baby names include the string “nnie”
library(babynames)
babynames %>%
filter_all(
any_vars(
str_detect(., pattern = "nnie")
)
) -> filt.df
head(filt.df)
# A tibble: 6 × 5
year sex name n prop
<dbl> <chr> <chr> <int> <dbl>
1 1880 F Minnie 1746 0.0179
2 1880 F Annie 1258 0.0129
3 1880 F Jennie 793 0.00812
4 1880 F Fannie 560 0.00574
5 1880 F Nannie 248 0.00254
6 1880 F Winnie 49 0.000502
babynames %>%
select_if(is.numeric) %>%
filter_all(
any_vars(
. == 1880
)
) -> filt.df
head(filt.df)
# A tibble: 6 × 3
year n prop
<dbl> <int> <dbl>
1 1880 7065 0.0724
2 1880 2604 0.0267
3 1880 2003 0.0205
4 1880 1939 0.0199
5 1880 1746 0.0179
6 1880 1578 0.0162
This has retained only those rows where at least one column/variable has a value of 1800
filter_if()
This will allow you to filter if the logical condition is met, here if the column is a character column and if any character column has a missing value
starwars %>%
filter_if(
is.character,
any_vars(is.na(.)
)
) -> filt.df
head(filt.df)
# A tibble: 6 × 14
name height mass hair_color skin_color eye_color birth_year sex
<chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr>
1 C-3PO 167 75 <NA> gold yellow 112 none
2 R2-D2 96 32 <NA> white, bl… red 33 none
3 R5-D4 97 32 <NA> white, red red NA none
4 Greedo 173 74 <NA> green black 44 male
5 Jabba… 175 1358 <NA> green-tan… orange 600 herm…
6 Yoda 66 17 white green brown 896 male
# … with 6 more variables: gender <chr>, homeworld <chr>,
# species <chr>, films <list>, vehicles <list>, starships <list>
filter_at()
This will come in very handy if you want to filter specific variables if a condition is met
mtcars %>%
filter_at(
vars(mpg, qsec),
all_vars(. > 19)
) -> filt.df
head(filt.df)
mpg cyl disp hp drat wt qsec vs am gear carb
Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
This looked for mpg
and qsec
and then retained rows where either variable had a value > 19
mtcars %>%
filter_at(
vars(contains("t")),
any_vars(. > 3)
) -> filt.df
head(filt.df)
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
Here we asked for columns with the letter t
in their name, and any
of these selected columns with values > 3
summarise_all()
Here we are asking that means be calculated for all selected variables – mpg:qsec
– each on its own, of course
mtcars %>%
select(mpg:qsec) %>%
summarise_all(mean, na.rm = TRUE) -> summ.df
summ.df
mpg cyl disp hp drat wt qsec
1 20.09062 6.1875 230.7219 146.6875 3.596563 3.21725 17.84875
summarise_if()
Here we are asking for the means of specific variables – if they are numeric – to be calculated
mtcars %>%
summarise_if(
is.numeric, mean, na.rm = TRUE
) %>%
rename_if(
is.numeric, ~paste0("mean_", .)
) -> summ.df
summ.df
mean_mpg mean_cyl mean_disp mean_hp mean_drat mean_wt mean_qsec
1 20.09062 6.1875 230.7219 146.6875 3.596563 3.21725 17.84875
mean_vs mean_am mean_gear mean_carb
1 0.4375 0.40625 3.6875 2.8125
rename_if()
is being used to append recognizable labels to the calculated values
summarise_at()
Here we are asking for the means of specific variables – those with the word “Delay” in their name – to be calculated
cmh.df %>%
summarise_at(
vars(contains("delay")), mean, na.rm = TRUE
) %>%
rename_at(
vars(contains("delay")), ~paste0("mean_", .)
) -> summ.df
summ.df
mean_dep_delay mean_dep_delay_minutes mean_departure_delay_groups
1 8.756439 11.54685 -0.03739762
mean_arr_delay mean_arr_delay_minutes mean_arrival_delay_groups
1 2.601107 11.07277 -0.3285672
mean_carrier_delay mean_weather_delay mean_nas_delay
1 18.72072 2.676069 12.33165
mean_security_delay mean_late_aircraft_delay mean_div_arr_delay
1 0.04338477 26.01961 178.8784
rename_at()
is being used to append recognizable labels to the calculated values
top_n()
This will give you the top
or bottom
n values (here n has been set to 7)
cmh.df %>%
group_by(reporting_airline) %>%
filter(dest == "CMH") %>%
summarise(
average.delay = mean(arr_delay, na.rm = TRUE)
) %>%
arrange(-average.delay) %>%
top_n(7)
# A tibble: 7 × 2
reporting_airline average.delay
<chr> <dbl>
1 F9 7.54
2 WN 5.20
3 AA 4.92
4 OO 4.66
5 EV 0.966
6 DL -1.01
7 UA -2.27
cmh.df %>%
group_by(reporting_airline) %>%
filter(dest == "CMH") %>%
summarise(
average.delay = mean(arr_delay, na.rm = TRUE)
) %>%
arrange(average.delay) %>%
top_n(-7)
# A tibble: 7 × 2
reporting_airline average.delay
<chr> <dbl>
1 UA -2.27
2 DL -1.01
3 EV 0.966
4 OO 4.66
5 AA 4.92
6 WN 5.20
7 F9 7.54
separate_rows
If you need to separate values accidentally stored in a single column into rows, use separate_rows
library(tibble)
df <- tibble(
x = 1:3,
y = c("a", "d,e,f", "g,h"),
z = c("1", "2,3,4", "5,6")
)
df
# A tibble: 3 × 3
x y z
<int> <chr> <chr>
1 1 a 1
2 2 d,e,f 2,3,4
3 3 g,h 5,6
separate_rows(df, y, z, convert = TRUE)
# A tibble: 6 × 3
x y z
<int> <chr> <int>
1 1 a 1
2 2 d 2
3 2 e 3
4 2 f 4
5 3 g 5
6 3 h 6
join
alternative to merge
dplyr
has it own commands that execute what merge()
did for us in the previous module. Let us see how this alternative route works but first we recreate the data we used with merge.
data.frame(
Score = c(10, 21, 33, 12, 35, 67, 43, 99),
ID = c("A12", "A23", "Z14", "WX1", "Y31", "D66", "C31", "Q22")
) -> data1
data.frame(
Sex = c("Male", "Female", "Male", "Female", "Male", "Female", "Male", "Female"),
ID = c("A12", "A23", "Z14", "WX1", "Y31", "D66", "E52", "H71")
) -> data2
data.frame(
Score = c(10, 21, 33, 12, 35, 67, 43, 99),
ID1 = c("A12", "A23", "Z14", "WX1", "Y31", "D66", "C31", "Q22")
) -> data3
data.frame(
Sex = c("Male", "Female", "Male", "Female", "Male", "Female", "Male", "Female"),
ID2 = c("A12", "A23", "Z14", "WX1", "Y31", "D66", "E52", "H71")
) -> data4
data.frame(
Score = c(10, 21, 33, 12, 35, 67, 43, 99),
ID = c("A12", "A23", "Z14", "WX1", "Y31", "D66", "C31", "Q22")
) -> df1
data.frame(
Sex = c("Male", "Female", "Male", "Female", "Male", "Female", "Male", "Female"),
ID = c("A12", "A23", "Z14", "WX1", "Y31", "D66", "E52", "H71")
) -> df2
data.frame(
Age = c(6, 7, 6, 8, 8, 9, 10, 5),
ID = c("A12", "A23", "Z14", "WX1", "Y31", "D66", "E52", "H71")
) -> df3
inner_join
This merges based on columns that appear in both data-sets, and will only merge rows that have the same value per column. In brief, only those observations common to both data-sets are retained.
inner_join(data1, data2, by = "ID")
Score ID Sex
1 10 A12 Male
2 21 A23 Female
3 33 Z14 Male
4 12 WX1 Female
5 35 Y31 Male
6 67 D66 Female
If you do not specify the by =
column dplyr defaults to all columns with similar names in each data-set …
inner_join(data1, data2)
Score ID Sex
1 10 A12 Male
2 21 A23 Female
3 33 Z14 Male
4 12 WX1 Female
5 35 Y31 Male
6 67 D66 Female
left_join
This merge will result in all rows being retained from the first data-set listed in the command, but will include only those rows from the second data-set that have the same value(s) for the by variable(s) …
data1 %>%
left_join(data2, by = "ID")
Score ID Sex
1 10 A12 Male
2 21 A23 Female
3 33 Z14 Male
4 12 WX1 Female
5 35 Y31 Male
6 67 D66 Female
7 43 C31 <NA>
8 99 Q22 <NA>
data2 %>%
left_join(data1, by = "ID")
Sex ID Score
1 Male A12 10
2 Female A23 21
3 Male Z14 33
4 Female WX1 12
5 Male Y31 35
6 Female D66 67
7 Male E52 NA
8 Female H71 NA
right_join
This merge will result in all rows being retained from the second data-set listed in the command, but will include only those rows from the first data-set that have the same value(s) for the by variable(s) …
data1 %>%
right_join(data2, by = "ID")
Score ID Sex
1 10 A12 Male
2 21 A23 Female
3 33 Z14 Male
4 12 WX1 Female
5 35 Y31 Male
6 67 D66 Female
7 NA E52 Male
8 NA H71 Female
data2 %>%
right_join(data1, by = "ID")
Sex ID Score
1 Male A12 10
2 Female A23 21
3 Male Z14 33
4 Female WX1 12
5 Male Y31 35
6 Female D66 67
7 <NA> C31 43
8 <NA> Q22 99
full_join
This merge keeps all rows from each data-set, with those only found in one data-set being given NA
values as necessary.
data1 %>%
full_join(data2, by = "ID")
Score ID Sex
1 10 A12 Male
2 21 A23 Female
3 33 Z14 Male
4 12 WX1 Female
5 35 Y31 Male
6 67 D66 Female
7 43 C31 <NA>
8 99 Q22 <NA>
9 NA E52 Male
10 NA H71 Female
semi_join
What if you only want observations that match in both data-sets and only those columns found in one data-set but not in the other? Use a semi-join
. In the examples below, you will notice that the column Sex
is not included in the first result since it is only found in data2
but is there in the second example since data2
is guiding the semi-Join()
.
data.frame(
Score = c(10, 21, 33, 12, 35, 67, 43, 99),
ID = c("A12", "A23", "Z14", "WX1", "Y31", "D66", "C31", "Q22")
) -> data1
data.frame(
Score = c(10, 21, 33, 12, 35, 67, NA, NA),
Sex = c("Male", "Female", "Male", "Female", "Male", "Female", "Male", "Female"),
ID = c("A12", "A23", "Z14", "WX1", "Y31", "D66", "E52", "H71")
) -> data2
semi_join(data1, data2)
Score ID
1 10 A12
2 21 A23
3 33 Z14
4 12 WX1
5 35 Y31
6 67 D66
semi_join(data2, data1, by = "ID")
Score Sex ID
1 10 Male A12
2 21 Female A23
3 33 Male Z14
4 12 Female WX1
5 35 Male Y31
6 67 Female D66
anti_join
This will only retain records not found in the second data-set listed.
anti_join(data1, data2)
Score ID
1 43 C31
2 99 Q22
anti_join(data2, data1)
Score Sex ID
1 NA Male E52
2 NA Female H71
mutate
/transmute
You can create new columns by executing vectorized functions that vectors of some length as the input and return vectors of the same length as the output. These functions are shown below.
tibble::tibble(
id = c(rep("A", 3), rep("B", 4), rep("C", 1), rep("D", 4)),
x = c(seq(1, 10, by = 1), -1, -2),
y = c(seq(11, 20, by = 1), -11, -12)
) -> a.df
lag()
versus lead()
To lag some value implies to assign to the current observation (i.e., row) the preceding value. Similarly, to lead implies to assign to the current observation (i.e., row) the value from a subsequent row. If I lead by 1, then I am assigning the value from the next row, lead by 2 then I am assigning the value from 2 rows ahead, and so on.
Say I want to lag the values of x by 1, first regardless of id
but then within each id
.
a.df %>%
mutate(
x.lag = lag(x, 1)
)
# A tibble: 12 × 4
id x y x.lag
<chr> <dbl> <dbl> <dbl>
1 A 1 11 NA
2 A 2 12 1
3 A 3 13 2
4 B 4 14 3
5 B 5 15 4
6 B 6 16 5
7 B 7 17 6
8 C 8 18 7
9 D 9 19 8
10 D 10 20 9
11 D -1 -11 10
12 D -2 -12 -1
a.df %>%
group_by(id) %>%
mutate(
x.lag = lag(x, 1)
)
# A tibble: 12 × 4
# Groups: id [4]
id x y x.lag
<chr> <dbl> <dbl> <dbl>
1 A 1 11 NA
2 A 2 12 1
3 A 3 13 2
4 B 4 14 NA
5 B 5 15 4
6 B 6 16 5
7 B 7 17 6
8 C 8 18 NA
9 D 9 19 NA
10 D 10 20 9
11 D -1 -11 10
12 D -2 -12 -1
Similarly, I can also lead by a specific “period”
a.df %>%
mutate(
x.lead = lead(x, 2)
)
# A tibble: 12 × 4
id x y x.lead
<chr> <dbl> <dbl> <dbl>
1 A 1 11 3
2 A 2 12 4
3 A 3 13 5
4 B 4 14 6
5 B 5 15 7
6 B 6 16 8
7 B 7 17 9
8 C 8 18 10
9 D 9 19 -1
10 D 10 20 -2
11 D -1 -11 NA
12 D -2 -12 NA
a.df %>%
group_by(id) %>%
mutate(
y.lead = lead(y, 1)
)
# A tibble: 12 × 4
# Groups: id [4]
id x y y.lead
<chr> <dbl> <dbl> <dbl>
1 A 1 11 12
2 A 2 12 13
3 A 3 13 NA
4 B 4 14 15
5 B 5 15 16
6 B 6 16 17
7 B 7 17 NA
8 C 8 18 NA
9 D 9 19 20
10 D 10 20 -11
11 D -1 -11 -12
12 D -2 -12 NA
We can also calculate various aggregates by cumulating over/across columns.
a.df %>%
mutate(
c = cummin(x)
) # c in any row is the smallest value observed thus far
# A tibble: 12 × 4
id x y c
<chr> <dbl> <dbl> <dbl>
1 A 1 11 1
2 A 2 12 1
3 A 3 13 1
4 B 4 14 1
5 B 5 15 1
6 B 6 16 1
7 B 7 17 1
8 C 8 18 1
9 D 9 19 1
10 D 10 20 1
11 D -1 -11 -1
12 D -2 -12 -2
a.df %>%
mutate(
c = cummax(x)
) # c in any row is the largest value observed thus far
# A tibble: 12 × 4
id x y c
<chr> <dbl> <dbl> <dbl>
1 A 1 11 1
2 A 2 12 2
3 A 3 13 3
4 B 4 14 4
5 B 5 15 5
6 B 6 16 6
7 B 7 17 7
8 C 8 18 8
9 D 9 19 9
10 D 10 20 10
11 D -1 -11 10
12 D -2 -12 10
a.df %>%
mutate(
c = cummean(x)
) # c is the average of the current + preceding value(s)
# A tibble: 12 × 4
id x y c
<chr> <dbl> <dbl> <dbl>
1 A 1 11 1
2 A 2 12 1.5
3 A 3 13 2
4 B 4 14 2.5
5 B 5 15 3
6 B 6 16 3.5
7 B 7 17 4
8 C 8 18 4.5
9 D 9 19 5
10 D 10 20 5.5
11 D -1 -11 4.91
12 D -2 -12 4.33
a.df %>%
mutate(
c = cumsum(x)
) # c is the cumulative sum of the current + preceding value(s)
# A tibble: 12 × 4
id x y c
<chr> <dbl> <dbl> <dbl>
1 A 1 11 1
2 A 2 12 3
3 A 3 13 6
4 B 4 14 10
5 B 5 15 15
6 B 6 16 21
7 B 7 17 28
8 C 8 18 36
9 D 9 19 45
10 D 10 20 55
11 D -1 -11 54
12 D -2 -12 52
a.df %>%
mutate(
c = cumprod(x)
) # c is the cumulative product of the current value of x and the preceding value of c
# A tibble: 12 × 4
id x y c
<chr> <dbl> <dbl> <dbl>
1 A 1 11 1
2 A 2 12 2
3 A 3 13 6
4 B 4 14 24
5 B 5 15 120
6 B 6 16 720
7 B 7 17 5040
8 C 8 18 40320
9 D 9 19 362880
10 D 10 20 3628800
11 D -1 -11 -3628800
12 D -2 -12 7257600
Say we have information on bank balance we hold by date. If the balance is negative, then we have leaned on an overdraft. We can use cumany()
and cumall()
to extract rows once a condition is met.
data.frame(
date = as.Date("2020-01-01") + 0:6,
balance = c(100, 50, 25, -25, -50, 30, 120)
) -> df
df
date balance
1 2020-01-01 100
2 2020-01-02 50
3 2020-01-03 25
4 2020-01-04 -25
5 2020-01-05 -50
6 2020-01-06 30
7 2020-01-07 120
# show me all rows starting with when the first overdraft occurs
df %>%
filter(
cumany(balance < 0)
)
date balance
1 2020-01-04 -25
2 2020-01-05 -50
3 2020-01-06 30
4 2020-01-07 120
# show me all rows preceding the first overdraft
df %>%
filter(
cumall(!(balance < 0))
)
date balance
1 2020-01-01 100
2 2020-01-02 50
3 2020-01-03 25
These come in handy when we need to rank observations based on some criterion.
cume_dist()
a.df %>%
mutate(
d = cume_dist(x)
) # gives a cumulative proportion of all values of x less than or equal to the current value.
# A tibble: 12 × 4
id x y d
<chr> <dbl> <dbl> <dbl>
1 A 1 11 0.25
2 A 2 12 0.333
3 A 3 13 0.417
4 B 4 14 0.5
5 B 5 15 0.583
6 B 6 16 0.667
7 B 7 17 0.75
8 C 8 18 0.833
9 D 9 19 0.917
10 D 10 20 1
11 D -1 -11 0.167
12 D -2 -12 0.0833
Note that -2 is the smallest value so it is ranked first, and the proportion becomes d = (1/12). -1 is bigger than -2, hence ranked second. Now d = (2/12) = 0.167 because two observations have a value of x equal to -1 or less. And so on …
row_number()
Say we have multiple observations for some group and want to flag whether a specific observation is the first, second, third, etc. within the group.
a.df %>%
group_by(id) %>%
mutate(
flag_row = row_number()
)
# A tibble: 12 × 4
# Groups: id [4]
id x y flag_row
<chr> <dbl> <dbl> <int>
1 A 1 11 1
2 A 2 12 2
3 A 3 13 3
4 B 4 14 1
5 B 5 15 2
6 B 6 16 3
7 B 7 17 4
8 C 8 18 1
9 D 9 19 1
10 D 10 20 2
11 D -1 -11 3
12 D -2 -12 4
If you want to flag in ascending or descending order of some variable, than you can run the following and when you do, pay attention to what happens with D since the smallest value of x is flagged as the first observation and NOT the order in which x appears in the data-set.
a.df %>%
group_by(id) %>%
mutate(
flag_row = row_number(x)
)
# A tibble: 12 × 4
# Groups: id [4]
id x y flag_row
<chr> <dbl> <dbl> <int>
1 A 1 11 1
2 A 2 12 2
3 A 3 13 3
4 B 4 14 1
5 B 5 15 2
6 B 6 16 3
7 B 7 17 4
8 C 8 18 1
9 D 9 19 3
10 D 10 20 4
11 D -1 -11 2
12 D -2 -12 1
min_rank()
and percent_rank()
This allows you to flag the minimum rank for an observation as a rank or as a proportion
a.df %>%
mutate(
min.rank = min_rank(x)
)
# A tibble: 12 × 4
id x y min.rank
<chr> <dbl> <dbl> <int>
1 A 1 11 3
2 A 2 12 4
3 A 3 13 5
4 B 4 14 6
5 B 5 15 7
6 B 6 16 8
7 B 7 17 9
8 C 8 18 10
9 D 9 19 11
10 D 10 20 12
11 D -1 -11 2
12 D -2 -12 1
a.df %>%
group_by(id) %>%
mutate(
min.rank = min_rank(x)
)
# A tibble: 12 × 4
# Groups: id [4]
id x y min.rank
<chr> <dbl> <dbl> <int>
1 A 1 11 1
2 A 2 12 2
3 A 3 13 3
4 B 4 14 1
5 B 5 15 2
6 B 6 16 3
7 B 7 17 4
8 C 8 18 1
9 D 9 19 3
10 D 10 20 4
11 D -1 -11 2
12 D -2 -12 1
a.df %>%
mutate(
pct.rank = percent_rank(x)
)
# A tibble: 12 × 4
id x y pct.rank
<chr> <dbl> <dbl> <dbl>
1 A 1 11 0.182
2 A 2 12 0.273
3 A 3 13 0.364
4 B 4 14 0.455
5 B 5 15 0.545
6 B 6 16 0.636
7 B 7 17 0.727
8 C 8 18 0.818
9 D 9 19 0.909
10 D 10 20 1
11 D -1 -11 0.0909
12 D -2 -12 0
a.df %>%
group_by(id) %>%
mutate(
pct.rank = percent_rank(x)
)
# A tibble: 12 × 4
# Groups: id [4]
id x y pct.rank
<chr> <dbl> <dbl> <dbl>
1 A 1 11 0
2 A 2 12 0.5
3 A 3 13 1
4 B 4 14 0
5 B 5 15 0.333
6 B 6 16 0.667
7 B 7 17 1
8 C 8 18 NaN
9 D 9 19 0.667
10 D 10 20 1
11 D -1 -11 0.333
12 D -2 -12 0
In addition, you can use first()
, last()
, or then a specific position in a vector via nth(somevariable, position)
as shown below. Note, however, that here the rows are not sorted in ascending/descending order of your target variable but instead used as they occur in the dataset.
a.df %>%
mutate(
first.value = first(y)
)
# A tibble: 12 × 4
id x y first.value
<chr> <dbl> <dbl> <dbl>
1 A 1 11 11
2 A 2 12 11
3 A 3 13 11
4 B 4 14 11
5 B 5 15 11
6 B 6 16 11
7 B 7 17 11
8 C 8 18 11
9 D 9 19 11
10 D 10 20 11
11 D -1 -11 11
12 D -2 -12 11
a.df %>%
mutate(
last.value = last(y)
)
# A tibble: 12 × 4
id x y last.value
<chr> <dbl> <dbl> <dbl>
1 A 1 11 -12
2 A 2 12 -12
3 A 3 13 -12
4 B 4 14 -12
5 B 5 15 -12
6 B 6 16 -12
7 B 7 17 -12
8 C 8 18 -12
9 D 9 19 -12
10 D 10 20 -12
11 D -1 -11 -12
12 D -2 -12 -12
a.df %>%
mutate(
third.value = nth(y, 3)
)
# A tibble: 12 × 4
id x y third.value
<chr> <dbl> <dbl> <dbl>
1 A 1 11 13
2 A 2 12 13
3 A 3 13 13
4 B 4 14 13
5 B 5 15 13
6 B 6 16 13
7 B 7 17 13
8 C 8 18 13
9 D 9 19 13
10 D 10 20 13
11 D -1 -11 13
12 D -2 -12 13
And then of course the grouped variants
a.df %>%
group_by(id) %>%
mutate(
first.value = first(y)
)
# A tibble: 12 × 4
# Groups: id [4]
id x y first.value
<chr> <dbl> <dbl> <dbl>
1 A 1 11 11
2 A 2 12 11
3 A 3 13 11
4 B 4 14 14
5 B 5 15 14
6 B 6 16 14
7 B 7 17 14
8 C 8 18 18
9 D 9 19 19
10 D 10 20 19
11 D -1 -11 19
12 D -2 -12 19
a.df %>%
group_by(id) %>%
mutate(
last.value = last(y)
)
# A tibble: 12 × 4
# Groups: id [4]
id x y last.value
<chr> <dbl> <dbl> <dbl>
1 A 1 11 13
2 A 2 12 13
3 A 3 13 13
4 B 4 14 17
5 B 5 15 17
6 B 6 16 17
7 B 7 17 17
8 C 8 18 18
9 D 9 19 -12
10 D 10 20 -12
11 D -1 -11 -12
12 D -2 -12 -12
a.df %>%
group_by(id) %>%
mutate(
third.value = nth(y, 3)
)
# A tibble: 12 × 4
# Groups: id [4]
id x y third.value
<chr> <dbl> <dbl> <dbl>
1 A 1 11 13
2 A 2 12 13
3 A 3 13 13
4 B 4 14 16
5 B 5 15 16
6 B 6 16 16
7 B 7 17 16
8 C 8 18 NA
9 D 9 19 -11
10 D 10 20 -11
11 D -1 -11 -11
12 D -2 -12 -11
For attribution, please cite this work as
Ruhil (2025, Feb. 16). Tidy and Tweak Your Data. Retrieved from https://aniruhil.org/courses/mpa6020/handouts/module04.html
BibTeX citation
@misc{ruhil2025tidy, author = {Ruhil, Ani}, title = {Tidy and Tweak Your Data}, url = {https://aniruhil.org/courses/mpa6020/handouts/module04.html}, year = {2025} }