In this module you will learn how three very powerful packages work their magic. {dplyr} and {tidyr} have quickly gained a following because they allow you to manage seemingly unwieldy data to calculate quantities of interest with relative ease, and to flip data from “long” to “wide” formats or vice-versa. {stringr} is a sister-package that works well with strings and often useful for key data cleaning.
Our goal is to understand how two very powerful packages work their
magic. Of recent vintage, {dplyr}
and {tidyr}
have quickly gained a following because they allow you to manage
seemingly unwieldy data to calculate quantities of interest with
relative ease. In this module we will see some of their core
abilities.
You will usually end up needing and using some or all of these six functions:
select()
filter()
arrange()
summarise()
mutate()
group_by()
In order to understand how these commands work we rely on a simple
data frame – all flights originating and departing from Columbus (Ohio)
for January through December of 2017. Let us load the data and
{dplyr}
.
library(here)
library(tidyverse)
library(tidylog, warn.conflicts = FALSE)
load(
here("data", "cmhflights2017.RData")
)
You will want to install {tidylog}
, a great package by
Benjamin Elbers, since tidylog
shows you what each step of a basic {dplyr}
operation has
done.
select(...)
Often your data frame will have more columns than you need to
calculate some quantity – for example, the mean of some variable or the
frequency distribution of some categorical variable. Say I only want the
second through sixth column – Year
, Quarter
,
Month
, DayofMonth
, and DayofWeek
.
Before I do that, though, let me render all column names to be
lowercase; running clean_names()
from the
{janitor}
package after you read in any data-set is a good
habit to acquire early on in your career.
library(janitor)
clean_names(cmhflights2017) -> cmh.df
Having done that, I could generate a data frame containing only the desired columns as follows:
%>%
cmh.df select(
:day_of_week
year-> my.df
)
names(my.df)
[1] "year" "quarter" "month" "dayof_month"
[5] "day_of_week"
What if the columns were not sequentially located? In that case we
would need to list each column we want, using the :
operator as shorthand for sequential columns.
%>%
cmh.df select(
year,:reporting_airline,
flight_date
tail_number-> my.df
)
names(my.df)
[1] "year" "flight_date" "reporting_airline"
[4] "tail_number"
Could we use column numbers instead? Absolutely, provided you know
the column numbers. Yes, provided you know the column numbers of the
columns you want (and these can be obtained by running
names()
on the original data-set).
%>%
cmh.df select(
c(1, 3, 5, 7)
-> my.df
)
names(my.df)
[1] "id" "quarter" "dayof_month" "flight_date"
You can also select columns in other ways, with
contains
, starts_with
, ends_with
,
any_of
, all_of
, matches
, and
num_range
. As these function names suggest,
contains()
will look for a specific string in the column
names, starts_with()
and ends_with()
look for
specific starting and ending strings, respectively,
matches()
will look for a matching string, and
any_of()
will look for any of the strings listed in quotes
inside the parentheses while all_of()
will look for all
listed strings and throw an error if even one is not found.
%>%
cmh.df select(
contains("carrier")
-> my.df # using contains()
)
names(my.df)
[1] "carrier_delay"
%>%
cmh.df select(
starts_with("de")
-> my.df # using starts_with()
)
names(my.df)
[1] "dest_airport_id" "dest_airport_seq_id"
[3] "dest_city_market_id" "dest"
[5] "dest_city_name" "dest_state"
[7] "dest_state_fips" "dest_state_name"
[9] "dest_wac" "dep_time"
[11] "dep_delay" "dep_delay_minutes"
[13] "dep_del15" "departure_delay_groups"
[15] "dep_time_blk"
%>%
cmh.df select(
ends_with("num")
-> my.df # using ends_with()
)
names(my.df)
[1] "div1tail_num" "div2tail_num" "div3tail_num" "div4tail_num"
[5] "div5tail_num"
Now here is any_of()
, with the desired variables fed via
creating a new object called myvars
= c("year", "month", "dest")
myvars
%>%
cmh.df select(
any_of(
myvars
)-> my.df # using any_of()
)
names(my.df)
[1] "year" "month" "dest"
Note that destined
is not found but the other two are
extracted any way. What about with all_of()
?
%>%
cmh.df select(
all_of(myvars)
-> my.df )
What if I misspelled any of the variables? Try running the code below
= c("year", "month", "destined")
mymalvars
%>%
cmh.df select(
all_of(mymalvars)
-> my.df )
and you should see this message:
Error: Can't subset columns that don't exist. x Column `destined` doesn't exist.
%>%
cmh.df select(
matches(
"wheel"
)-> my.df # using matches()
)
names(my.df)
[1] "wheels_off" "wheels_on" "div1wheels_on"
[4] "div1wheels_off" "div2wheels_on" "div2wheels_off"
[7] "div3wheels_on" "div3wheels_off" "div4wheels_on"
[10] "div4wheels_off" "div5wheels_on" "div5wheels_off"
What about num_range()
?
%>%
billboard select(
num_range("wk", 1:4)
-> my.bb
)
names(my.bb)
[1] "wk1" "wk2" "wk3" "wk4"
rename(...)
Variable names can be modified with the rename()
command. The sequence is easy to remember: newname
=
oldname
, as shown below. Just remember: When you do this,
the original variable name has been overwritten so do not invoke it in
any subsequent code.
%>%
cmh.df rename(
flt_number = flight_number_reporting_airline
%>%
) select(
tail_number, flt_number-> df.rename
)
names(df.rename)
[1] "tail_number" "flt_number"
filter(...)
Recall that with the diamonds
data, in the previous week
we used subset()
to select particular cases for analysis.
With {dplyr}
, the filter()
command serves the
same purpose. Say we only want records from January, i.e.,
month == 1
.
%>%
cmh.df filter(
== 1
month -> my.df )
What about only American Airline flights in January?
%>%
cmh.df filter(
== 1 &
month == "AA"
reporting_airline -> my.df )
What about United Airlines flights in January to CMH (Columbus, OH) from any originating airport?
%>%
cmh.df filter(
== 1 & reporting_airline == "UA" & dest == "CMH"
month -> my.df )
Note: Instead of &
you could use ,
… it
serves the same purpose; test it out for yourself with the code
below.
%>%
cmh.df filter(
== 1, reporting_airline == "UA", dest == "CMH"
month -> my.df )
filter(...)
with
%in%
What if I wanted a more complicated filter, say, flights in January or February either to CMH (Columbus, Ohio) or to ORD (the O’Hare airport in Chicago, Illinois)?
%>%
cmh.df filter(
%in% c(1, 2) &
month == "UA" &
reporting_airline %in% c("CMH", "ORD")
dest -> my.df )
Some times you may also want to filter with the opposite of
%in%
, as in not in
some list. You can do this
as well, but note where the !
is applied – to the
column-name.
%>%
cmh.df filter(
%in% c(1, 2) &
month == "UA" &
reporting_airline !dest %in% c("CMH", "ORD")
-> my.df )
Now let us say I wanted to arrange the resulting data frame in
ascending
order of departure delays. How might I do
that?
arrange(...)
%>%
my.df arrange(
dep_delay_minutes-> my.df2 )
And now in descending
order of delays. Note the use of
the minus symbol -
to reverse the sort order
%>%
my.df arrange(
-dep_delay_minutes
-> my.df2 )
We could tweak this further, perhaps asking R to first sort by
descending order of departure delays to CMH and then to ORD. Note that
dest
is ascending and DepDelayMinutes
is
descending
%>%
my.df arrange(
-dep_delay_minutes
dest, -> my.df2 )
%>%
So far, we have seen each function in isolation. Now we streamline things a bit so that we only end up with the columns we want, organized as we want them to be. That is, we can string together multiple functions to have a slim data frame of only those columns we selected, filtered, and arranged according to our wishes.
%>%
cmh.df select(
month, reporting_airline, dest, dep_delay_minutes%>%
) filter(
%in% c(1, 2) &
month == "UA" &
reporting_airline %in% c("CMH", "ORD")
dest %>%
) arrange(
desc(dep_delay_minutes)
month, dest, -> my.df3 )
Here, the end result is a data frame arranged by month, then within month by destination, and then finally by descending order of flight delay times.
group_by(...)
You often want to carry out some operations/calculations by specific
groups, for example, calculate the mean or median age of the men in our
sample versus the women, find out how many unique destinations does each
airline fly to from Columbus, and so on. The group_by()
command comes in very handy here to set the grouping structure.
Let us calculate the Mean and Median delay times. To do this, we will
lean on the group_by()
function to calculate median and
mean delays by month and destination. While we are at it, let us see the
standard deviation as well.
%>%
cmh.df select(
month, reporting_airline, dest, dep_delay_minutes%>%
) filter(
%in% c(1, 2),
month == "UA",
reporting_airline %in% c("CMH", "ORD")
dest %>%
) group_by(month, dest) %>%
summarise(
median.delay = median(dep_delay_minutes, na.rm = TRUE),
mean.delay = mean(dep_delay_minutes, na.rm = TRUE),
sd.delay = sd(dep_delay_minutes, na.rm = TRUE)
%>%
) arrange(month, dest) -> tab.01
What if wanted to know how many airports can you fly to from Columbus, and how many flights there were to each airport in July?
%>%
cmh.df select(
month, reporting_airline, dest, dep_delay_minutes, origin%>%
) filter(
%in% c(7),
month %in% c("CMH")
origin %>%
) group_by(dest) %>%
summarise(
no.of.flights = n()
-> tab.02 )
How many distinct airlines fly out of Columbus?
%>%
cmh.df select(
month, reporting_airline, dest, dep_delay_minutes, origin%>%
) filter(
%in% c("CMH")
origin %>%
) group_by(reporting_airline) %>%
summarise(
no.of.airlines = n_distinct(reporting_airline)
-> tab.03 )
And how many flights did each airline operate?
%>%
cmh.df select(
month, reporting_airline, dest, dep_delay_minutes, origin%>%
) filter(
%in% c("CMH")
origin %>%
) group_by(reporting_airline) %>%
summarise(
no.of.flights = n()
-> tab.04 )
Of all of these airlines, which one has the highest mean departure delay?
%>%
cmh.df select(
month, reporting_airline, dest, dep_delay_minutes, origin%>%
) filter(
%in% c("CMH")
origin %>%
) group_by(reporting_airline) %>%
summarise(
mean.delay = mean(dep_delay_minutes, na.rm = TRUE)
%>%
) arrange(
desc(mean.delay)
-> tab.05 )
ungroup(...)
If you want to remove a grouping structure you merely need to specify
ungroup()
, as in
%>%
cmh.df select(
month, reporting_airline, dest, dep_delay_minutes, origin%>%
) filter(
%in% c("CMH")
origin %>%
) group_by(reporting_airline) %>%
summarise(
median.delay = median(dep_delay_minutes, na.rm = TRUE)
%>%
) ungroup() -> tab.06
ungroup()
is often handy when you need to pipe several
commands but the grouping structure is different for each. For example,
maybe I want to calculate departure delays by unique airline, and then
by unique combinations of airline & origin.
%>%
cmh.df select(
month, reporting_airline, dest, dep_delay_minutes, origin%>%
) group_by(reporting_airline) %>%
mutate(
mean.delay.1 = mean(dep_delay_minutes, na.rm = TRUE)
%>%
) ungroup() %>%
group_by(reporting_airline, origin) %>%
mutate(
mean.delay.2 = mean(dep_delay_minutes, na.rm = TRUE)
)
# A tibble: 47,787 × 7
# Groups: reporting_airline, origin [53]
month reporting_airline dest dep_delay_minutes origin mean.delay.1
<dbl> <chr> <chr> <dbl> <chr> <dbl>
1 1 DL ATL 0 CMH 8.90
2 1 DL ATL 1 CMH 8.90
3 1 DL CMH 0 ATL 8.90
4 1 DL ATL 0 CMH 8.90
5 1 DL CMH 0 ATL 8.90
6 1 DL CMH 0 LAX 8.90
7 1 DL CMH 0 ATL 8.90
8 1 DL ATL 0 CMH 8.90
9 1 DL LAX 0 CMH 8.90
10 1 DL CMH 0 ATL 8.90
# … with 47,777 more rows, and 1 more variable: mean.delay.2 <dbl>
distinct(...)
Since the preceding code yields duplicate rows, I can use
distinct
to retain only unique rows
%>%
cmh.df select(
month, reporting_airline, dest, dep_delay_minutes, origin%>%
) group_by(reporting_airline) %>%
mutate(
mean.delay.1 = mean(dep_delay_minutes, na.rm = TRUE)
%>%
) ungroup() %>%
group_by(reporting_airline, origin) %>%
mutate(
mean.delay.2 = mean(dep_delay_minutes, na.rm = TRUE)
%>%
) distinct()
# A tibble: 13,332 × 7
# Groups: reporting_airline, origin [53]
month reporting_airline dest dep_delay_minutes origin mean.delay.1
<dbl> <chr> <chr> <dbl> <chr> <dbl>
1 1 DL ATL 0 CMH 8.90
2 1 DL ATL 1 CMH 8.90
3 1 DL CMH 0 ATL 8.90
4 1 DL CMH 0 LAX 8.90
5 1 DL LAX 0 CMH 8.90
6 1 DL CMH 7 ATL 8.90
7 1 DL CMH 1 LAX 8.90
8 1 DL ATL 2 CMH 8.90
9 1 DL CMH 105 ATL 8.90
10 1 DL CMH 1 ATL 8.90
# … with 13,322 more rows, and 1 more variable: mean.delay.2 <dbl>
mutate(...)
Thus far we have been creating a summary data frame containing
whatever we calculated – the frequencies, averages, and so on. Some
times, however, you may want these calculations to be be appended to the
original data frame and stored as new columns. This can be done with
mutate()
, as shown below. In this particular example, I am
calculating median departure delay by airline (for each airport it
departs from) and then including this as a column called
Median.Delay
in cmhflights.
%>%
cmh.df group_by(
reporting_airline, origin%>%
) mutate(
median.delay = median(dep_delay_minutes, na.rm = TRUE)
-> cmhflights )
Note the 111th column. Now, if we wanted to add this column but drop
all other columns not being used in the calculation, we could do that
via transmute()
. Be careful with this one though.
-> mut.df # cloning cmh.df into mut.df
cmh.df
%>%
mut.df group_by(reporting_airline, dest) %>%
transmute(
median.arr.delay = median(arr_delay_minutes, na.rm = TRUE)
-> mut.df2
)
names(mut.df2)
[1] "reporting_airline" "dest" "median.arr.delay"
case_when(...)
Just as one can use ifelse()
to generate new variables
or modify existing variables, so can one do the same thing with
case_when()
from dplyr
. Here are a few
examples. First, working with categorical variables, where I create a
new variable that flags whether the flight originated in BWI, DCA, OR
IAH versus some other airport.
%>%
cmh.df mutate(
DC_area = case_when(
%in% c("BWI","DCA", "IAH") ~ "DC Area",
origin !origin %in% c("BWI","DCA", "IAH") ~ "Not in DC Area"
)-> cw.df
)
table(cw.df$origin, cw.df$DC_area)
DC Area Not in DC Area
ATL 0 3794
BNA 0 692
BOS 0 623
BWI 1301 0
CLT 0 3
CMH 0 23893
DAL 0 369
DCA 674 0
DEN 0 1224
DFW 0 1482
DTW 0 755
EWR 0 847
FLL 0 321
HOU 0 16
IAH 760 0
LAS 0 1074
LAX 0 641
LGA 0 173
MCO 0 1488
MDW 0 1992
MSP 0 945
MSY 0 35
OAK 0 365
ORD 0 1089
PHL 0 57
PHX 0 1075
RSW 0 657
STL 0 655
TPA 0 787
The same result would obtain if you had run the following code:
%>%
cmh.df mutate(
DC_area = case_when(
%in% c("BWI","DCA", "IAH") ~ "DC Area",
origin TRUE ~ "Not in DC Area"
)-> cw.df
)
table(cw.df$origin, cw.df$DC_area)
DC Area Not in DC Area
ATL 0 3794
BNA 0 692
BOS 0 623
BWI 1301 0
CLT 0 3
CMH 0 23893
DAL 0 369
DCA 674 0
DEN 0 1224
DFW 0 1482
DTW 0 755
EWR 0 847
FLL 0 321
HOU 0 16
IAH 760 0
LAS 0 1074
LAX 0 641
LGA 0 173
MCO 0 1488
MDW 0 1992
MSP 0 945
MSY 0 35
OAK 0 365
ORD 0 1089
PHL 0 57
PHX 0 1075
RSW 0 657
STL 0 655
TPA 0 787
Similarly, we could use numerical variables to create new ones. For
example, let us call a flight delayed
if
dep_delay_minutes
is \(>
30\).
%>%
cw.df mutate(
delayed = case_when(
> 30 ~ "Yes",
dep_delay_minutes < 30 ~ "No",
dep_delay_minutes is.na(dep_delay_minutes) ~ "Unknown"
)-> cw.df )
case_when()
is often useful when you have to chain
multiple ifelse()
commands since managing a large
ifelse()
chain can be tiresome. Further, you should
remember that %in%
can be used to ease coding in
case_when()
, as shown below. Say I want to group flights
into weekday versus weekend flights. Note that the week starts on
Sundays hence weekdays are defined here as days 2 through 6 (i.e.,
Monday through Friday).
%>%
cw.df mutate(
weekend = case_when(
%in% c(2:6) ~ "No",
day_of_week TRUE ~ "Yes"
)-> cw.df )
unnest(...)
Every now and then the data you have to work with includes some
list-columns – columns that are like mini-data frames themselves.
Suddenly R functions that would otherwise work will not execute as
desired. The unnest
function from the
tidyverse
unpacks these list-columns into a regular tibble.
First a small data frame that shows a list-column.
library(tidyverse)
data_frame(
x = 1:3,
y = c("a", "d,e,f", "g,h")
-> bad.df )
If I want to clean up list-column y
, I can do that as
shown below.
%>%
bad.df transform(
y = strsplit(y, ",")
%>%
) unnest(y) -> good.df
Now see a slightly more complicated list-column, with y
made of named rows …. a and b, a with 1 value and b with 2 values.
data_frame(
x = 1:2,
y = list(a = 1, b = 3:4)
-> badder.df )
To clean this up we could do:
%>%
badder.df unnest(
.,.id = "name"
-> gooder.df )
Most packages you use in R call for data to be {tidy}
.
What are tidy
data?
Here are some untidy data
read.delim(
file = "http://stat405.had.co.nz/data/pew.txt",
header = TRUE,
stringsAsFactors = FALSE,
check.names = FALSE
-> pew
)
head(pew)
religion <$10k $10-20k $20-30k $30-40k $40-50k $50-75k
1 Agnostic 27 34 60 81 76 137
2 Atheist 12 27 37 52 35 70
3 Buddhist 27 21 30 34 33 58
4 Catholic 418 617 732 670 638 1116
5 Don’t know/refused 15 14 15 11 10 35
6 Evangelical Prot 575 869 1064 982 881 1486
$75-100k $100-150k >150k Don't know/refused
1 122 109 84 96
2 73 59 74 76
3 62 39 53 54
4 949 792 633 1489
5 21 17 18 116
6 949 723 414 1529
# remotes::install_github("garrettgman/DSR")
library(DSR)
table2
# A tibble: 12 × 4
country year key value
<fct> <int> <fct> <int>
1 Afghanistan 1999 cases 745
2 Afghanistan 1999 population 19987071
3 Afghanistan 2000 cases 2666
4 Afghanistan 2000 population 20595360
5 Brazil 1999 cases 37737
6 Brazil 1999 population 172006362
7 Brazil 2000 cases 80488
8 Brazil 2000 population 174504898
9 China 1999 cases 212258
10 China 1999 population 1272915272
11 China 2000 cases 213766
12 China 2000 population 1280428583
table3
# A tibble: 6 × 3
country year rate
<fct> <int> <chr>
1 Afghanistan 1999 745/19987071
2 Afghanistan 2000 2666/20595360
3 Brazil 1999 37737/172006362
4 Brazil 2000 80488/174504898
5 China 1999 212258/1272915272
6 China 2000 213766/1280428583
table4
# A tibble: 3 × 3
country `1999` `2000`
<fct> <int> <int>
1 Afghanistan 745 2666
2 Brazil 37737 80488
3 China 212258 213766
table5
# A tibble: 3 × 3
country `1999` `2000`
<fct> <int> <int>
1 Afghanistan 19987071 20595360
2 Brazil 172006362 174504898
3 China 1272915272 1280428583
What would a tidy version of the examples in (b) look like?
table1
# A tibble: 6 × 4
country year cases population
<fct> <int> <int> <int>
1 Afghanistan 1999 745 19987071
2 Afghanistan 2000 2666 20595360
3 Brazil 1999 37737 172006362
4 Brazil 2000 80488 174504898
5 China 1999 212258 1272915272
6 China 2000 213766 1280428583
Here, we have
The {tidyr}
package is designed to make messy data tidy
in order to proceed with analysis. It has four primary functions that we
will see in turn.
You will run into variables that you want to break into smaller
pieces. For instance, take table3
where rate
is a ratio of two variables, cases
and
population
, each of which should be in their own columns.
How can we achieve that?
library(tidyr)
%>%
table3 separate(
col = rate,
into = c("cases", "population"),
sep = "/",
remove = TRUE,
convert = TRUE
-> s.table3
)
s.table3
# A tibble: 6 × 4
country year cases population
<fct> <int> <int> <int>
1 Afghanistan 1999 745 19987071
2 Afghanistan 2000 2666 20595360
3 Brazil 1999 37737 172006362
4 Brazil 2000 80488 174504898
5 China 1999 212258 1272915272
6 China 2000 213766 1280428583
col =
to identify which column must be
splitinto = c("", "")
/
remove = TRUE
. Set this to `FALSE if you want to retain the
original columnconvert = TRUE
. If you omit this switch then the new
columns will be created as characters.unite(...)
The unite(...)
command does the exact opposite of
separate, combining columns
%>%
s.table3 unite(
"rate",
cases, population,sep = ":",
remove = FALSE
-> u.table3
)
u.table3
# A tibble: 6 × 5
country year rate cases population
<fct> <int> <chr> <int> <int>
1 Afghanistan 1999 745:19987071 745 19987071
2 Afghanistan 2000 2666:20595360 2666 20595360
3 Brazil 1999 37737:172006362 37737 172006362
4 Brazil 2000 80488:174504898 80488 174504898
5 China 1999 212258:1272915272 212258 1272915272
6 China 2000 213766:1280428583 213766 1280428583
Here you specify the name of the new column first, and then list the columns to be combined, specify the separator, and whether the original columns should be removed or not.
spread(...)
and
pivot_wider(...)
Revisit table2
table2
# A tibble: 12 × 4
country year key value
<fct> <int> <fct> <int>
1 Afghanistan 1999 cases 745
2 Afghanistan 1999 population 19987071
3 Afghanistan 2000 cases 2666
4 Afghanistan 2000 population 20595360
5 Brazil 1999 cases 37737
6 Brazil 1999 population 172006362
7 Brazil 2000 cases 80488
8 Brazil 2000 population 174504898
9 China 1999 cases 212258
10 China 1999 population 1272915272
11 China 2000 cases 213766
12 China 2000 population 1280428583
I want to flip this into a format where the data frame has two
columns per country per year – cases
and
population
. How can I do this?
%>%
table2 spread(
key,
value-> sp.table2
)
sp.table2
# A tibble: 6 × 4
country year cases population
<fct> <int> <int> <int>
1 Afghanistan 1999 745 19987071
2 Afghanistan 2000 2666 20595360
3 Brazil 1999 37737 172006362
4 Brazil 2000 80488 174504898
5 China 1999 212258 1272915272
6 China 2000 213766 1280428583
Notice that the key
column had two unique values – cases
and population – so two new columns were created. Notice also that tidyr
knew to populate the value of each of the new cells with what it finds
in the value
column. The result is what we call data in a
wide format.
gather(...)
and
pivot_longer(...)
gather(...)
does the opposite of
spread(...)
, converting wide data into what we call the
long form. Revisit table4
table4
# A tibble: 3 × 3
country `1999` `2000`
<fct> <int> <int>
1 Afghanistan 745 2666
2 Brazil 37737 80488
3 China 212258 213766
Note that each country has two columns, one per year. This is untidy. So we’ll flip this tibble as follows:
%>%
table4 gather(
"year",
"cases",
2:3
-> g.table4
)
g.table4
# A tibble: 6 × 3
country year cases
<fct> <chr> <int>
1 Afghanistan 1999 745
2 Brazil 1999 37737
3 China 1999 212258
4 Afghanistan 2000 2666
5 Brazil 2000 80488
6 China 2000 213766
Notice that you had to specify the columns you wanted to gather and
we did this via 2:3
. If you forget to specify columns it
creates something else altogether so be careful.
This is a slightly modified version of Garrett Grolemund’s WHO Data Case Study.
At this website you can find data provided by countries to WHO and estimates of TB burden generated by WHO for the Global Tuberculosis Report. I want you to gather the following data (all are in the csv format):
Read the data dictionary carefully. Then read-in the case notifications data.
The database has too many variables (175 or so) and so I’ll have you use just a few for this practice task.
library(readr)
read_csv(
here("data", "TB_notifications_2018-02-07.csv")
-> tb )
Let us use dplyr
to select and filter the tb database so
that it meets the following criteria: (i) observations from 2012 or
earlier years are kept, and (ii) keep only columns 1 through 6 and 25
(new_sp_m04) through 44 (new_sp_fu).
library(dplyr)
%>%
tb select(
c(1:6, 25:44)
%>%
) filter(
<= 2012
year -> tb2
)
glimpse(tb2)
Rows: 7,023
Columns: 26
$ country <chr> "Afghanistan", "Afghanistan", "Afghanistan", "A…
$ iso2 <chr> "AF", "AF", "AF", "AF", "AF", "AF", "AF", "AF",…
$ iso3 <chr> "AFG", "AFG", "AFG", "AFG", "AFG", "AFG", "AFG"…
$ iso_numeric <chr> "004", "004", "004", "004", "004", "004", "004"…
$ g_whoregion <chr> "EMR", "EMR", "EMR", "EMR", "EMR", "EMR", "EMR"…
$ year <dbl> 1980, 1981, 1982, 1983, 1984, 1985, 1986, 1987,…
$ new_sp_m04 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ new_sp_m514 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ new_sp_m014 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ new_sp_m1524 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ new_sp_m2534 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ new_sp_m3544 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ new_sp_m4554 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ new_sp_m5564 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ new_sp_m65 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ new_sp_mu <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ new_sp_f04 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ new_sp_f514 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ new_sp_f014 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ new_sp_f1524 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ new_sp_f2534 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ new_sp_f3544 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ new_sp_f4554 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ new_sp_f5564 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ new_sp_f65 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ new_sp_fu <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
Now, new_sp_xx
refer to new pulmonary smear-positive
cases. The m
refers to males and f
refers to
females. Finally, 04
is the 0-4 age group, 514
is the 5-14 age group, and so on. new_sp_mu
is males with
an unknown age-group and new_sp_fu
is females with an
unknown age-group. This is the messy part of the data. Ideally we’d have
a column for gender, flagged as Male, Female, Unknown
, and
a separate column for age-groups flagged
0-4, 5-14, 0-14, and so on
. Time to tidy these data.
Since the data are in the wide format we can use gather to flip it into the long format.
%>%
tb2 gather(
"group",
"cases",
7:26
-> tb3
)
glimpse(tb3)
Rows: 140,460
Columns: 8
$ country <chr> "Afghanistan", "Afghanistan", "Afghanistan", "Af…
$ iso2 <chr> "AF", "AF", "AF", "AF", "AF", "AF", "AF", "AF", …
$ iso3 <chr> "AFG", "AFG", "AFG", "AFG", "AFG", "AFG", "AFG",…
$ iso_numeric <chr> "004", "004", "004", "004", "004", "004", "004",…
$ g_whoregion <chr> "EMR", "EMR", "EMR", "EMR", "EMR", "EMR", "EMR",…
$ year <dbl> 1980, 1981, 1982, 1983, 1984, 1985, 1986, 1987, …
$ group <chr> "new_sp_m04", "new_sp_m04", "new_sp_m04", "new_s…
$ cases <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
Now we split the new_sp_xx
values into three columns
%>%
tb3 separate(
col = group,
into = c("new", "sp", "sexage"),
sep = "_"
-> tb4
)
glimpse(tb4)
Rows: 140,460
Columns: 10
$ country <chr> "Afghanistan", "Afghanistan", "Afghanistan", "Af…
$ iso2 <chr> "AF", "AF", "AF", "AF", "AF", "AF", "AF", "AF", …
$ iso3 <chr> "AFG", "AFG", "AFG", "AFG", "AFG", "AFG", "AFG",…
$ iso_numeric <chr> "004", "004", "004", "004", "004", "004", "004",…
$ g_whoregion <chr> "EMR", "EMR", "EMR", "EMR", "EMR", "EMR", "EMR",…
$ year <dbl> 1980, 1981, 1982, 1983, 1984, 1985, 1986, 1987, …
$ new <chr> "new", "new", "new", "new", "new", "new", "new",…
$ sp <chr> "sp", "sp", "sp", "sp", "sp", "sp", "sp", "sp", …
$ sexage <chr> "m04", "m04", "m04", "m04", "m04", "m04", "m04",…
$ cases <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
and then split sexage
into two columns by specifying
that the split should occur after the first value (which in our case
will be after m
or f
).
%>%
tb4 separate(
col = sexage,
into = c("sex", "agegroup"),
sep = 1
-> tb5
)
glimpse(tb5)
Rows: 140,460
Columns: 11
$ country <chr> "Afghanistan", "Afghanistan", "Afghanistan", "Af…
$ iso2 <chr> "AF", "AF", "AF", "AF", "AF", "AF", "AF", "AF", …
$ iso3 <chr> "AFG", "AFG", "AFG", "AFG", "AFG", "AFG", "AFG",…
$ iso_numeric <chr> "004", "004", "004", "004", "004", "004", "004",…
$ g_whoregion <chr> "EMR", "EMR", "EMR", "EMR", "EMR", "EMR", "EMR",…
$ year <dbl> 1980, 1981, 1982, 1983, 1984, 1985, 1986, 1987, …
$ new <chr> "new", "new", "new", "new", "new", "new", "new",…
$ sp <chr> "sp", "sp", "sp", "sp", "sp", "sp", "sp", "sp", …
$ sex <chr> "m", "m", "m", "m", "m", "m", "m", "m", "m", "m"…
$ agegroup <chr> "04", "04", "04", "04", "04", "04", "04", "04", …
$ cases <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
The beauty of dplyr and tidyr is the ability to combine all these steps into a single piped command!
%>%
tb select(
c(1:6, 25:44)
%>%
) filter(
<= 2012
year %>%
) gather(
"group",
"cases",
7:26
%>%
) separate(
col = group,
into = c("new", "sp", "sexage"),
sep = "_"
%>%
) separate(
col = sexage,
into = c("sex", "agegroup"),
sep = 1
%>%
) select(
c(1:6, 9:11)
-> tb.df
)
glimpse(tb.df)
Rows: 140,460
Columns: 9
$ country <chr> "Afghanistan", "Afghanistan", "Afghanistan", "Af…
$ iso2 <chr> "AF", "AF", "AF", "AF", "AF", "AF", "AF", "AF", …
$ iso3 <chr> "AFG", "AFG", "AFG", "AFG", "AFG", "AFG", "AFG",…
$ iso_numeric <chr> "004", "004", "004", "004", "004", "004", "004",…
$ g_whoregion <chr> "EMR", "EMR", "EMR", "EMR", "EMR", "EMR", "EMR",…
$ year <dbl> 1980, 1981, 1982, 1983, 1984, 1985, 1986, 1987, …
$ sex <chr> "m", "m", "m", "m", "m", "m", "m", "m", "m", "m"…
$ agegroup <chr> "04", "04", "04", "04", "04", "04", "04", "04", …
$ cases <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
Now, we can clean-up the codes for easier interpretation.
$sex.f <- ifelse(tb.df$sex == "m", "Males", "Females")
tb.df
$agegroup.f <- ifelse(tb.df$agegroup == "04", "0-4",
tb.dfifelse(tb.df$agegroup == "014", "0-14",
ifelse(tb.df$agegroup == "1524", "15-24",
ifelse(tb.df$agegroup == "2534", "25-34",
ifelse(tb.df$agegroup == "3544", "35-44",
ifelse(tb.df$agegroup == "4554", "45-54",
ifelse(tb.df$agegroup == "514", "5-14",
ifelse(tb.df$agegroup == "5564", "55-64",
ifelse(tb.df$agegroup == "65", "65+", "Uknown")
))))))))
glimpse(tb.df)
Rows: 140,460
Columns: 11
$ country <chr> "Afghanistan", "Afghanistan", "Afghanistan", "Af…
$ iso2 <chr> "AF", "AF", "AF", "AF", "AF", "AF", "AF", "AF", …
$ iso3 <chr> "AFG", "AFG", "AFG", "AFG", "AFG", "AFG", "AFG",…
$ iso_numeric <chr> "004", "004", "004", "004", "004", "004", "004",…
$ g_whoregion <chr> "EMR", "EMR", "EMR", "EMR", "EMR", "EMR", "EMR",…
$ year <dbl> 1980, 1981, 1982, 1983, 1984, 1985, 1986, 1987, …
$ sex <chr> "m", "m", "m", "m", "m", "m", "m", "m", "m", "m"…
$ agegroup <chr> "04", "04", "04", "04", "04", "04", "04", "04", …
$ cases <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ sex.f <chr> "Males", "Males", "Males", "Males", "Males", "Ma…
$ agegroup.f <chr> "0-4", "0-4", "0-4", "0-4", "0-4", "0-4", "0-4",…
The same thing, albeit with case_when()
…
%>%
tb.df mutate(
sex.f = case_when(
== "m" ~ "Males",
sex == "f" ~ "Females"
sex
),agegroup.f = case_when(
== "014" ~ "0-14",
agegroup == "1524" ~ "15-24",
agegroup == "2534" ~ "25-34",
agegroup == "3544" ~ "35-44",
agegroup == "4554" ~ "45-54",
agegroup == "514" ~ "5-14",
agegroup == "5564" ~ "55-64",
agegroup == "65" ~ "65+",
agegroup TRUE ~ "Uknown")
-> tb.df
)
glimpse(tb.df)
Rows: 140,460
Columns: 11
$ country <chr> "Afghanistan", "Afghanistan", "Afghanistan", "Af…
$ iso2 <chr> "AF", "AF", "AF", "AF", "AF", "AF", "AF", "AF", …
$ iso3 <chr> "AFG", "AFG", "AFG", "AFG", "AFG", "AFG", "AFG",…
$ iso_numeric <chr> "004", "004", "004", "004", "004", "004", "004",…
$ g_whoregion <chr> "EMR", "EMR", "EMR", "EMR", "EMR", "EMR", "EMR",…
$ year <dbl> 1980, 1981, 1982, 1983, 1984, 1985, 1986, 1987, …
$ sex <chr> "m", "m", "m", "m", "m", "m", "m", "m", "m", "m"…
$ agegroup <chr> "04", "04", "04", "04", "04", "04", "04", "04", …
$ cases <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ sex.f <chr> "Males", "Males", "Males", "Males", "Males", "Ma…
$ agegroup.f <chr> "Uknown", "Uknown", "Uknown", "Uknown", "Uknown"…
How many cases per country in the 1980-1990 period?
%>%
tb.df filter(
>= 1980 | year <= 1990
year %>%
) group_by(country) %>%
summarise(
ncases = sum(cases, na.rm = TRUE)
%>%
) arrange(-ncases) -> tbtab.01
glimpse(tbtab.01)
Rows: 219
Columns: 2
$ country <chr> "India", "China", "Indonesia", "South Africa", "Bang…
$ ncases <dbl> 6492850, 5643738, 1886657, 1435147, 1072431, 915525,…
What about by sex and age-group?
%>%
tb.df filter(
>= 1980 | year <= 1990
year %>%
) group_by(
country, sex, agegroup%>%
) summarise(
ncases = sum(cases, na.rm = TRUE)
%>%
) arrange(-ncases) -> tbtab.02
This is just one example of how dplyr
and
tidyr
– two packages from the tidyverse
– can
be used to deal with what seems to be quite messy data. The more you use
it, the easier it becomes to figure out solutions to even very
complicated messy data setups.
{forcats}
The tidyverse team has another package for working with factors –
forcats
. How is it useful?
For example, say I have the mtcars
data and want to see
the distribution of the number of cylinders a car has.
data(mtcars)
cyl
is stored as a numeric so I can flip this into a
factor via two ways, with base factor()
or then with
fct_recode()
.
library(forcats)
%>%
mtcars mutate(
f.cyl = factor(cyl)
-> mtcars.df )
%>%
mtcars.df mutate(
f.cyl2 = fct_recode(
f.cyl, Four = '4',
Six = '6',
Eight = '8'
)-> mtcars.df )
I am going to plot mpg
per car but before I do so, I
would like to know what model is being referred to. The
row.names(mtcars) -> mtcars$model
command pulls the
model details stored as row.names in mtcars
and creates a
new column called model
with this information. Note: You
will need to install the {hrbrthemes}
package before you
plot.
$model <- row.names(mtcars)
mtcars.df
library(ggplot2)
ggplot(
mtcars.df,aes(
x = mpg,
y = model
)+
) geom_point() +
labs(
x = "Miles per gallon",
y = "Car Model"
+
) ::theme_ipsum_rc() +
hrbrthemestheme(legend.position = "none")
This isn’t very helpful since the models are not arranged in
ascending/descending order of miles per gallon. We can easily fix this
problem with fct_reorder(...)
…
ggplot(
mtcars.df,aes(
x = mpg,
fct_reorder(
.f = model,
.x = mpg
)
)+
) geom_point() +
labs(
x = "Miles per gallon",
y = "Car Model"
+
) ::theme_ipsum_rc() +
hrbrthemestheme(legend.position = "none")
If I want the plot to be in descending order of mpg
,
ggplot(
mtcars.df,aes(
x = mpg,
fct_reorder(
.f = model,
.x = -mpg
)
)+
) geom_point() +
labs(
x = "Miles per gallon",
y = "Car Model"
+
) ::theme_ipsum_rc() +
hrbrthemestheme(legend.position = "none")
I can do this with bar-plots too.
ggplot(
mtcars.df,aes(f.cyl2)
+
) geom_bar(
aes(
fill = f.cyl2
)+
) theme(legend.position = "none") +
labs(
x = "Number of cylinders",
y = "Frequency"
+
) ::theme_ipsum_rc() +
hrbrthemestheme(legend.position = "none")
And now the bars ordered by decreasing frequency, and then by increasing frequency.
ggplot(
mtcars.df,aes(fct_infreq(f.cyl2))
+
) geom_bar(
aes(fill = f.cyl2)
+
) theme(legend.position = "none") +
labs(
x = "Number of cylinders",
y = "Frequency"
+
) ::theme_ipsum_rc() +
hrbrthemestheme(legend.position = "none")
ggplot(
mtcars.df,aes(fct_rev(fct_infreq(f.cyl2)))
+
) geom_bar(
aes(fill = f.cyl2)
+
) theme(legend.position = "none") +
labs(
x = "Number of cylinders",
y = "Frequency"
+
) ::theme_ipsum_rc() +
hrbrthemestheme(legend.position = "none")
Note the use of fct_infreq()
for descending and then
fct_rev(fct_infreq())
for ascending.
Take the pew
data shown to you early on in this module
and using dplyr and tidyr, tidy it up so that the final data-set (a) has
three columns — religion, income-group, frequency, and (b) is arranged
in ascending order of religion.
Tidy these weather data
read.delim(
file = "http://stat405.had.co.nz/data/weather.txt",
stringsAsFactors = FALSE
-> weather )
The tidy data should have the days as rows and TMIN and TMAX as columns (i.e., you are going from wide to long).
Download these
data from the World bank’s World Development Indicators and Climate
Change Knowledge Portal on climate systems, exposure to climate impacts,
resilience, greenhouse gas emissions, and energy use. The data are in a
worksheet called, wait for it, Data
, and need to be tidied
up such that for every country you have one row per year and then
columns for each of the 58 measures (called Series name
in
the database).
If you really want to challenge yourself, create separate data frames
from the Country
worksheet and the Series
worksheet. Then merge these with data frame you tidied up.
If you really dig into dplyr you find little nuggets here and there that are not hidden but for some reason few ever seem to blog about them or post examples where they are in use. I’ve cobbled together a few examples, inspired largely by Suzan Baert and then from the teradata documentation. It is also useful to bear in mind that packages continue to evolve and can, at times, deprecate or render obsolete code that worked in the past.
select_if
,
select_at
, select_all
Let us start with select_if
. This verb will allow you to
select columns to operate on if some criterion is met. For example,
perhaps I only want to pull out character columns:
%>%
cmh.df select_if(is.character) %>%
names()
[1] "reporting_airline" "iata_code_reporting_airline"
[3] "tail_number" "origin"
[5] "origin_city_name" "origin_state"
[7] "origin_state_fips" "origin_state_name"
[9] "dest" "dest_city_name"
[11] "dest_state" "dest_state_fips"
[13] "dest_state_name" "crs_dep_time"
[15] "dep_time" "dep_time_blk"
[17] "wheels_off" "wheels_on"
[19] "crs_arr_time" "arr_time"
[21] "arr_time_blk" "cancellation_code"
[23] "first_dep_time" "div1airport"
[25] "div1wheels_on" "div1wheels_off"
[27] "div1tail_num"
library(lubridate)
%>%
cmh.df select_if(is.Date) %>%
names()
[1] "flight_date"
You can also include seemingly complicated conditions, as shown below where we ask that not only should the column be numeric but that the column’s mean should exceed 10 (an arbitrary value chosen to showcase this example).
%>%
cmh.df select_if(
~is.numeric(.) & mean(., na.rm = TRUE) > 10
%>%
) names()
[1] "year"
[2] "dayof_month"
[3] "dot_id_reporting_airline"
[4] "flight_number_reporting_airline"
[5] "origin_airport_id"
[6] "origin_airport_seq_id"
[7] "origin_city_market_id"
[8] "origin_wac"
[9] "dest_airport_id"
[10] "dest_airport_seq_id"
[11] "dest_city_market_id"
[12] "dest_wac"
[13] "dep_delay_minutes"
[14] "taxi_out"
[15] "arr_delay_minutes"
[16] "crs_elapsed_time"
[17] "actual_elapsed_time"
[18] "air_time"
[19] "distance"
[20] "carrier_delay"
[21] "nas_delay"
[22] "late_aircraft_delay"
[23] "total_add_g_time"
[24] "longest_add_g_time"
[25] "div_actual_elapsed_time"
[26] "div_arr_delay"
[27] "div_distance"
[28] "div1airport_id"
[29] "div1airport_seq_id"
[30] "div1total_g_time"
[31] "div1longest_g_time"
You can also select with other conditions, for example, if the column has fewer than some number of distinct values (here I chose 5)
%>%
cmh.df select_if(
~n_distinct(.) < 5
%>%
) names()
[1] "year" "quarter"
[3] "dep_del15" "arr_del15"
[5] "cancelled" "cancellation_code"
[7] "diverted" "flights"
[9] "div_airport_landings" "div_reached_dest"
[11] "div2airport" "div2airport_id"
[13] "div2airport_seq_id" "div2wheels_on"
[15] "div2total_g_time" "div2longest_g_time"
[17] "div2wheels_off" "div2tail_num"
[19] "div3airport" "div3airport_id"
[21] "div3airport_seq_id" "div3wheels_on"
[23] "div3total_g_time" "div3longest_g_time"
[25] "div3wheels_off" "div3tail_num"
[27] "div4airport" "div4airport_id"
[29] "div4airport_seq_id" "div4wheels_on"
[31] "div4total_g_time" "div4longest_g_time"
[33] "div4wheels_off" "div4tail_num"
[35] "div5airport" "div5airport_id"
[37] "div5airport_seq_id" "div5wheels_on"
[39] "div5total_g_time" "div5longest_g_time"
[41] "div5wheels_off" "div5tail_num"
[43] "x110"
Don’t forget everything()
…You can rearrange columns
more swiftly too, as shown below where the first two columns I want are
explicitly specified and then all remaining columns via
everything()
%>%
cmh.df select(
flight_date, flight_number_reporting_airline,everything()
-> df.ev
)
head(df.ev[, c(1:6)], 15)
flight_date flight_number_reporting_airline id year quarter month
1 2017-01-02 2419 1 2017 1 1
2 2017-01-02 2583 1 2017 1 1
3 2017-01-03 977 1 2017 1 1
4 2017-01-03 977 1 2017 1 1
5 2017-01-03 1119 1 2017 1 1
6 2017-01-03 1170 1 2017 1 1
7 2017-01-03 1276 1 2017 1 1
8 2017-01-03 1276 1 2017 1 1
9 2017-01-03 1327 1 2017 1 1
10 2017-01-03 1405 1 2017 1 1
11 2017-01-03 1405 1 2017 1 1
12 2017-01-03 1533 1 2017 1 1
13 2017-01-03 1533 1 2017 1 1
14 2017-01-03 1736 1 2017 1 1
15 2017-01-03 1897 1 2017 1 1
If you want to make some sweeping changes across all columns, you can
do that with select_all()
. Here we flip the column names to
uppercase, and then I am replacing the underscore (_)
with
no whitespace at all.
%>%
cmh.df select_all(toupper) -> df.up
names(df.up[, 1:10])
[1] "ID" "YEAR"
[3] "QUARTER" "MONTH"
[5] "DAYOF_MONTH" "DAY_OF_WEEK"
[7] "FLIGHT_DATE" "REPORTING_AIRLINE"
[9] "DOT_ID_REPORTING_AIRLINE" "IATA_CODE_REPORTING_AIRLINE"
%>%
cmh.df select_all(toupper) %>%
select_all(~str_replace_all(., "_", "")) -> df.up
names(df.up[, 1:10])
[1] "ID" "YEAR"
[3] "QUARTER" "MONTH"
[5] "DAYOFMONTH" "DAYOFWEEK"
[7] "FLIGHTDATE" "REPORTINGAIRLINE"
[9] "DOTIDREPORTINGAIRLINE" "IATACODEREPORTINGAIRLINE"
In contrast, select_at()
will allow you to operate on
variables that meet a specific criterion. For example, here, only
columns with the string ar
in their column name or that
have column names starting with c
are being selected and
flipped to uppercase.
%>%
mtcars select_at(
vars(contains("ar"), starts_with("c")
),
toupper%>%
) head()
GEAR CARB CYL
Mazda RX4 4 4 6
Mazda RX4 Wag 4 4 6
Datsun 710 4 1 4
Hornet 4 Drive 3 1 6
Hornet Sportabout 3 2 8
Valiant 3 1 6
rownames_to_column
This is a handy function to flip a data-frame’s rownames into a column
head(mtcars[, c(1:4)])
mpg cyl disp hp
Mazda RX4 21.0 6 160 110
Mazda RX4 Wag 21.0 6 160 110
Datsun 710 22.8 4 108 93
Hornet 4 Drive 21.4 6 258 110
Hornet Sportabout 18.7 8 360 175
Valiant 18.1 6 225 105
%>%
mtcars rownames_to_column() %>%
head(3)
rowname mpg cyl disp hp drat wt qsec vs am gear carb
1 Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
2 Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
3 Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
mutate_at()
and
mutate_if()
mutate_at() will allow you to mutate specific columns when the specified condition is met. For example, if the column name contains “el” I want the contents of the column to be rendered lowercase.
%>%
mtcars mutate_at(
vars(
contains("el")
),
tolower-> mut.df
)
head(mut.df)
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
Or if the column name contains “pg”, I want the column’s values to be divided by 10
%>%
mtcars mutate_at(
vars(
contains("pg")
),~(. / 10)
-> mut.df
)
head(mut.df)
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 2.10 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 2.10 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 2.28 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 2.14 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 1.87 8 360 175 3.15 3.440 17.02 0 0 3 2
Valiant 1.81 6 225 105 2.76 3.460 20.22 1 0 3 1
Here, if a column is a factor, that column’s values will be converted into a uppercase
%>%
mtcars mutate_if(
is.factor,
toupper-> mut.df
)
head(mut.df)
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
na_if()
Need to convert values that meet specific criteria into NA? Here we
are asking that if any variable mpg through vs
has the
value of 110
, this value should be converted to
NA
(a missing value).
%>%
mtcars select(mpg:vs) %>%
na_if(110) -> naif.df
head(naif.df)
mpg cyl disp hp drat wt qsec vs
Mazda RX4 21.0 6 160 NA 3.90 2.620 16.46 0
Mazda RX4 Wag 21.0 6 160 NA 3.90 2.875 17.02 0
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1
Hornet 4 Drive 21.4 6 258 NA 3.08 3.215 19.44 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0
Valiant 18.1 6 225 105 2.76 3.460 20.22 1
rowwise()
This comes in handy if you have to operate on two or more columns but
want the calculation done for each row. Here the average of
mpg + qsec
is being calculated for each model
%>%
mtcars rowwise() %>%
mutate(
average = mean(c(mpg, qsec))
-> row.df
)
head(row.df[, c(1, 7, 12)])
# A tibble: 6 × 3
# Rowwise:
mpg qsec average
<dbl> <dbl> <dbl>
1 21 16.5 18.7
2 21 17.0 19.0
3 22.8 18.6 20.7
4 21.4 19.4 20.4
5 18.7 17.0 17.9
6 18.1 20.2 19.2
between()
As the name suggests, this allows you to filter for values that fall between the specified range (here mpg of 21 to 24). These boundary values are included in the filtered results
%>%
mtcars filter(
between(mpg, 21, 24)
-> between.df
)
between.df
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
near()
This allows you to filter within a range of values, the range
specified via tol
. Here, for example, we are asking for any
case where mpg is inside
the \(\pm 0.5\) range of \(mpg = 21\), i.e., \(20.5 < x < 21.5\)
%>%
mtcars filter(
near(mpg, 21, tol = 0.5)
-> near.df
)
near.df
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Volvo 142E 21.4 4 121 109 4.11 2.780 18.60 1 1 4 2
%>%
mtcars filter(
near(mpg, 21, tol = 0.5 * sd(mpg))
-> near.df
)
head(near.df)
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
xor()
This is just another way of saying filter if
mpg < 22 | cyl < 6
%>%
mtcars filter(
xor(mpg < 22, cyl < 6)
-> xor.df
)
head(xor.df)
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
%>%
mtcars filter(
< 22 | cyl < 6
mpg -> nonxor.df
)
head(nonxor.df)
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
filter_all()
This will filter all columns based on the condition specified, for example, finding all instances where baby names include the string “nnie”
library(babynames)
%>%
babynames filter_all(
any_vars(
str_detect(., pattern = "nnie")
)-> filt.df
)
head(filt.df)
# A tibble: 6 × 5
year sex name n prop
<dbl> <chr> <chr> <int> <dbl>
1 1880 F Minnie 1746 0.0179
2 1880 F Annie 1258 0.0129
3 1880 F Jennie 793 0.00812
4 1880 F Fannie 560 0.00574
5 1880 F Nannie 248 0.00254
6 1880 F Winnie 49 0.000502
%>%
babynames select_if(is.numeric) %>%
filter_all(
any_vars(
== 1880
.
)-> filt.df
)
head(filt.df)
# A tibble: 6 × 3
year n prop
<dbl> <int> <dbl>
1 1880 7065 0.0724
2 1880 2604 0.0267
3 1880 2003 0.0205
4 1880 1939 0.0199
5 1880 1746 0.0179
6 1880 1578 0.0162
This has retained only those rows where at least one column/variable has a value of 1800
filter_if()
This will allow you to filter if the logical condition is met, here if the column is a character column and if any character column has a missing value
%>%
starwars filter_if(
is.character,any_vars(is.na(.)
)-> filt.df
)
head(filt.df)
# A tibble: 6 × 14
name height mass hair_color skin_color eye_color birth_year sex
<chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr>
1 C-3PO 167 75 <NA> gold yellow 112 none
2 R2-D2 96 32 <NA> white, bl… red 33 none
3 R5-D4 97 32 <NA> white, red red NA none
4 Greedo 173 74 <NA> green black 44 male
5 Jabba… 175 1358 <NA> green-tan… orange 600 herm…
6 Yoda 66 17 white green brown 896 male
# … with 6 more variables: gender <chr>, homeworld <chr>,
# species <chr>, films <list>, vehicles <list>, starships <list>
filter_at()
This will come in very handy if you want to filter specific variables if a condition is met
%>%
mtcars filter_at(
vars(mpg, qsec),
all_vars(. > 19)
-> filt.df
)
head(filt.df)
mpg cyl disp hp drat wt qsec vs am gear carb
Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
This looked for mpg
and qsec
and then
retained rows where either variable had a value > 19
%>%
mtcars filter_at(
vars(contains("t")),
any_vars(. > 3)
-> filt.df
)
head(filt.df)
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
Here we asked for columns with the letter t
in their
name, and any
of these selected columns with values
> 3
summarise_all()
Here we are asking that means be calculated for all selected
variables – mpg:qsec
– each on its own, of course
%>%
mtcars select(mpg:qsec) %>%
summarise_all(mean, na.rm = TRUE) -> summ.df
summ.df
mpg cyl disp hp drat wt qsec
1 20.09062 6.1875 230.7219 146.6875 3.596563 3.21725 17.84875
summarise_if()
Here we are asking for the means of specific variables – if they are numeric – to be calculated
%>%
mtcars summarise_if(
na.rm = TRUE
is.numeric, mean, %>%
) rename_if(
~paste0("mean_", .)
is.numeric, -> summ.df
)
summ.df
mean_mpg mean_cyl mean_disp mean_hp mean_drat mean_wt mean_qsec
1 20.09062 6.1875 230.7219 146.6875 3.596563 3.21725 17.84875
mean_vs mean_am mean_gear mean_carb
1 0.4375 0.40625 3.6875 2.8125
rename_if()
is being used to append recognizable labels
to the calculated values
summarise_at()
Here we are asking for the means of specific variables – those with the word “Delay” in their name – to be calculated
%>%
cmh.df summarise_at(
vars(contains("delay")), mean, na.rm = TRUE
%>%
) rename_at(
vars(contains("delay")), ~paste0("mean_", .)
-> summ.df
)
summ.df
mean_dep_delay mean_dep_delay_minutes mean_departure_delay_groups
1 8.756439 11.54685 -0.03739762
mean_arr_delay mean_arr_delay_minutes mean_arrival_delay_groups
1 2.601107 11.07277 -0.3285672
mean_carrier_delay mean_weather_delay mean_nas_delay
1 18.72072 2.676069 12.33165
mean_security_delay mean_late_aircraft_delay mean_div_arr_delay
1 0.04338477 26.01961 178.8784
rename_at()
is being used to append recognizable labels
to the calculated values
top_n()
This will give you the top
or bottom
n
values (here n has been set to 7)
%>%
cmh.df group_by(reporting_airline) %>%
filter(dest == "CMH") %>%
summarise(
average.delay = mean(arr_delay, na.rm = TRUE)
%>%
) arrange(-average.delay) %>%
top_n(7)
# A tibble: 7 × 2
reporting_airline average.delay
<chr> <dbl>
1 F9 7.54
2 WN 5.20
3 AA 4.92
4 OO 4.66
5 EV 0.966
6 DL -1.01
7 UA -2.27
%>%
cmh.df group_by(reporting_airline) %>%
filter(dest == "CMH") %>%
summarise(
average.delay = mean(arr_delay, na.rm = TRUE)
%>%
) arrange(average.delay) %>%
top_n(-7)
# A tibble: 7 × 2
reporting_airline average.delay
<chr> <dbl>
1 UA -2.27
2 DL -1.01
3 EV 0.966
4 OO 4.66
5 AA 4.92
6 WN 5.20
7 F9 7.54
separate_rows
If you need to separate values accidentally stored in a single column
into rows, use separate_rows
library(tibble)
<- tibble(
df x = 1:3,
y = c("a", "d,e,f", "g,h"),
z = c("1", "2,3,4", "5,6")
)
df
# A tibble: 3 × 3
x y z
<int> <chr> <chr>
1 1 a 1
2 2 d,e,f 2,3,4
3 3 g,h 5,6
separate_rows(df, y, z, convert = TRUE)
# A tibble: 6 × 3
x y z
<int> <chr> <int>
1 1 a 1
2 2 d 2
3 2 e 3
4 2 f 4
5 3 g 5
6 3 h 6
join
alternative to merge
dplyr
has it own commands that execute what
merge()
did for us in the previous module. Let us see how
this alternative route works but first we recreate the data we used with
merge.
data.frame(
Score = c(10, 21, 33, 12, 35, 67, 43, 99),
ID = c("A12", "A23", "Z14", "WX1", "Y31", "D66", "C31", "Q22")
-> data1
)
data.frame(
Sex = c("Male", "Female", "Male", "Female", "Male", "Female", "Male", "Female"),
ID = c("A12", "A23", "Z14", "WX1", "Y31", "D66", "E52", "H71")
-> data2
)
data.frame(
Score = c(10, 21, 33, 12, 35, 67, 43, 99),
ID1 = c("A12", "A23", "Z14", "WX1", "Y31", "D66", "C31", "Q22")
-> data3
)
data.frame(
Sex = c("Male", "Female", "Male", "Female", "Male", "Female", "Male", "Female"),
ID2 = c("A12", "A23", "Z14", "WX1", "Y31", "D66", "E52", "H71")
-> data4
)
data.frame(
Score = c(10, 21, 33, 12, 35, 67, 43, 99),
ID = c("A12", "A23", "Z14", "WX1", "Y31", "D66", "C31", "Q22")
-> df1
)
data.frame(
Sex = c("Male", "Female", "Male", "Female", "Male", "Female", "Male", "Female"),
ID = c("A12", "A23", "Z14", "WX1", "Y31", "D66", "E52", "H71")
-> df2
)
data.frame(
Age = c(6, 7, 6, 8, 8, 9, 10, 5),
ID = c("A12", "A23", "Z14", "WX1", "Y31", "D66", "E52", "H71")
-> df3 )
inner_join
This merges based on columns that appear in both data-sets, and will only merge rows that have the same value per column. In brief, only those observations common to both data-sets are retained.
inner_join(data1, data2, by = "ID")
Score ID Sex
1 10 A12 Male
2 21 A23 Female
3 33 Z14 Male
4 12 WX1 Female
5 35 Y31 Male
6 67 D66 Female
If you do not specify the by =
column dplyr defaults to
all columns with similar names in each data-set …
inner_join(data1, data2)
Score ID Sex
1 10 A12 Male
2 21 A23 Female
3 33 Z14 Male
4 12 WX1 Female
5 35 Y31 Male
6 67 D66 Female
left_join
This merge will result in all rows being retained from the first data-set listed in the command, but will include only those rows from the second data-set that have the same value(s) for the by variable(s) …
%>%
data1 left_join(data2, by = "ID")
Score ID Sex
1 10 A12 Male
2 21 A23 Female
3 33 Z14 Male
4 12 WX1 Female
5 35 Y31 Male
6 67 D66 Female
7 43 C31 <NA>
8 99 Q22 <NA>
%>%
data2 left_join(data1, by = "ID")
Sex ID Score
1 Male A12 10
2 Female A23 21
3 Male Z14 33
4 Female WX1 12
5 Male Y31 35
6 Female D66 67
7 Male E52 NA
8 Female H71 NA
right_join
This merge will result in all rows being retained from the second data-set listed in the command, but will include only those rows from the first data-set that have the same value(s) for the by variable(s) …
%>%
data1 right_join(data2, by = "ID")
Score ID Sex
1 10 A12 Male
2 21 A23 Female
3 33 Z14 Male
4 12 WX1 Female
5 35 Y31 Male
6 67 D66 Female
7 NA E52 Male
8 NA H71 Female
%>%
data2 right_join(data1, by = "ID")
Sex ID Score
1 Male A12 10
2 Female A23 21
3 Male Z14 33
4 Female WX1 12
5 Male Y31 35
6 Female D66 67
7 <NA> C31 43
8 <NA> Q22 99
full_join
This merge keeps all rows from each data-set, with those only found
in one data-set being given NA
values as necessary.
%>%
data1 full_join(data2, by = "ID")
Score ID Sex
1 10 A12 Male
2 21 A23 Female
3 33 Z14 Male
4 12 WX1 Female
5 35 Y31 Male
6 67 D66 Female
7 43 C31 <NA>
8 99 Q22 <NA>
9 NA E52 Male
10 NA H71 Female
semi_join
What if you only want observations that match in both data-sets and
only those columns found in one data-set but not in the other? Use a
semi-join
. In the examples below, you will notice that the
column Sex
is not included in the first result since it is
only found in data2
but is there in the second example
since data2
is guiding the semi-Join()
.
data.frame(
Score = c(10, 21, 33, 12, 35, 67, 43, 99),
ID = c("A12", "A23", "Z14", "WX1", "Y31", "D66", "C31", "Q22")
-> data1
)
data.frame(
Score = c(10, 21, 33, 12, 35, 67, NA, NA),
Sex = c("Male", "Female", "Male", "Female", "Male", "Female", "Male", "Female"),
ID = c("A12", "A23", "Z14", "WX1", "Y31", "D66", "E52", "H71")
-> data2
)
semi_join(data1, data2)
Score ID
1 10 A12
2 21 A23
3 33 Z14
4 12 WX1
5 35 Y31
6 67 D66
semi_join(data2, data1, by = "ID")
Score Sex ID
1 10 Male A12
2 21 Female A23
3 33 Male Z14
4 12 Female WX1
5 35 Male Y31
6 67 Female D66
anti_join
This will only retain records not found in the second data-set listed.
anti_join(data1, data2)
Score ID
1 43 C31
2 99 Q22
anti_join(data2, data1)
Score Sex ID
1 NA Male E52
2 NA Female H71
mutate
/transmute
You can create new columns by executing vectorized functions that vectors of some length as the input and return vectors of the same length as the output. These functions are shown below.
::tibble(
tibbleid = c(rep("A", 3), rep("B", 4), rep("C", 1), rep("D", 4)),
x = c(seq(1, 10, by = 1), -1, -2),
y = c(seq(11, 20, by = 1), -11, -12)
-> a.df )
lag()
versus
lead()
To lag some value implies to assign to the current observation (i.e., row) the preceding value. Similarly, to lead implies to assign to the current observation (i.e., row) the value from a subsequent row. If I lead by 1, then I am assigning the value from the next row, lead by 2 then I am assigning the value from 2 rows ahead, and so on.
Say I want to lag the values of x by 1, first regardless of
id
but then within each id
.
%>%
a.df mutate(
x.lag = lag(x, 1)
)
# A tibble: 12 × 4
id x y x.lag
<chr> <dbl> <dbl> <dbl>
1 A 1 11 NA
2 A 2 12 1
3 A 3 13 2
4 B 4 14 3
5 B 5 15 4
6 B 6 16 5
7 B 7 17 6
8 C 8 18 7
9 D 9 19 8
10 D 10 20 9
11 D -1 -11 10
12 D -2 -12 -1
%>%
a.df group_by(id) %>%
mutate(
x.lag = lag(x, 1)
)
# A tibble: 12 × 4
# Groups: id [4]
id x y x.lag
<chr> <dbl> <dbl> <dbl>
1 A 1 11 NA
2 A 2 12 1
3 A 3 13 2
4 B 4 14 NA
5 B 5 15 4
6 B 6 16 5
7 B 7 17 6
8 C 8 18 NA
9 D 9 19 NA
10 D 10 20 9
11 D -1 -11 10
12 D -2 -12 -1
Similarly, I can also lead by a specific “period”
%>%
a.df mutate(
x.lead = lead(x, 2)
)
# A tibble: 12 × 4
id x y x.lead
<chr> <dbl> <dbl> <dbl>
1 A 1 11 3
2 A 2 12 4
3 A 3 13 5
4 B 4 14 6
5 B 5 15 7
6 B 6 16 8
7 B 7 17 9
8 C 8 18 10
9 D 9 19 -1
10 D 10 20 -2
11 D -1 -11 NA
12 D -2 -12 NA
%>%
a.df group_by(id) %>%
mutate(
y.lead = lead(y, 1)
)
# A tibble: 12 × 4
# Groups: id [4]
id x y y.lead
<chr> <dbl> <dbl> <dbl>
1 A 1 11 12
2 A 2 12 13
3 A 3 13 NA
4 B 4 14 15
5 B 5 15 16
6 B 6 16 17
7 B 7 17 NA
8 C 8 18 NA
9 D 9 19 20
10 D 10 20 -11
11 D -1 -11 -12
12 D -2 -12 NA
We can also calculate various aggregates by cumulating over/across columns.
%>%
a.df mutate(
c = cummin(x)
# c in any row is the smallest value observed thus far )
# A tibble: 12 × 4
id x y c
<chr> <dbl> <dbl> <dbl>
1 A 1 11 1
2 A 2 12 1
3 A 3 13 1
4 B 4 14 1
5 B 5 15 1
6 B 6 16 1
7 B 7 17 1
8 C 8 18 1
9 D 9 19 1
10 D 10 20 1
11 D -1 -11 -1
12 D -2 -12 -2
%>%
a.df mutate(
c = cummax(x)
# c in any row is the largest value observed thus far )
# A tibble: 12 × 4
id x y c
<chr> <dbl> <dbl> <dbl>
1 A 1 11 1
2 A 2 12 2
3 A 3 13 3
4 B 4 14 4
5 B 5 15 5
6 B 6 16 6
7 B 7 17 7
8 C 8 18 8
9 D 9 19 9
10 D 10 20 10
11 D -1 -11 10
12 D -2 -12 10
%>%
a.df mutate(
c = cummean(x)
# c is the average of the current + preceding value(s) )
# A tibble: 12 × 4
id x y c
<chr> <dbl> <dbl> <dbl>
1 A 1 11 1
2 A 2 12 1.5
3 A 3 13 2
4 B 4 14 2.5
5 B 5 15 3
6 B 6 16 3.5
7 B 7 17 4
8 C 8 18 4.5
9 D 9 19 5
10 D 10 20 5.5
11 D -1 -11 4.91
12 D -2 -12 4.33
%>%
a.df mutate(
c = cumsum(x)
# c is the cumulative sum of the current + preceding value(s) )
# A tibble: 12 × 4
id x y c
<chr> <dbl> <dbl> <dbl>
1 A 1 11 1
2 A 2 12 3
3 A 3 13 6
4 B 4 14 10
5 B 5 15 15
6 B 6 16 21
7 B 7 17 28
8 C 8 18 36
9 D 9 19 45
10 D 10 20 55
11 D -1 -11 54
12 D -2 -12 52
%>%
a.df mutate(
c = cumprod(x)
# c is the cumulative product of the current value of x and the preceding value of c )
# A tibble: 12 × 4
id x y c
<chr> <dbl> <dbl> <dbl>
1 A 1 11 1
2 A 2 12 2
3 A 3 13 6
4 B 4 14 24
5 B 5 15 120
6 B 6 16 720
7 B 7 17 5040
8 C 8 18 40320
9 D 9 19 362880
10 D 10 20 3628800
11 D -1 -11 -3628800
12 D -2 -12 7257600
Say we have information on bank balance we hold by date. If the
balance is negative, then we have leaned on an overdraft. We can use
cumany()
and cumall()
to extract rows once a
condition is met.
data.frame(
date = as.Date("2020-01-01") + 0:6,
balance = c(100, 50, 25, -25, -50, 30, 120)
-> df
)
df
date balance
1 2020-01-01 100
2 2020-01-02 50
3 2020-01-03 25
4 2020-01-04 -25
5 2020-01-05 -50
6 2020-01-06 30
7 2020-01-07 120
# show me all rows starting with when the first overdraft occurs
%>%
df filter(
cumany(balance < 0)
)
date balance
1 2020-01-04 -25
2 2020-01-05 -50
3 2020-01-06 30
4 2020-01-07 120
# show me all rows preceding the first overdraft
%>%
df filter(
cumall(!(balance < 0))
)
date balance
1 2020-01-01 100
2 2020-01-02 50
3 2020-01-03 25
These come in handy when we need to rank observations based on some criterion.
cume_dist()
%>%
a.df mutate(
d = cume_dist(x)
# gives a cumulative proportion of all values of x less than or equal to the current value. )
# A tibble: 12 × 4
id x y d
<chr> <dbl> <dbl> <dbl>
1 A 1 11 0.25
2 A 2 12 0.333
3 A 3 13 0.417
4 B 4 14 0.5
5 B 5 15 0.583
6 B 6 16 0.667
7 B 7 17 0.75
8 C 8 18 0.833
9 D 9 19 0.917
10 D 10 20 1
11 D -1 -11 0.167
12 D -2 -12 0.0833
Note that -2 is the smallest value so it is ranked first, and the proportion becomes d = (1/12). -1 is bigger than -2, hence ranked second. Now d = (2/12) = 0.167 because two observations have a value of x equal to -1 or less. And so on …
row_number()
Say we have multiple observations for some group and want to flag whether a specific observation is the first, second, third, etc. within the group.
%>%
a.df group_by(id) %>%
mutate(
flag_row = row_number()
)
# A tibble: 12 × 4
# Groups: id [4]
id x y flag_row
<chr> <dbl> <dbl> <int>
1 A 1 11 1
2 A 2 12 2
3 A 3 13 3
4 B 4 14 1
5 B 5 15 2
6 B 6 16 3
7 B 7 17 4
8 C 8 18 1
9 D 9 19 1
10 D 10 20 2
11 D -1 -11 3
12 D -2 -12 4
If you want to flag in ascending or descending order of some variable, than you can run the following and when you do, pay attention to what happens with D since the smallest value of x is flagged as the first observation and NOT the order in which x appears in the data-set.
%>%
a.df group_by(id) %>%
mutate(
flag_row = row_number(x)
)
# A tibble: 12 × 4
# Groups: id [4]
id x y flag_row
<chr> <dbl> <dbl> <int>
1 A 1 11 1
2 A 2 12 2
3 A 3 13 3
4 B 4 14 1
5 B 5 15 2
6 B 6 16 3
7 B 7 17 4
8 C 8 18 1
9 D 9 19 3
10 D 10 20 4
11 D -1 -11 2
12 D -2 -12 1
min_rank()
and
percent_rank()
This allows you to flag the minimum rank for an observation as a rank or as a proportion
%>%
a.df mutate(
min.rank = min_rank(x)
)
# A tibble: 12 × 4
id x y min.rank
<chr> <dbl> <dbl> <int>
1 A 1 11 3
2 A 2 12 4
3 A 3 13 5
4 B 4 14 6
5 B 5 15 7
6 B 6 16 8
7 B 7 17 9
8 C 8 18 10
9 D 9 19 11
10 D 10 20 12
11 D -1 -11 2
12 D -2 -12 1
%>%
a.df group_by(id) %>%
mutate(
min.rank = min_rank(x)
)
# A tibble: 12 × 4
# Groups: id [4]
id x y min.rank
<chr> <dbl> <dbl> <int>
1 A 1 11 1
2 A 2 12 2
3 A 3 13 3
4 B 4 14 1
5 B 5 15 2
6 B 6 16 3
7 B 7 17 4
8 C 8 18 1
9 D 9 19 3
10 D 10 20 4
11 D -1 -11 2
12 D -2 -12 1
%>%
a.df mutate(
pct.rank = percent_rank(x)
)
# A tibble: 12 × 4
id x y pct.rank
<chr> <dbl> <dbl> <dbl>
1 A 1 11 0.182
2 A 2 12 0.273
3 A 3 13 0.364
4 B 4 14 0.455
5 B 5 15 0.545
6 B 6 16 0.636
7 B 7 17 0.727
8 C 8 18 0.818
9 D 9 19 0.909
10 D 10 20 1
11 D -1 -11 0.0909
12 D -2 -12 0
%>%
a.df group_by(id) %>%
mutate(
pct.rank = percent_rank(x)
)
# A tibble: 12 × 4
# Groups: id [4]
id x y pct.rank
<chr> <dbl> <dbl> <dbl>
1 A 1 11 0
2 A 2 12 0.5
3 A 3 13 1
4 B 4 14 0
5 B 5 15 0.333
6 B 6 16 0.667
7 B 7 17 1
8 C 8 18 NaN
9 D 9 19 0.667
10 D 10 20 1
11 D -1 -11 0.333
12 D -2 -12 0
In addition, you can use first()
, last()
,
or then a specific position in a vector via
nth(somevariable, position)
as shown below. Note, however,
that here the rows are not sorted in ascending/descending order of your
target variable but instead used as they occur in the dataset.
%>%
a.df mutate(
first.value = first(y)
)
# A tibble: 12 × 4
id x y first.value
<chr> <dbl> <dbl> <dbl>
1 A 1 11 11
2 A 2 12 11
3 A 3 13 11
4 B 4 14 11
5 B 5 15 11
6 B 6 16 11
7 B 7 17 11
8 C 8 18 11
9 D 9 19 11
10 D 10 20 11
11 D -1 -11 11
12 D -2 -12 11
%>%
a.df mutate(
last.value = last(y)
)
# A tibble: 12 × 4
id x y last.value
<chr> <dbl> <dbl> <dbl>
1 A 1 11 -12
2 A 2 12 -12
3 A 3 13 -12
4 B 4 14 -12
5 B 5 15 -12
6 B 6 16 -12
7 B 7 17 -12
8 C 8 18 -12
9 D 9 19 -12
10 D 10 20 -12
11 D -1 -11 -12
12 D -2 -12 -12
%>%
a.df mutate(
third.value = nth(y, 3)
)
# A tibble: 12 × 4
id x y third.value
<chr> <dbl> <dbl> <dbl>
1 A 1 11 13
2 A 2 12 13
3 A 3 13 13
4 B 4 14 13
5 B 5 15 13
6 B 6 16 13
7 B 7 17 13
8 C 8 18 13
9 D 9 19 13
10 D 10 20 13
11 D -1 -11 13
12 D -2 -12 13
And then of course the grouped variants
%>%
a.df group_by(id) %>%
mutate(
first.value = first(y)
)
# A tibble: 12 × 4
# Groups: id [4]
id x y first.value
<chr> <dbl> <dbl> <dbl>
1 A 1 11 11
2 A 2 12 11
3 A 3 13 11
4 B 4 14 14
5 B 5 15 14
6 B 6 16 14
7 B 7 17 14
8 C 8 18 18
9 D 9 19 19
10 D 10 20 19
11 D -1 -11 19
12 D -2 -12 19
%>%
a.df group_by(id) %>%
mutate(
last.value = last(y)
)
# A tibble: 12 × 4
# Groups: id [4]
id x y last.value
<chr> <dbl> <dbl> <dbl>
1 A 1 11 13
2 A 2 12 13
3 A 3 13 13
4 B 4 14 17
5 B 5 15 17
6 B 6 16 17
7 B 7 17 17
8 C 8 18 18
9 D 9 19 -12
10 D 10 20 -12
11 D -1 -11 -12
12 D -2 -12 -12
%>%
a.df group_by(id) %>%
mutate(
third.value = nth(y, 3)
)
# A tibble: 12 × 4
# Groups: id [4]
id x y third.value
<chr> <dbl> <dbl> <dbl>
1 A 1 11 13
2 A 2 12 13
3 A 3 13 13
4 B 4 14 16
5 B 5 15 16
6 B 6 16 16
7 B 7 17 16
8 C 8 18 NA
9 D 9 19 -11
10 D 10 20 -11
11 D -1 -11 -11
12 D -2 -12 -11
For attribution, please cite this work as
Ruhil (2022, Feb. 3). Tidy and Tweak Your Data. Retrieved from https://aniruhil.org/courses/mpa6020/handouts/module04.html
BibTeX citation
@misc{ruhil2022tidy, author = {Ruhil, Ani}, title = {Tidy and Tweak Your Data}, url = {https://aniruhil.org/courses/mpa6020/handouts/module04.html}, year = {2022} }