+ - 0:00:00
Notes for current slide
Notes for next slide

Tidy Data

Ani Ruhil

1 / 96
2 / 96

Agenda

  • Overview of the tidyverse

"an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures."

  • dplyr

"dplyr is a grammar of data manipulation, providing a consistent set of verbs that help you solve the most common data manipulation challenges"

  • tidyr

"Tidy data describes a standard way of storing data that is used wherever possible throughout the tidyverse. If you ensure that your data is tidy, you’ll spend less time fighting with the tools and more time working on your analysis."

3 / 96
4 / 96

The core functions are:

  • mutate() ... to add new variables that are functions of existing variables

  • transmute() ... to add new variables and drop the rest of the variables

  • select() ... to pick a subset of variables to work with while dropping the rest

  • filter() ... to select a subset of observations to work with

  • summarise() ... to calculate quantities of interest (rather than add these quantities as new variables to the existing data frame) and save these as a standalone table

  • arrange() ... order the rows based on out preferences

  • distinct() ... to eliminate duplicate observations

There are several other commands that belong to dplyr ... here is the cheatsheet

Let us see dplyr in action with the xclass and xstudent data frames

load("data/xclass.RData")
load("data/xstudent.RData")
5 / 96

%>%

6 / 96

The pipe operator

The %>% operator is useful for chaining a sequence of operations into a single command

the_data <-
read.csv('/path/to/data/file.csv') %>%
subset(variable_a > x) %>%
transform(variable_c = variable_a/variable_b)

This command means ...

(1) First read the csv file THEN

(2) subset the data to cases where variable_a values are greater than some value g THEN

(3) create variable_c as the ratio of variable_a to variable_b

The result is saved in the_data

7 / 96

distinct()

See also Garrett and Hadley

8 / 96

Eliminating duplicates

You may have duplicate cases in your data set. How do you eliminate them? Let us assume we only need one record per student in xclass1

library(tidyverse)
xclass.df1 <- xclass %>%
distinct(anon_id, .keep_all = TRUE)

1: While these may well be legitimate records, for training purposes I am assuming each student should only show up once.

But a student could be enrolled on multiple campuses and have multiple majors

xclass.df2 <- xclass %>%
distinct(anon_id, campus_name, term_code, department_desc,
.keep_all = TRUE)

If more than one variable is specified in distinct() then dplyr looks for unique combinations of these variables. Each unique combination = a unique observation

.keep_all = TRUE is asking that all columns be retained after eliminating duplicate observations

9 / 96

Merging data

10 / 96

Say I have data on some students, with test scores in one file and their gender in another file. Each file has a unique student identifier, called ID. How can I create a unified data-set? Let us see each data-set first.

data1 = tibble(Score = c(10, 21, 33, 12, 35, 67, 43, 99),
ID = c("A12", "A23", "Z14", "WX1", "Y31", "D66", "C31", "Q22")
)
data2 = tibble(Sex = c("Male", "Female", "Male", "Female", "Male", "Female", "Male", "Female"),
ID = c("A12", "A23", "Z14", "WX1", "Y31", "D66", "E52", "H71")
)

Open up both data-sets and note that students C31 and Q22 are missing from data2 and students E52 and H71 are missing from data1.

11 / 96

... here are the two data-sets side-by-side

<div id="htmlwidget-a74f2629c9702d0961ec" style="width:100%;height:auto;" class="datatables html-widget"></div>
<script type="application/json" data-for="htmlwidget-a74f2629c9702d0961ec">{"x":{"filter":"none","data":[["1","2","3","4","5","6","7","8"],[10,21,33,12,35,67,43,99],["A12","A23","Z14","WX1","Y31","D66","C31","Q22"]],"container":"<table class=\"display\">\n <thead>\n <tr>\n <th> <\/th>\n <th>Score<\/th>\n <th>ID<\/th>\n <\/tr>\n <\/thead>\n<\/table>","options":{"columnDefs":[{"className":"dt-center","targets":[1,2]},{"orderable":false,"targets":0}],"filter":"top","rownames":false,"class":"compact","order":[],"autoWidth":false,"orderClasses":false}},"evals":[],"jsHooks":[]}</script>
<div id="htmlwidget-d1f40ace4abc0833e865" style="width:100%;height:auto;" class="datatables html-widget"></div>
<script type="application/json" data-for="htmlwidget-d1f40ace4abc0833e865">{"x":{"filter":"none","data":[["1","2","3","4","5","6","7","8"],["Male","Female","Male","Female","Male","Female","Male","Female"],["A12","A23","Z14","WX1","Y31","D66","E52","H71"]],"container":"<table class=\"display\">\n <thead>\n <tr>\n <th> <\/th>\n <th>Sex<\/th>\n <th>ID<\/th>\n <\/tr>\n <\/thead>\n<\/table>","options":{"columnDefs":[{"className":"dt-center","targets":[1,2]},{"orderable":false,"targets":0}],"filter":"top","rownames":false,"class":"compact","order":[],"autoWidth":false,"orderClasses":false}},"evals":[],"jsHooks":[]}</script>
12 / 96

Mutating joins

These combine data frames but to do so we need merge key(s) -- variable(s) that identifies unique observations in each file. Next, you need to decide on what you want.

natural join ... merge observations that show up in both files

natural <- inner_join(x = data1, y = data2, by = "ID")
natural
## # A tibble: 6 x 3
## Score ID Sex
## <dbl> <chr> <chr>
## 1 10 A12 Male
## 2 21 A23 Female
## 3 33 Z14 Male
## 4 12 WX1 Female
## 5 35 Y31 Male
## 6 67 D66 Female
13 / 96

14 / 96

full outer join ... merge everything even if some cases show up in one but not the other

full <- full_join(x = data1, y = data2, by = "ID")
full
## # A tibble: 10 x 3
## Score ID Sex
## <dbl> <chr> <chr>
## 1 10 A12 Male
## 2 21 A23 Female
## 3 33 Z14 Male
## 4 12 WX1 Female
## 5 35 Y31 Male
## 6 67 D66 Female
## 7 43 C31 <NA>
## 8 99 Q22 <NA>
## 9 NA E52 Male
## 10 NA H71 Female
15 / 96

16 / 96

left outer join ... merge such that all cases in file x are retained but only those seen in file y are joined

left <- left_join(x = data1, y = data2, by = "ID")
left
## # A tibble: 8 x 3
## Score ID Sex
## <dbl> <chr> <chr>
## 1 10 A12 Male
## 2 21 A23 Female
## 3 33 Z14 Male
## 4 12 WX1 Female
## 5 35 Y31 Male
## 6 67 D66 Female
## 7 43 C31 <NA>
## 8 99 Q22 <NA>
17 / 96

18 / 96

right outer join ... merge to keep all cases in y and only those cases are to be merged from x that show up in y

right <- right_join(x = data1, y = data2, by = "ID")
right
## # A tibble: 8 x 3
## Score ID Sex
## <dbl> <chr> <chr>
## 1 10 A12 Male
## 2 21 A23 Female
## 3 33 Z14 Male
## 4 12 WX1 Female
## 5 35 Y31 Male
## 6 67 D66 Female
## 7 NA E52 Male
## 8 NA H71 Female
19 / 96

20 / 96

What if the ID variables had different names in the files? You could rename them to have a common name or you could use by.x = and by.y = as shown below.

data3 <- tibble(Score = c(10, 21, 33, 12, 35, 67, 43, 99),
ID1 = c("A12", "A23", "Z14", "WX1", "Y31", "D66", "C31", "Q22")
)
data4 <- tibble(Sex = c("Male", "Female", "Male", "Female", "Male", "Female", "Male", "Female"),
ID2 = c("A12", "A23", "Z14", "WX1", "Y31", "D66", "E52", "H71")
)
diff.full <- full_join(data3, data4, by = c("ID1" = "ID2"))
21 / 96

You can also have more than one merge key. For example, if I am merging data for Ohio schools, each district has a district ID number (dirn), each school has building ID number (birn), and then the district's name (district), the building's name (building), and so on. If I am merging these data then my by = statement will be by = c("dirn", "birn", "district", "building").

Let us work through one example.

Load the sm_xstudent data frame and make sure the xclass data frame is loaded as well.

Look at sm_xstudent ... it has six variables, three of which are also in xclass -- term_code, anon_id, and sex_code. The other three variables are not present in xclass -- ethnicity_desc, residency_code, and university_ftpt.

We will use the three common variables (the keys) to merge sm_xstudent and xclass

load("data/sm_xstudent.RData")
multkey.merge <- full_join(
sm_xstudent, xclass,
by = c("term_code" = "term_code",
"anon_id" = "anon_id",
"sex_code" = "sex_code")
)

Open multkey.merge and see the merged fields

22 / 96

If we have multiple data-frames to merge we can do so in several ways but my default approach is to rely on the Reduce() command. Note: This is base R code

df1 <- tibble(Score = c(10, 21, 33, 12, 35, 67, 43, 99),
ID = c("A12", "A23", "Z14", "WX1", "Y31", "D66", "C31", "Q22")
)
df2 <- tibble(Sex = c("Male", "Female", "Male", "Female", "Male", "Female", "Male", "Female"),
ID = c("A12", "A23", "Z14", "WX1", "Y31", "D66", "E52", "H71")
)
df3 <- tibble(Age = c(6, 7, 6, 8, 8, 9, 10, 5),
ID = c("A12", "A23", "Z14", "WX1", "Y31", "D66", "E52", "H71")
)
my.list <- list(df1, df2, df3)
df.123 <- Reduce(function(...) merge(..., by = c("ID"), all = TRUE), my.list)
23 / 96

Here is the merged data-frame ...

ID Score Sex Age
A12 10 Male 6
A23 21 Female 7
C31 43 NA NA
D66 67 Female 9
E52 NA Male 10
H71 NA Female 5
Q22 99 NA NA
WX1 12 Female 8
Y31 35 Male 8
Z14 33 Male 6
24 / 96

Filtering joins

These only retain columns in the x data frame

semi_join() ... will keep only those rows found in x and y but keep only the columns in x

semi <- semi_join(data1, data2, by = "ID")
semi
## # A tibble: 6 x 2
## Score ID
## <dbl> <chr>
## 1 10 A12
## 2 21 A23
## 3 33 Z14
## 4 12 WX1
## 5 35 Y31
## 6 67 D66

anti_join() ... will keep only those rows not found in x and y but keep only the columns in x

anti <- anti_join(data1, data2, by = "ID")
anti
## # A tibble: 2 x 2
## Score ID
## <dbl> <chr>
## 1 43 C31
## 2 99 Q22
25 / 96

26 / 96

The oft-forgotten rbind()

What if we have to stack data frames atop one another, for example, because one data frame is from one term and the second from another term? No problem.

Load the rb1 (where term_code == 20171C), rb2 (where term_code == 20181C), and rb3 (where term_code == 20191C) data frames.

load("data/rb1.RData")
load("data/rb2.RData")
load("data/rb3.RData")
my.dfs <- rbind(rb1, rb2, rb3)
27 / 96
28 / 96


Your turn

29 / 96


Your turn

(1) Load xclass and sm2_xstudent

(2) Conduct a full_join, paying attention to the merge keys that should be used

(3) Conduct a semi_join with xclass as the x data frame

10:00 minutes

<div class="countdown blink-colon noupdate-15" id="timer_5fde929a" style="right:33%;bottom:0;" data-audio="true" data-warnwhen="0">
<code class="countdown-time"><span class="countdown-digits minutes">10</span><span class="countdown-digits colon">:</span><span class="countdown-digits seconds">00</span></code>
</div>
30 / 96

filter()

31 / 96

Say I want to select only online students in xclass

df.1 <- multkey.merge %>%
filter(online.f == "Online")

What if I wanted online students in Athens?

df.2 <- multkey.merge %>%
filter(online.f == "Online", campus_name == "Athens")

What if I only want students in selected colleges?

df.3 <- multkey.merge %>%
filter(college_desc %in% c("Education", "Fine Arts"))

What if I only want students in all other colleges except Fine Arts and Education?

df.4 <- multkey.merge %>%
filter(!college_desc %in% c("Education", "Fine Arts"))

Note the use of ! to create not in. Similarly, = means "equal to", != means "not equal to", | means "or", and & means "and"

32 / 96

What if have complicated filters of multiple variables? No problem ...

df.5 <- multkey.merge %>%
filter(!college_desc %in% c("Education", "Fine Arts"),
sex_code == "F",
enrlld_crdt_hrs > 4 & enrlld_crdt_hrs < 8)
head(df.5)
## # A tibble: 6 x 26
## term_code anon_id ethnicity_desc sex_code residency_code university_ftpt class_nbr
## <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 20171C ANON10… White F R Full Time 08287
## 2 20171C ANON10… White F R Full Time 08285
## 3 20171C ANON10… White F R Part Time 08024
## 4 20171C ANON10… White F R Part Time 08935
## 5 20171C ANON10… White F R Full Time 09761
## 6 20171C ANON10… White F R Part Time 09339
## # … with 19 more variables: campus_name <chr>, campus_dtl_desc <chr>, college_desc <chr>,
## # department_desc <chr>, subject <chr>, catalog_nbr <chr>, class_desc <chr>,
## # delivery_desc <chr>, online_flag <dbl>, rank_group <chr>, rank_desc <chr>,
## # acad_plan <chr>, student_college_desc <chr>, acad_plan_desc <chr>,
## # field_of_study_desc <chr>, residency_desc <chr>, enrolled_flag <dbl>,
## # enrlld_crdt_hrs <dbl>, online.f <fct>
33 / 96


Your turn

Use filter() to select cases that meet ALL of the the following criteria

(1) term_code must be 20191C (2) student is not on the Athens camps (3) student is a senior

Make sure you save the resulting data frame with some unique name

05:00 minutes

<div class="countdown blink-colon noupdate-15" id="timer_5fde94b1" style="right:33%;bottom:0;" data-audio="true" data-warnwhen="0">
<code class="countdown-time"><span class="countdown-digits minutes">05</span><span class="countdown-digits colon">:</span><span class="countdown-digits seconds">00</span></code>
</div>
34 / 96


Your turn

Use filter() to select cases that meet ALL of the the following criteria

(1) term_code must be 20191C (2) student is not on the Athens camps (3) student is a senior

Make sure you save the resulting data frame with some unique name

05:00 minutes

<div class="countdown blink-colon noupdate-15" id="timer_5fde94b1" style="right:33%;bottom:0;" data-audio="true" data-warnwhen="0">
<code class="countdown-time"><span class="countdown-digits minutes">05</span><span class="countdown-digits colon">:</span><span class="countdown-digits seconds">00</span></code>
</div>
my.df <- multkey.merge %>%
filter(term_code == "20191C",
campus_name != "Athens",
rank_desc == "Senior")
34 / 96

select()

35 / 96

Often you have more columns that you need for some operations, and so you trim the data by retaining (or discarding) specific columns. When you have to do this, it helps to know the column names\numbers

names(multkey.merge)
## [1] "term_code" "anon_id" "ethnicity_desc"
## [4] "sex_code" "residency_code" "university_ftpt"
## [7] "class_nbr" "campus_name" "campus_dtl_desc"
## [10] "college_desc" "department_desc" "subject"
## [13] "catalog_nbr" "class_desc" "delivery_desc"
## [16] "online_flag" "rank_group" "rank_desc"
## [19] "acad_plan" "student_college_desc" "acad_plan_desc"
## [22] "field_of_study_desc" "residency_desc" "enrolled_flag"
## [25] "enrlld_crdt_hrs" "online.f"

Say I want the first 4 columns

select.1 <- multkey.merge %>%
select(c(1:4))
names(select.1)
## [1] "term_code" "anon_id" "ethnicity_desc" "sex_code"
36 / 96

What if my columns are not sequential? No worries ...

select.2 <- multkey.merge %>%
select(c(1, 2, 6, 9:11))
names(select.2)
## [1] "term_code" "anon_id" "university_ftpt" "campus_dtl_desc"
## [5] "college_desc" "department_desc"

Could I have called for the desired columns by name? Sure ...

select.3 <- multkey.merge %>%
select(c(term_code, class_nbr, rank_desc, acad_plan))
names(select.3)
## [1] "term_code" "class_nbr" "rank_desc" "acad_plan"
37 / 96

Could I call out desired variables by a specific string in the column name?

select.4 <- multkey.merge %>%
select(contains("desc"))
names(select.4)
## [1] "ethnicity_desc" "campus_dtl_desc" "college_desc"
## [4] "department_desc" "class_desc" "delivery_desc"
## [7] "rank_desc" "student_college_desc" "acad_plan_desc"
## [10] "field_of_study_desc" "residency_desc"

What if it begins with or ends with some string?

select.5 <- multkey.merge %>%
select(ends_with("desc"))
names(select.5)
## [1] "ethnicity_desc" "campus_dtl_desc" "college_desc"
## [4] "department_desc" "class_desc" "delivery_desc"
## [7] "rank_desc" "student_college_desc" "acad_plan_desc"
## [10] "field_of_study_desc" "residency_desc"
select.6 <- multkey.merge %>%
select(starts_with("acad"))
names(select.6)
## [1] "acad_plan" "acad_plan_desc"

This is just the tip of the iceberg ... look at the dplyr documentation and online resources for other variants, including the scoped variants of select_if(), select_at(), and select_all()

38 / 96


Your turn

Use select() to retain the following cases, respectively

(1) Columns 3, 4, 8, and 22

(2) Columns that end with the string "flag"

(3) Columns that contain the string "plan"

10:00 minutes

<div class="countdown blink-colon noupdate-15" id="timer_5fde94c2" style="right:33%;bottom:0;" data-audio="true" data-warnwhen="0">
<code class="countdown-time"><span class="countdown-digits minutes">10</span><span class="countdown-digits colon">:</span><span class="countdown-digits seconds">00</span></code>
</div>
39 / 96

(1) Columns 3, 4, 8, and 22

multkey.merge %>%
select(c(3:4, 8, 22))
## # A tibble: 488,560 x 4
## ethnicity_desc sex_code campus_name field_of_study_desc
## <chr> <chr> <chr> <chr>
## 1 Two or More Races F Athens Counselor Education
## 2 White F Athens Art
## 3 White F Athens Art
## 4 Asian M Athens International Affairs
## 5 Asian M Athens International Affairs
## 6 Asian M Athens International Affairs
## 7 Asian M Athens International Affairs
## 8 White M Athens Management Systems
## 9 White M Athens Management Systems
## 10 Race/Ethnicity Unknown F Athens Undecided Education
## # … with 488,550 more rows
40 / 96

(2) Columns that end with the string "flag"

multkey.merge %>%
select(ends_with("flag"))
## # A tibble: 488,560 x 2
## online_flag enrolled_flag
## <dbl> <dbl>
## 1 0 1
## 2 0 1
## 3 0 1
## 4 0 1
## 5 0 1
## 6 0 1
## 7 0 1
## 8 1 1
## 9 1 1
## 10 0 1
## # … with 488,550 more rows
41 / 96

(3) Columns that contain the string "plan"

multkey.merge %>%
select(contains("plan"))
## # A tibble: 488,560 x 2
## acad_plan acad_plan_desc
## <chr> <chr>
## 1 ND6274 School Counseling
## 2 MF5052 Painting and Drawing
## 3 MF5052 Painting and Drawing
## 4 MA4216 Asian Studies
## 5 MA4216 Asian Studies
## 6 MA4216 Asian Studies
## 7 MA4216 Asian Studies
## 8 MB6157 MBA Business Analytics
## 9 MB6157 MBA Business Analytics
## 10 ND9915 EHS-Nondegree
## # … with 488,550 more rows
42 / 96

summarise() and mutate()

43 / 96

Say I want to calculate some quantities of interest --- frequencies perhaps, or means, standard deviations, etc. Let us start with mean, medians, etc.

tab1 <- multkey.merge %>%
summarise(
Mean.cr = mean(enrlld_crdt_hrs, na.rm = TRUE),
Median.cr = median(enrlld_crdt_hrs, na.rm = TRUE),
StdDev.cr = sd(enrlld_crdt_hrs, na.rm = TRUE),
Var.cr = var(enrlld_crdt_hrs, na.rm = TRUE),
Q1.cr = quantile(enrlld_crdt_hrs, p = c(0.25), na.rm = TRUE),
Q3.cr = quantile(enrlld_crdt_hrs, p = c(0.75), na.rm = TRUE)
)
tab1
## # A tibble: 1 x 6
## Mean.cr Median.cr StdDev.cr Var.cr Q1.cr Q3.cr
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2.80 3 1.37 1.88 3 3

Note that tab1 is a table data frame with only what we asked for

44 / 96

What if we wanted frequencies? Assume we want to know enrollment by college.

tab2 <- multkey.merge %>%
group_by(college_desc) %>%
summarise(frequency = n()) %>%
arrange(frequency)
tab2
## # A tibble: 14 x 2
## college_desc frequency
## <chr> <int>
## 1 Miscellaneous 308
## 2 Honors Tutorial 481
## 3 International Studies 921
## 4 George Voinovich School 1317
## 5 Osteopathic Medicine 12170
## 6 University College 17542
## 7 Regional Higher Ed 19043
## 8 Engineering & Technology 27094
## 9 Communication 33312
## 10 Fine Arts 34296
## 11 Education 37544
## 12 Business 53606
## 13 Health Sciences & Professions 77085
## 14 Arts & Sciences 173841
tab3 <- multkey.merge %>%
group_by(college_desc) %>%
summarise(frequency = n()) %>%
arrange(-frequency)
tab3
## # A tibble: 14 x 2
## college_desc frequency
## <chr> <int>
## 1 Arts & Sciences 173841
## 2 Health Sciences & Professions 77085
## 3 Business 53606
## 4 Education 37544
## 5 Fine Arts 34296
## 6 Communication 33312
## 7 Engineering & Technology 27094
## 8 Regional Higher Ed 19043
## 9 University College 17542
## 10 Osteopathic Medicine 12170
## 11 George Voinovich School 1317
## 12 International Studies 921
## 13 Honors Tutorial 481
## 14 Miscellaneous 308
45 / 96

Frequencies are well and good but we may want percentages

tab4 <- multkey.merge %>%
group_by(college_desc) %>%
summarise(frequency = n()) %>%
mutate(percent = (frequency / sum(frequency)) * 100,
percent = round(percent, digits = 2)) %>%
arrange(percent)
tab4
## # A tibble: 14 x 3
## college_desc frequency percent
## <chr> <int> <dbl>
## 1 Miscellaneous 308 0.06
## 2 Honors Tutorial 481 0.1
## 3 International Studies 921 0.19
## 4 George Voinovich School 1317 0.27
## 5 Osteopathic Medicine 12170 2.49
## 6 University College 17542 3.59
## 7 Regional Higher Ed 19043 3.9
## 8 Engineering & Technology 27094 5.55
## 9 Communication 33312 6.82
## 10 Fine Arts 34296 7.02
## 11 Education 37544 7.68
## 12 Business 53606 11.0
## 13 Health Sciences & Professions 77085 15.8
## 14 Arts & Sciences 173841 35.6
tab5 <- multkey.merge %>%
group_by(college_desc) %>%
summarise(frequency = n()) %>%
mutate(percent = (frequency / sum(frequency)) * 100,
percent = round(percent, digits = 2)) %>%
arrange(-percent)
tab5
## # A tibble: 14 x 3
## college_desc frequency percent
## <chr> <int> <dbl>
## 1 Arts & Sciences 173841 35.6
## 2 Health Sciences & Professions 77085 15.8
## 3 Business 53606 11.0
## 4 Education 37544 7.68
## 5 Fine Arts 34296 7.02
## 6 Communication 33312 6.82
## 7 Engineering & Technology 27094 5.55
## 8 Regional Higher Ed 19043 3.9
## 9 University College 17542 3.59
## 10 Osteopathic Medicine 12170 2.49
## 11 George Voinovich School 1317 0.27
## 12 International Studies 921 0.19
## 13 Honors Tutorial 481 0.1
## 14 Miscellaneous 308 0.06
46 / 96

Perhaps I want frequencies and percentages by sex_code

tab6 <- multkey.merge %>%
group_by(college_desc, sex_code) %>%
summarise(frequency = n()) %>%
mutate(percent = (frequency / sum(frequency)) * 100,
percent = round(percent, digits = 2)) %>%
arrange(college_desc, sex_code, -percent)
tab6
## # A tibble: 31 x 4
## # Groups: college_desc [14]
## college_desc sex_code frequency percent
## <chr> <chr> <int> <dbl>
## 1 Arts & Sciences F 98267 56.5
## 2 Arts & Sciences M 75574 43.5
## 3 Business F 19875 37.1
## 4 Business M 33731 62.9
## 5 Communication F 18844 56.6
## 6 Communication M 14468 43.4
## 7 Education F 25939 69.1
## 8 Education M 11605 30.9
## 9 Engineering & Technology F 4542 16.8
## 10 Engineering & Technology M 22552 83.2
## # … with 21 more rows
47 / 96

Or by sex_code and rank_desc

tab7 <- multkey.merge %>%
group_by(college_desc, sex_code, rank_desc) %>%
summarise(frequency = n()) %>%
mutate(percent = (frequency / sum(frequency)) * 100,
percent = round(percent, digits = 1)) %>%
arrange(college_desc, sex_code, rank_desc, -percent)
tab7
## # A tibble: 178 x 5
## # Groups: college_desc, sex_code [31]
## college_desc sex_code rank_desc frequency percent
## <chr> <chr> <chr> <int> <dbl>
## 1 Arts & Sciences F Doctoral Student 1178 1.2
## 2 Arts & Sciences F Freshman 33519 34.1
## 3 Arts & Sciences F Junior 16100 16.4
## 4 Arts & Sciences F Master's Student 3388 3.4
## 5 Arts & Sciences F Non-Degree Undergraduate 4068 4.1
## 6 Arts & Sciences F Senior 17321 17.6
## 7 Arts & Sciences F Sophomore 22693 23.1
## 8 Arts & Sciences M Doctoral Student 1604 2.1
## 9 Arts & Sciences M Freshman 26114 34.6
## 10 Arts & Sciences M Junior 12320 16.3
## # … with 168 more rows
48 / 96

case_when()

49 / 96

Often you will need to collapse some categories or otherwise modify some values. For example, perhaps I want sex_code to be fully spelled out instead of F and M. How might I do that?

multkey.merge <- multkey.merge %>%
mutate(
sex.f = case_when(
sex_code == "F" ~ "Female",
sex_code == "M" ~ "Male"
)
)
table(multkey.merge$sex.f)
##
## Female Male
## 277650 210907
multkey.merge <- multkey.merge %>%
mutate(
sex.f = case_when(
sex_code == "F" ~ "Female",
TRUE ~ "Male"
)
)
table(multkey.merge$sex.f)
##
## Female Male
## 277650 210910
50 / 96

I can also take a numeric variable and flip it into a categorical variable with case_when().

For example, say I want to split enrlld_crdt_hrs into three groups, those with hours <= 4, those with hours >= 4 but <= 15, and those with hours >= 16

multkey.merge <- multkey.merge %>%
mutate(
enrolled.f = case_when(
enrlld_crdt_hrs < 4 ~ "<= 3 hours",
enrlld_crdt_hrs >= 4 & enrlld_crdt_hrs <= 15 ~ "4 - 15 hours",
enrlld_crdt_hrs > 15 ~ "> 15 hours",
)
)
table(multkey.merge$enrolled.f)
##
## <= 3 hours > 15 hours 4 - 15 hours
## 404150 54 84356
multkey.merge <- multkey.merge %>%
mutate(
enrolled.f = case_when(
enrlld_crdt_hrs < 4 ~ "<= 3 hours",
enrlld_crdt_hrs >= 4 & enrlld_crdt_hrs <= 15 ~ "4 - 15 hours",
TRUE ~ "> 15 hours",
)
)
table(multkey.merge$enrolled.f)
##
## <= 3 hours > 15 hours 4 - 15 hours
## 404150 54 84356
51 / 96

Dealing with strings

52 / 96

stringr()

Some times we may need to eliminate some strings in specific columns, perhaps change their case, or replace a string with another string. All this and more can be done either by two packages: stringr or stringi. Let us see stringr in action for a few commonly encountered tasks.

For example, rank_desc has the word "student" in the labels. Let us assume we want to get rid of these. Here goes ...

multkey.merge <- multkey.merge %>%
mutate(rank = stringr::str_replace(rank_desc, " Student", ""))

What about changing the case?

multkey.merge <- multkey.merge %>%
mutate(college = stringr::str_to_upper(college_desc))

Other case options: str_to_title, str_to_lower, str_to_sentence

stringr is very versatile; see here for the authors' walkthrough

53 / 96

Grouped operations via group_by()

54 / 96

You will, more often than not, want to calculate quantities of interest for specific groups -- by campus, by term, by college/department/course, and so on.

Well, the group-by() switch allows you to specify the grouping structure, while ungroup() will remove constraints imposed by the grouping structure. For example, assume I want to calculate the number of students in each college

multkey.merge %>%
group_by(college_desc, campus_name) %>%
summarise(students = n_distinct(anon_id)) %>%
arrange(-students)
## # A tibble: 56 x 3
## # Groups: college_desc [14]
## college_desc campus_name students
## <chr> <chr> <int>
## 1 Arts & Sciences Athens 47145
## 2 Health Sciences & Professions Athens 30119
## 3 Business Athens 19247
## 4 Communication Athens 14332
## 5 Fine Arts Athens 14272
## 6 Education Athens 13854
## 7 University College Athens 12891
## 8 Engineering & Technology Athens 8263
## 9 Arts & Sciences Lancaster 4590
## 10 Arts & Sciences Chillicothe 4021
## # … with 46 more rows

n_distinct() is picking up unique values of anon_id even though the data frame has multiple rows per some students

55 / 96

What about male versus female students?

multkey.merge %>%
group_by(college_desc, campus_name, sex_code) %>%
summarise(students = n_distinct(anon_id)) %>%
arrange(college_desc, campus_name, -students)
## # A tibble: 115 x 4
## # Groups: college_desc, campus_name [56]
## college_desc campus_name sex_code students
## <chr> <chr> <chr> <int>
## 1 Arts & Sciences Athens F 25515
## 2 Arts & Sciences Athens M 21630
## 3 Arts & Sciences Chillicothe F 2559
## 4 Arts & Sciences Chillicothe M 1462
## 5 Arts & Sciences Eastern F 1318
## 6 Arts & Sciences Eastern M 847
## 7 Arts & Sciences Lancaster F 2755
## 8 Arts & Sciences Lancaster M 1835
## 9 Arts & Sciences Southern F 2032
## 10 Arts & Sciences Southern M 1012
## # … with 105 more rows
56 / 96

What about male versus female students by ethnicity?

multkey.merge %>%
group_by(college_desc, campus_name, sex_code, ethnicity_desc) %>%
summarise(students = n_distinct(anon_id)) %>%
arrange(college_desc, campus_name, ethnicity_desc, sex_code)
## # A tibble: 820 x 5
## # Groups: college_desc, campus_name, sex_code [115]
## college_desc campus_name sex_code ethnicity_desc students
## <chr> <chr> <chr> <chr> <int>
## 1 Arts & Sciences Athens F American Indian/Alaska Native 43
## 2 Arts & Sciences Athens M American Indian/Alaska Native 34
## 3 Arts & Sciences Athens F Asian 310
## 4 Arts & Sciences Athens M Asian 215
## 5 Arts & Sciences Athens F Black or African American 1488
## 6 Arts & Sciences Athens M Black or African American 1087
## 7 Arts & Sciences Athens F Hispanic or Latino 846
## 8 Arts & Sciences Athens M Hispanic or Latino 611
## 9 Arts & Sciences Athens F Native Hawaiian/Other Pacific Islander 11
## 10 Arts & Sciences Athens M Native Hawaiian/Other Pacific Islander 17
## # … with 810 more rows
57 / 96

Maybe I want the number enrolled in a give course. To do this, I will first combine subject and catalog_nbr

multkey.merge %>%
mutate(course = stringr::str_c(subject, catalog_nbr, sep = " ")) %>%
group_by(term_code, campus_name, course) %>%
summarise(students = n_distinct(anon_id)) %>%
arrange(-students)
## # A tibble: 12,666 x 4
## # Groups: term_code, campus_name [18]
## term_code campus_name course students
## <chr> <chr> <chr> <int>
## 1 20181C Athens UC 1900 3107
## 2 20191C Athens UC 1900 3102
## 3 20171C Athens UC 1900 2811
## 4 20171C Athens ENG 1510 1547
## 5 20171C Athens NRSE 4570 1467
## 6 20181C Athens ENG 1510 1428
## 7 20171C Athens NRSE 4520 1426
## 8 20191C Athens NRSE 4550 1414
## 9 20171C Athens PSY 1010 1404
## 10 20181C Athens NRSE 4530 1334
## # … with 12,656 more rows
58 / 96


Your turn

(1) Remove the phrase "Main Campus" from campus_dtl_desc

(2) Add class_nbr to course, making sure the separator is " - "

(3) Save multkey.merge with the RData extension

10:00 minutes

59 / 96

rename()

You may need to rename some existing columns. An easy way to do this is as follows:

multkey.merge %>%
rename("Ethnicity" = ethnicity_desc,
"Sex" = sex_code,
"Department" = department_desc) %>%
names()
## [1] "term_code" "anon_id" "Ethnicity"
## [4] "Sex" "residency_code" "university_ftpt"
## [7] "class_nbr" "campus_name" "campus_dtl_desc"
## [10] "college_desc" "Department" "subject"
## [13] "catalog_nbr" "class_desc" "delivery_desc"
## [16] "online_flag" "rank_group" "rank_desc"
## [19] "acad_plan" "student_college_desc" "acad_plan_desc"
## [22] "field_of_study_desc" "residency_desc" "enrolled_flag"
## [25] "enrlld_crdt_hrs" "online.f" "sex.f"
## [28] "enrolled.f" "rank" "college"
60 / 96

top_n()

After running some operations that generate some quantities of interest we may want to extract or merely peek at the n observations with the highest values or n observatins with the lowest values. This can be done as follows:

multkey.merge %>%
group_by(department_desc) %>%
summarize(frequency = n()) %>%
arrange(-frequency) %>%
top_n(10, frequency)
## # A tibble: 10 x 2
## department_desc frequency
## <chr> <int>
## 1 Nursing 36146
## 2 Management Systems 27078
## 3 Biological Sciences 22470
## 4 Mathematics 20716
## 5 Vice President Regional Higher Education 19043
## 6 Psychology 18791
## 7 English 17148
## 8 Communication Studies 17008
## 9 Teacher Education 16464
## 10 Dean University College 16447
multkey.merge %>%
group_by(department_desc) %>%
summarize(frequency = n()) %>%
arrange(-frequency) %>%
top_n(-10, frequency)
## # A tibble: 10 x 2
## department_desc frequency
## <chr> <int>
## 1 Center For Intl Studies 921
## 2 African American Studies 884
## 3 Aviation 883
## 4 Military Science 818
## 5 Ohio Program of Intensive English 505
## 6 Honors College 481
## 7 Misc 308
## 8 Aerospace Studies 277
## 9 Education Dean 69
## 10 Center for Law, Justice and Culture 54
61 / 96

tidy data

62 / 96

Most packages you use in R call for data to be tidy. What are tidy data?

  • Each variable forms a column.
  • Each observation forms a row.
  • Each type of observational unit forms a table.

Here are some untidy data

63 / 96

(a) The Pew survey data

pew <- read.delim(
file = "http://stat405.had.co.nz/data/pew.txt",
header = TRUE,
stringsAsFactors = FALSE,
check.names = F
)
head(pew)
## religion <$10k $10-20k $20-30k $30-40k $40-50k $50-75k $75-100k $100-150k
## 1 Agnostic 27 34 60 81 76 137 122 109
## 2 Atheist 12 27 37 52 35 70 73 59
## 3 Buddhist 27 21 30 34 33 58 62 39
## 4 Catholic 418 617 732 670 638 1116 949 792
## 5 Don’t know/refused 15 14 15 11 10 35 21 17
## 6 Evangelical Prot 575 869 1064 982 881 1486 949 723
## >150k Don't know/refused
## 1 84 96
## 2 74 76
## 3 53 54
## 4 633 1489
## 5 18 116
## 6 414 1529
64 / 96

(b) Here are some more examples

library(DSR)
table2
## # A tibble: 12 x 4
## country year key value
## <fct> <int> <fct> <int>
## 1 Afghanistan 1999 cases 745
## 2 Afghanistan 1999 population 19987071
## 3 Afghanistan 2000 cases 2666
## 4 Afghanistan 2000 population 20595360
## 5 Brazil 1999 cases 37737
## 6 Brazil 1999 population 172006362
## 7 Brazil 2000 cases 80488
## 8 Brazil 2000 population 174504898
## 9 China 1999 cases 212258
## 10 China 1999 population 1272915272
## 11 China 2000 cases 213766
## 12 China 2000 population 1280428583
table3
## # A tibble: 6 x 3
## country year rate
## <fct> <int> <chr>
## 1 Afghanistan 1999 745/19987071
## 2 Afghanistan 2000 2666/20595360
## 3 Brazil 1999 37737/172006362
## 4 Brazil 2000 80488/174504898
## 5 China 1999 212258/1272915272
## 6 China 2000 213766/1280428583
65 / 96
table4
## # A tibble: 3 x 3
## country `1999` `2000`
## <fct> <int> <int>
## 1 Afghanistan 745 2666
## 2 Brazil 37737 80488
## 3 China 212258 213766
table5
## # A tibble: 3 x 3
## country `1999` `2000`
## <fct> <int> <int>
## 1 Afghanistan 19987071 20595360
## 2 Brazil 172006362 174504898
## 3 China 1272915272 1280428583
66 / 96

What would a tidy version of these examples look like?

table1
## # A tibble: 6 x 4
## country year cases population
## <fct> <int> <int> <int>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583

Here, we have

  • each country as a row per year
  • each column has a single variable
  • each cell has a single value

The tidyr package is designed to make messy data tidy in order to proceed with analysis. It has four primary functions that we will see in turn.

67 / 96

separate

You will run into variables that you want to break into smaller pieces. For instance, take table3 where rate is a ratio of two variables, cases and population, each of which should be in their own columns. How can we achieve that?

library(tidyr)
s.table3 <- table3 %>%
separate(col = rate, into = c("cases", "population"), sep = "/",
remove = TRUE, convert = TRUE)
head(s.table3)
## # A tibble: 6 x 4
## country year cases population
## <fct> <int> <int> <int>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583
68 / 96

unite

The unite command does the exact opposite of separate, combining columns

u.table3 <- s.table3 %>%
unite("rate", cases, population, sep = ":", remove = FALSE)
head(u.table3)
## # A tibble: 6 x 5
## country year rate cases population
## <fct> <int> <chr> <int> <int>
## 1 Afghanistan 1999 745:19987071 745 19987071
## 2 Afghanistan 2000 2666:20595360 2666 20595360
## 3 Brazil 1999 37737:172006362 37737 172006362
## 4 Brazil 2000 80488:174504898 80488 174504898
## 5 China 1999 212258:1272915272 212258 1272915272
## 6 China 2000 213766:1280428583 213766 1280428583

Here you specify the name of the new column first, and then list the columns to be combined, specify the separator, and whether the original columns should be removed or not.

69 / 96

spread

Revisit table2

table2
## # A tibble: 12 x 4
## country year key value
## <fct> <int> <fct> <int>
## 1 Afghanistan 1999 cases 745
## 2 Afghanistan 1999 population 19987071
## 3 Afghanistan 2000 cases 2666
## 4 Afghanistan 2000 population 20595360
## 5 Brazil 1999 cases 37737
## 6 Brazil 1999 population 172006362
## 7 Brazil 2000 cases 80488
## 8 Brazil 2000 population 174504898
## 9 China 1999 cases 212258
## 10 China 1999 population 1272915272
## 11 China 2000 cases 213766
## 12 China 2000 population 1280428583
70 / 96

I want to flip this into a format where the data frame has two columns per country per year -- cases and population. How can I do this?

sp.table2 <- table2 %>%
spread(key, value)
sp.table2
## # A tibble: 6 x 4
## country year cases population
## <fct> <int> <int> <int>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583

Notice that the key column had two unique values -- cases and population -- so two new columns were created. Notice also that tidyr knew to populate the value of each of the new cells with what it finds in the value column. The result is what we call data in a wide format.

71 / 96

gather

gather does the opposite of spread, converting wide data into what we call the long form. Revisit table4

table4
## # A tibble: 3 x 3
## country `1999` `2000`
## <fct> <int> <int>
## 1 Afghanistan 745 2666
## 2 Brazil 37737 80488
## 3 China 212258 213766

Note that each country has two columns, one per year. ... This is untidy ...

72 / 96

... So, we'll flip this tibble as follows:

g.table4 <- table4 %>%
gather("year", "cases", 2:3)
head(g.table4)
## # A tibble: 6 x 3
## country year cases
## <fct> <chr> <int>
## 1 Afghanistan 1999 745
## 2 Brazil 1999 37737
## 3 China 1999 212258
## 4 Afghanistan 2000 2666
## 5 Brazil 2000 80488
## 6 China 2000 213766

Notice that you had to specify the columns you wanted to gather and we did this via 2:3. If you forget to specify columns it creates something else altogether so be careful.

73 / 96

Some Examples

nospread.df <- multkey.merge %>%
group_by(department_desc, term_code) %>%
summarize(frequency = n())
head(nospread.df, 10)
## # A tibble: 10 x 3
## # Groups: department_desc [4]
## department_desc term_code frequency
## <chr> <chr> <int>
## 1 Accountancy 20171C 1874
## 2 Accountancy 20181C 1740
## 3 Accountancy 20191C 1734
## 4 Aerospace Studies 20171C 69
## 5 Aerospace Studies 20181C 112
## 6 Aerospace Studies 20191C 96
## 7 African American Studies 20171C 313
## 8 African American Studies 20181C 238
## 9 African American Studies 20191C 333
## 10 Applied Health Sciences & Wellness 20171C 3922
74 / 96
spread.df <- multkey.merge %>%
group_by(department_desc, term_code) %>%
summarize(frequency = n()) %>%
spread(term_code, frequency)
head(spread.df, 10)
## # A tibble: 10 x 4
## # Groups: department_desc [10]
## department_desc `20171C` `20181C` `20191C`
## <chr> <int> <int> <int>
## 1 Accountancy 1874 1740 1734
## 2 Aerospace Studies 69 112 96
## 3 African American Studies 313 238 333
## 4 Applied Health Sciences & Wellness 3922 4525 4268
## 5 Art 2447 2969 2738
## 6 Aviation 293 275 315
## 7 Biological Sciences 6722 7947 7801
## 8 Center For Intl Studies 255 328 338
## 9 Center for Law, Justice and Culture NA NA 54
## 10 Chemical & Biomolecular Engineering 870 1189 1059

Note: term_code is now a column instead of rows

75 / 96
76 / 96

The World Health Organization Example

At this website you can find data provided by countries to WHO and estimates of TB burden generated by WHO for the Global Tuberculosis Report. I want you to gather the following data (all are in the csv format):

Read the data dictionary carefully. Then read-in the case notifications data.

The database has too many variables (175 or so) and so I'll have you use just a few for this practice task.

library(readr)
tb <- read_csv("data/TB_notifications_2018-02-07.csv")
77 / 96

Let us use dplyr to select and filter the tb database so that it meets the following criteria: (i) observations from 2012 or earlier years are kept, and (ii) keep only columns 1 through 6 and 25 (new_sp_m04) through 44 (new_sp_fu).

library(dplyr)
tb2 <- tb %>%
select(c(1:6, 25:44)) %>%
filter(year <= 2012)
head(tb2)
## # A tibble: 6 x 26
## country iso2 iso3 iso_numeric g_whoregion year new_sp_m04 new_sp_m514 new_sp_m014
## <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Afghan… AF AFG 004 EMR 1980 NA NA NA
## 2 Afghan… AF AFG 004 EMR 1981 NA NA NA
## 3 Afghan… AF AFG 004 EMR 1982 NA NA NA
## 4 Afghan… AF AFG 004 EMR 1983 NA NA NA
## 5 Afghan… AF AFG 004 EMR 1984 NA NA NA
## 6 Afghan… AF AFG 004 EMR 1985 NA NA NA
## # … with 17 more variables: new_sp_m1524 <dbl>, new_sp_m2534 <dbl>, new_sp_m3544 <dbl>,
## # new_sp_m4554 <dbl>, new_sp_m5564 <dbl>, new_sp_m65 <dbl>, new_sp_mu <dbl>,
## # new_sp_f04 <dbl>, new_sp_f514 <dbl>, new_sp_f014 <dbl>, new_sp_f1524 <dbl>,
## # new_sp_f2534 <dbl>, new_sp_f3544 <dbl>, new_sp_f4554 <dbl>, new_sp_f5564 <dbl>,
## # new_sp_f65 <dbl>, new_sp_fu <dbl>
78 / 96

Now, new_sp_xx refer to new pulmonary smear-positive cases. The m refers to males and f refers to females. Finally, 04 is the 0-4 age group, 514 is the 5-14 age group, and so on. new_sp_mu is males with an unknown age-group and new_sp_fu is females with an unknown age-group. This is the messy part of the data. Ideally we'd have a column for gender, flagged as Male, Female, Unknown, and a separate column for age-groups flagged 0-4, 5-14, 0-14, and so on. Time to tidy these data.

Since the data are in the wide format we can use gather to flip it into the long format.

tb3 <- tb2 %>%
gather("group", "cases", 7:26)
head(tb3)
## # A tibble: 6 x 8
## country iso2 iso3 iso_numeric g_whoregion year group cases
## <chr> <chr> <chr> <chr> <chr> <dbl> <chr> <dbl>
## 1 Afghanistan AF AFG 004 EMR 1980 new_sp_m04 NA
## 2 Afghanistan AF AFG 004 EMR 1981 new_sp_m04 NA
## 3 Afghanistan AF AFG 004 EMR 1982 new_sp_m04 NA
## 4 Afghanistan AF AFG 004 EMR 1983 new_sp_m04 NA
## 5 Afghanistan AF AFG 004 EMR 1984 new_sp_m04 NA
## 6 Afghanistan AF AFG 004 EMR 1985 new_sp_m04 NA
79 / 96

Now we split the new_sp_xx values into three columns

tb4 <- tb3 %>%
separate(col = group, into = c("new", "sp", "sexage"), sep = "_")
head(tb4)
## # A tibble: 6 x 10
## country iso2 iso3 iso_numeric g_whoregion year new sp sexage cases
## <chr> <chr> <chr> <chr> <chr> <dbl> <chr> <chr> <chr> <dbl>
## 1 Afghanistan AF AFG 004 EMR 1980 new sp m04 NA
## 2 Afghanistan AF AFG 004 EMR 1981 new sp m04 NA
## 3 Afghanistan AF AFG 004 EMR 1982 new sp m04 NA
## 4 Afghanistan AF AFG 004 EMR 1983 new sp m04 NA
## 5 Afghanistan AF AFG 004 EMR 1984 new sp m04 NA
## 6 Afghanistan AF AFG 004 EMR 1985 new sp m04 NA
80 / 96

and then split sexage into two columns by specifying that the split should occur after the first value (which in our case will be after m or f).

tb5 <- tb4 %>%
separate(col = sexage, into = c("sex", "agegroup"), sep = 1)
head(tb5)
## # A tibble: 6 x 11
## country iso2 iso3 iso_numeric g_whoregion year new sp sex agegroup cases
## <chr> <chr> <chr> <chr> <chr> <dbl> <chr> <chr> <chr> <chr> <dbl>
## 1 Afghanistan AF AFG 004 EMR 1980 new sp m 04 NA
## 2 Afghanistan AF AFG 004 EMR 1981 new sp m 04 NA
## 3 Afghanistan AF AFG 004 EMR 1982 new sp m 04 NA
## 4 Afghanistan AF AFG 004 EMR 1983 new sp m 04 NA
## 5 Afghanistan AF AFG 004 EMR 1984 new sp m 04 NA
## 6 Afghanistan AF AFG 004 EMR 1985 new sp m 04 NA
81 / 96

The beauty of dplyr and tidyr (in fact of all things tidyverse) is the ability to combine multiple steps into a single piped command!

tb.df <- tb %>%
select(c(1:6, 25:44)) %>%
filter(year <= 2012) %>%
gather("group", "cases", 7:26) %>%
separate(col = group, into = c("new", "sp", "sexage"), sep = "_") %>%
separate(col = sexage, into = c("sex", "agegroup"), sep = 1) %>%
select(c(1:6, 9:11))
head(tb.df)
## # A tibble: 6 x 9
## country iso2 iso3 iso_numeric g_whoregion year sex agegroup cases
## <chr> <chr> <chr> <chr> <chr> <dbl> <chr> <chr> <dbl>
## 1 Afghanistan AF AFG 004 EMR 1980 m 04 NA
## 2 Afghanistan AF AFG 004 EMR 1981 m 04 NA
## 3 Afghanistan AF AFG 004 EMR 1982 m 04 NA
## 4 Afghanistan AF AFG 004 EMR 1983 m 04 NA
## 5 Afghanistan AF AFG 004 EMR 1984 m 04 NA
## 6 Afghanistan AF AFG 004 EMR 1985 m 04 NA
82 / 96

We can clean-up the codes for easier interpretation.

tb.df$sex <- ifelse(tb.df$sex == "m", "Males", "Females")
tb.df$agegroup <- ifelse(tb.df$agegroup == "04", "0-4",
ifelse(tb.df$agegroup == "014", "0-14",
ifelse(tb.df$agegroup == "1524", "15-24",
ifelse(tb.df$agegroup == "2534", "25-34",
ifelse(tb.df$agegroup == "3544", "35-44",
ifelse(tb.df$agegroup == "4554", "45-54",
ifelse(tb.df$agegroup == "514", "5-14",
ifelse(tb.df$agegroup == "5564", "55-64",
ifelse(tb.df$agegroup == "65", "65+", "Uknown")))))))))
head(tb.df)
## # A tibble: 6 x 9
## country iso2 iso3 iso_numeric g_whoregion year sex agegroup cases
## <chr> <chr> <chr> <chr> <chr> <dbl> <chr> <chr> <dbl>
## 1 Afghanistan AF AFG 004 EMR 1980 Males 0-4 NA
## 2 Afghanistan AF AFG 004 EMR 1981 Males 0-4 NA
## 3 Afghanistan AF AFG 004 EMR 1982 Males 0-4 NA
## 4 Afghanistan AF AFG 004 EMR 1983 Males 0-4 NA
## 5 Afghanistan AF AFG 004 EMR 1984 Males 0-4 NA
## 6 Afghanistan AF AFG 004 EMR 1985 Males 0-4 NA

Note: This is using base R and not case_when() from dplyr

83 / 96

How many cases per country in the 1980-1990 period?

tbtab.01 <- tb.df %>%
filter(year >= 1980 | year <= 1990) %>%
group_by(country) %>%
summarise(Ncases = sum(cases, na.rm = TRUE)) %>%
arrange(-Ncases)
head(tbtab.01)
## # A tibble: 6 x 2
## country Ncases
## <chr> <dbl>
## 1 India 6492850
## 2 China 5643738
## 3 Indonesia 1886657
## 4 South Africa 1435147
## 5 Bangladesh 1072431
## 6 Viet Nam 915525
84 / 96

What about by sex and age-group?

tbtab.02 <- tb.df %>%
filter(year >= 1980 | year <= 1990) %>%
group_by(country, sex, agegroup) %>%
summarise(Ncases = sum(cases, na.rm = TRUE)) %>%
arrange(-Ncases)
head(tbtab.02)
## # A tibble: 6 x 4
## # Groups: country, sex [2]
## country sex agegroup Ncases
## <chr> <chr> <chr> <dbl>
## 1 India Males 35-44 963811
## 2 India Males 25-34 913939
## 3 India Males 45-54 835228
## 4 India Males 15-24 791303
## 5 China Males 65+ 748103
## 6 China Males 45-54 668133

This is just one example of how dplyr and tidyr -- two packages from the tidyverse -- can be used to deal with what seems to be quite messy data. The more you use it, the easier it becomes to figure out solutions to even very complicated messy data setups.

85 / 96


lubridate

86 / 96

This package makes working with dates and times quite easy because it does what base R does but in more intuitive ways and perhaps more flexibly. Let us start with some date fields.

today1 = "20171217"
today2 = "2017-12-17"
today3 = "2017 December 17"
today4 = "20171217143241"
today5 = "2017 December 17 14:32:41"
today6 = "December 17 2017 14:32:41"
today7 = "17-Dec, 2017 14:32:41"

The formats are quite varied but lubridate deals with them quite seamlessly so long as you pay attention to the order -- is year first or last? What about month? day? Is time given in hours, minutes and seconds?.

87 / 96
library(lubridate)
ymd(today1)
## [1] "2017-12-17"
ymd(today2)
## [1] "2017-12-17"
ymd(today3)
## [1] "2017-12-17"
ymd_hms(today4)
## [1] "2017-12-17 14:32:41 UTC"
ymd_hms(today5)
## [1] "2017-12-17 14:32:41 UTC"
mdy_hms(today6)
## [1] "2017-12-17 14:32:41 UTC"
dmy_hms(today7)
## [1] "2017-12-17 14:32:41 UTC"
88 / 96

If I need to extract data/time elements, I can do that as well.

today = dmy_hms(today7); today # a date and time set in today7
## [1] "2017-12-17 14:32:41 UTC"
today.y = year(today); today.y # The year
## [1] 2017
today.m1 = month(today); today.m1 # the month, as a number
## [1] 12
today.m2 = month(today, label = TRUE, abbr = TRUE); today.m2 # labeling the month but with an abbreviation
## [1] Dec
## Levels: Jan < Feb < Mar < Apr < May < Jun < Jul < Aug < Sep < Oct < Nov < Dec
today.m3 = month(today, label = TRUE, abbr = FALSE); today.m3# fully labelling the month
## [1] December
## 12 Levels: January < February < March < April < May < June < July < ... < December
89 / 96
today.w = week(today)
today.w # what week of the year is it?
## [1] 51
today.doy = yday(today)
today.doy # what day of the year is it?
## [1] 351
today.dom = mday(today)
today.dom # what day of the month is it?
## [1] 17
today.dow1 = wday(today)
today.dow1 # what day of the week is it, as a number?
## [1] 1
today.dow2 = wday(today, label = TRUE, abbr = TRUE)
today.dow2 # day of the week, abbreviated label
## [1] Sun
## Levels: Sun < Mon < Tue < Wed < Thu < Fri < Sat
today.dow3 = wday(today, label = TRUE, abbr = FALSE)
today.dow3 # day of the week fully labelled
## [1] Sunday
## Levels: Sunday < Monday < Tuesday < Wednesday < Thursday < Friday < Saturday
90 / 96
today.h = hour(today)
today.h # what hour is it?
## [1] 14
today.m = minute(today)
today.m # what minute is it?
## [1] 32
today.s = second(today)
today.s # what second is it?
## [1] 41
today.tz = tz(today)
today.tz # what time zone is it?
## [1] "UTC"
91 / 96

Intervals vs. Durations vs. Periods

Calculating time lapsed between dates is tricky because you have to take into account daylight savings time1, leap years2, and leap seconds3. These render the length of months, weeks, days, hours, and minutes to be relative units of time but seconds tends to be exact units of time. As a result, and by design, lubridate differentiates between intervals, durations, and periods, each measuring time spans in different ways.

[1] Daylight Saving Time (DST) is the practice of setting the clocks forward 1 hour from standard time during the summer months, and back again in the fall, in order to make better use of natural daylight. Source

[2] Leap years are needed to keep our modern day Gregorian calendar in alignment with the Earth's revolutions around the sun. It takes the Earth approximately 365.242189 days – or 365 days, 5 hours, 48 minutes, and 45 seconds – to circle once around the Sun. This is called a tropical year, and is measured from the March equinox. However, the Gregorian calendar has only 365 days in a year, so if we didn't add a leap day on February 29 nearly every four years, we would lose almost six hours off our calendar every year. After only 100 years, our calendar would be off by around 24 days! Source

[3] Two components are used to determine UTC (Coordinated Universal Time): International Atomic Time (TAI): A time scale that combines the output of some 200 highly precise atomic clocks worldwide, and provides the exact speed for our clocks to tick. Universal Time (UT1), also known as Astronomical Time, refers to the Earth's rotation around its own axis, which determines the length of a day. Before the difference between UTC and UT1 reaches 0.9 seconds, a leap second is added to UTC and to clocks worldwide. By adding an additional second to the time count, our clocks are effectively stopped for that second to give Earth the opportunity to catch up. Source.

92 / 96

Let us start with a duration, the simplest measure of lapsed time since it measures the passing of time in seconds. Say I pick two dates and times ... 05:00am on March 9, 2019 and 05:00am on March 10, 2019.

date1 <- ymd_hms("2019-03-09 05:00:00", tzone = "US/Eastern")
date2 <- ymd_hms("2019-03-10 05:00:00", tzone = "US/Eastern")
timint <- interval(date1, date2)
timint
## [1] 2019-03-09 05:00:00 UTC--2019-03-10 05:00:00 UTC
## [2] NA--NA

How much time has lapsed between date1 and date2? We calculate with as.duration()

timelapsed.d <- as.duration(timint)

Why is it 23 hours and not 24 hours? Because daylight savings kicks in at 2:00 AM on March 10, and since the clocks kick forward by one hour, only 23 hours have passed, not 24 as we might naively expect. Go back and look at the output from timeint; see the EST versus EDT? In contrast to duration, if I ask for the period of time, what do I get?

timelapsed.p <- as.period(timint)

Aha! as.period() is imprecise because it tells me 1 day has passed

93 / 96

Now, we can use the duration of time lapsed in whatever units we want, but if we want accuracy, we better work with durations, as shown below.

time_length(timelapsed.d, unit = "second")
## [1] 86400 NA
time_length(timelapsed.d, unit = "minute")
## [1] 1440 NA
time_length(timelapsed.d, unit = "hour")
## [1] 24 NA
time_length(timelapsed.d, unit = "day")
## [1] 1 NA
time_length(timelapsed.d, unit = "week")
## [1] 0.1428571 NA
time_length(timelapsed.d, unit = "month")
## [1] 0.03285421 NA
time_length(timelapsed.d, unit = "year")
## [1] 0.002737851 NA

duration() is always estimated in seconds but you can change the reporting unit

94 / 96

If, unfortunately, we rely on periods, well then we have inherited a problem!

time_length(timelapsed.p, unit = "second")
## [1] 86400 NA
time_length(timelapsed.p, unit = "minute")
## [1] 1440 NA
time_length(timelapsed.p, unit = "hour")
## [1] 24 NA
time_length(timelapsed.p, unit = "day")
## [1] 1 NA
time_length(timelapsed.p, unit = "week")
## [1] 0.1428571 NA
time_length(timelapsed.p, unit = "month")
## [1] 0.03285421 NA
time_length(timelapsed.p, unit = "year")
## [1] 0.002737851 NA

only the seconds are correct and nothing else

95 / 96
2 / 96
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow