tidyverse
"an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures."
dplyr
"dplyr is a grammar of data manipulation, providing a consistent set of verbs that help you solve the most common data manipulation challenges"
tidyr
"Tidy data describes a standard way of storing data that is used wherever possible throughout the tidyverse. If you ensure that your data is tidy, you’ll spend less time fighting with the tools and more time working on your analysis."
The core functions are:
mutate()
... to add new variables that are functions of existing variables
transmute()
... to add new variables and drop the rest of the variables
select()
... to pick a subset of variables to work with while dropping the rest
filter()
... to select a subset of observations to work with
summarise()
... to calculate quantities of interest (rather than add these quantities as new variables to the existing data frame) and save these as a standalone table
arrange()
... order the rows based on out preferences
distinct()
... to eliminate duplicate observations
There are several other commands that belong to dplyr
... here is the cheatsheet
Let us see dplyr
in action with the xclass and xstudent data frames
load("data/xclass.RData")load("data/xstudent.RData")
%>%
The %>%
operator is useful for chaining a sequence of operations
into a single command
the_data <- read.csv('/path/to/data/file.csv') %>% subset(variable_a > x) %>% transform(variable_c = variable_a/variable_b)
This command means ...
(1) First read the csv file THEN
(2) subset the data to cases where variable_a values are greater than some value g THEN
(3) create variable_c as the ratio of variable_a to variable_b
The result is saved in the_data
You may have duplicate cases in your data set. How do you eliminate them? Let us assume we only need one record per student in xclass
1
library(tidyverse)xclass.df1 <- xclass %>% distinct(anon_id, .keep_all = TRUE)
1: While these may well be legitimate records, for training purposes I am assuming each student should only show up once.
But a student could be enrolled on multiple campuses and have multiple majors
xclass.df2 <- xclass %>% distinct(anon_id, campus_name, term_code, department_desc, .keep_all = TRUE)
If more than one variable is specified in distinct()
then dplyr
looks for unique combinations of these variables. Each unique combination = a unique observation
.keep_all = TRUE
is asking that all columns be retained after eliminating duplicate observations
Say I have data on some students, with test scores in one file and their gender in another file. Each file has a unique student identifier, called ID
. How can I create a unified data-set? Let us see each data-set first.
data1 = tibble(Score = c(10, 21, 33, 12, 35, 67, 43, 99), ID = c("A12", "A23", "Z14", "WX1", "Y31", "D66", "C31", "Q22") )data2 = tibble(Sex = c("Male", "Female", "Male", "Female", "Male", "Female", "Male", "Female"), ID = c("A12", "A23", "Z14", "WX1", "Y31", "D66", "E52", "H71") )
Open up both data-sets and note that students C31 and Q22 are missing from data2
and students E52 and H71 are missing from data1
.
<div id="htmlwidget-a74f2629c9702d0961ec" style="width:100%;height:auto;" class="datatables html-widget"></div><script type="application/json" data-for="htmlwidget-a74f2629c9702d0961ec">{"x":{"filter":"none","data":[["1","2","3","4","5","6","7","8"],[10,21,33,12,35,67,43,99],["A12","A23","Z14","WX1","Y31","D66","C31","Q22"]],"container":"<table class=\"display\">\n <thead>\n <tr>\n <th> <\/th>\n <th>Score<\/th>\n <th>ID<\/th>\n <\/tr>\n <\/thead>\n<\/table>","options":{"columnDefs":[{"className":"dt-center","targets":[1,2]},{"orderable":false,"targets":0}],"filter":"top","rownames":false,"class":"compact","order":[],"autoWidth":false,"orderClasses":false}},"evals":[],"jsHooks":[]}</script>
<div id="htmlwidget-d1f40ace4abc0833e865" style="width:100%;height:auto;" class="datatables html-widget"></div><script type="application/json" data-for="htmlwidget-d1f40ace4abc0833e865">{"x":{"filter":"none","data":[["1","2","3","4","5","6","7","8"],["Male","Female","Male","Female","Male","Female","Male","Female"],["A12","A23","Z14","WX1","Y31","D66","E52","H71"]],"container":"<table class=\"display\">\n <thead>\n <tr>\n <th> <\/th>\n <th>Sex<\/th>\n <th>ID<\/th>\n <\/tr>\n <\/thead>\n<\/table>","options":{"columnDefs":[{"className":"dt-center","targets":[1,2]},{"orderable":false,"targets":0}],"filter":"top","rownames":false,"class":"compact","order":[],"autoWidth":false,"orderClasses":false}},"evals":[],"jsHooks":[]}</script>
These combine data frames but to do so we need merge key(s)
-- variable(s) that identifies unique observations in each file. Next, you need to decide on what you want.
natural join
... merge observations that show up in both files
natural <- inner_join(x = data1, y = data2, by = "ID")natural
## # A tibble: 6 x 3## Score ID Sex ## <dbl> <chr> <chr> ## 1 10 A12 Male ## 2 21 A23 Female## 3 33 Z14 Male ## 4 12 WX1 Female## 5 35 Y31 Male ## 6 67 D66 Female
full outer join
... merge everything even if some cases show up in one but not the other
full <- full_join(x = data1, y = data2, by = "ID")full
## # A tibble: 10 x 3## Score ID Sex ## <dbl> <chr> <chr> ## 1 10 A12 Male ## 2 21 A23 Female## 3 33 Z14 Male ## 4 12 WX1 Female## 5 35 Y31 Male ## 6 67 D66 Female## 7 43 C31 <NA> ## 8 99 Q22 <NA> ## 9 NA E52 Male ## 10 NA H71 Female
left outer join
... merge such that all cases in file x
are retained but only those seen in file y
are joined
left <- left_join(x = data1, y = data2, by = "ID")left
## # A tibble: 8 x 3## Score ID Sex ## <dbl> <chr> <chr> ## 1 10 A12 Male ## 2 21 A23 Female## 3 33 Z14 Male ## 4 12 WX1 Female## 5 35 Y31 Male ## 6 67 D66 Female## 7 43 C31 <NA> ## 8 99 Q22 <NA>
right outer join
... merge to keep all cases in y
and only those cases are to be merged from x
that show up in y
right <- right_join(x = data1, y = data2, by = "ID")right
## # A tibble: 8 x 3## Score ID Sex ## <dbl> <chr> <chr> ## 1 10 A12 Male ## 2 21 A23 Female## 3 33 Z14 Male ## 4 12 WX1 Female## 5 35 Y31 Male ## 6 67 D66 Female## 7 NA E52 Male ## 8 NA H71 Female
What if the ID variables had different names in the files? You could rename them to have a common name or you could use by.x =
and by.y =
as shown below.
data3 <- tibble(Score = c(10, 21, 33, 12, 35, 67, 43, 99), ID1 = c("A12", "A23", "Z14", "WX1", "Y31", "D66", "C31", "Q22") )data4 <- tibble(Sex = c("Male", "Female", "Male", "Female", "Male", "Female", "Male", "Female"), ID2 = c("A12", "A23", "Z14", "WX1", "Y31", "D66", "E52", "H71") )diff.full <- full_join(data3, data4, by = c("ID1" = "ID2"))
You can also have more than one merge key. For example, if I am merging data for Ohio schools, each district has a district ID number (dirn)
, each school has building ID number (birn)
, and then the district's name (district)
, the building's name (building)
, and so on. If I am merging these data then my by =
statement will be by = c("dirn", "birn", "district", "building")
.
Let us work through one example.
Load the sm_xstudent
data frame and make sure the xclass
data frame is loaded as well.
Look at sm_xstudent
... it has six variables, three of which are also in xclass -- term_code
, anon_id
, and sex_code
. The other three variables are not present in xclass
-- ethnicity_desc
, residency_code
, and university_ftpt
.
We will use the three common variables (the keys
) to merge sm_xstudent
and xclass
load("data/sm_xstudent.RData")multkey.merge <- full_join( sm_xstudent, xclass, by = c("term_code" = "term_code", "anon_id" = "anon_id", "sex_code" = "sex_code") )
Open multkey.merge
and see the merged fields
If we have multiple data-frames to merge we can do so in several ways but my default approach is to rely on the Reduce()
command. Note: This is base R code
df1 <- tibble(Score = c(10, 21, 33, 12, 35, 67, 43, 99), ID = c("A12", "A23", "Z14", "WX1", "Y31", "D66", "C31", "Q22") )df2 <- tibble(Sex = c("Male", "Female", "Male", "Female", "Male", "Female", "Male", "Female"), ID = c("A12", "A23", "Z14", "WX1", "Y31", "D66", "E52", "H71") )df3 <- tibble(Age = c(6, 7, 6, 8, 8, 9, 10, 5), ID = c("A12", "A23", "Z14", "WX1", "Y31", "D66", "E52", "H71") )my.list <- list(df1, df2, df3)df.123 <- Reduce(function(...) merge(..., by = c("ID"), all = TRUE), my.list)
Here is the merged data-frame ...
ID | Score | Sex | Age |
---|---|---|---|
A12 | 10 | Male | 6 |
A23 | 21 | Female | 7 |
C31 | 43 | NA | NA |
D66 | 67 | Female | 9 |
E52 | NA | Male | 10 |
H71 | NA | Female | 5 |
Q22 | 99 | NA | NA |
WX1 | 12 | Female | 8 |
Y31 | 35 | Male | 8 |
Z14 | 33 | Male | 6 |
These only retain columns in the x
data frame
semi_join()
... will keep only those rows found in x
and y
but keep only the columns in x
semi <- semi_join(data1, data2, by = "ID")semi
## # A tibble: 6 x 2## Score ID ## <dbl> <chr>## 1 10 A12 ## 2 21 A23 ## 3 33 Z14 ## 4 12 WX1 ## 5 35 Y31 ## 6 67 D66
anti_join()
... will keep only those rows not found in x
and y
but keep only the columns in x
anti <- anti_join(data1, data2, by = "ID")anti
## # A tibble: 2 x 2## Score ID ## <dbl> <chr>## 1 43 C31 ## 2 99 Q22
rbind()
What if we have to stack data frames atop one another, for example, because one data frame is from one term and the second from another term? No problem.
Load the rb1
(where term_code == 20171C
), rb2
(where term_code == 20181C
), and rb3
(where term_code == 20191C
) data frames.
load("data/rb1.RData")load("data/rb2.RData")load("data/rb3.RData")my.dfs <- rbind(rb1, rb2, rb3)
(1) Load xclass
and sm2_xstudent
(2) Conduct a full_join, paying attention to the merge keys that should be used
(3) Conduct a semi_join with xclass
as the x
data frame
<div class="countdown blink-colon noupdate-15" id="timer_5fde929a" style="right:33%;bottom:0;" data-audio="true" data-warnwhen="0"><code class="countdown-time"><span class="countdown-digits minutes">10</span><span class="countdown-digits colon">:</span><span class="countdown-digits seconds">00</span></code></div>
Say I want to select only online students in xclass
df.1 <- multkey.merge %>% filter(online.f == "Online")
What if I wanted online students in Athens?
df.2 <- multkey.merge %>% filter(online.f == "Online", campus_name == "Athens")
What if I only want students in selected colleges?
df.3 <- multkey.merge %>% filter(college_desc %in% c("Education", "Fine Arts"))
What if I only want students in all other colleges except Fine Arts and Education?
df.4 <- multkey.merge %>% filter(!college_desc %in% c("Education", "Fine Arts"))
Note the use of !
to create not in
. Similarly, =
means "equal to", !=
means "not equal to", |
means "or", and &
means "and"
What if have complicated filters of multiple variables? No problem ...
df.5 <- multkey.merge %>% filter(!college_desc %in% c("Education", "Fine Arts"), sex_code == "F", enrlld_crdt_hrs > 4 & enrlld_crdt_hrs < 8)head(df.5)
## # A tibble: 6 x 26## term_code anon_id ethnicity_desc sex_code residency_code university_ftpt class_nbr## <chr> <chr> <chr> <chr> <chr> <chr> <chr> ## 1 20171C ANON10… White F R Full Time 08287 ## 2 20171C ANON10… White F R Full Time 08285 ## 3 20171C ANON10… White F R Part Time 08024 ## 4 20171C ANON10… White F R Part Time 08935 ## 5 20171C ANON10… White F R Full Time 09761 ## 6 20171C ANON10… White F R Part Time 09339 ## # … with 19 more variables: campus_name <chr>, campus_dtl_desc <chr>, college_desc <chr>,## # department_desc <chr>, subject <chr>, catalog_nbr <chr>, class_desc <chr>,## # delivery_desc <chr>, online_flag <dbl>, rank_group <chr>, rank_desc <chr>,## # acad_plan <chr>, student_college_desc <chr>, acad_plan_desc <chr>,## # field_of_study_desc <chr>, residency_desc <chr>, enrolled_flag <dbl>,## # enrlld_crdt_hrs <dbl>, online.f <fct>
Use filter()
to select cases that meet ALL of the the following criteria
(1) term_code must be 20191C (2) student is not on the Athens camps (3) student is a senior
Make sure you save the resulting data frame with some unique name
<div class="countdown blink-colon noupdate-15" id="timer_5fde94b1" style="right:33%;bottom:0;" data-audio="true" data-warnwhen="0"><code class="countdown-time"><span class="countdown-digits minutes">05</span><span class="countdown-digits colon">:</span><span class="countdown-digits seconds">00</span></code></div>
Use filter()
to select cases that meet ALL of the the following criteria
(1) term_code must be 20191C (2) student is not on the Athens camps (3) student is a senior
Make sure you save the resulting data frame with some unique name
<div class="countdown blink-colon noupdate-15" id="timer_5fde94b1" style="right:33%;bottom:0;" data-audio="true" data-warnwhen="0"><code class="countdown-time"><span class="countdown-digits minutes">05</span><span class="countdown-digits colon">:</span><span class="countdown-digits seconds">00</span></code></div>
my.df <- multkey.merge %>% filter(term_code == "20191C", campus_name != "Athens", rank_desc == "Senior")
Often you have more columns that you need for some operations, and so you trim the data by retaining (or discarding) specific columns. When you have to do this, it helps to know the column names\numbers
names(multkey.merge)
## [1] "term_code" "anon_id" "ethnicity_desc" ## [4] "sex_code" "residency_code" "university_ftpt" ## [7] "class_nbr" "campus_name" "campus_dtl_desc" ## [10] "college_desc" "department_desc" "subject" ## [13] "catalog_nbr" "class_desc" "delivery_desc" ## [16] "online_flag" "rank_group" "rank_desc" ## [19] "acad_plan" "student_college_desc" "acad_plan_desc" ## [22] "field_of_study_desc" "residency_desc" "enrolled_flag" ## [25] "enrlld_crdt_hrs" "online.f"
Say I want the first 4 columns
select.1 <- multkey.merge %>% select(c(1:4))names(select.1)
## [1] "term_code" "anon_id" "ethnicity_desc" "sex_code"
What if my columns are not sequential? No worries ...
select.2 <- multkey.merge %>% select(c(1, 2, 6, 9:11))names(select.2)
## [1] "term_code" "anon_id" "university_ftpt" "campus_dtl_desc"## [5] "college_desc" "department_desc"
Could I have called for the desired columns by name? Sure ...
select.3 <- multkey.merge %>% select(c(term_code, class_nbr, rank_desc, acad_plan))names(select.3)
## [1] "term_code" "class_nbr" "rank_desc" "acad_plan"
Could I call out desired variables by a specific string in the column name?
select.4 <- multkey.merge %>% select(contains("desc"))names(select.4)
## [1] "ethnicity_desc" "campus_dtl_desc" "college_desc" ## [4] "department_desc" "class_desc" "delivery_desc" ## [7] "rank_desc" "student_college_desc" "acad_plan_desc" ## [10] "field_of_study_desc" "residency_desc"
What if it begins with or ends with some string?
select.5 <- multkey.merge %>% select(ends_with("desc"))names(select.5)
## [1] "ethnicity_desc" "campus_dtl_desc" "college_desc" ## [4] "department_desc" "class_desc" "delivery_desc" ## [7] "rank_desc" "student_college_desc" "acad_plan_desc" ## [10] "field_of_study_desc" "residency_desc"
select.6 <- multkey.merge %>% select(starts_with("acad"))names(select.6)
## [1] "acad_plan" "acad_plan_desc"
This is just the tip of the iceberg ... look at the dplyr documentation and online resources for other variants, including the scoped variants of select_if()
, select_at()
, and select_all()
Use select()
to retain the following cases, respectively
(1) Columns 3, 4, 8, and 22
(2) Columns that end with the string "flag"
(3) Columns that contain the string "plan"
<div class="countdown blink-colon noupdate-15" id="timer_5fde94c2" style="right:33%;bottom:0;" data-audio="true" data-warnwhen="0"><code class="countdown-time"><span class="countdown-digits minutes">10</span><span class="countdown-digits colon">:</span><span class="countdown-digits seconds">00</span></code></div>
(1) Columns 3, 4, 8, and 22
multkey.merge %>% select(c(3:4, 8, 22))
## # A tibble: 488,560 x 4## ethnicity_desc sex_code campus_name field_of_study_desc ## <chr> <chr> <chr> <chr> ## 1 Two or More Races F Athens Counselor Education ## 2 White F Athens Art ## 3 White F Athens Art ## 4 Asian M Athens International Affairs## 5 Asian M Athens International Affairs## 6 Asian M Athens International Affairs## 7 Asian M Athens International Affairs## 8 White M Athens Management Systems ## 9 White M Athens Management Systems ## 10 Race/Ethnicity Unknown F Athens Undecided Education ## # … with 488,550 more rows
(2) Columns that end with the string "flag"
multkey.merge %>% select(ends_with("flag"))
## # A tibble: 488,560 x 2## online_flag enrolled_flag## <dbl> <dbl>## 1 0 1## 2 0 1## 3 0 1## 4 0 1## 5 0 1## 6 0 1## 7 0 1## 8 1 1## 9 1 1## 10 0 1## # … with 488,550 more rows
(3) Columns that contain the string "plan"
multkey.merge %>% select(contains("plan"))
## # A tibble: 488,560 x 2## acad_plan acad_plan_desc ## <chr> <chr> ## 1 ND6274 School Counseling ## 2 MF5052 Painting and Drawing ## 3 MF5052 Painting and Drawing ## 4 MA4216 Asian Studies ## 5 MA4216 Asian Studies ## 6 MA4216 Asian Studies ## 7 MA4216 Asian Studies ## 8 MB6157 MBA Business Analytics## 9 MB6157 MBA Business Analytics## 10 ND9915 EHS-Nondegree ## # … with 488,550 more rows
Say I want to calculate some quantities of interest --- frequencies perhaps, or means, standard deviations, etc. Let us start with mean, medians, etc.
tab1 <- multkey.merge %>% summarise( Mean.cr = mean(enrlld_crdt_hrs, na.rm = TRUE), Median.cr = median(enrlld_crdt_hrs, na.rm = TRUE), StdDev.cr = sd(enrlld_crdt_hrs, na.rm = TRUE), Var.cr = var(enrlld_crdt_hrs, na.rm = TRUE), Q1.cr = quantile(enrlld_crdt_hrs, p = c(0.25), na.rm = TRUE), Q3.cr = quantile(enrlld_crdt_hrs, p = c(0.75), na.rm = TRUE) )tab1
## # A tibble: 1 x 6## Mean.cr Median.cr StdDev.cr Var.cr Q1.cr Q3.cr## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>## 1 2.80 3 1.37 1.88 3 3
Note that tab1
is a table data frame with only what we asked for
What if we wanted frequencies
? Assume we want to know enrollment by college.
tab2 <- multkey.merge %>% group_by(college_desc) %>% summarise(frequency = n()) %>% arrange(frequency)tab2
## # A tibble: 14 x 2## college_desc frequency## <chr> <int>## 1 Miscellaneous 308## 2 Honors Tutorial 481## 3 International Studies 921## 4 George Voinovich School 1317## 5 Osteopathic Medicine 12170## 6 University College 17542## 7 Regional Higher Ed 19043## 8 Engineering & Technology 27094## 9 Communication 33312## 10 Fine Arts 34296## 11 Education 37544## 12 Business 53606## 13 Health Sciences & Professions 77085## 14 Arts & Sciences 173841
tab3 <- multkey.merge %>% group_by(college_desc) %>% summarise(frequency = n()) %>% arrange(-frequency)tab3
## # A tibble: 14 x 2## college_desc frequency## <chr> <int>## 1 Arts & Sciences 173841## 2 Health Sciences & Professions 77085## 3 Business 53606## 4 Education 37544## 5 Fine Arts 34296## 6 Communication 33312## 7 Engineering & Technology 27094## 8 Regional Higher Ed 19043## 9 University College 17542## 10 Osteopathic Medicine 12170## 11 George Voinovich School 1317## 12 International Studies 921## 13 Honors Tutorial 481## 14 Miscellaneous 308
Frequencies are well and good but we may want percentages
tab4 <- multkey.merge %>% group_by(college_desc) %>% summarise(frequency = n()) %>% mutate(percent = (frequency / sum(frequency)) * 100, percent = round(percent, digits = 2)) %>% arrange(percent)tab4
## # A tibble: 14 x 3## college_desc frequency percent## <chr> <int> <dbl>## 1 Miscellaneous 308 0.06## 2 Honors Tutorial 481 0.1 ## 3 International Studies 921 0.19## 4 George Voinovich School 1317 0.27## 5 Osteopathic Medicine 12170 2.49## 6 University College 17542 3.59## 7 Regional Higher Ed 19043 3.9 ## 8 Engineering & Technology 27094 5.55## 9 Communication 33312 6.82## 10 Fine Arts 34296 7.02## 11 Education 37544 7.68## 12 Business 53606 11.0 ## 13 Health Sciences & Professions 77085 15.8 ## 14 Arts & Sciences 173841 35.6
tab5 <- multkey.merge %>% group_by(college_desc) %>% summarise(frequency = n()) %>% mutate(percent = (frequency / sum(frequency)) * 100, percent = round(percent, digits = 2)) %>% arrange(-percent)tab5
## # A tibble: 14 x 3## college_desc frequency percent## <chr> <int> <dbl>## 1 Arts & Sciences 173841 35.6 ## 2 Health Sciences & Professions 77085 15.8 ## 3 Business 53606 11.0 ## 4 Education 37544 7.68## 5 Fine Arts 34296 7.02## 6 Communication 33312 6.82## 7 Engineering & Technology 27094 5.55## 8 Regional Higher Ed 19043 3.9 ## 9 University College 17542 3.59## 10 Osteopathic Medicine 12170 2.49## 11 George Voinovich School 1317 0.27## 12 International Studies 921 0.19## 13 Honors Tutorial 481 0.1 ## 14 Miscellaneous 308 0.06
Perhaps I want frequencies and percentages by sex_code
tab6 <- multkey.merge %>% group_by(college_desc, sex_code) %>% summarise(frequency = n()) %>% mutate(percent = (frequency / sum(frequency)) * 100, percent = round(percent, digits = 2)) %>% arrange(college_desc, sex_code, -percent)tab6
## # A tibble: 31 x 4## # Groups: college_desc [14]## college_desc sex_code frequency percent## <chr> <chr> <int> <dbl>## 1 Arts & Sciences F 98267 56.5## 2 Arts & Sciences M 75574 43.5## 3 Business F 19875 37.1## 4 Business M 33731 62.9## 5 Communication F 18844 56.6## 6 Communication M 14468 43.4## 7 Education F 25939 69.1## 8 Education M 11605 30.9## 9 Engineering & Technology F 4542 16.8## 10 Engineering & Technology M 22552 83.2## # … with 21 more rows
Or by sex_code
and rank_desc
tab7 <- multkey.merge %>% group_by(college_desc, sex_code, rank_desc) %>% summarise(frequency = n()) %>% mutate(percent = (frequency / sum(frequency)) * 100, percent = round(percent, digits = 1)) %>% arrange(college_desc, sex_code, rank_desc, -percent)tab7
## # A tibble: 178 x 5## # Groups: college_desc, sex_code [31]## college_desc sex_code rank_desc frequency percent## <chr> <chr> <chr> <int> <dbl>## 1 Arts & Sciences F Doctoral Student 1178 1.2## 2 Arts & Sciences F Freshman 33519 34.1## 3 Arts & Sciences F Junior 16100 16.4## 4 Arts & Sciences F Master's Student 3388 3.4## 5 Arts & Sciences F Non-Degree Undergraduate 4068 4.1## 6 Arts & Sciences F Senior 17321 17.6## 7 Arts & Sciences F Sophomore 22693 23.1## 8 Arts & Sciences M Doctoral Student 1604 2.1## 9 Arts & Sciences M Freshman 26114 34.6## 10 Arts & Sciences M Junior 12320 16.3## # … with 168 more rows
Often you will need to collapse some categories or otherwise modify some values. For example, perhaps I want sex_code
to be fully spelled out instead of F
and M
. How might I do that?
multkey.merge <- multkey.merge %>% mutate( sex.f = case_when( sex_code == "F" ~ "Female", sex_code == "M" ~ "Male" ) )table(multkey.merge$sex.f)
## ## Female Male ## 277650 210907
multkey.merge <- multkey.merge %>% mutate( sex.f = case_when( sex_code == "F" ~ "Female", TRUE ~ "Male" ) )table(multkey.merge$sex.f)
## ## Female Male ## 277650 210910
I can also take a numeric variable and flip it into a categorical variable with case_when()
.
For example, say I want to split enrlld_crdt_hrs
into three groups, those with hours <= 4, those with hours >= 4 but <= 15, and those with hours >= 16
multkey.merge <- multkey.merge %>% mutate( enrolled.f = case_when( enrlld_crdt_hrs < 4 ~ "<= 3 hours", enrlld_crdt_hrs >= 4 & enrlld_crdt_hrs <= 15 ~ "4 - 15 hours", enrlld_crdt_hrs > 15 ~ "> 15 hours", ) )table(multkey.merge$enrolled.f)
## ## <= 3 hours > 15 hours 4 - 15 hours ## 404150 54 84356
multkey.merge <- multkey.merge %>% mutate( enrolled.f = case_when( enrlld_crdt_hrs < 4 ~ "<= 3 hours", enrlld_crdt_hrs >= 4 & enrlld_crdt_hrs <= 15 ~ "4 - 15 hours", TRUE ~ "> 15 hours", ) )table(multkey.merge$enrolled.f)
## ## <= 3 hours > 15 hours 4 - 15 hours ## 404150 54 84356
Some times we may need to eliminate some strings in specific columns, perhaps change their case, or replace a string with another string. All this and more can be done either by two packages: stringr
or stringi
. Let us see stringr
in action for a few commonly encountered tasks.
For example, rank_desc
has the word "student" in the labels. Let us assume we want to get rid of these. Here goes ...
multkey.merge <- multkey.merge %>% mutate(rank = stringr::str_replace(rank_desc, " Student", ""))
What about changing the case
?
multkey.merge <- multkey.merge %>% mutate(college = stringr::str_to_upper(college_desc))
Other case options: str_to_title
, str_to_lower
, str_to_sentence
stringr
is very versatile; see here for the authors' walkthrough
You will, more often than not, want to calculate quantities of interest for specific groups -- by campus, by term, by college/department/course, and so on.
Well, the group-by()
switch allows you to specify the grouping structure, while ungroup()
will remove constraints imposed by the grouping structure. For example, assume I want to calculate the number of students in each college
multkey.merge %>% group_by(college_desc, campus_name) %>% summarise(students = n_distinct(anon_id)) %>% arrange(-students)
## # A tibble: 56 x 3## # Groups: college_desc [14]## college_desc campus_name students## <chr> <chr> <int>## 1 Arts & Sciences Athens 47145## 2 Health Sciences & Professions Athens 30119## 3 Business Athens 19247## 4 Communication Athens 14332## 5 Fine Arts Athens 14272## 6 Education Athens 13854## 7 University College Athens 12891## 8 Engineering & Technology Athens 8263## 9 Arts & Sciences Lancaster 4590## 10 Arts & Sciences Chillicothe 4021## # … with 46 more rows
n_distinct()
is picking up unique values of anon_id
even though the data frame has multiple rows per some students
What about male versus female students?
multkey.merge %>% group_by(college_desc, campus_name, sex_code) %>% summarise(students = n_distinct(anon_id)) %>% arrange(college_desc, campus_name, -students)
## # A tibble: 115 x 4## # Groups: college_desc, campus_name [56]## college_desc campus_name sex_code students## <chr> <chr> <chr> <int>## 1 Arts & Sciences Athens F 25515## 2 Arts & Sciences Athens M 21630## 3 Arts & Sciences Chillicothe F 2559## 4 Arts & Sciences Chillicothe M 1462## 5 Arts & Sciences Eastern F 1318## 6 Arts & Sciences Eastern M 847## 7 Arts & Sciences Lancaster F 2755## 8 Arts & Sciences Lancaster M 1835## 9 Arts & Sciences Southern F 2032## 10 Arts & Sciences Southern M 1012## # … with 105 more rows
What about male versus female students by ethnicity?
multkey.merge %>% group_by(college_desc, campus_name, sex_code, ethnicity_desc) %>% summarise(students = n_distinct(anon_id)) %>% arrange(college_desc, campus_name, ethnicity_desc, sex_code)
## # A tibble: 820 x 5## # Groups: college_desc, campus_name, sex_code [115]## college_desc campus_name sex_code ethnicity_desc students## <chr> <chr> <chr> <chr> <int>## 1 Arts & Sciences Athens F American Indian/Alaska Native 43## 2 Arts & Sciences Athens M American Indian/Alaska Native 34## 3 Arts & Sciences Athens F Asian 310## 4 Arts & Sciences Athens M Asian 215## 5 Arts & Sciences Athens F Black or African American 1488## 6 Arts & Sciences Athens M Black or African American 1087## 7 Arts & Sciences Athens F Hispanic or Latino 846## 8 Arts & Sciences Athens M Hispanic or Latino 611## 9 Arts & Sciences Athens F Native Hawaiian/Other Pacific Islander 11## 10 Arts & Sciences Athens M Native Hawaiian/Other Pacific Islander 17## # … with 810 more rows
Maybe I want the number enrolled in a give course. To do this, I will first combine subject
and catalog_nbr
multkey.merge %>% mutate(course = stringr::str_c(subject, catalog_nbr, sep = " ")) %>% group_by(term_code, campus_name, course) %>% summarise(students = n_distinct(anon_id)) %>% arrange(-students)
## # A tibble: 12,666 x 4## # Groups: term_code, campus_name [18]## term_code campus_name course students## <chr> <chr> <chr> <int>## 1 20181C Athens UC 1900 3107## 2 20191C Athens UC 1900 3102## 3 20171C Athens UC 1900 2811## 4 20171C Athens ENG 1510 1547## 5 20171C Athens NRSE 4570 1467## 6 20181C Athens ENG 1510 1428## 7 20171C Athens NRSE 4520 1426## 8 20191C Athens NRSE 4550 1414## 9 20171C Athens PSY 1010 1404## 10 20181C Athens NRSE 4530 1334## # … with 12,656 more rows
(1) Remove the phrase "Main Campus" from campus_dtl_desc
(2) Add class_nbr
to course, making sure the separator is " - "
(3) Save multkey.merge
with the RData extension
10:00 minutes
You may need to rename some existing columns. An easy way to do this is as follows:
multkey.merge %>% rename("Ethnicity" = ethnicity_desc, "Sex" = sex_code, "Department" = department_desc) %>% names()
## [1] "term_code" "anon_id" "Ethnicity" ## [4] "Sex" "residency_code" "university_ftpt" ## [7] "class_nbr" "campus_name" "campus_dtl_desc" ## [10] "college_desc" "Department" "subject" ## [13] "catalog_nbr" "class_desc" "delivery_desc" ## [16] "online_flag" "rank_group" "rank_desc" ## [19] "acad_plan" "student_college_desc" "acad_plan_desc" ## [22] "field_of_study_desc" "residency_desc" "enrolled_flag" ## [25] "enrlld_crdt_hrs" "online.f" "sex.f" ## [28] "enrolled.f" "rank" "college"
After running some operations that generate some quantities of interest we may want to extract or merely peek at the n
observations with the highest values or n
observatins with the lowest values. This can be done as follows:
multkey.merge %>% group_by(department_desc) %>% summarize(frequency = n()) %>% arrange(-frequency) %>% top_n(10, frequency)
## # A tibble: 10 x 2## department_desc frequency## <chr> <int>## 1 Nursing 36146## 2 Management Systems 27078## 3 Biological Sciences 22470## 4 Mathematics 20716## 5 Vice President Regional Higher Education 19043## 6 Psychology 18791## 7 English 17148## 8 Communication Studies 17008## 9 Teacher Education 16464## 10 Dean University College 16447
multkey.merge %>% group_by(department_desc) %>% summarize(frequency = n()) %>% arrange(-frequency) %>% top_n(-10, frequency)
## # A tibble: 10 x 2## department_desc frequency## <chr> <int>## 1 Center For Intl Studies 921## 2 African American Studies 884## 3 Aviation 883## 4 Military Science 818## 5 Ohio Program of Intensive English 505## 6 Honors College 481## 7 Misc 308## 8 Aerospace Studies 277## 9 Education Dean 69## 10 Center for Law, Justice and Culture 54
Most packages you use in R call for data to be tidy
. What are tidy data?
Here are some untidy data
(a) The Pew survey data
pew <- read.delim( file = "http://stat405.had.co.nz/data/pew.txt", header = TRUE, stringsAsFactors = FALSE, check.names = F)head(pew)
## religion <$10k $10-20k $20-30k $30-40k $40-50k $50-75k $75-100k $100-150k## 1 Agnostic 27 34 60 81 76 137 122 109## 2 Atheist 12 27 37 52 35 70 73 59## 3 Buddhist 27 21 30 34 33 58 62 39## 4 Catholic 418 617 732 670 638 1116 949 792## 5 Don’t know/refused 15 14 15 11 10 35 21 17## 6 Evangelical Prot 575 869 1064 982 881 1486 949 723## >150k Don't know/refused## 1 84 96## 2 74 76## 3 53 54## 4 633 1489## 5 18 116## 6 414 1529
(b) Here are some more examples
library(DSR)table2
## # A tibble: 12 x 4## country year key value## <fct> <int> <fct> <int>## 1 Afghanistan 1999 cases 745## 2 Afghanistan 1999 population 19987071## 3 Afghanistan 2000 cases 2666## 4 Afghanistan 2000 population 20595360## 5 Brazil 1999 cases 37737## 6 Brazil 1999 population 172006362## 7 Brazil 2000 cases 80488## 8 Brazil 2000 population 174504898## 9 China 1999 cases 212258## 10 China 1999 population 1272915272## 11 China 2000 cases 213766## 12 China 2000 population 1280428583
table3
## # A tibble: 6 x 3## country year rate ## <fct> <int> <chr> ## 1 Afghanistan 1999 745/19987071 ## 2 Afghanistan 2000 2666/20595360 ## 3 Brazil 1999 37737/172006362 ## 4 Brazil 2000 80488/174504898 ## 5 China 1999 212258/1272915272## 6 China 2000 213766/1280428583
table4
## # A tibble: 3 x 3## country `1999` `2000`## <fct> <int> <int>## 1 Afghanistan 745 2666## 2 Brazil 37737 80488## 3 China 212258 213766
table5
## # A tibble: 3 x 3## country `1999` `2000`## <fct> <int> <int>## 1 Afghanistan 19987071 20595360## 2 Brazil 172006362 174504898## 3 China 1272915272 1280428583
What would a tidy version of these examples look like?
table1
## # A tibble: 6 x 4## country year cases population## <fct> <int> <int> <int>## 1 Afghanistan 1999 745 19987071## 2 Afghanistan 2000 2666 20595360## 3 Brazil 1999 37737 172006362## 4 Brazil 2000 80488 174504898## 5 China 1999 212258 1272915272## 6 China 2000 213766 1280428583
Here, we have
The tidyr
package is designed to make messy data tidy in order to proceed with analysis. It has four primary functions that we will see in turn.
You will run into variables that you want to break into smaller pieces. For instance, take table3
where rate
is a ratio of two variables, cases
and population
, each of which should be in their own columns. How can we achieve that?
library(tidyr)s.table3 <- table3 %>% separate(col = rate, into = c("cases", "population"), sep = "/", remove = TRUE, convert = TRUE) head(s.table3)
## # A tibble: 6 x 4## country year cases population## <fct> <int> <int> <int>## 1 Afghanistan 1999 745 19987071## 2 Afghanistan 2000 2666 20595360## 3 Brazil 1999 37737 172006362## 4 Brazil 2000 80488 174504898## 5 China 1999 212258 1272915272## 6 China 2000 213766 1280428583
The unite
command does the exact opposite of separate, combining columns
u.table3 <- s.table3 %>% unite("rate", cases, population, sep = ":", remove = FALSE)head(u.table3)
## # A tibble: 6 x 5## country year rate cases population## <fct> <int> <chr> <int> <int>## 1 Afghanistan 1999 745:19987071 745 19987071## 2 Afghanistan 2000 2666:20595360 2666 20595360## 3 Brazil 1999 37737:172006362 37737 172006362## 4 Brazil 2000 80488:174504898 80488 174504898## 5 China 1999 212258:1272915272 212258 1272915272## 6 China 2000 213766:1280428583 213766 1280428583
Here you specify the name of the new column first, and then list the columns to be combined, specify the separator, and whether the original columns should be removed or not.
Revisit table2
table2
## # A tibble: 12 x 4## country year key value## <fct> <int> <fct> <int>## 1 Afghanistan 1999 cases 745## 2 Afghanistan 1999 population 19987071## 3 Afghanistan 2000 cases 2666## 4 Afghanistan 2000 population 20595360## 5 Brazil 1999 cases 37737## 6 Brazil 1999 population 172006362## 7 Brazil 2000 cases 80488## 8 Brazil 2000 population 174504898## 9 China 1999 cases 212258## 10 China 1999 population 1272915272## 11 China 2000 cases 213766## 12 China 2000 population 1280428583
I want to flip this into a format where the data frame has two columns per country per year -- cases
and population
. How can I do this?
sp.table2 <- table2 %>% spread(key, value)sp.table2
## # A tibble: 6 x 4## country year cases population## <fct> <int> <int> <int>## 1 Afghanistan 1999 745 19987071## 2 Afghanistan 2000 2666 20595360## 3 Brazil 1999 37737 172006362## 4 Brazil 2000 80488 174504898## 5 China 1999 212258 1272915272## 6 China 2000 213766 1280428583
Notice that the key
column had two unique values -- cases and population -- so two new columns were created. Notice also that tidyr knew to populate the value of each of the new cells with what it finds in the value
column. The result is what we call data in a wide format.
gather
does the opposite of spread
, converting wide data into what we call the long form. Revisit table4
table4
## # A tibble: 3 x 3## country `1999` `2000`## <fct> <int> <int>## 1 Afghanistan 745 2666## 2 Brazil 37737 80488## 3 China 212258 213766
Note that each country has two columns, one per year. ... This is untidy ...
... So, we'll flip this tibble as follows:
g.table4 <- table4 %>% gather("year", "cases", 2:3)head(g.table4)
## # A tibble: 6 x 3## country year cases## <fct> <chr> <int>## 1 Afghanistan 1999 745## 2 Brazil 1999 37737## 3 China 1999 212258## 4 Afghanistan 2000 2666## 5 Brazil 2000 80488## 6 China 2000 213766
Notice that you had to specify the columns you wanted to gather and we did this via 2:3
. If you forget to specify columns it creates something else altogether so be careful.
nospread.df <- multkey.merge %>% group_by(department_desc, term_code) %>% summarize(frequency = n()) head(nospread.df, 10)
## # A tibble: 10 x 3## # Groups: department_desc [4]## department_desc term_code frequency## <chr> <chr> <int>## 1 Accountancy 20171C 1874## 2 Accountancy 20181C 1740## 3 Accountancy 20191C 1734## 4 Aerospace Studies 20171C 69## 5 Aerospace Studies 20181C 112## 6 Aerospace Studies 20191C 96## 7 African American Studies 20171C 313## 8 African American Studies 20181C 238## 9 African American Studies 20191C 333## 10 Applied Health Sciences & Wellness 20171C 3922
spread.df <- multkey.merge %>% group_by(department_desc, term_code) %>% summarize(frequency = n()) %>% spread(term_code, frequency) head(spread.df, 10)
## # A tibble: 10 x 4## # Groups: department_desc [10]## department_desc `20171C` `20181C` `20191C`## <chr> <int> <int> <int>## 1 Accountancy 1874 1740 1734## 2 Aerospace Studies 69 112 96## 3 African American Studies 313 238 333## 4 Applied Health Sciences & Wellness 3922 4525 4268## 5 Art 2447 2969 2738## 6 Aviation 293 275 315## 7 Biological Sciences 6722 7947 7801## 8 Center For Intl Studies 255 328 338## 9 Center for Law, Justice and Culture NA NA 54## 10 Chemical & Biomolecular Engineering 870 1189 1059
Note: term_code
is now a column instead of rows
At this website you can find data provided by countries to WHO and estimates of TB burden generated by WHO for the Global Tuberculosis Report. I want you to gather the following data (all are in the csv format):
Read the data dictionary carefully. Then read-in the case notifications data.
The database has too many variables (175 or so) and so I'll have you use just a few for this practice task.
library(readr)tb <- read_csv("data/TB_notifications_2018-02-07.csv")
Let us use dplyr
to select and filter the tb database so that it meets the following criteria: (i) observations from 2012 or earlier years are kept, and (ii) keep only columns 1 through 6 and 25 (new_sp_m04) through 44 (new_sp_fu).
library(dplyr)tb2 <- tb %>% select(c(1:6, 25:44)) %>% filter(year <= 2012)head(tb2)
## # A tibble: 6 x 26## country iso2 iso3 iso_numeric g_whoregion year new_sp_m04 new_sp_m514 new_sp_m014## <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>## 1 Afghan… AF AFG 004 EMR 1980 NA NA NA## 2 Afghan… AF AFG 004 EMR 1981 NA NA NA## 3 Afghan… AF AFG 004 EMR 1982 NA NA NA## 4 Afghan… AF AFG 004 EMR 1983 NA NA NA## 5 Afghan… AF AFG 004 EMR 1984 NA NA NA## 6 Afghan… AF AFG 004 EMR 1985 NA NA NA## # … with 17 more variables: new_sp_m1524 <dbl>, new_sp_m2534 <dbl>, new_sp_m3544 <dbl>,## # new_sp_m4554 <dbl>, new_sp_m5564 <dbl>, new_sp_m65 <dbl>, new_sp_mu <dbl>,## # new_sp_f04 <dbl>, new_sp_f514 <dbl>, new_sp_f014 <dbl>, new_sp_f1524 <dbl>,## # new_sp_f2534 <dbl>, new_sp_f3544 <dbl>, new_sp_f4554 <dbl>, new_sp_f5564 <dbl>,## # new_sp_f65 <dbl>, new_sp_fu <dbl>
Now, new_sp_xx
refer to new pulmonary smear-positive cases. The m
refers to males and f
refers to females. Finally, 04
is the 0-4 age group, 514
is the 5-14 age group, and so on. new_sp_mu
is males with an unknown age-group and new_sp_fu
is females with an unknown age-group. This is the messy part of the data. Ideally we'd have a column for gender, flagged as Male, Female, Unknown
, and a separate column for age-groups flagged 0-4, 5-14, 0-14, and so on
. Time to tidy these data.
Since the data are in the wide format we can use gather to flip it into the long format.
tb3 <- tb2 %>% gather("group", "cases", 7:26) head(tb3)
## # A tibble: 6 x 8## country iso2 iso3 iso_numeric g_whoregion year group cases## <chr> <chr> <chr> <chr> <chr> <dbl> <chr> <dbl>## 1 Afghanistan AF AFG 004 EMR 1980 new_sp_m04 NA## 2 Afghanistan AF AFG 004 EMR 1981 new_sp_m04 NA## 3 Afghanistan AF AFG 004 EMR 1982 new_sp_m04 NA## 4 Afghanistan AF AFG 004 EMR 1983 new_sp_m04 NA## 5 Afghanistan AF AFG 004 EMR 1984 new_sp_m04 NA## 6 Afghanistan AF AFG 004 EMR 1985 new_sp_m04 NA
Now we split the new_sp_xx
values into three columns
tb4 <- tb3 %>% separate(col = group, into = c("new", "sp", "sexage"), sep = "_") head(tb4)
## # A tibble: 6 x 10## country iso2 iso3 iso_numeric g_whoregion year new sp sexage cases## <chr> <chr> <chr> <chr> <chr> <dbl> <chr> <chr> <chr> <dbl>## 1 Afghanistan AF AFG 004 EMR 1980 new sp m04 NA## 2 Afghanistan AF AFG 004 EMR 1981 new sp m04 NA## 3 Afghanistan AF AFG 004 EMR 1982 new sp m04 NA## 4 Afghanistan AF AFG 004 EMR 1983 new sp m04 NA## 5 Afghanistan AF AFG 004 EMR 1984 new sp m04 NA## 6 Afghanistan AF AFG 004 EMR 1985 new sp m04 NA
and then split sexage
into two columns by specifying that the split should occur after the first value (which in our case will be after m
or f
).
tb5 <- tb4 %>% separate(col = sexage, into = c("sex", "agegroup"), sep = 1) head(tb5)
## # A tibble: 6 x 11## country iso2 iso3 iso_numeric g_whoregion year new sp sex agegroup cases## <chr> <chr> <chr> <chr> <chr> <dbl> <chr> <chr> <chr> <chr> <dbl>## 1 Afghanistan AF AFG 004 EMR 1980 new sp m 04 NA## 2 Afghanistan AF AFG 004 EMR 1981 new sp m 04 NA## 3 Afghanistan AF AFG 004 EMR 1982 new sp m 04 NA## 4 Afghanistan AF AFG 004 EMR 1983 new sp m 04 NA## 5 Afghanistan AF AFG 004 EMR 1984 new sp m 04 NA## 6 Afghanistan AF AFG 004 EMR 1985 new sp m 04 NA
The beauty of dplyr
and tidyr
(in fact of all things tidyverse
) is the ability to combine multiple steps into a single piped command!
tb.df <- tb %>% select(c(1:6, 25:44)) %>% filter(year <= 2012) %>% gather("group", "cases", 7:26) %>% separate(col = group, into = c("new", "sp", "sexage"), sep = "_") %>% separate(col = sexage, into = c("sex", "agegroup"), sep = 1) %>% select(c(1:6, 9:11)) head(tb.df)
## # A tibble: 6 x 9## country iso2 iso3 iso_numeric g_whoregion year sex agegroup cases## <chr> <chr> <chr> <chr> <chr> <dbl> <chr> <chr> <dbl>## 1 Afghanistan AF AFG 004 EMR 1980 m 04 NA## 2 Afghanistan AF AFG 004 EMR 1981 m 04 NA## 3 Afghanistan AF AFG 004 EMR 1982 m 04 NA## 4 Afghanistan AF AFG 004 EMR 1983 m 04 NA## 5 Afghanistan AF AFG 004 EMR 1984 m 04 NA## 6 Afghanistan AF AFG 004 EMR 1985 m 04 NA
We can clean-up the codes for easier interpretation.
tb.df$sex <- ifelse(tb.df$sex == "m", "Males", "Females") tb.df$agegroup <- ifelse(tb.df$agegroup == "04", "0-4", ifelse(tb.df$agegroup == "014", "0-14", ifelse(tb.df$agegroup == "1524", "15-24", ifelse(tb.df$agegroup == "2534", "25-34", ifelse(tb.df$agegroup == "3544", "35-44", ifelse(tb.df$agegroup == "4554", "45-54", ifelse(tb.df$agegroup == "514", "5-14", ifelse(tb.df$agegroup == "5564", "55-64", ifelse(tb.df$agegroup == "65", "65+", "Uknown")))))))))head(tb.df)
## # A tibble: 6 x 9## country iso2 iso3 iso_numeric g_whoregion year sex agegroup cases## <chr> <chr> <chr> <chr> <chr> <dbl> <chr> <chr> <dbl>## 1 Afghanistan AF AFG 004 EMR 1980 Males 0-4 NA## 2 Afghanistan AF AFG 004 EMR 1981 Males 0-4 NA## 3 Afghanistan AF AFG 004 EMR 1982 Males 0-4 NA## 4 Afghanistan AF AFG 004 EMR 1983 Males 0-4 NA## 5 Afghanistan AF AFG 004 EMR 1984 Males 0-4 NA## 6 Afghanistan AF AFG 004 EMR 1985 Males 0-4 NA
Note: This is using base R and not case_when()
from dplyr
How many cases per country in the 1980-1990 period?
tbtab.01 <- tb.df %>% filter(year >= 1980 | year <= 1990) %>% group_by(country) %>% summarise(Ncases = sum(cases, na.rm = TRUE)) %>% arrange(-Ncases)head(tbtab.01)
## # A tibble: 6 x 2## country Ncases## <chr> <dbl>## 1 India 6492850## 2 China 5643738## 3 Indonesia 1886657## 4 South Africa 1435147## 5 Bangladesh 1072431## 6 Viet Nam 915525
What about by sex and age-group?
tbtab.02 <- tb.df %>% filter(year >= 1980 | year <= 1990) %>% group_by(country, sex, agegroup) %>% summarise(Ncases = sum(cases, na.rm = TRUE)) %>% arrange(-Ncases)head(tbtab.02)
## # A tibble: 6 x 4## # Groups: country, sex [2]## country sex agegroup Ncases## <chr> <chr> <chr> <dbl>## 1 India Males 35-44 963811## 2 India Males 25-34 913939## 3 India Males 45-54 835228## 4 India Males 15-24 791303## 5 China Males 65+ 748103## 6 China Males 45-54 668133
This is just one example of how dplyr
and tidyr
-- two packages from the tidyverse
-- can be used to deal with what seems to be quite messy data. The more you use it, the easier it becomes to figure out solutions to even very complicated messy data setups.
This package makes working with dates and times quite easy because it does what base R does but in more intuitive ways and perhaps more flexibly. Let us start with some date fields.
today1 = "20171217"today2 = "2017-12-17"today3 = "2017 December 17"today4 = "20171217143241"today5 = "2017 December 17 14:32:41"today6 = "December 17 2017 14:32:41"today7 = "17-Dec, 2017 14:32:41"
The formats are quite varied but lubridate
deals with them quite seamlessly so long as you pay attention to the order -- is year first or last? What about month? day? Is time given in hours, minutes and seconds?.
library(lubridate)ymd(today1)
## [1] "2017-12-17"
ymd(today2)
## [1] "2017-12-17"
ymd(today3)
## [1] "2017-12-17"
ymd_hms(today4)
## [1] "2017-12-17 14:32:41 UTC"
ymd_hms(today5)
## [1] "2017-12-17 14:32:41 UTC"
mdy_hms(today6)
## [1] "2017-12-17 14:32:41 UTC"
dmy_hms(today7)
## [1] "2017-12-17 14:32:41 UTC"
If I need to extract data/time elements, I can do that as well.
today = dmy_hms(today7); today # a date and time set in today7
## [1] "2017-12-17 14:32:41 UTC"
today.y = year(today); today.y # The year
## [1] 2017
today.m1 = month(today); today.m1 # the month, as a number
## [1] 12
today.m2 = month(today, label = TRUE, abbr = TRUE); today.m2 # labeling the month but with an abbreviation
## [1] Dec## Levels: Jan < Feb < Mar < Apr < May < Jun < Jul < Aug < Sep < Oct < Nov < Dec
today.m3 = month(today, label = TRUE, abbr = FALSE); today.m3# fully labelling the month
## [1] December## 12 Levels: January < February < March < April < May < June < July < ... < December
today.w = week(today)today.w # what week of the year is it?
## [1] 51
today.doy = yday(today)today.doy # what day of the year is it?
## [1] 351
today.dom = mday(today)today.dom # what day of the month is it?
## [1] 17
today.dow1 = wday(today)today.dow1 # what day of the week is it, as a number?
## [1] 1
today.dow2 = wday(today, label = TRUE, abbr = TRUE)today.dow2 # day of the week, abbreviated label
## [1] Sun## Levels: Sun < Mon < Tue < Wed < Thu < Fri < Sat
today.dow3 = wday(today, label = TRUE, abbr = FALSE)today.dow3 # day of the week fully labelled
## [1] Sunday## Levels: Sunday < Monday < Tuesday < Wednesday < Thursday < Friday < Saturday
today.h = hour(today)today.h # what hour is it?
## [1] 14
today.m = minute(today)today.m # what minute is it?
## [1] 32
today.s = second(today)today.s # what second is it?
## [1] 41
today.tz = tz(today)today.tz # what time zone is it?
## [1] "UTC"
Calculating time lapsed between dates is tricky because you have to take into account daylight savings time1, leap years2, and leap seconds3. These render the length of months, weeks, days, hours, and minutes to be relative units of time
but seconds tends to be exact units of time
. As a result, and by design, lubridate
differentiates between intervals
, durations
, and periods
, each measuring time spans in different ways.
[1] Daylight Saving Time (DST) is the practice of setting the clocks forward 1 hour from standard time during the summer months, and back again in the fall, in order to make better use of natural daylight. Source
[2] Leap years are needed to keep our modern day Gregorian calendar in alignment with the Earth's revolutions around the sun. It takes the Earth approximately 365.242189 days – or 365 days, 5 hours, 48 minutes, and 45 seconds – to circle once around the Sun. This is called a tropical year, and is measured from the March equinox. However, the Gregorian calendar has only 365 days in a year, so if we didn't add a leap day on February 29 nearly every four years, we would lose almost six hours off our calendar every year. After only 100 years, our calendar would be off by around 24 days! Source
[3] Two components are used to determine UTC (Coordinated Universal Time): International Atomic Time (TAI): A time scale that combines the output of some 200 highly precise atomic clocks worldwide, and provides the exact speed for our clocks to tick. Universal Time (UT1), also known as Astronomical Time, refers to the Earth's rotation around its own axis, which determines the length of a day. Before the difference between UTC and UT1 reaches 0.9 seconds, a leap second is added to UTC and to clocks worldwide. By adding an additional second to the time count, our clocks are effectively stopped for that second to give Earth the opportunity to catch up. Source.
Let us start with a duration
, the simplest measure of lapsed time since it measures the passing of time in seconds. Say I pick two dates and times ... 05:00am on March 9, 2019
and 05:00am on March 10, 2019
.
date1 <- ymd_hms("2019-03-09 05:00:00", tzone = "US/Eastern")date2 <- ymd_hms("2019-03-10 05:00:00", tzone = "US/Eastern")timint <- interval(date1, date2)timint
## [1] 2019-03-09 05:00:00 UTC--2019-03-10 05:00:00 UTC## [2] NA--NA
How much time has lapsed between date1 and date2? We calculate with as.duration()
timelapsed.d <- as.duration(timint)
Why is it 23 hours and not 24 hours? Because daylight savings kicks in at 2:00 AM on March 10, and since the clocks kick forward by one hour, only 23 hours have passed, not 24 as we might naively expect. Go back and look at the output from timeint
; see the EST versus EDT? In contrast to duration, if I ask for the period of time
, what do I get?
timelapsed.p <- as.period(timint)
Aha! as.period()
is imprecise because it tells me 1 day has passed
Now, we can use the duration of time lapsed in whatever units we want, but if we want accuracy, we better work with durations, as shown below.
time_length(timelapsed.d, unit = "second")
## [1] 86400 NA
time_length(timelapsed.d, unit = "minute")
## [1] 1440 NA
time_length(timelapsed.d, unit = "hour")
## [1] 24 NA
time_length(timelapsed.d, unit = "day")
## [1] 1 NA
time_length(timelapsed.d, unit = "week")
## [1] 0.1428571 NA
time_length(timelapsed.d, unit = "month")
## [1] 0.03285421 NA
time_length(timelapsed.d, unit = "year")
## [1] 0.002737851 NA
duration()
is always estimated in seconds but you can change the reporting unit
If, unfortunately, we rely on periods, well then we have inherited a problem!
time_length(timelapsed.p, unit = "second")
## [1] 86400 NA
time_length(timelapsed.p, unit = "minute")
## [1] 1440 NA
time_length(timelapsed.p, unit = "hour")
## [1] 24 NA
time_length(timelapsed.p, unit = "day")
## [1] 1 NA
time_length(timelapsed.p, unit = "week")
## [1] 0.1428571 NA
time_length(timelapsed.p, unit = "month")
## [1] 0.03285421 NA
time_length(timelapsed.p, unit = "year")
## [1] 0.002737851 NA
only
the seconds
are correct and nothing else
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |