Introduction to the tidyverse

Workshop Session 02 @ Ohio University

Ani Ruhil

2019-05-19

“Yet far too much handcrafted work — what data scientists call “data wrangling,” “data munging” and “data janitor work” — is still required. Data scientists, according to interviews and expert estimates, spend from 50 percent to 80 percent of their time mired in this more mundane labor of collecting and preparing unruly digital data, before it can be explored for useful nuggets."

Anybody who has worked with real-world data knows how messy these data tend to be. As a result, you find yourself cleaning up the data, slapping on value labels, perhaps pruning duplicate records, changing variable names to something manageable, and so on. One could use base R commands to complete all these tasks but instead we focus on the tidyverse, described by its authors as “an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures.” While there are several packages that make up the tidyverse, we will focus on two:

  1. dplyr – “dplyr is a grammar of data manipulation, providing a consistent set of verbs that help you solve the most common data manipulation challenges”

  1. tidyr – “Tidy data describes a standard way of storing data that is used wherever possible throughout the tidyverse. If you ensure that your data is tidy, you’ll spend less time fighting with the tools and more time working on your analysis.”

The Basic Prepwork

Let us start by loading the (i) xclass and (ii) xstudent data-sets.

load("data/xclass.RData")
load("data/xstudent.RData")

Now, the first thing we may want to do is to clean up the variable names – all lowercase, elements separated by an underscore. This makes it easy to keep the structure of the variable names straight in our workflow without worrying about uppercase/lowercase, underscores versus dashes or periods, or worse, blank spaces.

Other features of the janitor package can be explored here

library(janitor)
xstudent <- janitor::clean_names(xstudent)
xclass <- janitor::clean_names(xclass)

The second thing I want to do is to get a quick overview of what the data contains. There are many ways to do this but let us rely on skimr

Learn more about the skimr package here

library(skimr)
skim(xstudent)

Data summary

Name xstudent
Number of rows 106771
Number of columns 21
_______________________
Column type frequency:
character 19
numeric 2
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
term_code 0 1 6 6 0 3 0
anon_id 0 1 11 11 0 106771 0
campus_detail_code 0 1 2 2 0 17 0
campus_name 0 1 6 11 0 6 0
campus_detail_desc 0 1 6 23 0 17 0
student_rank_code 0 1 2 2 0 8 0
student_career 0 1 7 13 0 3 0
student_rank_desc 0 1 6 24 0 8 0
acad_plan 0 1 6 6 0 576 0
college_code 0 1 2 2 0 14 0
college_desc 0 1 8 29 0 14 0
htc_college_code 0 1 2 2 0 15 0
htc_college_desc 0 1 0 29 906 15 0
field_of_study_desc 0 1 3 40 0 96 0
acad_plan_desc 0 1 3 40 0 436 0
ethnicity_desc 0 1 5 38 0 9 0
residency_code 0 1 1 1 0 2 0
sex_code 0 1 1 1 0 3 0
university_ftpt 0 1 9 9 0 2 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
hours_carried_all_campuses 0 1 12.83 5.24 0 9.0 15 16.00 48 ▃▇▁▁▁
fte 0 1 0.84 0.31 0 0.6 1 1.07 2 ▂▂▇▁▁

Now we can get to work with dplyr but before we do so, note that you will see the following operator showing up a lot in the code %>%

The %>% operator is useful for chaining a sequence of operations into a single command

the_data <-
  read.csv('/path/to/data/file.csv') %>%
  subset(variable_a > x) %>%
  transform(variable_c = variable_a/variable_b)

This command means …

  1. First read the csv file THEN
  2. subset the data to cases where variable_a values are greater than some value g THEN
  3. create variable_c as the ratio of variable_a to variable_b

The result is saved in the_data

dplyr

Say we want to check for and eliminate duplicate records from a data-set. I will assume (incorrectly I know) that there should be just one row per student in the xclass data-set. I can retain unique records via the following

xclass.df1 <- xclass %>%
  distinct(anon_id, .keep_all = TRUE) 

.keep_all = TRUE retains all columns. Seetting it to FALSE will only retain the anon_id column.

But a student could be enrolled on multiple campuses and have multiple majors. How might we take this into account?

xclass.df2 <- xclass %>%
  distinct(anon_id, campus_name, term_code, department_desc,
           .keep_all = TRUE) 

If more than one variable is specified in distinct() then dplyr looks for unique combinations of these variables. Each unique combination \(=\) a unique observation

Merging data

Say I have data on some students, with test scores in one file and their gender in another file. Each file has a unique student identifier, called ID. How can I create a unified data-set? Let us see each data-set first.

data1 = tibble(Score = c(10, 21, 33, 12, 35, 67, 43, 99),
                   ID = c("A12", "A23", "Z14", "WX1", "Y31", "D66", "C31", "Q22")
                   )

data2 = tibble(Sex = c("Male", "Female", "Male", "Female", "Male", "Female", "Male", "Female"),
                   ID = c("A12", "A23", "Z14", "WX1", "Y31", "D66", "E52", "H71")
                   )

Open up both data-sets and note that students C31 and Q22 are missing from data2 and students E52 and H71 are missing from data1.

Mutating joins

These combine data frames but to do so we need to have merge key(s) – variable(s) that identifies unique observations in each file. Next, you need to decide on what you want.

natural join … merge observations that show up in both files

natural <- inner_join(x = data1, y = data2, by = "ID")
natural
## # A tibble: 6 x 3
##   Score ID    Sex   
##   <dbl> <chr> <chr> 
## 1    10 A12   Male  
## 2    21 A23   Female
## 3    33 Z14   Male  
## 4    12 WX1   Female
## 5    35 Y31   Male  
## 6    67 D66   Female

full outer join … merge everything even if some cases show up in one but not the other

full <- full_join(x = data1, y = data2, by = "ID")
full
## # A tibble: 10 x 3
##    Score ID    Sex   
##    <dbl> <chr> <chr> 
##  1    10 A12   Male  
##  2    21 A23   Female
##  3    33 Z14   Male  
##  4    12 WX1   Female
##  5    35 Y31   Male  
##  6    67 D66   Female
##  7    43 C31   <NA>  
##  8    99 Q22   <NA>  
##  9    NA E52   Male  
## 10    NA H71   Female

left outer join … merge such that all cases in file x are retained but only those seen in file y are joined

left <- left_join(x = data1, y = data2, by = "ID")
left
## # A tibble: 8 x 3
##   Score ID    Sex   
##   <dbl> <chr> <chr> 
## 1    10 A12   Male  
## 2    21 A23   Female
## 3    33 Z14   Male  
## 4    12 WX1   Female
## 5    35 Y31   Male  
## 6    67 D66   Female
## 7    43 C31   <NA>  
## 8    99 Q22   <NA>

right outer join … merge to keep all cases in y and only those cases are to be merged from x that show up in y

right <- right_join(x = data1, y = data2, by = "ID")
right
## # A tibble: 8 x 3
##   Score ID    Sex   
##   <dbl> <chr> <chr> 
## 1    10 A12   Male  
## 2    21 A23   Female
## 3    33 Z14   Male  
## 4    12 WX1   Female
## 5    35 Y31   Male  
## 6    67 D66   Female
## 7    NA E52   Male  
## 8    NA H71   Female

What if the ID variables had different names in the files? You could rename them to have a common name or you could use by.x = and by.y = as shown below.

data3 <- tibble(Score = c(10, 21, 33, 12, 35, 67, 43, 99),
                ID1 = c("A12", "A23", "Z14", "WX1", "Y31", "D66", "C31", "Q22")
                    )

data4 <- tibble(Sex = c("Male", "Female", "Male", "Female", "Male", "Female", "Male", "Female"),
                ID2 = c("A12", "A23", "Z14", "WX1", "Y31", "D66", "E52", "H71")
                    )

diff.full <- full_join(data3, data4, by = c("ID1" = "ID2"))

You can also have more than one merge key. For example, if I am merging data for Ohio schools, each district has a district ID number (dirn), each school has building ID number (birn), and then the district’s name (district), the building’s name (building), and so on. If I am merging these data then my by = statement will be by = c("dirn", "birn", "district", "building").

Let us work through one example.

Load the sm_xstudent data frame and make sure the xclass data frame is loaded as well.

Look at sm_xstudent … it has six variables, three of which are also in xclass – term_code, anon_id, and sex_code. The other three variables are not present in xclassethnicity_desc, residency_code, and university_ftpt.

We will use the three common variables (the keys) to merge sm_xstudent and xclass

load("data/sm_xstudent.RData")

multkey.merge <- full_join(
    sm_xstudent, xclass,
    by = c("term_code" = "term_code",
           "anon_id" = "anon_id",
           "sex_code" = "sex_code")
    )

Open multkey.merge and see the merged fields

If we have multiple data-frames to merge we can do so in several ways but my default approach is to rely on the Reduce() command. Note: This is base R code

df1 <- tibble(Score = c(10, 21, 33, 12, 35, 67, 43, 99),
              ID = c("A12", "A23", "Z14", "WX1", "Y31", "D66", "C31", "Q22")
              )

df2 <- tibble(Sex = c("Male", "Female", "Male", "Female", "Male", "Female", "Male", "Female"),
              ID = c("A12", "A23", "Z14", "WX1", "Y31", "D66", "E52", "H71")
              )

df3 <- tibble(Age = c(6, 7, 6, 8, 8, 9, 10, 5),
              ID = c("A12", "A23", "Z14", "WX1", "Y31", "D66", "E52", "H71")
              )

my.list <- list(df1, df2, df3)

df.123 <- Reduce(function(...) merge(..., by = c("ID"), all = TRUE), my.list)

Here is the merged data-frame …

ID Score Sex Age
A12 10 Male 6
A23 21 Female 7
C31 43 NA NA
D66 67 Female 9
E52 NA Male 10
H71 NA Female 5
Q22 99 NA NA
WX1 12 Female 8
Y31 35 Male 8
Z14 33 Male 6

Filtering joins

These only retain columns in the x data frame

semi_join() … will keep only those rows found in x and y but keep only the columns in x

semi <- semi_join(data1, data2, by = "ID")
semi
## # A tibble: 6 x 2
##   Score ID   
##   <dbl> <chr>
## 1    10 A12  
## 2    21 A23  
## 3    33 Z14  
## 4    12 WX1  
## 5    35 Y31  
## 6    67 D66

anti_join() … will keep only those rows not found in x and y but keep only the columns in x

anti <- anti_join(data1, data2, by = "ID")
anti
## # A tibble: 2 x 2
##   Score ID   
##   <dbl> <chr>
## 1    43 C31  
## 2    99 Q22

The oft-forgotten rbind()

What if we have to stack data frames atop one another, for example, because one data frame is from one term and the second from another term? No problem.

Load the rb1 (where term_code == 20171C), rb2 (where term_code == 20181C), and rb3 (where term_code == 20191C) data frames.

load("data/rb1.RData")
load("data/rb2.RData")
load("data/rb3.RData")
my.dfs <- rbind(rb1, rb2, rb3)

Now go ahead and merge xclass and xstudent, making sure no variable shows up twice in the merged data frame. Then save the merged data frame as multkey.merge.RData and let us get cracking with dplyr’s verbs.

filter()

Say I want to select only online students in xclass

df.1 <- multkey.merge %>%
  filter(online.f == "Online")

What if I wanted online students in Athens?

df.2 <- multkey.merge %>%
  filter(online.f == "Online", campus_name == "Athens")

What if I only want students in selected colleges?

df.3 <- multkey.merge %>%
  filter(college_desc %in% c("Education", "Fine Arts"))

What if I only want students in all other colleges except Fine Arts and Education?

df.4 <- multkey.merge %>%
  filter(!college_desc %in% c("Education", "Fine Arts"))

Note the use of ! to create not in. Similarly, = means “equal to”, != means “not equal to”, | means “or”, and & means “and”

What if have complicated filters of multiple variables? No problem …

df.5 <- multkey.merge %>%
  filter(!college_desc %in% c("Education", "Fine Arts"),
         sex_code == "F",
         enrlld_crdt_hrs > 4 &  enrlld_crdt_hrs < 8)

glimpse(df.5)
## Rows: 4,923
## Columns: 30
## $ term_code            <chr> "20171C", "20171C", "20171C", "20171C", "20171C"…
## $ anon_id              <chr> "ANON1000083", "ANON1000083", "ANON1000116", "AN…
## $ ethnicity_desc       <chr> "White", "White", "White", "White", "White", "Wh…
## $ sex_code             <chr> "F", "F", "F", "F", "F", "F", "F", "F", "F", "F"…
## $ residency_code       <chr> "R", "R", "R", "R", "R", "R", "R", "R", "N", "R"…
## $ university_ftpt      <chr> "Full Time", "Full Time", "Part Time", "Part Tim…
## $ class_nbr            <chr> "08287", "08285", "08024", "08935", "09761", "09…
## $ campus_name          <chr> "Chillicothe", "Chillicothe", "Chillicothe", "Za…
## $ campus_dtl_desc      <chr> "Chillicothe Main Campus", "Chillicothe Main Cam…
## $ college_desc         <chr> "Health Sciences & Professions", "Health Science…
## $ department_desc      <chr> "Nursing", "Nursing", "Mathematics", "Vice Presi…
## $ subject              <chr> "NRSE", "NRSE", "MATH", "NURS", "SW", "MATH", "M…
## $ catalog_nbr          <chr> "4150", "4140", "D004", "1210", "4923", "D004", …
## $ class_desc           <chr> "Populations: Family, Community", "Nursing Care …
## $ delivery_desc        <chr> "Lecture", "Lecture", "Lecture", "Lecture", "Int…
## $ online_flag          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ rank_group           <chr> "U", "U", "U", "U", "U", "U", "U", "U", "G", "M"…
## $ rank_desc            <chr> "Senior", "Senior", "Senior", "Senior", "Senior"…
## $ acad_plan            <chr> "BS1221", "BS1221", "ND0210", "ND1201", "BS6605"…
## $ student_college_desc <chr> "Health Sciences & Professions", "Health Science…
## $ acad_plan_desc       <chr> "Nursing", "Nursing", "Undecided-HSP", "Undecide…
## $ field_of_study_desc  <chr> "Nursing", "Nursing", "Undecided Health Sciences…
## $ residency_desc       <chr> "Resident", "Resident", "Resident", "Resident", …
## $ enrolled_flag        <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ enrlld_crdt_hrs      <dbl> 5, 5, 5, 7, 6, 5, 5, 7, 6, 7, 5, 6, 7, 5, 7, 6, …
## $ online.f             <fct> Not Online, Not Online, Not Online, Not Online, …
## $ sex.f                <chr> "Female", "Female", "Female", "Female", "Female"…
## $ enrolled.f           <chr> "4 - 15 hours", "4 - 15 hours", "4 - 15 hours", …
## $ rank                 <chr> "Senior", "Senior", "Senior", "Senior", "Senior"…
## $ college              <chr> "HEALTH SCIENCES & PROFESSIONS", "HEALTH SCIENCE…

select()

names(multkey.merge)
##  [1] "term_code"            "anon_id"              "ethnicity_desc"      
##  [4] "sex_code"             "residency_code"       "university_ftpt"     
##  [7] "class_nbr"            "campus_name"          "campus_dtl_desc"     
## [10] "college_desc"         "department_desc"      "subject"             
## [13] "catalog_nbr"          "class_desc"           "delivery_desc"       
## [16] "online_flag"          "rank_group"           "rank_desc"           
## [19] "acad_plan"            "student_college_desc" "acad_plan_desc"      
## [22] "field_of_study_desc"  "residency_desc"       "enrolled_flag"       
## [25] "enrlld_crdt_hrs"

Say I want the first 4 columns

select.1 <- multkey.merge %>%
  select(c(1:4))

names(select.1)
## [1] "term_code"      "anon_id"        "ethnicity_desc" "sex_code"

What if my columns are not sequential? No worries …

select.2 <- multkey.merge %>%
  select(c(1, 2, 6, 9:11))

names(select.2)
## [1] "term_code"       "anon_id"         "university_ftpt" "campus_dtl_desc"
## [5] "college_desc"    "department_desc"

Could I have called for the desired columns by name? Sure …

select.3 <- multkey.merge %>%
  select(c(term_code, class_nbr, rank_desc, acad_plan))

names(select.3)
## [1] "term_code" "class_nbr" "rank_desc" "acad_plan"

Could I call out desired variables by a specific string in the column name?

select.4 <- multkey.merge %>%
  select(contains("desc"))

names(select.4)
##  [1] "ethnicity_desc"       "campus_dtl_desc"      "college_desc"        
##  [4] "department_desc"      "class_desc"           "delivery_desc"       
##  [7] "rank_desc"            "student_college_desc" "acad_plan_desc"      
## [10] "field_of_study_desc"  "residency_desc"

What if the variable name begins/ends with some string?

select.5 <- multkey.merge %>%
  select(ends_with("desc"))

names(select.5)
##  [1] "ethnicity_desc"       "campus_dtl_desc"      "college_desc"        
##  [4] "department_desc"      "class_desc"           "delivery_desc"       
##  [7] "rank_desc"            "student_college_desc" "acad_plan_desc"      
## [10] "field_of_study_desc"  "residency_desc"
select.6 <- multkey.merge %>%
  select(starts_with("acad"))

names(select.6)
## [1] "acad_plan"      "acad_plan_desc"

summarise()/summarize() and mutate()

Say I want to calculate some quantities of interest — frequencies perhaps, or means, standard deviations, etc. Let us start with mean, medians, etc.

tab1 <- multkey.merge %>% 
  summarise(
    Mean.cr = mean(enrlld_crdt_hrs, na.rm = TRUE),
    Median.cr = median(enrlld_crdt_hrs, na.rm = TRUE),
    StdDev.cr = sd(enrlld_crdt_hrs, na.rm = TRUE),
    Var.cr = var(enrlld_crdt_hrs, na.rm = TRUE),
    Q1.cr = quantile(enrlld_crdt_hrs, p = c(0.25), na.rm = TRUE),
    Q3.cr = quantile(enrlld_crdt_hrs, p = c(0.75), na.rm = TRUE)
    )

tab1
## # A tibble: 1 x 6
##   Mean.cr Median.cr StdDev.cr Var.cr Q1.cr Q3.cr
##     <dbl>     <dbl>     <dbl>  <dbl> <dbl> <dbl>
## 1    2.80         3      1.37   1.88     3     3

What if we wanted frequencies? Assume we want to know enrollment by college.

tab2 <- multkey.merge %>%
  group_by(college_desc) %>%
  summarise(frequency = n()) %>%
  arrange(frequency)

tab2
## # A tibble: 14 x 2
##    college_desc                  frequency
##    <chr>                             <int>
##  1 Miscellaneous                       308
##  2 Honors Tutorial                     481
##  3 International Studies               921
##  4 George Voinovich School            1317
##  5 Osteopathic Medicine              12170
##  6 University College                17542
##  7 Regional Higher Ed                19043
##  8 Engineering & Technology          27094
##  9 Communication                     33312
## 10 Fine Arts                         34296
## 11 Education                         37544
## 12 Business                          53606
## 13 Health Sciences & Professions     77085
## 14 Arts & Sciences                  173841
tab3 <- multkey.merge %>%
  group_by(college_desc) %>%
  summarise(frequency = n()) %>%
  arrange(-frequency)

tab3
## # A tibble: 14 x 2
##    college_desc                  frequency
##    <chr>                             <int>
##  1 Arts & Sciences                  173841
##  2 Health Sciences & Professions     77085
##  3 Business                          53606
##  4 Education                         37544
##  5 Fine Arts                         34296
##  6 Communication                     33312
##  7 Engineering & Technology          27094
##  8 Regional Higher Ed                19043
##  9 University College                17542
## 10 Osteopathic Medicine              12170
## 11 George Voinovich School            1317
## 12 International Studies               921
## 13 Honors Tutorial                     481
## 14 Miscellaneous                       308

Frequencies are well and good but we may want percentages

tab4 <- multkey.merge %>%
  group_by(college_desc) %>%
  summarise(frequency = n()) %>%
  mutate(percent = (frequency / sum(frequency)) * 100,
         percent = round(percent, digits = 2)) %>%
  arrange(percent)

tab4
## # A tibble: 14 x 3
##    college_desc                  frequency percent
##    <chr>                             <int>   <dbl>
##  1 Miscellaneous                       308    0.06
##  2 Honors Tutorial                     481    0.1 
##  3 International Studies               921    0.19
##  4 George Voinovich School            1317    0.27
##  5 Osteopathic Medicine              12170    2.49
##  6 University College                17542    3.59
##  7 Regional Higher Ed                19043    3.9 
##  8 Engineering & Technology          27094    5.55
##  9 Communication                     33312    6.82
## 10 Fine Arts                         34296    7.02
## 11 Education                         37544    7.68
## 12 Business                          53606   11.0 
## 13 Health Sciences & Professions     77085   15.8 
## 14 Arts & Sciences                  173841   35.6
tab5 <- multkey.merge %>%
  group_by(college_desc) %>%
  summarise(frequency = n()) %>%
  mutate(percent = (frequency / sum(frequency)) * 100,
         percent = round(percent, digits = 2)) %>%
  arrange(-percent)

tab5
## # A tibble: 14 x 3
##    college_desc                  frequency percent
##    <chr>                             <int>   <dbl>
##  1 Arts & Sciences                  173841   35.6 
##  2 Health Sciences & Professions     77085   15.8 
##  3 Business                          53606   11.0 
##  4 Education                         37544    7.68
##  5 Fine Arts                         34296    7.02
##  6 Communication                     33312    6.82
##  7 Engineering & Technology          27094    5.55
##  8 Regional Higher Ed                19043    3.9 
##  9 University College                17542    3.59
## 10 Osteopathic Medicine              12170    2.49
## 11 George Voinovich School            1317    0.27
## 12 International Studies               921    0.19
## 13 Honors Tutorial                     481    0.1 
## 14 Miscellaneous                       308    0.06

Perhaps I want frequencies and percentages by sex_code

tab6 <- multkey.merge %>%
  group_by(college_desc, sex_code) %>%
  summarise(frequency = n()) %>%
  mutate(percent = (frequency / sum(frequency)) * 100,
         percent = round(percent, digits = 2)) %>%
  arrange(college_desc, sex_code, -percent)

tab6
## # A tibble: 31 x 4
## # Groups:   college_desc [14]
##    college_desc             sex_code frequency percent
##    <chr>                    <chr>        <int>   <dbl>
##  1 Arts & Sciences          F            98267    56.5
##  2 Arts & Sciences          M            75574    43.5
##  3 Business                 F            19875    37.1
##  4 Business                 M            33731    62.9
##  5 Communication            F            18844    56.6
##  6 Communication            M            14468    43.4
##  7 Education                F            25939    69.1
##  8 Education                M            11605    30.9
##  9 Engineering & Technology F             4542    16.8
## 10 Engineering & Technology M            22552    83.2
## # … with 21 more rows

Or by sex_code and rank_desc

tab7 <- multkey.merge %>%
  group_by(college_desc, sex_code, rank_desc) %>%
  summarise(frequency = n()) %>%
  mutate(percent = (frequency / sum(frequency)) * 100,
         percent = round(percent, digits = 1)) %>%
  arrange(college_desc, sex_code, rank_desc, -percent)

tab7
## # A tibble: 178 x 5
## # Groups:   college_desc, sex_code [31]
##    college_desc    sex_code rank_desc                frequency percent
##    <chr>           <chr>    <chr>                        <int>   <dbl>
##  1 Arts & Sciences F        Doctoral Student              1178     1.2
##  2 Arts & Sciences F        Freshman                     33519    34.1
##  3 Arts & Sciences F        Junior                       16100    16.4
##  4 Arts & Sciences F        Master's Student              3388     3.4
##  5 Arts & Sciences F        Non-Degree Undergraduate      4068     4.1
##  6 Arts & Sciences F        Senior                       17321    17.6
##  7 Arts & Sciences F        Sophomore                    22693    23.1
##  8 Arts & Sciences M        Doctoral Student              1604     2.1
##  9 Arts & Sciences M        Freshman                     26114    34.6
## 10 Arts & Sciences M        Junior                       12320    16.3
## # … with 168 more rows

case_when()

Often you will need to collapse some categories or otherwise modify some values. For example, perhaps I want sex_code to be fully spelled out instead of F and M. How might I do that?

multkey.merge <- multkey.merge %>%
  mutate(
    sex.f = case_when(
      sex_code == "F" ~ "Female",
      sex_code == "M" ~ "Male"
      )
    )

table(multkey.merge$sex.f)
## 
## Female   Male 
## 277650 210907
multkey.merge <- multkey.merge %>%
  mutate(
    sex.f = case_when(
      sex_code == "F" ~ "Female",
      TRUE ~ "Male"
      )
    )

table(multkey.merge$sex.f)
## 
## Female   Male 
## 277650 210910

I can also take a numeric variable and flip it into a categorical variable with case_when(). For example, say I want to split enrlld_crdt_hrs into three groups, those with hours <= 4, those with hours >= 4 but <= 15, and those with hours >= 16

multkey.merge <- multkey.merge %>%
  mutate(
    enrolled.f = case_when(
      enrlld_crdt_hrs < 4 ~ "<= 3 hours",
      enrlld_crdt_hrs >= 4 & enrlld_crdt_hrs <= 15 ~ "4 - 15 hours",
      enrlld_crdt_hrs > 15 ~ "> 15 hours",
      )
    )

table(multkey.merge$enrolled.f)
## 
##   <= 3 hours   > 15 hours 4 - 15 hours 
##       404150           54        84356
multkey.merge <- multkey.merge %>%
  mutate(
    enrolled.f = case_when(
      enrlld_crdt_hrs < 4 ~ "<= 3 hours",
      enrlld_crdt_hrs >= 4 & enrlld_crdt_hrs <= 15 ~ "4 - 15 hours",
      TRUE ~ "> 15 hours",
      )
    )

table(multkey.merge$enrolled.f)
## 
##   <= 3 hours   > 15 hours 4 - 15 hours 
##       404150           54        84356

stringr()

Some times we may need to eliminate some strings in specific columns, perhaps change their case, or replace a string with another string. All this and more can be done either by two packages: stringr or stringi. Let us see stringr in action for a few commonly encountered tasks.

For example, rank_desc has the word “student” in the labels. Let us assume we want to get rid of these. Here goes …

multkey.merge <- multkey.merge %>%
  mutate(rank = stringr::str_replace(rank_desc, " Student", ""))

What about changing the case?

multkey.merge <- multkey.merge %>%
  mutate(college = stringr::str_to_upper(college_desc))

Other case options: str_to_title, str_to_lower, str_to_sentence

stringr is very versatile; see here for the authors’ walkthrough

group_by()

Now we are back to dplyr so do not confuse these with stringr. You will, more often than not, want to calculate quantities of interest for specific groups – by campus, by term, by college/department/course, and so on.

Well, the group-by() switch allows you to specify the grouping structure, while ungroup() will remove constraints imposed by the grouping structure. For example, assume I want to calculate the number of students in each college

multkey.merge %>%
  group_by(college_desc, campus_name) %>%
  summarise(students = n_distinct(anon_id)) %>%
  arrange(-students)
## # A tibble: 56 x 3
## # Groups:   college_desc [14]
##    college_desc                  campus_name students
##    <chr>                         <chr>          <int>
##  1 Arts & Sciences               Athens         47145
##  2 Health Sciences & Professions Athens         30119
##  3 Business                      Athens         19247
##  4 Communication                 Athens         14332
##  5 Fine Arts                     Athens         14272
##  6 Education                     Athens         13854
##  7 University College            Athens         12891
##  8 Engineering & Technology      Athens          8263
##  9 Arts & Sciences               Lancaster       4590
## 10 Arts & Sciences               Chillicothe     4021
## # … with 46 more rows

n_distinct() is picking up unique values of anon_id even though the data frame has multiple rows per some students

What about male versus female students?

multkey.merge %>%
  group_by(college_desc,  campus_name, sex_code) %>%
  summarise(students = n_distinct(anon_id)) %>%
  arrange(college_desc,  campus_name, -students)
## # A tibble: 115 x 4
## # Groups:   college_desc, campus_name [56]
##    college_desc    campus_name sex_code students
##    <chr>           <chr>       <chr>       <int>
##  1 Arts & Sciences Athens      F           25515
##  2 Arts & Sciences Athens      M           21630
##  3 Arts & Sciences Chillicothe F            2559
##  4 Arts & Sciences Chillicothe M            1462
##  5 Arts & Sciences Eastern     F            1318
##  6 Arts & Sciences Eastern     M             847
##  7 Arts & Sciences Lancaster   F            2755
##  8 Arts & Sciences Lancaster   M            1835
##  9 Arts & Sciences Southern    F            2032
## 10 Arts & Sciences Southern    M            1012
## # … with 105 more rows

What about male versus female students by ethnicity?

multkey.merge %>%
  group_by(college_desc, campus_name, sex_code, ethnicity_desc) %>%
  summarise(students = n_distinct(anon_id)) %>%
  arrange(college_desc, campus_name, ethnicity_desc, sex_code)
## # A tibble: 820 x 5
## # Groups:   college_desc, campus_name, sex_code [115]
##    college_desc    campus_name sex_code ethnicity_desc                  students
##    <chr>           <chr>       <chr>    <chr>                              <int>
##  1 Arts & Sciences Athens      F        American Indian/Alaska Native         43
##  2 Arts & Sciences Athens      M        American Indian/Alaska Native         34
##  3 Arts & Sciences Athens      F        Asian                                310
##  4 Arts & Sciences Athens      M        Asian                                215
##  5 Arts & Sciences Athens      F        Black or African American           1488
##  6 Arts & Sciences Athens      M        Black or African American           1087
##  7 Arts & Sciences Athens      F        Hispanic or Latino                   846
##  8 Arts & Sciences Athens      M        Hispanic or Latino                   611
##  9 Arts & Sciences Athens      F        Native Hawaiian/Other Pacific …       11
## 10 Arts & Sciences Athens      M        Native Hawaiian/Other Pacific …       17
## # … with 810 more rows

Maybe I want the number enrolled in a give course. To do this, I will first combine subject and catalog_nbr

multkey.merge %>%
  mutate(course = stringr::str_c(subject, catalog_nbr, sep = " ")) %>% 
  group_by(term_code, campus_name, course) %>%
  summarise(students = n_distinct(anon_id)) %>%
  arrange(-students)
## # A tibble: 12,666 x 4
## # Groups:   term_code, campus_name [18]
##    term_code campus_name course    students
##    <chr>     <chr>       <chr>        <int>
##  1 20181C    Athens      UC 1900       3107
##  2 20191C    Athens      UC 1900       3102
##  3 20171C    Athens      UC 1900       2811
##  4 20171C    Athens      ENG 1510      1547
##  5 20171C    Athens      NRSE 4570     1467
##  6 20181C    Athens      ENG 1510      1428
##  7 20171C    Athens      NRSE 4520     1426
##  8 20191C    Athens      NRSE 4550     1414
##  9 20171C    Athens      PSY 1010      1404
## 10 20181C    Athens      NRSE 4530     1334
## # … with 12,656 more rows

rename()

You may need to rename some existing columns. An easy way to do this is as follows:

multkey.merge %>%
  rename("Ethnicity" = ethnicity_desc,
         "Sex" = sex_code,
         "Department" = department_desc) %>%
  names()
##  [1] "term_code"            "anon_id"              "Ethnicity"           
##  [4] "Sex"                  "residency_code"       "university_ftpt"     
##  [7] "class_nbr"            "campus_name"          "campus_dtl_desc"     
## [10] "college_desc"         "Department"           "subject"             
## [13] "catalog_nbr"          "class_desc"           "delivery_desc"       
## [16] "online_flag"          "rank_group"           "rank_desc"           
## [19] "acad_plan"            "student_college_desc" "acad_plan_desc"      
## [22] "field_of_study_desc"  "residency_desc"       "enrolled_flag"       
## [25] "enrlld_crdt_hrs"      "sex.f"                "enrolled.f"          
## [28] "rank"                 "college"

top_n()

After running some operations that generate some quantities of interest we may want to extract or merely peek at the n observations with the highest values or n observatins with the lowest values. This can be done as follows:

multkey.merge %>%
  group_by(department_desc) %>%
  summarize(frequency = n()) %>% 
  arrange(-frequency) %>% 
  top_n(10, frequency)
## # A tibble: 10 x 2
##    department_desc                          frequency
##    <chr>                                        <int>
##  1 Nursing                                      36146
##  2 Management Systems                           27078
##  3 Biological Sciences                          22470
##  4 Mathematics                                  20716
##  5 Vice President Regional Higher Education     19043
##  6 Psychology                                   18791
##  7 English                                      17148
##  8 Communication Studies                        17008
##  9 Teacher Education                            16464
## 10 Dean University College                      16447
multkey.merge %>%
  group_by(department_desc) %>%
  summarize(frequency = n()) %>% 
  arrange(-frequency) %>%
  top_n(-10, frequency)
## # A tibble: 10 x 2
##    department_desc                     frequency
##    <chr>                                   <int>
##  1 Center For Intl Studies                   921
##  2 African American Studies                  884
##  3 Aviation                                  883
##  4 Military Science                          818
##  5 Ohio Program of Intensive English         505
##  6 Honors College                            481
##  7 Misc                                      308
##  8 Aerospace Studies                         277
##  9 Education Dean                             69
## 10 Center for Law, Justice and Culture        54

tidyr

What are tidy data?

Here are some untidy data

  1. The Pew survey data
pew <- read.delim(
  file = "http://stat405.had.co.nz/data/pew.txt",
  header = TRUE,
  stringsAsFactors = FALSE,
  check.names = F
)

head(pew)
##             religion <$10k $10-20k $20-30k $30-40k $40-50k $50-75k $75-100k
## 1           Agnostic    27      34      60      81      76     137      122
## 2            Atheist    12      27      37      52      35      70       73
## 3           Buddhist    27      21      30      34      33      58       62
## 4           Catholic   418     617     732     670     638    1116      949
## 5 Don’t know/refused    15      14      15      11      10      35       21
## 6   Evangelical Prot   575     869    1064     982     881    1486      949
##   $100-150k >150k Don't know/refused
## 1       109    84                 96
## 2        59    74                 76
## 3        39    53                 54
## 4       792   633               1489
## 5        17    18                116
## 6       723   414               1529
  1. Here are some more examples
devtools::install_github("garrettgman/DSR")

library(DSR)
table2
## # A tibble: 12 x 4
##    country      year key             value
##    <fct>       <int> <fct>           <int>
##  1 Afghanistan  1999 cases             745
##  2 Afghanistan  1999 population   19987071
##  3 Afghanistan  2000 cases            2666
##  4 Afghanistan  2000 population   20595360
##  5 Brazil       1999 cases           37737
##  6 Brazil       1999 population  172006362
##  7 Brazil       2000 cases           80488
##  8 Brazil       2000 population  174504898
##  9 China        1999 cases          212258
## 10 China        1999 population 1272915272
## 11 China        2000 cases          213766
## 12 China        2000 population 1280428583
table3
## # A tibble: 6 x 3
##   country      year rate             
##   <fct>       <int> <chr>            
## 1 Afghanistan  1999 745/19987071     
## 2 Afghanistan  2000 2666/20595360    
## 3 Brazil       1999 37737/172006362  
## 4 Brazil       2000 80488/174504898  
## 5 China        1999 212258/1272915272
## 6 China        2000 213766/1280428583
table4
## # A tibble: 3 x 3
##   country     `1999` `2000`
##   <fct>        <int>  <int>
## 1 Afghanistan    745   2666
## 2 Brazil       37737  80488
## 3 China       212258 213766
table5
## # A tibble: 3 x 3
##   country         `1999`     `2000`
##   <fct>            <int>      <int>
## 1 Afghanistan   19987071   20595360
## 2 Brazil       172006362  174504898
## 3 China       1272915272 1280428583

What would a tidy version of these examples look like?

table1
## # A tibble: 6 x 4
##   country      year  cases population
##   <fct>       <int>  <int>      <int>
## 1 Afghanistan  1999    745   19987071
## 2 Afghanistan  2000   2666   20595360
## 3 Brazil       1999  37737  172006362
## 4 Brazil       2000  80488  174504898
## 5 China        1999 212258 1272915272
## 6 China        2000 213766 1280428583

Here, we have

separate()

library(tidyr)
s.table3 <- table3 %>% 
  separate(col = rate, into = c("cases", "population"), sep = "/", 
           remove = TRUE, convert = TRUE) 

head(s.table3)
## # A tibble: 6 x 4
##   country      year  cases population
##   <fct>       <int>  <int>      <int>
## 1 Afghanistan  1999    745   19987071
## 2 Afghanistan  2000   2666   20595360
## 3 Brazil       1999  37737  172006362
## 4 Brazil       2000  80488  174504898
## 5 China        1999 212258 1272915272
## 6 China        2000 213766 1280428583

unite()

u.table3 <- s.table3 %>% 
  unite("rate", cases, population, sep = ":", remove = FALSE)

head(u.table3)
## # A tibble: 6 x 5
##   country      year rate               cases population
##   <fct>       <int> <chr>              <int>      <int>
## 1 Afghanistan  1999 745:19987071         745   19987071
## 2 Afghanistan  2000 2666:20595360       2666   20595360
## 3 Brazil       1999 37737:172006362    37737  172006362
## 4 Brazil       2000 80488:174504898    80488  174504898
## 5 China        1999 212258:1272915272 212258 1272915272
## 6 China        2000 213766:1280428583 213766 1280428583

spread()

Revisit table2

table2
## # A tibble: 12 x 4
##    country      year key             value
##    <fct>       <int> <fct>           <int>
##  1 Afghanistan  1999 cases             745
##  2 Afghanistan  1999 population   19987071
##  3 Afghanistan  2000 cases            2666
##  4 Afghanistan  2000 population   20595360
##  5 Brazil       1999 cases           37737
##  6 Brazil       1999 population  172006362
##  7 Brazil       2000 cases           80488
##  8 Brazil       2000 population  174504898
##  9 China        1999 cases          212258
## 10 China        1999 population 1272915272
## 11 China        2000 cases          213766
## 12 China        2000 population 1280428583

I want to flip this into a format where the data frame has two columns per country per year – cases and population. How can I do this?

sp.table2 <- table2 %>% 
  spread(key, value)

sp.table2
## # A tibble: 6 x 4
##   country      year  cases population
##   <fct>       <int>  <int>      <int>
## 1 Afghanistan  1999    745   19987071
## 2 Afghanistan  2000   2666   20595360
## 3 Brazil       1999  37737  172006362
## 4 Brazil       2000  80488  174504898
## 5 China        1999 212258 1272915272
## 6 China        2000 213766 1280428583

gather()

table4
## # A tibble: 3 x 3
##   country     `1999` `2000`
##   <fct>        <int>  <int>
## 1 Afghanistan    745   2666
## 2 Brazil       37737  80488
## 3 China       212258 213766
g.table4 <- table4 %>% 
  gather("year", "cases", 2:3)

head(g.table4)
## # A tibble: 6 x 3
##   country     year   cases
##   <fct>       <chr>  <int>
## 1 Afghanistan 1999     745
## 2 Brazil      1999   37737
## 3 China       1999  212258
## 4 Afghanistan 2000    2666
## 5 Brazil      2000   80488
## 6 China       2000  213766

Some Examples

nospread.df <- multkey.merge %>%
  group_by(department_desc, term_code) %>%
  summarize(frequency = n()) 

head(nospread.df, 10)
## # A tibble: 10 x 3
## # Groups:   department_desc [4]
##    department_desc                    term_code frequency
##    <chr>                              <chr>         <int>
##  1 Accountancy                        20171C         1874
##  2 Accountancy                        20181C         1740
##  3 Accountancy                        20191C         1734
##  4 Aerospace Studies                  20171C           69
##  5 Aerospace Studies                  20181C          112
##  6 Aerospace Studies                  20191C           96
##  7 African American Studies           20171C          313
##  8 African American Studies           20181C          238
##  9 African American Studies           20191C          333
## 10 Applied Health Sciences & Wellness 20171C         3922

Note: ‘term_code’ is now a column instead of rows.

spread.df <- multkey.merge %>%
  group_by(department_desc, term_code) %>%
  summarize(frequency = n()) %>%
  spread(term_code, frequency) 

head(spread.df, 10)
## # A tibble: 10 x 4
## # Groups:   department_desc [10]
##    department_desc                     `20171C` `20181C` `20191C`
##    <chr>                                  <int>    <int>    <int>
##  1 Accountancy                             1874     1740     1734
##  2 Aerospace Studies                         69      112       96
##  3 African American Studies                 313      238      333
##  4 Applied Health Sciences & Wellness      3922     4525     4268
##  5 Art                                     2447     2969     2738
##  6 Aviation                                 293      275      315
##  7 Biological Sciences                     6722     7947     7801
##  8 Center For Intl Studies                  255      328      338
##  9 Center for Law, Justice and Culture       NA       NA       54
## 10 Chemical & Biomolecular Engineering      870     1189     1059

lubridate

This package makes working with dates and times quite easy because it does what base R does but in more intuitive ways and perhaps more flexibly. Let us start with some date fields.

today1 = "20171217"
today2 = "2017-12-17"
today3 = "2017 December 17"
today4 = "20171217143241"
today5 = "2017 December 17 14:32:41"
today6 = "December 17 2017 14:32:41"
today7 = "17-Dec, 2017 14:32:41"

The formats are quite varied but lubridate deals with them quite seamlessly so long as you pay attention to the order – is year first or last? What about month? day? Is time given in hours, minutes and seconds?.

library(lubridate)

ymd(today1)
## [1] "2017-12-17"
ymd(today2)
## [1] "2017-12-17"
ymd(today3)
## [1] "2017-12-17"
ymd_hms(today4)
## [1] "2017-12-17 14:32:41 UTC"
ymd_hms(today5)
## [1] "2017-12-17 14:32:41 UTC"
mdy_hms(today6)
## [1] "2017-12-17 14:32:41 UTC"
dmy_hms(today7)
## [1] "2017-12-17 14:32:41 UTC"

If I need to extract data/time elements, I can do that as well.

today = dmy_hms(today7); today # a date and time set in today7 
## [1] "2017-12-17 14:32:41 UTC"
today.y = year(today); today.y # The year 
## [1] 2017
today.m1 = month(today); today.m1 # the month, as a number  
## [1] 12
today.m2 = month(today, label = TRUE, abbr = TRUE); today.m2 # labeling the month but with an abbreviation 
## [1] Dec
## 12 Levels: Jan < Feb < Mar < Apr < May < Jun < Jul < Aug < Sep < ... < Dec
today.m3 = month(today, label = TRUE, abbr = FALSE); today.m3# fully labelling the month 
## [1] December
## 12 Levels: January < February < March < April < May < June < ... < December
today.w = week(today)
today.w # what week of the year is it? 
## [1] 51
today.doy = yday(today)
today.doy # what day of the year is it? 
## [1] 351
today.dom = mday(today)
today.dom # what day of the month is it? 
## [1] 17
today.dow1 = wday(today)
today.dow1 # what day of the week is it, as a number? 
## [1] 1
today.dow2 = wday(today, label = TRUE, abbr = TRUE)
today.dow2 #  day of the week, abbreviated label
## [1] Sun
## Levels: Sun < Mon < Tue < Wed < Thu < Fri < Sat
today.dow3 = wday(today, label = TRUE, abbr = FALSE)
today.dow3 # day of the week fully labelled  
## [1] Sunday
## 7 Levels: Sunday < Monday < Tuesday < Wednesday < Thursday < ... < Saturday
today.h = hour(today)
today.h # what hour is it? 
## [1] 14
today.m = minute(today)
today.m # what minute is it? 
## [1] 32
today.s = second(today)
today.s # what second is it? 
## [1] 41
today.tz = tz(today)
today.tz # what time zone is it? 
## [1] "UTC"

Intervals vs. Durations vs. Periods

Let us start with a duration, the simplest measure of lapsed time since it measures the passing of time in seconds. Say I pick two dates and times … 05:00am on March 9, 2019 and 05:00am on March 10, 2019.

date1 <- ymd_hms("2019-03-09 05:00:00", tz = "US/Eastern")
date2 <- ymd_hms("2019-03-10 05:00:00", tz = "US/Eastern")
timint <- interval(date1, date2)
timint
## [1] 2019-03-09 05:00:00 EST--2019-03-10 05:00:00 EDT

How much time has lapsed between date1 and date2? We calculate with as.duration()

timelapsed.d <- as.duration(timint) 

Why is it 23 hours and not 24 hours? Because daylight savings kicks in at 2:00 AM on March 10, and since the clocks kick forward by one hour, only 23 hours have passed, not 24 as we might naively expect. Go back and look at the output from timeint; see the EST versus EDT? In contrast to duration, if I ask for the period of time, what do I get?

timelapsed.p <- as.period(timint) 

Aha! as.period() is imprecise because it tells me 1 day has passed

Now, we can use the duration of time lapsed in whatever units we want, but if we want accuracy, we better work with durations, as shown below.

time_length(timelapsed.d, unit = "second")
## [1] 82800
time_length(timelapsed.d, unit = "minute")
## [1] 1380
time_length(timelapsed.d, unit = "hour")
## [1] 23
time_length(timelapsed.d, unit = "day")
## [1] 0.9583333
time_length(timelapsed.d, unit = "week")
## [1] 0.1369048
time_length(timelapsed.d, unit = "month")
## [1] 0.03148528
time_length(timelapsed.d, unit = "year")
## [1] 0.002623774

duration() is always estimated in seconds but you can change the reporting unit