“Yet far too much handcrafted work — what data scientists call “data wrangling,” “data munging” and “data janitor work” — is still required. Data scientists, according to interviews and expert estimates, spend from 50 percent to 80 percent of their time mired in this more mundane labor of collecting and preparing unruly digital data, before it can be explored for useful nuggets."
Anybody who has worked with real-world data knows how messy these data tend to be. As a result, you find yourself cleaning up the data, slapping on value labels, perhaps pruning duplicate records, changing variable names to something manageable, and so on. One could use base R commands to complete all these tasks but instead we focus on the tidyverse
, described by its authors as “an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures.” While there are several packages that make up the tidyverse
, we will focus on two:
dplyr
– “dplyr is a grammar of data manipulation, providing a consistent set of verbs that help you solve the most common data manipulation challenges”
tidyr
– “Tidy data describes a standard way of storing data that is used wherever possible throughout the tidyverse. If you ensure that your data is tidy, you’ll spend less time fighting with the tools and more time working on your analysis.”Let us start by loading the (i) xclass
and (ii) xstudent
data-sets.
load("data/xclass.RData")
load("data/xstudent.RData")
Now, the first thing we may want to do is to clean up the variable names – all lowercase, elements separated by an underscore. This makes it easy to keep the structure of the variable names straight in our workflow without worrying about uppercase/lowercase, underscores versus dashes or periods, or worse, blank spaces.
Other features of the janitor package can be explored here
library(janitor)
xstudent <- janitor::clean_names(xstudent)
xclass <- janitor::clean_names(xclass)
The second thing I want to do is to get a quick overview of what the data contains. There are many ways to do this but let us rely on skimr
Learn more about the skimr package here
library(skimr)
skim(xstudent)
Data summary
Name | xstudent |
Number of rows | 106771 |
Number of columns | 21 |
_______________________ | |
Column type frequency: | |
character | 19 |
numeric | 2 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
term_code | 0 | 1 | 6 | 6 | 0 | 3 | 0 |
anon_id | 0 | 1 | 11 | 11 | 0 | 106771 | 0 |
campus_detail_code | 0 | 1 | 2 | 2 | 0 | 17 | 0 |
campus_name | 0 | 1 | 6 | 11 | 0 | 6 | 0 |
campus_detail_desc | 0 | 1 | 6 | 23 | 0 | 17 | 0 |
student_rank_code | 0 | 1 | 2 | 2 | 0 | 8 | 0 |
student_career | 0 | 1 | 7 | 13 | 0 | 3 | 0 |
student_rank_desc | 0 | 1 | 6 | 24 | 0 | 8 | 0 |
acad_plan | 0 | 1 | 6 | 6 | 0 | 576 | 0 |
college_code | 0 | 1 | 2 | 2 | 0 | 14 | 0 |
college_desc | 0 | 1 | 8 | 29 | 0 | 14 | 0 |
htc_college_code | 0 | 1 | 2 | 2 | 0 | 15 | 0 |
htc_college_desc | 0 | 1 | 0 | 29 | 906 | 15 | 0 |
field_of_study_desc | 0 | 1 | 3 | 40 | 0 | 96 | 0 |
acad_plan_desc | 0 | 1 | 3 | 40 | 0 | 436 | 0 |
ethnicity_desc | 0 | 1 | 5 | 38 | 0 | 9 | 0 |
residency_code | 0 | 1 | 1 | 1 | 0 | 2 | 0 |
sex_code | 0 | 1 | 1 | 1 | 0 | 3 | 0 |
university_ftpt | 0 | 1 | 9 | 9 | 0 | 2 | 0 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
hours_carried_all_campuses | 0 | 1 | 12.83 | 5.24 | 0 | 9.0 | 15 | 16.00 | 48 | ▃▇▁▁▁ |
fte | 0 | 1 | 0.84 | 0.31 | 0 | 0.6 | 1 | 1.07 | 2 | ▂▂▇▁▁ |
Now we can get to work with dplyr
but before we do so, note that you will see the following operator showing up a lot in the code %>%
The %>%
operator is useful for chaining a sequence of operations
into a single command
the_data <-
read.csv('/path/to/data/file.csv') %>%
subset(variable_a > x) %>%
transform(variable_c = variable_a/variable_b)
This command means …
THEN
THEN
The result is saved in the_data
Say we want to check for and eliminate duplicate records from a data-set. I will assume (incorrectly I know) that there should be just one row per student in the xclass
data-set. I can retain unique records via the following
xclass.df1 <- xclass %>%
distinct(anon_id, .keep_all = TRUE)
.keep_all = TRUE retains all columns. Seetting it to FALSE will only retain the anon_id column.
But a student could be enrolled on multiple campuses and have multiple majors. How might we take this into account?
xclass.df2 <- xclass %>%
distinct(anon_id, campus_name, term_code, department_desc,
.keep_all = TRUE)
If more than one variable is specified in distinct() then dplyr looks for unique combinations of these variables. Each unique combination \(=\) a unique observation
Say I have data on some students, with test scores in one file and their gender in another file. Each file has a unique student identifier, called ID
. How can I create a unified data-set? Let us see each data-set first.
data1 = tibble(Score = c(10, 21, 33, 12, 35, 67, 43, 99),
ID = c("A12", "A23", "Z14", "WX1", "Y31", "D66", "C31", "Q22")
)
data2 = tibble(Sex = c("Male", "Female", "Male", "Female", "Male", "Female", "Male", "Female"),
ID = c("A12", "A23", "Z14", "WX1", "Y31", "D66", "E52", "H71")
)
Open up both data-sets and note that students C31 and Q22 are missing from data2
and students E52 and H71 are missing from data1
.
These combine data frames but to do so we need to have merge key(s)
– variable(s) that identifies unique observations in each file. Next, you need to decide on what you want.
natural join
… merge observations that show up in both files
natural <- inner_join(x = data1, y = data2, by = "ID")
natural
## # A tibble: 6 x 3
## Score ID Sex
## <dbl> <chr> <chr>
## 1 10 A12 Male
## 2 21 A23 Female
## 3 33 Z14 Male
## 4 12 WX1 Female
## 5 35 Y31 Male
## 6 67 D66 Female
full outer join
… merge everything even if some cases show up in one but not the other
full <- full_join(x = data1, y = data2, by = "ID")
full
## # A tibble: 10 x 3
## Score ID Sex
## <dbl> <chr> <chr>
## 1 10 A12 Male
## 2 21 A23 Female
## 3 33 Z14 Male
## 4 12 WX1 Female
## 5 35 Y31 Male
## 6 67 D66 Female
## 7 43 C31 <NA>
## 8 99 Q22 <NA>
## 9 NA E52 Male
## 10 NA H71 Female
left outer join
… merge such that all cases in file x
are retained but only those seen in file y
are joined
left <- left_join(x = data1, y = data2, by = "ID")
left
## # A tibble: 8 x 3
## Score ID Sex
## <dbl> <chr> <chr>
## 1 10 A12 Male
## 2 21 A23 Female
## 3 33 Z14 Male
## 4 12 WX1 Female
## 5 35 Y31 Male
## 6 67 D66 Female
## 7 43 C31 <NA>
## 8 99 Q22 <NA>
right outer join
… merge to keep all cases in y
and only those cases are to be merged from x
that show up in y
right <- right_join(x = data1, y = data2, by = "ID")
right
## # A tibble: 8 x 3
## Score ID Sex
## <dbl> <chr> <chr>
## 1 10 A12 Male
## 2 21 A23 Female
## 3 33 Z14 Male
## 4 12 WX1 Female
## 5 35 Y31 Male
## 6 67 D66 Female
## 7 NA E52 Male
## 8 NA H71 Female
What if the ID variables had different names in the files? You could rename them to have a common name or you could use by.x =
and by.y =
as shown below.
data3 <- tibble(Score = c(10, 21, 33, 12, 35, 67, 43, 99),
ID1 = c("A12", "A23", "Z14", "WX1", "Y31", "D66", "C31", "Q22")
)
data4 <- tibble(Sex = c("Male", "Female", "Male", "Female", "Male", "Female", "Male", "Female"),
ID2 = c("A12", "A23", "Z14", "WX1", "Y31", "D66", "E52", "H71")
)
diff.full <- full_join(data3, data4, by = c("ID1" = "ID2"))
You can also have more than one merge key. For example, if I am merging data for Ohio schools, each district has a district ID number (dirn)
, each school has building ID number (birn)
, and then the district’s name (district)
, the building’s name (building)
, and so on. If I am merging these data then my by =
statement will be by = c("dirn", "birn", "district", "building")
.
Let us work through one example.
Load the sm_xstudent
data frame and make sure the xclass
data frame is loaded as well.
Look at sm_xstudent
… it has six variables, three of which are also in xclass – term_code
, anon_id
, and sex_code
. The other three variables are not present in xclass
– ethnicity_desc
, residency_code
, and university_ftpt
.
We will use the three common variables (the keys
) to merge sm_xstudent
and xclass
load("data/sm_xstudent.RData")
multkey.merge <- full_join(
sm_xstudent, xclass,
by = c("term_code" = "term_code",
"anon_id" = "anon_id",
"sex_code" = "sex_code")
)
Open multkey.merge
and see the merged fields
If we have multiple data-frames to merge we can do so in several ways but my default approach is to rely on the Reduce()
command. Note: This is base R code
df1 <- tibble(Score = c(10, 21, 33, 12, 35, 67, 43, 99),
ID = c("A12", "A23", "Z14", "WX1", "Y31", "D66", "C31", "Q22")
)
df2 <- tibble(Sex = c("Male", "Female", "Male", "Female", "Male", "Female", "Male", "Female"),
ID = c("A12", "A23", "Z14", "WX1", "Y31", "D66", "E52", "H71")
)
df3 <- tibble(Age = c(6, 7, 6, 8, 8, 9, 10, 5),
ID = c("A12", "A23", "Z14", "WX1", "Y31", "D66", "E52", "H71")
)
my.list <- list(df1, df2, df3)
df.123 <- Reduce(function(...) merge(..., by = c("ID"), all = TRUE), my.list)
Here is the merged data-frame …
ID | Score | Sex | Age |
---|---|---|---|
A12 | 10 | Male | 6 |
A23 | 21 | Female | 7 |
C31 | 43 | NA | NA |
D66 | 67 | Female | 9 |
E52 | NA | Male | 10 |
H71 | NA | Female | 5 |
Q22 | 99 | NA | NA |
WX1 | 12 | Female | 8 |
Y31 | 35 | Male | 8 |
Z14 | 33 | Male | 6 |
These only retain columns in the x
data frame
semi_join()
… will keep only those rows found in x
and y
but keep only the columns in x
semi <- semi_join(data1, data2, by = "ID")
semi
## # A tibble: 6 x 2
## Score ID
## <dbl> <chr>
## 1 10 A12
## 2 21 A23
## 3 33 Z14
## 4 12 WX1
## 5 35 Y31
## 6 67 D66
anti_join()
… will keep only those rows not found in x
and y
but keep only the columns in x
anti <- anti_join(data1, data2, by = "ID")
anti
## # A tibble: 2 x 2
## Score ID
## <dbl> <chr>
## 1 43 C31
## 2 99 Q22
rbind()
What if we have to stack data frames atop one another, for example, because one data frame is from one term and the second from another term? No problem.
Load the rb1
(where term_code == 20171C
), rb2
(where term_code == 20181C
), and rb3
(where term_code == 20191C
) data frames.
load("data/rb1.RData")
load("data/rb2.RData")
load("data/rb3.RData")
my.dfs <- rbind(rb1, rb2, rb3)
Now go ahead and merge xclass
and xstudent
, making sure no variable shows up twice in the merged data frame. Then save the merged data frame as multkey.merge.RData
and let us get cracking with dplyr’s verbs.
Say I want to select only online students in xclass
df.1 <- multkey.merge %>%
filter(online.f == "Online")
What if I wanted online students in Athens?
df.2 <- multkey.merge %>%
filter(online.f == "Online", campus_name == "Athens")
What if I only want students in selected colleges?
df.3 <- multkey.merge %>%
filter(college_desc %in% c("Education", "Fine Arts"))
What if I only want students in all other colleges except Fine Arts and Education?
df.4 <- multkey.merge %>%
filter(!college_desc %in% c("Education", "Fine Arts"))
Note the use of !
to create not in
. Similarly, =
means “equal to”, !=
means “not equal to”, |
means “or”, and &
means “and”
What if have complicated filters of multiple variables? No problem …
df.5 <- multkey.merge %>%
filter(!college_desc %in% c("Education", "Fine Arts"),
sex_code == "F",
enrlld_crdt_hrs > 4 & enrlld_crdt_hrs < 8)
glimpse(df.5)
## Rows: 4,923
## Columns: 30
## $ term_code <chr> "20171C", "20171C", "20171C", "20171C", "20171C"…
## $ anon_id <chr> "ANON1000083", "ANON1000083", "ANON1000116", "AN…
## $ ethnicity_desc <chr> "White", "White", "White", "White", "White", "Wh…
## $ sex_code <chr> "F", "F", "F", "F", "F", "F", "F", "F", "F", "F"…
## $ residency_code <chr> "R", "R", "R", "R", "R", "R", "R", "R", "N", "R"…
## $ university_ftpt <chr> "Full Time", "Full Time", "Part Time", "Part Tim…
## $ class_nbr <chr> "08287", "08285", "08024", "08935", "09761", "09…
## $ campus_name <chr> "Chillicothe", "Chillicothe", "Chillicothe", "Za…
## $ campus_dtl_desc <chr> "Chillicothe Main Campus", "Chillicothe Main Cam…
## $ college_desc <chr> "Health Sciences & Professions", "Health Science…
## $ department_desc <chr> "Nursing", "Nursing", "Mathematics", "Vice Presi…
## $ subject <chr> "NRSE", "NRSE", "MATH", "NURS", "SW", "MATH", "M…
## $ catalog_nbr <chr> "4150", "4140", "D004", "1210", "4923", "D004", …
## $ class_desc <chr> "Populations: Family, Community", "Nursing Care …
## $ delivery_desc <chr> "Lecture", "Lecture", "Lecture", "Lecture", "Int…
## $ online_flag <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ rank_group <chr> "U", "U", "U", "U", "U", "U", "U", "U", "G", "M"…
## $ rank_desc <chr> "Senior", "Senior", "Senior", "Senior", "Senior"…
## $ acad_plan <chr> "BS1221", "BS1221", "ND0210", "ND1201", "BS6605"…
## $ student_college_desc <chr> "Health Sciences & Professions", "Health Science…
## $ acad_plan_desc <chr> "Nursing", "Nursing", "Undecided-HSP", "Undecide…
## $ field_of_study_desc <chr> "Nursing", "Nursing", "Undecided Health Sciences…
## $ residency_desc <chr> "Resident", "Resident", "Resident", "Resident", …
## $ enrolled_flag <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ enrlld_crdt_hrs <dbl> 5, 5, 5, 7, 6, 5, 5, 7, 6, 7, 5, 6, 7, 5, 7, 6, …
## $ online.f <fct> Not Online, Not Online, Not Online, Not Online, …
## $ sex.f <chr> "Female", "Female", "Female", "Female", "Female"…
## $ enrolled.f <chr> "4 - 15 hours", "4 - 15 hours", "4 - 15 hours", …
## $ rank <chr> "Senior", "Senior", "Senior", "Senior", "Senior"…
## $ college <chr> "HEALTH SCIENCES & PROFESSIONS", "HEALTH SCIENCE…
names(multkey.merge)
## [1] "term_code" "anon_id" "ethnicity_desc"
## [4] "sex_code" "residency_code" "university_ftpt"
## [7] "class_nbr" "campus_name" "campus_dtl_desc"
## [10] "college_desc" "department_desc" "subject"
## [13] "catalog_nbr" "class_desc" "delivery_desc"
## [16] "online_flag" "rank_group" "rank_desc"
## [19] "acad_plan" "student_college_desc" "acad_plan_desc"
## [22] "field_of_study_desc" "residency_desc" "enrolled_flag"
## [25] "enrlld_crdt_hrs"
Say I want the first 4 columns
select.1 <- multkey.merge %>%
select(c(1:4))
names(select.1)
## [1] "term_code" "anon_id" "ethnicity_desc" "sex_code"
What if my columns are not sequential? No worries …
select.2 <- multkey.merge %>%
select(c(1, 2, 6, 9:11))
names(select.2)
## [1] "term_code" "anon_id" "university_ftpt" "campus_dtl_desc"
## [5] "college_desc" "department_desc"
Could I have called for the desired columns by name? Sure …
select.3 <- multkey.merge %>%
select(c(term_code, class_nbr, rank_desc, acad_plan))
names(select.3)
## [1] "term_code" "class_nbr" "rank_desc" "acad_plan"
Could I call out desired variables by a specific string in the column name?
select.4 <- multkey.merge %>%
select(contains("desc"))
names(select.4)
## [1] "ethnicity_desc" "campus_dtl_desc" "college_desc"
## [4] "department_desc" "class_desc" "delivery_desc"
## [7] "rank_desc" "student_college_desc" "acad_plan_desc"
## [10] "field_of_study_desc" "residency_desc"
What if the variable name begins/ends with some string?
select.5 <- multkey.merge %>%
select(ends_with("desc"))
names(select.5)
## [1] "ethnicity_desc" "campus_dtl_desc" "college_desc"
## [4] "department_desc" "class_desc" "delivery_desc"
## [7] "rank_desc" "student_college_desc" "acad_plan_desc"
## [10] "field_of_study_desc" "residency_desc"
select.6 <- multkey.merge %>%
select(starts_with("acad"))
names(select.6)
## [1] "acad_plan" "acad_plan_desc"
Say I want to calculate some quantities of interest — frequencies perhaps, or means, standard deviations, etc. Let us start with mean, medians, etc.
tab1 <- multkey.merge %>%
summarise(
Mean.cr = mean(enrlld_crdt_hrs, na.rm = TRUE),
Median.cr = median(enrlld_crdt_hrs, na.rm = TRUE),
StdDev.cr = sd(enrlld_crdt_hrs, na.rm = TRUE),
Var.cr = var(enrlld_crdt_hrs, na.rm = TRUE),
Q1.cr = quantile(enrlld_crdt_hrs, p = c(0.25), na.rm = TRUE),
Q3.cr = quantile(enrlld_crdt_hrs, p = c(0.75), na.rm = TRUE)
)
tab1
## # A tibble: 1 x 6
## Mean.cr Median.cr StdDev.cr Var.cr Q1.cr Q3.cr
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2.80 3 1.37 1.88 3 3
What if we wanted frequencies
? Assume we want to know enrollment by college.
tab2 <- multkey.merge %>%
group_by(college_desc) %>%
summarise(frequency = n()) %>%
arrange(frequency)
tab2
## # A tibble: 14 x 2
## college_desc frequency
## <chr> <int>
## 1 Miscellaneous 308
## 2 Honors Tutorial 481
## 3 International Studies 921
## 4 George Voinovich School 1317
## 5 Osteopathic Medicine 12170
## 6 University College 17542
## 7 Regional Higher Ed 19043
## 8 Engineering & Technology 27094
## 9 Communication 33312
## 10 Fine Arts 34296
## 11 Education 37544
## 12 Business 53606
## 13 Health Sciences & Professions 77085
## 14 Arts & Sciences 173841
tab3 <- multkey.merge %>%
group_by(college_desc) %>%
summarise(frequency = n()) %>%
arrange(-frequency)
tab3
## # A tibble: 14 x 2
## college_desc frequency
## <chr> <int>
## 1 Arts & Sciences 173841
## 2 Health Sciences & Professions 77085
## 3 Business 53606
## 4 Education 37544
## 5 Fine Arts 34296
## 6 Communication 33312
## 7 Engineering & Technology 27094
## 8 Regional Higher Ed 19043
## 9 University College 17542
## 10 Osteopathic Medicine 12170
## 11 George Voinovich School 1317
## 12 International Studies 921
## 13 Honors Tutorial 481
## 14 Miscellaneous 308
Frequencies are well and good but we may want percentages
tab4 <- multkey.merge %>%
group_by(college_desc) %>%
summarise(frequency = n()) %>%
mutate(percent = (frequency / sum(frequency)) * 100,
percent = round(percent, digits = 2)) %>%
arrange(percent)
tab4
## # A tibble: 14 x 3
## college_desc frequency percent
## <chr> <int> <dbl>
## 1 Miscellaneous 308 0.06
## 2 Honors Tutorial 481 0.1
## 3 International Studies 921 0.19
## 4 George Voinovich School 1317 0.27
## 5 Osteopathic Medicine 12170 2.49
## 6 University College 17542 3.59
## 7 Regional Higher Ed 19043 3.9
## 8 Engineering & Technology 27094 5.55
## 9 Communication 33312 6.82
## 10 Fine Arts 34296 7.02
## 11 Education 37544 7.68
## 12 Business 53606 11.0
## 13 Health Sciences & Professions 77085 15.8
## 14 Arts & Sciences 173841 35.6
tab5 <- multkey.merge %>%
group_by(college_desc) %>%
summarise(frequency = n()) %>%
mutate(percent = (frequency / sum(frequency)) * 100,
percent = round(percent, digits = 2)) %>%
arrange(-percent)
tab5
## # A tibble: 14 x 3
## college_desc frequency percent
## <chr> <int> <dbl>
## 1 Arts & Sciences 173841 35.6
## 2 Health Sciences & Professions 77085 15.8
## 3 Business 53606 11.0
## 4 Education 37544 7.68
## 5 Fine Arts 34296 7.02
## 6 Communication 33312 6.82
## 7 Engineering & Technology 27094 5.55
## 8 Regional Higher Ed 19043 3.9
## 9 University College 17542 3.59
## 10 Osteopathic Medicine 12170 2.49
## 11 George Voinovich School 1317 0.27
## 12 International Studies 921 0.19
## 13 Honors Tutorial 481 0.1
## 14 Miscellaneous 308 0.06
Perhaps I want frequencies and percentages by sex_code
tab6 <- multkey.merge %>%
group_by(college_desc, sex_code) %>%
summarise(frequency = n()) %>%
mutate(percent = (frequency / sum(frequency)) * 100,
percent = round(percent, digits = 2)) %>%
arrange(college_desc, sex_code, -percent)
tab6
## # A tibble: 31 x 4
## # Groups: college_desc [14]
## college_desc sex_code frequency percent
## <chr> <chr> <int> <dbl>
## 1 Arts & Sciences F 98267 56.5
## 2 Arts & Sciences M 75574 43.5
## 3 Business F 19875 37.1
## 4 Business M 33731 62.9
## 5 Communication F 18844 56.6
## 6 Communication M 14468 43.4
## 7 Education F 25939 69.1
## 8 Education M 11605 30.9
## 9 Engineering & Technology F 4542 16.8
## 10 Engineering & Technology M 22552 83.2
## # … with 21 more rows
Or by sex_code
and rank_desc
tab7 <- multkey.merge %>%
group_by(college_desc, sex_code, rank_desc) %>%
summarise(frequency = n()) %>%
mutate(percent = (frequency / sum(frequency)) * 100,
percent = round(percent, digits = 1)) %>%
arrange(college_desc, sex_code, rank_desc, -percent)
tab7
## # A tibble: 178 x 5
## # Groups: college_desc, sex_code [31]
## college_desc sex_code rank_desc frequency percent
## <chr> <chr> <chr> <int> <dbl>
## 1 Arts & Sciences F Doctoral Student 1178 1.2
## 2 Arts & Sciences F Freshman 33519 34.1
## 3 Arts & Sciences F Junior 16100 16.4
## 4 Arts & Sciences F Master's Student 3388 3.4
## 5 Arts & Sciences F Non-Degree Undergraduate 4068 4.1
## 6 Arts & Sciences F Senior 17321 17.6
## 7 Arts & Sciences F Sophomore 22693 23.1
## 8 Arts & Sciences M Doctoral Student 1604 2.1
## 9 Arts & Sciences M Freshman 26114 34.6
## 10 Arts & Sciences M Junior 12320 16.3
## # … with 168 more rows
Often you will need to collapse some categories or otherwise modify some values. For example, perhaps I want sex_code
to be fully spelled out instead of F
and M
. How might I do that?
multkey.merge <- multkey.merge %>%
mutate(
sex.f = case_when(
sex_code == "F" ~ "Female",
sex_code == "M" ~ "Male"
)
)
table(multkey.merge$sex.f)
##
## Female Male
## 277650 210907
multkey.merge <- multkey.merge %>%
mutate(
sex.f = case_when(
sex_code == "F" ~ "Female",
TRUE ~ "Male"
)
)
table(multkey.merge$sex.f)
##
## Female Male
## 277650 210910
I can also take a numeric variable and flip it into a categorical variable with case_when()
. For example, say I want to split enrlld_crdt_hrs
into three groups, those with hours <= 4, those with hours >= 4 but <= 15, and those with hours >= 16
multkey.merge <- multkey.merge %>%
mutate(
enrolled.f = case_when(
enrlld_crdt_hrs < 4 ~ "<= 3 hours",
enrlld_crdt_hrs >= 4 & enrlld_crdt_hrs <= 15 ~ "4 - 15 hours",
enrlld_crdt_hrs > 15 ~ "> 15 hours",
)
)
table(multkey.merge$enrolled.f)
##
## <= 3 hours > 15 hours 4 - 15 hours
## 404150 54 84356
multkey.merge <- multkey.merge %>%
mutate(
enrolled.f = case_when(
enrlld_crdt_hrs < 4 ~ "<= 3 hours",
enrlld_crdt_hrs >= 4 & enrlld_crdt_hrs <= 15 ~ "4 - 15 hours",
TRUE ~ "> 15 hours",
)
)
table(multkey.merge$enrolled.f)
##
## <= 3 hours > 15 hours 4 - 15 hours
## 404150 54 84356
Some times we may need to eliminate some strings in specific columns, perhaps change their case, or replace a string with another string. All this and more can be done either by two packages: stringr
or stringi
. Let us see stringr
in action for a few commonly encountered tasks.
For example, rank_desc
has the word “student” in the labels. Let us assume we want to get rid of these. Here goes …
multkey.merge <- multkey.merge %>%
mutate(rank = stringr::str_replace(rank_desc, " Student", ""))
What about changing the case
?
multkey.merge <- multkey.merge %>%
mutate(college = stringr::str_to_upper(college_desc))
Other case options: str_to_title
, str_to_lower
, str_to_sentence
stringr
is very versatile; see here for the authors’ walkthrough
Now we are back to dplyr
so do not confuse these with stringr
. You will, more often than not, want to calculate quantities of interest for specific groups – by campus, by term, by college/department/course, and so on.
Well, the group-by()
switch allows you to specify the grouping structure, while ungroup()
will remove constraints imposed by the grouping structure. For example, assume I want to calculate the number of students in each college
multkey.merge %>%
group_by(college_desc, campus_name) %>%
summarise(students = n_distinct(anon_id)) %>%
arrange(-students)
## # A tibble: 56 x 3
## # Groups: college_desc [14]
## college_desc campus_name students
## <chr> <chr> <int>
## 1 Arts & Sciences Athens 47145
## 2 Health Sciences & Professions Athens 30119
## 3 Business Athens 19247
## 4 Communication Athens 14332
## 5 Fine Arts Athens 14272
## 6 Education Athens 13854
## 7 University College Athens 12891
## 8 Engineering & Technology Athens 8263
## 9 Arts & Sciences Lancaster 4590
## 10 Arts & Sciences Chillicothe 4021
## # … with 46 more rows
n_distinct()
is picking up unique values of anon_id
even though the data frame has multiple rows per some students
What about male versus female students?
multkey.merge %>%
group_by(college_desc, campus_name, sex_code) %>%
summarise(students = n_distinct(anon_id)) %>%
arrange(college_desc, campus_name, -students)
## # A tibble: 115 x 4
## # Groups: college_desc, campus_name [56]
## college_desc campus_name sex_code students
## <chr> <chr> <chr> <int>
## 1 Arts & Sciences Athens F 25515
## 2 Arts & Sciences Athens M 21630
## 3 Arts & Sciences Chillicothe F 2559
## 4 Arts & Sciences Chillicothe M 1462
## 5 Arts & Sciences Eastern F 1318
## 6 Arts & Sciences Eastern M 847
## 7 Arts & Sciences Lancaster F 2755
## 8 Arts & Sciences Lancaster M 1835
## 9 Arts & Sciences Southern F 2032
## 10 Arts & Sciences Southern M 1012
## # … with 105 more rows
What about male versus female students by ethnicity?
multkey.merge %>%
group_by(college_desc, campus_name, sex_code, ethnicity_desc) %>%
summarise(students = n_distinct(anon_id)) %>%
arrange(college_desc, campus_name, ethnicity_desc, sex_code)
## # A tibble: 820 x 5
## # Groups: college_desc, campus_name, sex_code [115]
## college_desc campus_name sex_code ethnicity_desc students
## <chr> <chr> <chr> <chr> <int>
## 1 Arts & Sciences Athens F American Indian/Alaska Native 43
## 2 Arts & Sciences Athens M American Indian/Alaska Native 34
## 3 Arts & Sciences Athens F Asian 310
## 4 Arts & Sciences Athens M Asian 215
## 5 Arts & Sciences Athens F Black or African American 1488
## 6 Arts & Sciences Athens M Black or African American 1087
## 7 Arts & Sciences Athens F Hispanic or Latino 846
## 8 Arts & Sciences Athens M Hispanic or Latino 611
## 9 Arts & Sciences Athens F Native Hawaiian/Other Pacific … 11
## 10 Arts & Sciences Athens M Native Hawaiian/Other Pacific … 17
## # … with 810 more rows
Maybe I want the number enrolled in a give course. To do this, I will first combine subject
and catalog_nbr
multkey.merge %>%
mutate(course = stringr::str_c(subject, catalog_nbr, sep = " ")) %>%
group_by(term_code, campus_name, course) %>%
summarise(students = n_distinct(anon_id)) %>%
arrange(-students)
## # A tibble: 12,666 x 4
## # Groups: term_code, campus_name [18]
## term_code campus_name course students
## <chr> <chr> <chr> <int>
## 1 20181C Athens UC 1900 3107
## 2 20191C Athens UC 1900 3102
## 3 20171C Athens UC 1900 2811
## 4 20171C Athens ENG 1510 1547
## 5 20171C Athens NRSE 4570 1467
## 6 20181C Athens ENG 1510 1428
## 7 20171C Athens NRSE 4520 1426
## 8 20191C Athens NRSE 4550 1414
## 9 20171C Athens PSY 1010 1404
## 10 20181C Athens NRSE 4530 1334
## # … with 12,656 more rows
You may need to rename some existing columns. An easy way to do this is as follows:
multkey.merge %>%
rename("Ethnicity" = ethnicity_desc,
"Sex" = sex_code,
"Department" = department_desc) %>%
names()
## [1] "term_code" "anon_id" "Ethnicity"
## [4] "Sex" "residency_code" "university_ftpt"
## [7] "class_nbr" "campus_name" "campus_dtl_desc"
## [10] "college_desc" "Department" "subject"
## [13] "catalog_nbr" "class_desc" "delivery_desc"
## [16] "online_flag" "rank_group" "rank_desc"
## [19] "acad_plan" "student_college_desc" "acad_plan_desc"
## [22] "field_of_study_desc" "residency_desc" "enrolled_flag"
## [25] "enrlld_crdt_hrs" "sex.f" "enrolled.f"
## [28] "rank" "college"
After running some operations that generate some quantities of interest we may want to extract or merely peek at the n
observations with the highest values or n
observatins with the lowest values. This can be done as follows:
multkey.merge %>%
group_by(department_desc) %>%
summarize(frequency = n()) %>%
arrange(-frequency) %>%
top_n(10, frequency)
## # A tibble: 10 x 2
## department_desc frequency
## <chr> <int>
## 1 Nursing 36146
## 2 Management Systems 27078
## 3 Biological Sciences 22470
## 4 Mathematics 20716
## 5 Vice President Regional Higher Education 19043
## 6 Psychology 18791
## 7 English 17148
## 8 Communication Studies 17008
## 9 Teacher Education 16464
## 10 Dean University College 16447
multkey.merge %>%
group_by(department_desc) %>%
summarize(frequency = n()) %>%
arrange(-frequency) %>%
top_n(-10, frequency)
## # A tibble: 10 x 2
## department_desc frequency
## <chr> <int>
## 1 Center For Intl Studies 921
## 2 African American Studies 884
## 3 Aviation 883
## 4 Military Science 818
## 5 Ohio Program of Intensive English 505
## 6 Honors College 481
## 7 Misc 308
## 8 Aerospace Studies 277
## 9 Education Dean 69
## 10 Center for Law, Justice and Culture 54
What are tidy data?
Here are some untidy data
pew <- read.delim(
file = "http://stat405.had.co.nz/data/pew.txt",
header = TRUE,
stringsAsFactors = FALSE,
check.names = F
)
head(pew)
## religion <$10k $10-20k $20-30k $30-40k $40-50k $50-75k $75-100k
## 1 Agnostic 27 34 60 81 76 137 122
## 2 Atheist 12 27 37 52 35 70 73
## 3 Buddhist 27 21 30 34 33 58 62
## 4 Catholic 418 617 732 670 638 1116 949
## 5 Don’t know/refused 15 14 15 11 10 35 21
## 6 Evangelical Prot 575 869 1064 982 881 1486 949
## $100-150k >150k Don't know/refused
## 1 109 84 96
## 2 59 74 76
## 3 39 53 54
## 4 792 633 1489
## 5 17 18 116
## 6 723 414 1529
devtools::install_github("garrettgman/DSR")
library(DSR)
table2
## # A tibble: 12 x 4
## country year key value
## <fct> <int> <fct> <int>
## 1 Afghanistan 1999 cases 745
## 2 Afghanistan 1999 population 19987071
## 3 Afghanistan 2000 cases 2666
## 4 Afghanistan 2000 population 20595360
## 5 Brazil 1999 cases 37737
## 6 Brazil 1999 population 172006362
## 7 Brazil 2000 cases 80488
## 8 Brazil 2000 population 174504898
## 9 China 1999 cases 212258
## 10 China 1999 population 1272915272
## 11 China 2000 cases 213766
## 12 China 2000 population 1280428583
table3
## # A tibble: 6 x 3
## country year rate
## <fct> <int> <chr>
## 1 Afghanistan 1999 745/19987071
## 2 Afghanistan 2000 2666/20595360
## 3 Brazil 1999 37737/172006362
## 4 Brazil 2000 80488/174504898
## 5 China 1999 212258/1272915272
## 6 China 2000 213766/1280428583
table4
## # A tibble: 3 x 3
## country `1999` `2000`
## <fct> <int> <int>
## 1 Afghanistan 745 2666
## 2 Brazil 37737 80488
## 3 China 212258 213766
table5
## # A tibble: 3 x 3
## country `1999` `2000`
## <fct> <int> <int>
## 1 Afghanistan 19987071 20595360
## 2 Brazil 172006362 174504898
## 3 China 1272915272 1280428583
What would a tidy version of these examples look like?
table1
## # A tibble: 6 x 4
## country year cases population
## <fct> <int> <int> <int>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583
Here, we have
library(tidyr)
s.table3 <- table3 %>%
separate(col = rate, into = c("cases", "population"), sep = "/",
remove = TRUE, convert = TRUE)
head(s.table3)
## # A tibble: 6 x 4
## country year cases population
## <fct> <int> <int> <int>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583
u.table3 <- s.table3 %>%
unite("rate", cases, population, sep = ":", remove = FALSE)
head(u.table3)
## # A tibble: 6 x 5
## country year rate cases population
## <fct> <int> <chr> <int> <int>
## 1 Afghanistan 1999 745:19987071 745 19987071
## 2 Afghanistan 2000 2666:20595360 2666 20595360
## 3 Brazil 1999 37737:172006362 37737 172006362
## 4 Brazil 2000 80488:174504898 80488 174504898
## 5 China 1999 212258:1272915272 212258 1272915272
## 6 China 2000 213766:1280428583 213766 1280428583
Revisit table2
table2
## # A tibble: 12 x 4
## country year key value
## <fct> <int> <fct> <int>
## 1 Afghanistan 1999 cases 745
## 2 Afghanistan 1999 population 19987071
## 3 Afghanistan 2000 cases 2666
## 4 Afghanistan 2000 population 20595360
## 5 Brazil 1999 cases 37737
## 6 Brazil 1999 population 172006362
## 7 Brazil 2000 cases 80488
## 8 Brazil 2000 population 174504898
## 9 China 1999 cases 212258
## 10 China 1999 population 1272915272
## 11 China 2000 cases 213766
## 12 China 2000 population 1280428583
I want to flip this into a format where the data frame has two columns per country per year – cases
and population
. How can I do this?
sp.table2 <- table2 %>%
spread(key, value)
sp.table2
## # A tibble: 6 x 4
## country year cases population
## <fct> <int> <int> <int>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583
table4
## # A tibble: 3 x 3
## country `1999` `2000`
## <fct> <int> <int>
## 1 Afghanistan 745 2666
## 2 Brazil 37737 80488
## 3 China 212258 213766
g.table4 <- table4 %>%
gather("year", "cases", 2:3)
head(g.table4)
## # A tibble: 6 x 3
## country year cases
## <fct> <chr> <int>
## 1 Afghanistan 1999 745
## 2 Brazil 1999 37737
## 3 China 1999 212258
## 4 Afghanistan 2000 2666
## 5 Brazil 2000 80488
## 6 China 2000 213766
nospread.df <- multkey.merge %>%
group_by(department_desc, term_code) %>%
summarize(frequency = n())
head(nospread.df, 10)
## # A tibble: 10 x 3
## # Groups: department_desc [4]
## department_desc term_code frequency
## <chr> <chr> <int>
## 1 Accountancy 20171C 1874
## 2 Accountancy 20181C 1740
## 3 Accountancy 20191C 1734
## 4 Aerospace Studies 20171C 69
## 5 Aerospace Studies 20181C 112
## 6 Aerospace Studies 20191C 96
## 7 African American Studies 20171C 313
## 8 African American Studies 20181C 238
## 9 African American Studies 20191C 333
## 10 Applied Health Sciences & Wellness 20171C 3922
Note: ‘term_code’ is now a column instead of rows.
spread.df <- multkey.merge %>%
group_by(department_desc, term_code) %>%
summarize(frequency = n()) %>%
spread(term_code, frequency)
head(spread.df, 10)
## # A tibble: 10 x 4
## # Groups: department_desc [10]
## department_desc `20171C` `20181C` `20191C`
## <chr> <int> <int> <int>
## 1 Accountancy 1874 1740 1734
## 2 Aerospace Studies 69 112 96
## 3 African American Studies 313 238 333
## 4 Applied Health Sciences & Wellness 3922 4525 4268
## 5 Art 2447 2969 2738
## 6 Aviation 293 275 315
## 7 Biological Sciences 6722 7947 7801
## 8 Center For Intl Studies 255 328 338
## 9 Center for Law, Justice and Culture NA NA 54
## 10 Chemical & Biomolecular Engineering 870 1189 1059
This package makes working with dates and times quite easy because it does what base R does but in more intuitive ways and perhaps more flexibly. Let us start with some date fields.
today1 = "20171217"
today2 = "2017-12-17"
today3 = "2017 December 17"
today4 = "20171217143241"
today5 = "2017 December 17 14:32:41"
today6 = "December 17 2017 14:32:41"
today7 = "17-Dec, 2017 14:32:41"
The formats are quite varied but lubridate
deals with them quite seamlessly so long as you pay attention to the order – is year first or last? What about month? day? Is time given in hours, minutes and seconds?.
library(lubridate)
ymd(today1)
## [1] "2017-12-17"
ymd(today2)
## [1] "2017-12-17"
ymd(today3)
## [1] "2017-12-17"
ymd_hms(today4)
## [1] "2017-12-17 14:32:41 UTC"
ymd_hms(today5)
## [1] "2017-12-17 14:32:41 UTC"
mdy_hms(today6)
## [1] "2017-12-17 14:32:41 UTC"
dmy_hms(today7)
## [1] "2017-12-17 14:32:41 UTC"
If I need to extract data/time elements, I can do that as well.
today = dmy_hms(today7); today # a date and time set in today7
## [1] "2017-12-17 14:32:41 UTC"
today.y = year(today); today.y # The year
## [1] 2017
today.m1 = month(today); today.m1 # the month, as a number
## [1] 12
today.m2 = month(today, label = TRUE, abbr = TRUE); today.m2 # labeling the month but with an abbreviation
## [1] Dec
## 12 Levels: Jan < Feb < Mar < Apr < May < Jun < Jul < Aug < Sep < ... < Dec
today.m3 = month(today, label = TRUE, abbr = FALSE); today.m3# fully labelling the month
## [1] December
## 12 Levels: January < February < March < April < May < June < ... < December
today.w = week(today)
today.w # what week of the year is it?
## [1] 51
today.doy = yday(today)
today.doy # what day of the year is it?
## [1] 351
today.dom = mday(today)
today.dom # what day of the month is it?
## [1] 17
today.dow1 = wday(today)
today.dow1 # what day of the week is it, as a number?
## [1] 1
today.dow2 = wday(today, label = TRUE, abbr = TRUE)
today.dow2 # day of the week, abbreviated label
## [1] Sun
## Levels: Sun < Mon < Tue < Wed < Thu < Fri < Sat
today.dow3 = wday(today, label = TRUE, abbr = FALSE)
today.dow3 # day of the week fully labelled
## [1] Sunday
## 7 Levels: Sunday < Monday < Tuesday < Wednesday < Thursday < ... < Saturday
today.h = hour(today)
today.h # what hour is it?
## [1] 14
today.m = minute(today)
today.m # what minute is it?
## [1] 32
today.s = second(today)
today.s # what second is it?
## [1] 41
today.tz = tz(today)
today.tz # what time zone is it?
## [1] "UTC"
Let us start with a duration
, the simplest measure of lapsed time since it measures the passing of time in seconds. Say I pick two dates and times … 05:00am on March 9, 2019
and 05:00am on March 10, 2019
.
date1 <- ymd_hms("2019-03-09 05:00:00", tz = "US/Eastern")
date2 <- ymd_hms("2019-03-10 05:00:00", tz = "US/Eastern")
timint <- interval(date1, date2)
timint
## [1] 2019-03-09 05:00:00 EST--2019-03-10 05:00:00 EDT
How much time has lapsed between date1 and date2? We calculate with as.duration()
timelapsed.d <- as.duration(timint)
Why is it 23 hours and not 24 hours? Because daylight savings kicks in at 2:00 AM on March 10, and since the clocks kick forward by one hour, only 23 hours have passed, not 24 as we might naively expect. Go back and look at the output from timeint
; see the EST versus EDT? In contrast to duration, if I ask for the period of time
, what do I get?
timelapsed.p <- as.period(timint)
Aha! as.period()
is imprecise because it tells me 1 day has passed
Now, we can use the duration of time lapsed in whatever units we want, but if we want accuracy, we better work with durations, as shown below.
time_length(timelapsed.d, unit = "second")
## [1] 82800
time_length(timelapsed.d, unit = "minute")
## [1] 1380
time_length(timelapsed.d, unit = "hour")
## [1] 23
time_length(timelapsed.d, unit = "day")
## [1] 0.9583333
time_length(timelapsed.d, unit = "week")
## [1] 0.1369048
time_length(timelapsed.d, unit = "month")
## [1] 0.03148528
time_length(timelapsed.d, unit = "year")
## [1] 0.002623774
duration()
is always estimated in seconds but you can change the reporting unit