Tidy Data

class: title-slide, center, middle
background-image: url(images/theridges.jpeg)
background-size: cover

# .large[.fancy[Tidy Data]]

## .fancy[Ani Ruhil]

---

---

## Agenda

- Overview of the `tidyverse`

> "an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures."

- `dplyr`

> "dplyr is a grammar of data manipulation, providing a consistent set of verbs that help you solve the most common data manipulation challenges"

- `tidyr`

> "Tidy data describes a standard way of storing data that is used wherever possible throughout the tidyverse. If you ensure that your data is tidy, you’ll spend less time fighting with the tools and more time working on your analysis."

---

---

The core functions are:

- `mutate()` ... to add new variables that are functions of existing variables

- `transmute()` ... to add new variables and drop the rest of the variables

- `select()` ... to pick a subset of variables to work with while dropping the rest

- `filter()` ... to select a subset of observations to work with

- `summarise()` ... to calculate quantities of interest (rather than add these quantities as new variables to the existing data frame) and save these as a standalone table

- `arrange()` ... order the rows based on out preferences

- `distinct()` ... to eliminate duplicate observations

There are several other commands that belong to `dplyr` ... [here is the cheatsheet](https://github.com/rstudio/cheatsheets/raw/master/data-transformation.pdf)

Let us see `dplyr` in action with the xclass and xstudent data frames

```r
load("data/xclass.RData")
load("data/xstudent.RData")
```

---

# `%>%`

---
# The pipe operator

The `%>%` operator is useful for `chaining a sequence of operations` into a single command

```r
the_data <-
  read.csv('/path/to/data/file.csv') %>%
  subset(variable_a > x) %>%
  transform(variable_c = variable_a/variable_b)
```

This command means ...

(1) First read the csv file `THEN`

(2) subset the data to cases where variable_a values are greater than some value g `THEN`

(3) create variable_c as the ratio of variable_a to variable_b

The result is saved in `the_data`

---

# .fat[.fancy[distinct()]]

See also [Garrett and Hadley](http://r4ds.had.co.nz)

---

# Eliminating duplicates

You may have duplicate cases in your data set. How do you eliminate them? Let us assume we only need one record per student in `xclass`<sup>1</sup>

```r
library(tidyverse)
xclass.df1 <- xclass %>%
  distinct(anon_id, .keep_all = TRUE)
```

.footnote[1: While these may well be legitimate records, for training purposes I am assuming each student should only show up once.]

But a student could be enrolled on multiple campuses and have multiple majors

```r
xclass.df2 <- xclass %>%
  distinct(anon_id, campus_name, term_code, department_desc,
           .keep_all = TRUE)
```

If more than one variable is specified in `distinct()` then `dplyr` looks for unique combinations of these variables. Each unique combination `$=$` a unique observation

`.keep_all = TRUE` is asking that all columns be retained after eliminating duplicate observations

---

#.fat[.fancy[Merging data]]

---

Say I have data on some students, with test scores in one file and their gender in another file. Each file has a unique student identifier, called `ID`. How can I create a unified data-set? Let us see each data-set first.

```r
data1 = tibble(Score = c(10, 21, 33, 12, 35, 67, 43, 99),
                   ID = c("A12", "A23", "Z14", "WX1", "Y31", "D66", "C31", "Q22")
                   )

data2 = tibble(Sex = c("Male", "Female", "Male", "Female", "Male", "Female", "Male", "Female"),
                   ID = c("A12", "A23", "Z14", "WX1", "Y31", "D66", "E52", "H71")
                   )
```

Open up both data-sets and note that students C31 and Q22 are missing from `data2` and students E52 and H71 are missing from `data1`.

---

### ... here are the two data-sets side-by-side

```{=html}
<div id="htmlwidget-a74f2629c9702d0961ec" style="width:100%;height:auto;" class="datatables html-widget"></div>
<script type="application/json" data-for="htmlwidget-a74f2629c9702d0961ec">{"x":{"filter":"none","data":[["1","2","3","4","5","6","7","8"],[10,21,33,12,35,67,43,99],["A12","A23","Z14","WX1","Y31","D66","C31","Q22"]],"container":"<table class=\"display\">\n  <thead>\n    <tr>\n      <th> <\/th>\n      <th>Score<\/th>\n      <th>ID<\/th>\n    <\/tr>\n  <\/thead>\n<\/table>","options":{"columnDefs":[{"className":"dt-center","targets":[1,2]},{"orderable":false,"targets":0}],"filter":"top","rownames":false,"class":"compact","order":[],"autoWidth":false,"orderClasses":false}},"evals":[],"jsHooks":[]}</script>
```
]

```{=html}
<div id="htmlwidget-d1f40ace4abc0833e865" style="width:100%;height:auto;" class="datatables html-widget"></div>
<script type="application/json" data-for="htmlwidget-d1f40ace4abc0833e865">{"x":{"filter":"none","data":[["1","2","3","4","5","6","7","8"],["Male","Female","Male","Female","Male","Female","Male","Female"],["A12","A23","Z14","WX1","Y31","D66","E52","H71"]],"container":"<table class=\"display\">\n  <thead>\n    <tr>\n      <th> <\/th>\n      <th>Sex<\/th>\n      <th>ID<\/th>\n    <\/tr>\n  <\/thead>\n<\/table>","options":{"columnDefs":[{"className":"dt-center","targets":[1,2]},{"orderable":false,"targets":0}],"filter":"top","rownames":false,"class":"compact","order":[],"autoWidth":false,"orderClasses":false}},"evals":[],"jsHooks":[]}</script>
```
]
---

## Mutating joins

These combine data frames but to do so we need `merge key(s)`  -- variable(s) that identifies unique observations in each file. Next, you need to decide on what you want.

`natural join` ... merge observations that show up in both files

```r
natural <- inner_join(x = data1, y = data2, by = "ID")
natural
```

```
## # A tibble: 6 x 3
##   Score ID    Sex   
##   <dbl> <chr> <chr> 
## 1    10 A12   Male  
## 2    21 A23   Female
## 3    33 Z14   Male  
## 4    12 WX1   Female
## 5    35 Y31   Male  
## 6    67 D66   Female
```

---

---

`full outer join` ... merge everything even if some cases show up in one but not the other

```r
full <- full_join(x = data1, y = data2, by = "ID")
full
```

```
## # A tibble: 10 x 3
##    Score ID    Sex   
##    <dbl> <chr> <chr> 
##  1    10 A12   Male  
##  2    21 A23   Female
##  3    33 Z14   Male  
##  4    12 WX1   Female
##  5    35 Y31   Male  
##  6    67 D66   Female
##  7    43 C31   <NA>  
##  8    99 Q22   <NA>  
##  9    NA E52   Male  
## 10    NA H71   Female
```

---

---

`left outer join` ... merge such that all cases in file `x` are retained but only those seen in file `y` are joined

```r
left <- left_join(x = data1, y = data2, by = "ID")
left
```

```
## # A tibble: 8 x 3
##   Score ID    Sex   
##   <dbl> <chr> <chr> 
## 1    10 A12   Male  
## 2    21 A23   Female
## 3    33 Z14   Male  
## 4    12 WX1   Female
## 5    35 Y31   Male  
## 6    67 D66   Female
## 7    43 C31   <NA>  
## 8    99 Q22   <NA>
```

---

---

`right outer join` ... merge to keep all cases in `y` and only those cases are to be merged from `x` that show up in `y`

```r
right <- right_join(x = data1, y = data2, by = "ID")
right
```

```
## # A tibble: 8 x 3
##   Score ID    Sex   
##   <dbl> <chr> <chr> 
## 1    10 A12   Male  
## 2    21 A23   Female
## 3    33 Z14   Male  
## 4    12 WX1   Female
## 5    35 Y31   Male  
## 6    67 D66   Female
## 7    NA E52   Male  
## 8    NA H71   Female
```

---

---

What if the ID variables had different names in the files? You could rename them to have a common name or you could use `by.x =` and `by.y = ` as shown below.

```r
data3 <- tibble(Score = c(10, 21, 33, 12, 35, 67, 43, 99),
                ID1 = c("A12", "A23", "Z14", "WX1", "Y31", "D66", "C31", "Q22")
                    )

data4 <- tibble(Sex = c("Male", "Female", "Male", "Female", "Male", "Female", "Male", "Female"),
                ID2 = c("A12", "A23", "Z14", "WX1", "Y31", "D66", "E52", "H71")
                    )

diff.full <- full_join(data3, data4, by = c("ID1" = "ID2"))
```

---

You can also have more than one merge key. For example, if I am merging data for Ohio schools, each district has a district ID number `(dirn)`, each school has building ID number `(birn)`, and then the district's name `(district)`, the building's name `(building)`, and so on. If I am merging these data then my `by = ` statement will be `by = c("dirn", "birn", "district", "building")`.

Let us work through one example.

Load the `sm_xstudent` data frame and make sure the `xclass` data frame is loaded as well.

Look at `sm_xstudent` ... it has six variables, three of which are also in xclass -- `term_code`, `anon_id`, and `sex_code`. The other three variables are not present in `xclass` -- `ethnicity_desc`, `residency_code`, and `university_ftpt`.

We will use the three common variables (the `keys`) to merge `sm_xstudent` and `xclass`

```r
load("data/sm_xstudent.RData")

multkey.merge <- full_join(
    sm_xstudent, xclass,
    by = c("term_code" = "term_code",
           "anon_id" = "anon_id",
           "sex_code" = "sex_code")
    )
```

Open `multkey.merge` and see the merged fields

---

If we have multiple data-frames to merge we can do so in several ways but my default approach is to rely on the `Reduce()` command. `Note: This is base R code`

```r
df1 <- tibble(Score = c(10, 21, 33, 12, 35, 67, 43, 99),
              ID = c("A12", "A23", "Z14", "WX1", "Y31", "D66", "C31", "Q22")
              )

df2 <- tibble(Sex = c("Male", "Female", "Male", "Female", "Male", "Female", "Male", "Female"),
              ID = c("A12", "A23", "Z14", "WX1", "Y31", "D66", "E52", "H71")
              )

df3 <- tibble(Age = c(6, 7, 6, 8, 8, 9, 10, 5),
              ID = c("A12", "A23", "Z14", "WX1", "Y31", "D66", "E52", "H71")
              )

my.list <- list(df1, df2, df3)

df.123 <- Reduce(function(...) merge(..., by = c("ID"), all = TRUE), my.list)
```

---

Here is the merged data-frame ...

<table class="table table-striped" style="margin-left: auto; margin-right: auto;">
 <thead>
  <tr>
   <th style="text-align:left;"> ID </th>
   <th style="text-align:right;"> Score </th>
   <th style="text-align:left;"> Sex </th>
   <th style="text-align:right;"> Age </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> A12 </td>
   <td style="text-align:right;"> 10 </td>
   <td style="text-align:left;"> Male </td>
   <td style="text-align:right;"> 6 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> A23 </td>
   <td style="text-align:right;"> 21 </td>
   <td style="text-align:left;"> Female </td>
   <td style="text-align:right;"> 7 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> C31 </td>
   <td style="text-align:right;"> 43 </td>
   <td style="text-align:left;"> NA </td>
   <td style="text-align:right;"> NA </td>
  </tr>
  <tr>
   <td style="text-align:left;"> D66 </td>
   <td style="text-align:right;"> 67 </td>
   <td style="text-align:left;"> Female </td>
   <td style="text-align:right;"> 9 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> E52 </td>
   <td style="text-align:right;"> NA </td>
   <td style="text-align:left;"> Male </td>
   <td style="text-align:right;"> 10 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> H71 </td>
   <td style="text-align:right;"> NA </td>
   <td style="text-align:left;"> Female </td>
   <td style="text-align:right;"> 5 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Q22 </td>
   <td style="text-align:right;"> 99 </td>
   <td style="text-align:left;"> NA </td>
   <td style="text-align:right;"> NA </td>
  </tr>
  <tr>
   <td style="text-align:left;"> WX1 </td>
   <td style="text-align:right;"> 12 </td>
   <td style="text-align:left;"> Female </td>
   <td style="text-align:right;"> 8 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Y31 </td>
   <td style="text-align:right;"> 35 </td>
   <td style="text-align:left;"> Male </td>
   <td style="text-align:right;"> 8 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Z14 </td>
   <td style="text-align:right;"> 33 </td>
   <td style="text-align:left;"> Male </td>
   <td style="text-align:right;"> 6 </td>
  </tr>
</tbody>
</table>

---

## Filtering joins

These only retain columns in the `x` data frame

.pull-left[
`semi_join()` ... will keep only those rows found in `x` and `y` but keep only the columns in `x`

```r
semi <- semi_join(data1, data2, by = "ID")
semi
```

```
## # A tibble: 6 x 2
##   Score ID   
##   <dbl> <chr>
## 1    10 A12  
## 2    21 A23  
## 3    33 Z14  
## 4    12 WX1  
## 5    35 Y31  
## 6    67 D66
```
]

.pull-right[
`anti_join()` ... will keep only those rows [not found]() in `x` and `y` but keep only the columns in `x`

```r
anti <- anti_join(data1, data2, by = "ID")
anti
```

```
## # A tibble: 2 x 2
##   Score ID   
##   <dbl> <chr>
## 1    43 C31  
## 2    99 Q22
```
]

---

## The oft-forgotten `rbind()`

What if we have to stack data frames atop one another, for example, because one data frame is from one term and the second from another term? No problem.

Load the `rb1` (where `term_code == 20171C`), `rb2` (where `term_code == 20181C`), and `rb3` (where `term_code == 20191C`) data frames.

```r
load("data/rb1.RData")

load("data/rb2.RData")

load("data/rb3.RData")

my.dfs <- rbind(rb1, rb2, rb3)
```

---

---

---

(2) Conduct a full_join, paying attention to the merge keys that should be used

(3) Conduct a semi_join with `xclass` as the `x` data frame]
]

```{=html}
<div class="countdown blink-colon noupdate-15" id="timer_5fde929a" style="right:33%;bottom:0;" data-audio="true" data-warnwhen="0">
<code class="countdown-time"><span class="countdown-digits minutes">10</span><span class="countdown-digits colon">:</span><span class="countdown-digits seconds">00</span></code>
</div>
```

---

# .fat[.fancy[filter()]]

---

Say I want to select only online students in `xclass`

```r
df.1 <- multkey.merge %>%
  filter(online.f == "Online")
```

What if I wanted online students in Athens?

```r
df.2 <- multkey.merge %>%
  filter(online.f == "Online", campus_name == "Athens")
```

What if I only want students in selected colleges?

```r
df.3 <- multkey.merge %>%
  filter(college_desc %in% c("Education", "Fine Arts"))
```

What if I only want students in all other colleges except Fine Arts and Education?

```r
df.4 <- multkey.merge %>%
  filter(!college_desc %in% c("Education", "Fine Arts"))
```

Note the use of `!` to create `not in`. Similarly, `=` means "equal to", `!=` means "not equal to", `|` means "or", and `&` means "and"

---

What if have complicated filters of multiple variables? No problem ...

```r
df.5 <- multkey.merge %>%
  filter(!college_desc %in% c("Education", "Fine Arts"),
         sex_code == "F",
         enrlld_crdt_hrs > 4 &  enrlld_crdt_hrs < 8)

head(df.5)
```

```
## # A tibble: 6 x 26
##   term_code anon_id ethnicity_desc sex_code residency_code university_ftpt class_nbr
##   <chr>     <chr>   <chr>          <chr>    <chr>          <chr>           <chr>    
## 1 20171C    ANON10… White          F        R              Full Time       08287    
## 2 20171C    ANON10… White          F        R              Full Time       08285    
## 3 20171C    ANON10… White          F        R              Part Time       08024    
## 4 20171C    ANON10… White          F        R              Part Time       08935    
## 5 20171C    ANON10… White          F        R              Full Time       09761    
## 6 20171C    ANON10… White          F        R              Part Time       09339    
## # … with 19 more variables: campus_name <chr>, campus_dtl_desc <chr>, college_desc <chr>,
## #   department_desc <chr>, subject <chr>, catalog_nbr <chr>, class_desc <chr>,
## #   delivery_desc <chr>, online_flag <dbl>, rank_group <chr>, rank_desc <chr>,
## #   acad_plan <chr>, student_college_desc <chr>, acad_plan_desc <chr>,
## #   field_of_study_desc <chr>, residency_desc <chr>, enrolled_flag <dbl>,
## #   enrlld_crdt_hrs <dbl>, online.f <fct>
```

---

(1) term_code must be 20191C 
(2) student is not on the Athens camps 
(3) student is a senior

Make sure you save the resulting data frame with some unique name]  
]

```{=html}
<div class="countdown blink-colon noupdate-15" id="timer_5fde94b1" style="right:33%;bottom:0;" data-audio="true" data-warnwhen="0">
<code class="countdown-time"><span class="countdown-digits minutes">05</span><span class="countdown-digits colon">:</span><span class="countdown-digits seconds">00</span></code>
</div>
```

```r
my.df <- multkey.merge %>%
  filter(term_code == "20191C",
         campus_name != "Athens",
         rank_desc == "Senior") 
```
]

---

# .fat[.fancy[select()]]

<img src = "https://media.giphy.com/media/wHnw9knwgl13G/giphy.gif" width = 300px </img>
---

Often you have more columns that you need for some operations, and so you trim the data by retaining (or discarding) specific columns. When you have to do this, it helps to know the column names\\numbers

```r
names(multkey.merge)
```

```
##  [1] "term_code"            "anon_id"              "ethnicity_desc"      
##  [4] "sex_code"             "residency_code"       "university_ftpt"     
##  [7] "class_nbr"            "campus_name"          "campus_dtl_desc"     
## [10] "college_desc"         "department_desc"      "subject"             
## [13] "catalog_nbr"          "class_desc"           "delivery_desc"       
## [16] "online_flag"          "rank_group"           "rank_desc"           
## [19] "acad_plan"            "student_college_desc" "acad_plan_desc"      
## [22] "field_of_study_desc"  "residency_desc"       "enrolled_flag"       
## [25] "enrlld_crdt_hrs"      "online.f"
```

Say I want the first 4 columns

```r
select.1 <- multkey.merge %>%
  select(c(1:4))

names(select.1)
```

```
## [1] "term_code"      "anon_id"        "ethnicity_desc" "sex_code"
```

---

What if my columns are not sequential? No worries ...

```r
select.2 <- multkey.merge %>%
  select(c(1, 2, 6, 9:11))

names(select.2)
```

```
## [1] "term_code"       "anon_id"         "university_ftpt" "campus_dtl_desc"
## [5] "college_desc"    "department_desc"
```

Could I have called for the desired columns by name? Sure ...

```r
select.3 <- multkey.merge %>%
  select(c(term_code, class_nbr, rank_desc, acad_plan))

names(select.3)
```

```
## [1] "term_code" "class_nbr" "rank_desc" "acad_plan"
```

---

Could I call out desired variables by a specific string in the column name?

```r
select.4 <- multkey.merge %>%
  select(contains("desc"))

names(select.4)
```

```
##  [1] "ethnicity_desc"       "campus_dtl_desc"      "college_desc"        
##  [4] "department_desc"      "class_desc"           "delivery_desc"       
##  [7] "rank_desc"            "student_college_desc" "acad_plan_desc"      
## [10] "field_of_study_desc"  "residency_desc"
```

What if it begins with or ends with some string?

```r
select.5 <- multkey.merge %>%
  select(ends_with("desc"))

names(select.5)
```

```r
select.6 <- multkey.merge %>%
  select(starts_with("acad"))

names(select.6)
```

```
## [1] "acad_plan"      "acad_plan_desc"
```
]

This is just the tip of the iceberg ... look at the dplyr documentation and online resources for other variants, including the scoped variants of `select_if()`, `select_at()`, and `select_all()`

---

(1) Columns 3, 4, 8, and 22

(2) Columns that end with the string "flag"

(3) Columns that contain the string "plan"]

```{=html}
<div class="countdown blink-colon noupdate-15" id="timer_5fde94c2" style="right:33%;bottom:0;" data-audio="true" data-warnwhen="0">
<code class="countdown-time"><span class="countdown-digits minutes">10</span><span class="countdown-digits colon">:</span><span class="countdown-digits seconds">00</span></code>
</div>
```

---

(1) Columns 3, 4, 8, and 22

```r
multkey.merge %>%
  select(c(3:4, 8, 22))
```

```
## # A tibble: 488,560 x 4
##    ethnicity_desc         sex_code campus_name field_of_study_desc  
##    <chr>                  <chr>    <chr>       <chr>                
##  1 Two or More Races      F        Athens      Counselor Education  
##  2 White                  F        Athens      Art                  
##  3 White                  F        Athens      Art                  
##  4 Asian                  M        Athens      International Affairs
##  5 Asian                  M        Athens      International Affairs
##  6 Asian                  M        Athens      International Affairs
##  7 Asian                  M        Athens      International Affairs
##  8 White                  M        Athens      Management Systems   
##  9 White                  M        Athens      Management Systems   
## 10 Race/Ethnicity Unknown F        Athens      Undecided Education  
## # … with 488,550 more rows
```

---

(2) Columns that end with the string "flag"

```r
multkey.merge %>%
  select(ends_with("flag"))
```

```
## # A tibble: 488,560 x 2
##    online_flag enrolled_flag
##          <dbl>         <dbl>
##  1           0             1
##  2           0             1
##  3           0             1
##  4           0             1
##  5           0             1
##  6           0             1
##  7           0             1
##  8           1             1
##  9           1             1
## 10           0             1
## # … with 488,550 more rows
```

---

(3) Columns that contain the string "plan"

```r
multkey.merge %>%
  select(contains("plan"))
```

```
## # A tibble: 488,560 x 2
##    acad_plan acad_plan_desc        
##    <chr>     <chr>                 
##  1 ND6274    School Counseling     
##  2 MF5052    Painting and Drawing  
##  3 MF5052    Painting and Drawing  
##  4 MA4216    Asian Studies         
##  5 MA4216    Asian Studies         
##  6 MA4216    Asian Studies         
##  7 MA4216    Asian Studies         
##  8 MB6157    MBA Business Analytics
##  9 MB6157    MBA Business Analytics
## 10 ND9915    EHS-Nondegree         
## # … with 488,550 more rows
```

---

# .fat[.fancy[summarise() and mutate()]]

---

Say I want to calculate some quantities of interest --- frequencies perhaps, or means, standard deviations, etc. Let us start with mean, medians, etc.

```r
tab1 <- multkey.merge %>% 
  summarise(
    Mean.cr = mean(enrlld_crdt_hrs, na.rm = TRUE),
    Median.cr = median(enrlld_crdt_hrs, na.rm = TRUE),
    StdDev.cr = sd(enrlld_crdt_hrs, na.rm = TRUE),
    Var.cr = var(enrlld_crdt_hrs, na.rm = TRUE),
    Q1.cr = quantile(enrlld_crdt_hrs, p = c(0.25), na.rm = TRUE),
    Q3.cr = quantile(enrlld_crdt_hrs, p = c(0.75), na.rm = TRUE)
    )

tab1
```

```
## # A tibble: 1 x 6
##   Mean.cr Median.cr StdDev.cr Var.cr Q1.cr Q3.cr
##     <dbl>     <dbl>     <dbl>  <dbl> <dbl> <dbl>
## 1    2.80         3      1.37   1.88     3     3
```

Note that `tab1` is a table data frame with only what we asked for

---

What if we wanted `frequencies`? Assume we want to know enrollment by college.

```r
tab2 <- multkey.merge %>%
  group_by(college_desc) %>%
  summarise(frequency = n()) %>%
  arrange(frequency)

tab2
```

```
## # A tibble: 14 x 2
##    college_desc                  frequency
##    <chr>                             <int>
##  1 Miscellaneous                       308
##  2 Honors Tutorial                     481
##  3 International Studies               921
##  4 George Voinovich School            1317
##  5 Osteopathic Medicine              12170
##  6 University College                17542
##  7 Regional Higher Ed                19043
##  8 Engineering & Technology          27094
##  9 Communication                     33312
## 10 Fine Arts                         34296
## 11 Education                         37544
## 12 Business                          53606
## 13 Health Sciences & Professions     77085
## 14 Arts & Sciences                  173841
```
]

```r
tab3 <- multkey.merge %>%
  group_by(college_desc) %>%
  summarise(frequency = n()) %>%
  arrange(-frequency)

tab3
```

```
## # A tibble: 14 x 2
##    college_desc                  frequency
##    <chr>                             <int>
##  1 Arts & Sciences                  173841
##  2 Health Sciences & Professions     77085
##  3 Business                          53606
##  4 Education                         37544
##  5 Fine Arts                         34296
##  6 Communication                     33312
##  7 Engineering & Technology          27094
##  8 Regional Higher Ed                19043
##  9 University College                17542
## 10 Osteopathic Medicine              12170
## 11 George Voinovich School            1317
## 12 International Studies               921
## 13 Honors Tutorial                     481
## 14 Miscellaneous                       308
```
]

---

Frequencies are well and good but we may want `percentages`

```r
tab4 <- multkey.merge %>%
  group_by(college_desc) %>%
  summarise(frequency = n()) %>%
  mutate(percent = (frequency / sum(frequency)) * 100,
         percent = round(percent, digits = 2)) %>%
  arrange(percent)

tab4
```

```
## # A tibble: 14 x 3
##    college_desc                  frequency percent
##    <chr>                             <int>   <dbl>
##  1 Miscellaneous                       308    0.06
##  2 Honors Tutorial                     481    0.1 
##  3 International Studies               921    0.19
##  4 George Voinovich School            1317    0.27
##  5 Osteopathic Medicine              12170    2.49
##  6 University College                17542    3.59
##  7 Regional Higher Ed                19043    3.9 
##  8 Engineering & Technology          27094    5.55
##  9 Communication                     33312    6.82
## 10 Fine Arts                         34296    7.02
## 11 Education                         37544    7.68
## 12 Business                          53606   11.0 
## 13 Health Sciences & Professions     77085   15.8 
## 14 Arts & Sciences                  173841   35.6
```
]

```r
tab5 <- multkey.merge %>%
  group_by(college_desc) %>%
  summarise(frequency = n()) %>%
  mutate(percent = (frequency / sum(frequency)) * 100,
         percent = round(percent, digits = 2)) %>%
  arrange(-percent)

tab5
```

```
## # A tibble: 14 x 3
##    college_desc                  frequency percent
##    <chr>                             <int>   <dbl>
##  1 Arts & Sciences                  173841   35.6 
##  2 Health Sciences & Professions     77085   15.8 
##  3 Business                          53606   11.0 
##  4 Education                         37544    7.68
##  5 Fine Arts                         34296    7.02
##  6 Communication                     33312    6.82
##  7 Engineering & Technology          27094    5.55
##  8 Regional Higher Ed                19043    3.9 
##  9 University College                17542    3.59
## 10 Osteopathic Medicine              12170    2.49
## 11 George Voinovich School            1317    0.27
## 12 International Studies               921    0.19
## 13 Honors Tutorial                     481    0.1 
## 14 Miscellaneous                       308    0.06
```
]

---

Perhaps I want frequencies and percentages by `sex_code`

```r
tab6 <- multkey.merge %>%
  group_by(college_desc, sex_code) %>%
  summarise(frequency = n()) %>%
  mutate(percent = (frequency / sum(frequency)) * 100,
         percent = round(percent, digits = 2)) %>%
  arrange(college_desc, sex_code, -percent)

tab6
```

```
## # A tibble: 31 x 4
## # Groups:   college_desc [14]
##    college_desc             sex_code frequency percent
##    <chr>                    <chr>        <int>   <dbl>
##  1 Arts & Sciences          F            98267    56.5
##  2 Arts & Sciences          M            75574    43.5
##  3 Business                 F            19875    37.1
##  4 Business                 M            33731    62.9
##  5 Communication            F            18844    56.6
##  6 Communication            M            14468    43.4
##  7 Education                F            25939    69.1
##  8 Education                M            11605    30.9
##  9 Engineering & Technology F             4542    16.8
## 10 Engineering & Technology M            22552    83.2
## # … with 21 more rows
```

---

Or by `sex_code` and `rank_desc`

```r
tab7 <- multkey.merge %>%
  group_by(college_desc, sex_code, rank_desc) %>%
  summarise(frequency = n()) %>%
  mutate(percent = (frequency / sum(frequency)) * 100,
         percent = round(percent, digits = 1)) %>%
  arrange(college_desc, sex_code, rank_desc, -percent)

tab7
```

```
## # A tibble: 178 x 5
## # Groups:   college_desc, sex_code [31]
##    college_desc    sex_code rank_desc                frequency percent
##    <chr>           <chr>    <chr>                        <int>   <dbl>
##  1 Arts & Sciences F        Doctoral Student              1178     1.2
##  2 Arts & Sciences F        Freshman                     33519    34.1
##  3 Arts & Sciences F        Junior                       16100    16.4
##  4 Arts & Sciences F        Master's Student              3388     3.4
##  5 Arts & Sciences F        Non-Degree Undergraduate      4068     4.1
##  6 Arts & Sciences F        Senior                       17321    17.6
##  7 Arts & Sciences F        Sophomore                    22693    23.1
##  8 Arts & Sciences M        Doctoral Student              1604     2.1
##  9 Arts & Sciences M        Freshman                     26114    34.6
## 10 Arts & Sciences M        Junior                       12320    16.3
## # … with 168 more rows
```

---

#.fancy[.fat[case_when()]]

---

Often you will need to collapse some categories or otherwise modify some values. For example, perhaps I want `sex_code` to be fully spelled out instead of `F` and `M`. How might I do that?

```r
multkey.merge <- multkey.merge %>%
  mutate(
    sex.f = case_when(
      sex_code == "F" ~ "Female",
      sex_code == "M" ~ "Male"
      )
    )

table(multkey.merge$sex.f)
```

```
## 
## Female   Male 
## 277650 210907
```
]

```r
multkey.merge <- multkey.merge %>%
  mutate(
    sex.f = case_when(
      sex_code == "F" ~ "Female",
      TRUE ~ "Male"
      )
    )

table(multkey.merge$sex.f)
```

```
## 
## Female   Male 
## 277650 210910
```
]

---

I can also take a numeric variable and flip it into a categorical variable with `case_when()`.

For example, say I want to split `enrlld_crdt_hrs` into three groups, those with hours <= 4, those with hours >= 4 but <= 15, and those with hours >= 16

```r
multkey.merge <- multkey.merge %>%
  mutate(
    enrolled.f = case_when(
      enrlld_crdt_hrs < 4 ~ "<= 3 hours",
      enrlld_crdt_hrs >= 4 & enrlld_crdt_hrs <= 15 ~ "4 - 15 hours",
      enrlld_crdt_hrs > 15 ~ "> 15 hours",
      )
    )

table(multkey.merge$enrolled.f)
```

```
## 
##   <= 3 hours   > 15 hours 4 - 15 hours 
##       404150           54        84356
```
]

```r
multkey.merge <- multkey.merge %>%
  mutate(
    enrolled.f = case_when(
      enrlld_crdt_hrs < 4 ~ "<= 3 hours",
      enrlld_crdt_hrs >= 4 & enrlld_crdt_hrs <= 15 ~ "4 - 15 hours",
      TRUE ~ "> 15 hours",
      )
    )

table(multkey.merge$enrolled.f)
```

```
## 
##   <= 3 hours   > 15 hours 4 - 15 hours 
##       404150           54        84356
```
]

---

# .fat[.fancy[Dealing with strings]]

---

## stringr() 
Some times we may need to eliminate some strings in specific columns, perhaps change their case, or replace a string with another string. All this and more can be done either by two packages: `stringr` or `stringi`. Let us see `stringr` in action for a few commonly encountered tasks.

For example, `rank_desc` has the word "student" in the labels. Let us assume we want to get rid of these. Here goes ...

```r
multkey.merge <- multkey.merge %>%
  mutate(rank = stringr::str_replace(rank_desc, " Student", ""))
```

What about changing the `case`?

```r
multkey.merge <- multkey.merge %>%
  mutate(college = stringr::str_to_upper(college_desc))
```

Other case options: `str_to_title`, `str_to_lower`, `str_to_sentence`

`stringr` is very versatile; see here for [the authors' walkthrough](https://r4ds.had.co.nz/strings.html)

---

# .fat[.fancy[Grouped operations via group_by()]]

---

You will, more often than not, want to calculate quantities of interest for specific groups -- by campus, by term, by college/department/course, and so on.

Well, the `group-by()` switch allows you to specify the grouping structure, while `ungroup()` will remove constraints imposed by the grouping structure. For example, assume I want to calculate the number of students in each college

```r
multkey.merge %>%
  group_by(college_desc, campus_name) %>%
  summarise(students = n_distinct(anon_id)) %>%
  arrange(-students)
```

```
## # A tibble: 56 x 3
## # Groups:   college_desc [14]
##    college_desc                  campus_name students
##    <chr>                         <chr>          <int>
##  1 Arts & Sciences               Athens         47145
##  2 Health Sciences & Professions Athens         30119
##  3 Business                      Athens         19247
##  4 Communication                 Athens         14332
##  5 Fine Arts                     Athens         14272
##  6 Education                     Athens         13854
##  7 University College            Athens         12891
##  8 Engineering & Technology      Athens          8263
##  9 Arts & Sciences               Lancaster       4590
## 10 Arts & Sciences               Chillicothe     4021
## # … with 46 more rows
```

`n_distinct()` is picking up unique values of `anon_id` even though the data frame has multiple rows per some students

---

What about male versus female students?

```r
multkey.merge %>%
  group_by(college_desc,  campus_name, sex_code) %>%
  summarise(students = n_distinct(anon_id)) %>%
  arrange(college_desc,  campus_name, -students)
```

```
## # A tibble: 115 x 4
## # Groups:   college_desc, campus_name [56]
##    college_desc    campus_name sex_code students
##    <chr>           <chr>       <chr>       <int>
##  1 Arts & Sciences Athens      F           25515
##  2 Arts & Sciences Athens      M           21630
##  3 Arts & Sciences Chillicothe F            2559
##  4 Arts & Sciences Chillicothe M            1462
##  5 Arts & Sciences Eastern     F            1318
##  6 Arts & Sciences Eastern     M             847
##  7 Arts & Sciences Lancaster   F            2755
##  8 Arts & Sciences Lancaster   M            1835
##  9 Arts & Sciences Southern    F            2032
## 10 Arts & Sciences Southern    M            1012
## # … with 105 more rows
```

---

What about male versus female students by ethnicity?

```r
multkey.merge %>%
  group_by(college_desc, campus_name, sex_code, ethnicity_desc) %>%
  summarise(students = n_distinct(anon_id)) %>%
  arrange(college_desc, campus_name, ethnicity_desc, sex_code)
```

```
## # A tibble: 820 x 5
## # Groups:   college_desc, campus_name, sex_code [115]
##    college_desc    campus_name sex_code ethnicity_desc                         students
##    <chr>           <chr>       <chr>    <chr>                                     <int>
##  1 Arts & Sciences Athens      F        American Indian/Alaska Native                43
##  2 Arts & Sciences Athens      M        American Indian/Alaska Native                34
##  3 Arts & Sciences Athens      F        Asian                                       310
##  4 Arts & Sciences Athens      M        Asian                                       215
##  5 Arts & Sciences Athens      F        Black or African American                  1488
##  6 Arts & Sciences Athens      M        Black or African American                  1087
##  7 Arts & Sciences Athens      F        Hispanic or Latino                          846
##  8 Arts & Sciences Athens      M        Hispanic or Latino                          611
##  9 Arts & Sciences Athens      F        Native Hawaiian/Other Pacific Islander       11
## 10 Arts & Sciences Athens      M        Native Hawaiian/Other Pacific Islander       17
## # … with 810 more rows
```

---

Maybe I want the number enrolled in a give course. To do this, I will first combine `subject` and `catalog_nbr`

```r
multkey.merge %>%
  mutate(course = stringr::str_c(subject, catalog_nbr, sep = " ")) %>% 
  group_by(term_code, campus_name, course) %>%
  summarise(students = n_distinct(anon_id)) %>%
  arrange(-students)
```

```
## # A tibble: 12,666 x 4
## # Groups:   term_code, campus_name [18]
##    term_code campus_name course    students
##    <chr>     <chr>       <chr>        <int>
##  1 20181C    Athens      UC 1900       3107
##  2 20191C    Athens      UC 1900       3102
##  3 20171C    Athens      UC 1900       2811
##  4 20171C    Athens      ENG 1510      1547
##  5 20171C    Athens      NRSE 4570     1467
##  6 20181C    Athens      ENG 1510      1428
##  7 20171C    Athens      NRSE 4520     1426
##  8 20191C    Athens      NRSE 4550     1414
##  9 20171C    Athens      PSY 1010      1404
## 10 20181C    Athens      NRSE 4530     1334
## # … with 12,656 more rows
```

---

(2) Add `class_nbr` to course, making sure the separator is " - "

(3) Save `multkey.merge` with the RData extension]
]

---

## .fancy[rename()]

You may need to rename some existing columns. An easy way to do this is as follows:

```r
multkey.merge %>%
  rename("Ethnicity" = ethnicity_desc,
         "Sex" = sex_code,
         "Department" = department_desc) %>%
  names()
```

```
##  [1] "term_code"            "anon_id"              "Ethnicity"           
##  [4] "Sex"                  "residency_code"       "university_ftpt"     
##  [7] "class_nbr"            "campus_name"          "campus_dtl_desc"     
## [10] "college_desc"         "Department"           "subject"             
## [13] "catalog_nbr"          "class_desc"           "delivery_desc"       
## [16] "online_flag"          "rank_group"           "rank_desc"           
## [19] "acad_plan"            "student_college_desc" "acad_plan_desc"      
## [22] "field_of_study_desc"  "residency_desc"       "enrolled_flag"       
## [25] "enrlld_crdt_hrs"      "online.f"             "sex.f"               
## [28] "enrolled.f"           "rank"                 "college"
```

---

## .fancy[top_n()]

After running some operations that generate some quantities of interest we may want to extract or merely peek at the `n` observations with the highest values or `n` observatins with the lowest values. This can be done as follows:

```r
multkey.merge %>%
  group_by(department_desc) %>%
  summarize(frequency = n()) %>% 
  arrange(-frequency) %>% 
  top_n(10, frequency)
```

```
## # A tibble: 10 x 2
##    department_desc                          frequency
##    <chr>                                        <int>
##  1 Nursing                                      36146
##  2 Management Systems                           27078
##  3 Biological Sciences                          22470
##  4 Mathematics                                  20716
##  5 Vice President Regional Higher Education     19043
##  6 Psychology                                   18791
##  7 English                                      17148
##  8 Communication Studies                        17008
##  9 Teacher Education                            16464
## 10 Dean University College                      16447
```
]

```r
multkey.merge %>%
  group_by(department_desc) %>%
  summarize(frequency = n()) %>% 
  arrange(-frequency) %>%
  top_n(-10, frequency)
```

```
## # A tibble: 10 x 2
##    department_desc                     frequency
##    <chr>                                   <int>
##  1 Center For Intl Studies                   921
##  2 African American Studies                  884
##  3 Aviation                                  883
##  4 Military Science                          818
##  5 Ohio Program of Intensive English         505
##  6 Honors College                            481
##  7 Misc                                      308
##  8 Aerospace Studies                         277
##  9 Education Dean                             69
## 10 Center for Law, Justice and Culture        54
```
]

---

# .fat[.fancy[tidy data]]

---

Most packages you use in R call for data to be `tidy`. [What are tidy data](http://vita.had.co.nz/papers/tidy-data.pdf)?

- Each variable forms a column. 
- Each observation forms a row. 
- Each type of observational unit forms a table.

Here are some untidy data

---

(a) The Pew survey data

```r
pew <- read.delim(
  file = "http://stat405.had.co.nz/data/pew.txt",
  header = TRUE,
  stringsAsFactors = FALSE,
  check.names = F
)

head(pew)
```

```
##             religion <$10k $10-20k $20-30k $30-40k $40-50k $50-75k $75-100k $100-150k
## 1           Agnostic    27      34      60      81      76     137      122       109
## 2            Atheist    12      27      37      52      35      70       73        59
## 3           Buddhist    27      21      30      34      33      58       62        39
## 4           Catholic   418     617     732     670     638    1116      949       792
## 5 Don’t know/refused    15      14      15      11      10      35       21        17
## 6   Evangelical Prot   575     869    1064     982     881    1486      949       723
##   >150k Don't know/refused
## 1    84                 96
## 2    74                 76
## 3    53                 54
## 4   633               1489
## 5    18                116
## 6   414               1529
```

---

(b) Here are some more examples

```r
library(DSR)
table2
```

```
## # A tibble: 12 x 4
##    country      year key             value
##    <fct>       <int> <fct>           <int>
##  1 Afghanistan  1999 cases             745
##  2 Afghanistan  1999 population   19987071
##  3 Afghanistan  2000 cases            2666
##  4 Afghanistan  2000 population   20595360
##  5 Brazil       1999 cases           37737
##  6 Brazil       1999 population  172006362
##  7 Brazil       2000 cases           80488
##  8 Brazil       2000 population  174504898
##  9 China        1999 cases          212258
## 10 China        1999 population 1272915272
## 11 China        2000 cases          213766
## 12 China        2000 population 1280428583
```
]

```r
table3
```

```
## # A tibble: 6 x 3
##   country      year rate             
##   <fct>       <int> <chr>            
## 1 Afghanistan  1999 745/19987071     
## 2 Afghanistan  2000 2666/20595360    
## 3 Brazil       1999 37737/172006362  
## 4 Brazil       2000 80488/174504898  
## 5 China        1999 212258/1272915272
## 6 China        2000 213766/1280428583
```
]

---

```r
table4
```

```
## # A tibble: 3 x 3
##   country     `1999` `2000`
##   <fct>        <int>  <int>
## 1 Afghanistan    745   2666
## 2 Brazil       37737  80488
## 3 China       212258 213766
```
]

```r
table5
```

```
## # A tibble: 3 x 3
##   country         `1999`     `2000`
##   <fct>            <int>      <int>
## 1 Afghanistan   19987071   20595360
## 2 Brazil       172006362  174504898
## 3 China       1272915272 1280428583
```
]

---

What would a tidy version of these examples look like?

```r
table1
```

```
## # A tibble: 6 x 4
##   country      year  cases population
##   <fct>       <int>  <int>      <int>
## 1 Afghanistan  1999    745   19987071
## 2 Afghanistan  2000   2666   20595360
## 3 Brazil       1999  37737  172006362
## 4 Brazil       2000  80488  174504898
## 5 China        1999 212258 1272915272
## 6 China        2000 213766 1280428583
```

Here, we have

- each country as a row per year 
- each column has a single variable 
- each cell has a single value

The `tidyr` package is designed to make messy data tidy in order to proceed with analysis. It has four primary functions that we will see in turn.

---

## separate 
You will run into variables that you want to break into smaller pieces. For instance, take `table3` where `rate` is a ratio of two variables, `cases` and `population`, each of which should be in their own columns. How can we achieve that?

```r
library(tidyr)
s.table3 <- table3 %>% 
  separate(col = rate, into = c("cases", "population"), sep = "/", 
           remove = TRUE, convert = TRUE)

head(s.table3)
```

---

## unite

The `unite` command does the exact opposite of separate, combining columns

```r
u.table3 <- s.table3 %>% 
  unite("rate", cases, population, sep = ":", remove = FALSE)

head(u.table3)
```

```
## # A tibble: 6 x 5
##   country      year rate               cases population
##   <fct>       <int> <chr>              <int>      <int>
## 1 Afghanistan  1999 745:19987071         745   19987071
## 2 Afghanistan  2000 2666:20595360       2666   20595360
## 3 Brazil       1999 37737:172006362    37737  172006362
## 4 Brazil       2000 80488:174504898    80488  174504898
## 5 China        1999 212258:1272915272 212258 1272915272
## 6 China        2000 213766:1280428583 213766 1280428583
```

Here you specify the name of  the new column first, and then list the columns to be combined, specify the separator, and whether the original columns should be removed or not.

---

## spread

Revisit `table2`

```r
table2
```

---

I want to flip this into a format where the data frame has two columns per country per year -- `cases` and `population`. How can I do this?

```r
sp.table2 <- table2 %>% 
  spread(key, value)

sp.table2
```

Notice that the  `key` column had two unique values -- cases and population -- so two new columns were created. Notice also that tidyr knew to populate the value of each of the new cells with what it finds in the `value` column. The result is what we call data in a wide format.

---

## gather

`gather` does the opposite of `spread`, converting wide data into what we call the long form. Revisit `table4`

```r
table4
```

```
## # A tibble: 3 x 3
##   country     `1999` `2000`
##   <fct>        <int>  <int>
## 1 Afghanistan    745   2666
## 2 Brazil       37737  80488
## 3 China       212258 213766
```

Note that each country has two columns, one per  year. ... This is untidy ...

---

... So, we'll flip this tibble as follows:

```r
g.table4 <- table4 %>% 
  gather("year", "cases", 2:3)

head(g.table4)
```

```
## # A tibble: 6 x 3
##   country     year   cases
##   <fct>       <chr>  <int>
## 1 Afghanistan 1999     745
## 2 Brazil      1999   37737
## 3 China       1999  212258
## 4 Afghanistan 2000    2666
## 5 Brazil      2000   80488
## 6 China       2000  213766
```

Notice that you had to specify the columns you wanted to gather and we did this via `2:3`. If you forget to specify columns it creates something else altogether so be careful.

---

## .salt[.fancy[Some Examples]]

```r
nospread.df <- multkey.merge %>%
  group_by(department_desc, term_code) %>%
  summarize(frequency = n())

head(nospread.df, 10)
```

```
## # A tibble: 10 x 3
## # Groups:   department_desc [4]
##    department_desc                    term_code frequency
##    <chr>                              <chr>         <int>
##  1 Accountancy                        20171C         1874
##  2 Accountancy                        20181C         1740
##  3 Accountancy                        20191C         1734
##  4 Aerospace Studies                  20171C           69
##  5 Aerospace Studies                  20181C          112
##  6 Aerospace Studies                  20191C           96
##  7 African American Studies           20171C          313
##  8 African American Studies           20181C          238
##  9 African American Studies           20191C          333
## 10 Applied Health Sciences & Wellness 20171C         3922
```

---

```r
spread.df <- multkey.merge %>%
  group_by(department_desc, term_code) %>%
  summarize(frequency = n()) %>%
  spread(term_code, frequency)

head(spread.df, 10)
```

```
## # A tibble: 10 x 4
## # Groups:   department_desc [10]
##    department_desc                     `20171C` `20181C` `20191C`
##    <chr>                                  <int>    <int>    <int>
##  1 Accountancy                             1874     1740     1734
##  2 Aerospace Studies                         69      112       96
##  3 African American Studies                 313      238      333
##  4 Applied Health Sciences & Wellness      3922     4525     4268
##  5 Art                                     2447     2969     2738
##  6 Aviation                                 293      275      315
##  7 Biological Sciences                     6722     7947     7801
##  8 Center For Intl Studies                  255      328      338
##  9 Center for Law, Justice and Culture       NA       NA       54
## 10 Chemical & Biomolecular Engineering      870     1189     1059
```

---

# The World Health Organization Example

[At this website](http://www.who.int/tb/country/data/download/en/) you can find data provided by countries to WHO and estimates of TB burden generated by WHO for the Global Tuberculosis Report. I want you to gather the following data (all are in the csv format):

- [Download the data dictionary](https://extranet.who.int/tme/generateCSV.asp?ds=dictionary) 
- [Download case notifications](https://extranet.who.int/tme/generateCSV.asp?ds=notifications)  
- [Download WHO TB burden estimates](https://extranet.who.int/tme/generateCSV.asp?ds=estimates)

Read the data dictionary carefully. Then read-in the case notifications data.

The database has too many variables (175 or so) and so I'll have you use just a few for this practice task.

```r
library(readr)
tb <- read_csv("data/TB_notifications_2018-02-07.csv")
```

---

Let us use `dplyr` to select and filter the tb database so that it meets the following criteria: (i) observations from 2012 or earlier years are kept, and (ii) keep only columns 1 through 6 and 25 (new_sp_m04) through 44 (new_sp_fu).

```r
library(dplyr)
tb2 <- tb %>% 
  select(c(1:6, 25:44)) %>% 
  filter(year <= 2012)

head(tb2)
```

```
## # A tibble: 6 x 26
##   country iso2  iso3  iso_numeric g_whoregion  year new_sp_m04 new_sp_m514 new_sp_m014
##   <chr>   <chr> <chr> <chr>       <chr>       <dbl>      <dbl>       <dbl>       <dbl>
## 1 Afghan… AF    AFG   004         EMR          1980         NA          NA          NA
## 2 Afghan… AF    AFG   004         EMR          1981         NA          NA          NA
## 3 Afghan… AF    AFG   004         EMR          1982         NA          NA          NA
## 4 Afghan… AF    AFG   004         EMR          1983         NA          NA          NA
## 5 Afghan… AF    AFG   004         EMR          1984         NA          NA          NA
## 6 Afghan… AF    AFG   004         EMR          1985         NA          NA          NA
## # … with 17 more variables: new_sp_m1524 <dbl>, new_sp_m2534 <dbl>, new_sp_m3544 <dbl>,
## #   new_sp_m4554 <dbl>, new_sp_m5564 <dbl>, new_sp_m65 <dbl>, new_sp_mu <dbl>,
## #   new_sp_f04 <dbl>, new_sp_f514 <dbl>, new_sp_f014 <dbl>, new_sp_f1524 <dbl>,
## #   new_sp_f2534 <dbl>, new_sp_f3544 <dbl>, new_sp_f4554 <dbl>, new_sp_f5564 <dbl>,
## #   new_sp_f65 <dbl>, new_sp_fu <dbl>
```

---

Now, `new_sp_xx` refer to new pulmonary smear-positive cases. The `m` refers to males and `f` refers to females. Finally, `04` is the 0-4 age group, `514` is the 5-14 age group, and so on. `new_sp_mu` is males with an unknown age-group and `new_sp_fu` is females with an unknown age-group. This is the messy part of the data. Ideally we'd have a column for gender, flagged as `Male, Female, Unknown`, and a separate column for age-groups flagged `0-4, 5-14, 0-14, and so on`. Time to tidy these data.

Since the data are in the wide format we can use gather to flip it into the long format.

```r
tb3 <- tb2 %>% 
  gather("group", "cases", 7:26)

head(tb3)
```

```
## # A tibble: 6 x 8
##   country     iso2  iso3  iso_numeric g_whoregion  year group      cases
##   <chr>       <chr> <chr> <chr>       <chr>       <dbl> <chr>      <dbl>
## 1 Afghanistan AF    AFG   004         EMR          1980 new_sp_m04    NA
## 2 Afghanistan AF    AFG   004         EMR          1981 new_sp_m04    NA
## 3 Afghanistan AF    AFG   004         EMR          1982 new_sp_m04    NA
## 4 Afghanistan AF    AFG   004         EMR          1983 new_sp_m04    NA
## 5 Afghanistan AF    AFG   004         EMR          1984 new_sp_m04    NA
## 6 Afghanistan AF    AFG   004         EMR          1985 new_sp_m04    NA
```

---

Now we split the `new_sp_xx` values into three columns

```r
tb4 <- tb3 %>% 
  separate(col = group, into = c("new", "sp", "sexage"), sep = "_")

head(tb4)
```

```
## # A tibble: 6 x 10
##   country     iso2  iso3  iso_numeric g_whoregion  year new   sp    sexage cases
##   <chr>       <chr> <chr> <chr>       <chr>       <dbl> <chr> <chr> <chr>  <dbl>
## 1 Afghanistan AF    AFG   004         EMR          1980 new   sp    m04       NA
## 2 Afghanistan AF    AFG   004         EMR          1981 new   sp    m04       NA
## 3 Afghanistan AF    AFG   004         EMR          1982 new   sp    m04       NA
## 4 Afghanistan AF    AFG   004         EMR          1983 new   sp    m04       NA
## 5 Afghanistan AF    AFG   004         EMR          1984 new   sp    m04       NA
## 6 Afghanistan AF    AFG   004         EMR          1985 new   sp    m04       NA
```

---

and then split `sexage` into two columns by specifying that the split should occur after the first value (which in our case will be after `m` or `f`).

```r
tb5 <- tb4 %>% 
  separate(col = sexage, into = c("sex", "agegroup"), sep = 1)

head(tb5)
```

```
## # A tibble: 6 x 11
##   country     iso2  iso3  iso_numeric g_whoregion  year new   sp    sex   agegroup cases
##   <chr>       <chr> <chr> <chr>       <chr>       <dbl> <chr> <chr> <chr> <chr>    <dbl>
## 1 Afghanistan AF    AFG   004         EMR          1980 new   sp    m     04          NA
## 2 Afghanistan AF    AFG   004         EMR          1981 new   sp    m     04          NA
## 3 Afghanistan AF    AFG   004         EMR          1982 new   sp    m     04          NA
## 4 Afghanistan AF    AFG   004         EMR          1983 new   sp    m     04          NA
## 5 Afghanistan AF    AFG   004         EMR          1984 new   sp    m     04          NA
## 6 Afghanistan AF    AFG   004         EMR          1985 new   sp    m     04          NA
```

---

The beauty of `dplyr` and `tidyr` (in fact of all things `tidyverse`) is the ability to combine multiple steps into a single piped command!

```r
tb.df <- tb %>% 
  select(c(1:6, 25:44)) %>% 
  filter(year <= 2012)  %>% 
  gather("group", "cases", 7:26) %>% 
  separate(col = group, into = c("new", "sp", "sexage"), sep = "_") %>% 
  separate(col = sexage, into = c("sex", "agegroup"), sep = 1) %>% 
  select(c(1:6, 9:11))

head(tb.df)
```

```
## # A tibble: 6 x 9
##   country     iso2  iso3  iso_numeric g_whoregion  year sex   agegroup cases
##   <chr>       <chr> <chr> <chr>       <chr>       <dbl> <chr> <chr>    <dbl>
## 1 Afghanistan AF    AFG   004         EMR          1980 m     04          NA
## 2 Afghanistan AF    AFG   004         EMR          1981 m     04          NA
## 3 Afghanistan AF    AFG   004         EMR          1982 m     04          NA
## 4 Afghanistan AF    AFG   004         EMR          1983 m     04          NA
## 5 Afghanistan AF    AFG   004         EMR          1984 m     04          NA
## 6 Afghanistan AF    AFG   004         EMR          1985 m     04          NA
```

---

We can clean-up the codes for easier interpretation.

```r
tb.df$sex <- ifelse(tb.df$sex == "m", "Males", "Females")

tb.df$agegroup <- ifelse(tb.df$agegroup == "04", "0-4", 
                        ifelse(tb.df$agegroup == "014", "0-14", 
                        ifelse(tb.df$agegroup == "1524", "15-24",
                        ifelse(tb.df$agegroup == "2534", "25-34",
                        ifelse(tb.df$agegroup == "3544", "35-44",
                        ifelse(tb.df$agegroup == "4554", "45-54",
                        ifelse(tb.df$agegroup == "514", "5-14",
                        ifelse(tb.df$agegroup == "5564", "55-64",
                        ifelse(tb.df$agegroup == "65", "65+", "Uknown")))))))))

head(tb.df)
```

```
## # A tibble: 6 x 9
##   country     iso2  iso3  iso_numeric g_whoregion  year sex   agegroup cases
##   <chr>       <chr> <chr> <chr>       <chr>       <dbl> <chr> <chr>    <dbl>
## 1 Afghanistan AF    AFG   004         EMR          1980 Males 0-4         NA
## 2 Afghanistan AF    AFG   004         EMR          1981 Males 0-4         NA
## 3 Afghanistan AF    AFG   004         EMR          1982 Males 0-4         NA
## 4 Afghanistan AF    AFG   004         EMR          1983 Males 0-4         NA
## 5 Afghanistan AF    AFG   004         EMR          1984 Males 0-4         NA
## 6 Afghanistan AF    AFG   004         EMR          1985 Males 0-4         NA
```

How many cases per country in the 1980-1990 period?

```r
tbtab.01 <- tb.df %>% 
  filter(year >= 1980 | year <= 1990) %>% 
  group_by(country) %>% 
  summarise(Ncases = sum(cases, na.rm = TRUE)) %>% 
  arrange(-Ncases)

head(tbtab.01)
```

```
## # A tibble: 6 x 2
##   country       Ncases
##   <chr>          <dbl>
## 1 India        6492850
## 2 China        5643738
## 3 Indonesia    1886657
## 4 South Africa 1435147
## 5 Bangladesh   1072431
## 6 Viet Nam      915525
```

---

What about by sex and age-group?

```r
tbtab.02 <- tb.df %>% 
  filter(year >= 1980 | year <= 1990) %>% 
  group_by(country,  sex, agegroup) %>% 
  summarise(Ncases = sum(cases, na.rm = TRUE)) %>% 
  arrange(-Ncases)

head(tbtab.02)
```

```
## # A tibble: 6 x 4
## # Groups:   country, sex [2]
##   country sex   agegroup Ncases
##   <chr>   <chr> <chr>     <dbl>
## 1 India   Males 35-44    963811
## 2 India   Males 25-34    913939
## 3 India   Males 45-54    835228
## 4 India   Males 15-24    791303
## 5 China   Males 65+      748103
## 6 China   Males 45-54    668133
```

This is just one example of how `dplyr` and `tidyr` -- two packages from the `tidyverse` -- can be used to deal with what seems to be quite messy data. The more you use it, the easier it becomes to figure out solutions to even very complicated messy data setups.

---

# <br> .fat[.fancy[lubridate]]

---

This package makes working with dates and times quite easy because it does what base R does but in more intuitive ways and perhaps more flexibly. Let us start with some date fields.

```r
today1 = "20171217"

today2 = "2017-12-17"

today3 = "2017 December 17"

today4 = "20171217143241"

today5 = "2017 December 17 14:32:41"

today6 = "December 17 2017 14:32:41"

today7 = "17-Dec, 2017 14:32:41"
```

The formats are quite varied but `lubridate` deals with them quite seamlessly so long as you pay attention to the order -- is year first or last? What about month? day? Is time given in hours, minutes and seconds?.

---

```r
library(lubridate)

ymd(today1)
```

```
## [1] "2017-12-17"
```

```r
ymd(today2)
```

```
## [1] "2017-12-17"
```

```r
ymd(today3)
```

```
## [1] "2017-12-17"
```
]
.pull-right[

```r
ymd_hms(today4)
```

```
## [1] "2017-12-17 14:32:41 UTC"
```

```r
ymd_hms(today5)
```

```
## [1] "2017-12-17 14:32:41 UTC"
```

```r
mdy_hms(today6)
```

```
## [1] "2017-12-17 14:32:41 UTC"
```

```r
dmy_hms(today7)
```

```
## [1] "2017-12-17 14:32:41 UTC"
```
]

---

If I need to extract data/time elements, I can do that as well.

```r
today = dmy_hms(today7); today # a date and time set in today7 
```

```
## [1] "2017-12-17 14:32:41 UTC"
```

```r
today.y = year(today); today.y # The year 
```

```
## [1] 2017
```

```r
today.m1 = month(today); today.m1 # the month, as a number  
```

```
## [1] 12
```

```r
today.m2 = month(today, label = TRUE, abbr = TRUE); today.m2 # labeling the month but with an abbreviation 
```

```
## [1] Dec
## Levels: Jan < Feb < Mar < Apr < May < Jun < Jul < Aug < Sep < Oct < Nov < Dec
```

```r
today.m3 = month(today, label = TRUE, abbr = FALSE); today.m3# fully labelling the month 
```

```
## [1] December
## 12 Levels: January < February < March < April < May < June < July < ... < December
```

---

```r
today.w = week(today)

today.w # what week of the year is it? 
```

```
## [1] 51
```

```r
today.doy = yday(today)

today.doy # what day of the year is it? 
```

```
## [1] 351
```

```r
today.dom = mday(today)

today.dom # what day of the month is it? 
```

```
## [1] 17
```
]
.pull-right[

```r
today.dow1 = wday(today)

today.dow1 # what day of the week is it, as a number? 
```

```
## [1] 1
```

```r
today.dow2 = wday(today, label = TRUE, abbr = TRUE)

today.dow2 #  day of the week, abbreviated label
```

```
## [1] Sun
## Levels: Sun < Mon < Tue < Wed < Thu < Fri < Sat
```

```r
today.dow3 = wday(today, label = TRUE, abbr = FALSE)

today.dow3 # day of the week fully labelled  
```

```
## [1] Sunday
## Levels: Sunday < Monday < Tuesday < Wednesday < Thursday < Friday < Saturday
```
]

---

```r
today.h = hour(today)

today.h # what hour is it? 
```

```
## [1] 14
```

```r
today.m = minute(today)

today.m # what minute is it? 
```

```
## [1] 32
```

```r
today.s = second(today)

today.s # what second is it? 
```

```
## [1] 41
```

```r
today.tz = tz(today)

today.tz # what time zone is it? 
```

```
## [1] "UTC"
```

---

### Intervals vs. Durations vs. Periods

Calculating time lapsed between dates is tricky because you have to take into account daylight savings time<sup>1</sup>, leap years<sup>2</sup>, and leap seconds<sup>3</sup>. These render the length of months, weeks, days, hours, and minutes to be `relative units of time` but seconds tends to be `exact units of time`. As a result, and by design, `lubridate` differentiates between `intervals`, `durations`, and `periods`, each measuring time spans in different ways.

.footnote[
[1] Daylight Saving Time (DST) is the practice of setting the clocks forward 1 hour from standard time during the summer months, and back again in the fall, in order to make better use of natural daylight. [Source](https://www.timeanddate.com/time/dst/)

[2] Leap years are needed to keep our modern day Gregorian calendar in alignment with the Earth's revolutions around the sun. It takes the Earth approximately 365.242189 days – or 365 days, 5 hours, 48 minutes, and 45 seconds – to circle once around the Sun. This is called a tropical year, and is measured from the March equinox. However, the Gregorian calendar has only 365 days in a year, so if we didn't add a leap day on February 29 nearly every four years, we would lose almost six hours off our calendar every year. After only 100 years, our calendar would be off by around 24 days! [Source](https://www.timeanddate.com/date/leapyear.html)

[3] Two components are used to determine UTC (Coordinated Universal Time): International Atomic Time (TAI): A time scale that combines the output of some 200 highly precise atomic clocks worldwide, and provides the exact speed for our clocks to tick. Universal Time (UT1), also known as Astronomical Time, refers to the Earth's rotation around its own axis, which determines the length of a day. Before the difference between UTC and UT1 reaches 0.9 seconds, a leap second is added to UTC and to clocks worldwide. By adding an additional second to the time count, our clocks are effectively stopped for that second to give Earth the opportunity to catch up. [Source](https://www.timeanddate.com/time/leapseconds.html).
]

---

Let us start with a `duration`, the simplest measure of lapsed time since it measures the passing of time in seconds. Say I pick two dates and times ... `05:00am on March 9, 2019` and `05:00am on March 10, 2019`.

```r
date1 <- ymd_hms("2019-03-09 05:00:00", tzone = "US/Eastern")
date2 <- ymd_hms("2019-03-10 05:00:00", tzone = "US/Eastern")
timint <- interval(date1, date2)
timint
```

```
## [1] 2019-03-09 05:00:00 UTC--2019-03-10 05:00:00 UTC
## [2] NA--NA
```

How much time has lapsed between date1 and date2? We calculate with `as.duration()`

```r
timelapsed.d <- as.duration(timint) 
```

Why is it 23 hours and not 24 hours? Because daylight savings kicks in at 2:00 AM on March 10, and since the clocks kick forward by one hour, only 23 hours have passed, not 24 as we might naively expect. Go back and look at the output from `timeint`; see the EST versus EDT? In contrast to duration, if I ask for the `period of time`, what do I get?

```r
timelapsed.p <- as.period(timint) 
```

Aha! `as.period()` is imprecise because it tells me 1 day has passed

---
Now, we can use the duration of time lapsed in whatever units we want, but if we want accuracy, we better work with durations, as shown below.

```r
time_length(timelapsed.d, unit = "second")
```

```
## [1] 86400    NA
```

```r
time_length(timelapsed.d, unit = "minute")
```

```
## [1] 1440   NA
```

```r
time_length(timelapsed.d, unit = "hour")
```

```
## [1] 24 NA
```
]

```r
time_length(timelapsed.d, unit = "day")
```

```
## [1]  1 NA
```

```r
time_length(timelapsed.d, unit = "week")
```

```
## [1] 0.1428571        NA
```

```r
time_length(timelapsed.d, unit = "month")
```

```
## [1] 0.03285421         NA
```

```r
time_length(timelapsed.d, unit = "year")
```

```
## [1] 0.002737851          NA
```
]

`duration()` is always estimated in seconds but you can change the reporting unit

---

If, unfortunately, we rely on periods, well then we have inherited a problem!

```r
time_length(timelapsed.p, unit = "second")
```

```
## [1] 86400    NA
```

```r
time_length(timelapsed.p, unit = "minute")
```

```
## [1] 1440   NA
```

```r
time_length(timelapsed.p, unit = "hour")
```

```
## [1] 24 NA
```
]
.pull-right[

```r
time_length(timelapsed.p, unit = "day")
```

```
## [1]  1 NA
```

```r
time_length(timelapsed.p, unit = "week")
```

```
## [1] 0.1428571        NA
```

```r
time_length(timelapsed.p, unit = "month")
```

```
## [1] 0.03285421         NA
```

```r
time_length(timelapsed.p, unit = "year")
```

```
## [1] 0.002737851          NA
```
]

`only` the `seconds` are correct and nothing else

---

# Find me at...

[<svg style="height:0.8em;top:.04em;position:relative;" viewBox="0 0 512 512"><path d="M459.37 151.716c.325 4.548.325 9.097.325 13.645 0 138.72-105.583 298.558-298.558 298.558-59.452 0-114.68-17.219-161.137-47.106 8.447.974 16.568 1.299 25.34 1.299 49.055 0 94.213-16.568 130.274-44.832-46.132-.975-84.792-31.188-98.112-72.772 6.498.974 12.995 1.624 19.818 1.624 9.421 0 18.843-1.3 27.614-3.573-48.081-9.747-84.143-51.98-84.143-102.985v-1.299c13.969 7.797 30.214 12.67 47.431 13.319-28.264-18.843-46.781-51.005-46.781-87.391 0-19.492 5.197-37.36 14.294-52.954 51.655 63.675 129.3 105.258 216.365 109.807-1.624-7.797-2.599-15.918-2.599-24.04 0-57.828 46.782-104.934 104.934-104.934 30.213 0 57.502 12.67 76.67 33.137 23.715-4.548 46.456-13.32 66.599-25.34-7.798 24.366-24.366 44.833-46.132 57.827 21.117-2.273 41.584-8.122 60.426-16.243-14.292 20.791-32.161 39.308-52.628 54.253z"/></svg> @aruhil](http://twitter.com/aruhil)  
[<svg style="height:0.8em;top:.04em;position:relative;" viewBox="0 0 512 512"><path d="M326.612 185.391c59.747 59.809 58.927 155.698.36 214.59-.11.12-.24.25-.36.37l-67.2 67.2c-59.27 59.27-155.699 59.262-214.96 0-59.27-59.26-59.27-155.7 0-214.96l37.106-37.106c9.84-9.84 26.786-3.3 27.294 10.606.648 17.722 3.826 35.527 9.69 52.721 1.986 5.822.567 12.262-3.783 16.612l-13.087 13.087c-28.026 28.026-28.905 73.66-1.155 101.96 28.024 28.579 74.086 28.749 102.325.51l67.2-67.19c28.191-28.191 28.073-73.757 0-101.83-3.701-3.694-7.429-6.564-10.341-8.569a16.037 16.037 0 0 1-6.947-12.606c-.396-10.567 3.348-21.456 11.698-29.806l21.054-21.055c5.521-5.521 14.182-6.199 20.584-1.731a152.482 152.482 0 0 1 20.522 17.197zM467.547 44.449c-59.261-59.262-155.69-59.27-214.96 0l-67.2 67.2c-.12.12-.25.25-.36.37-58.566 58.892-59.387 154.781.36 214.59a152.454 152.454 0 0 0 20.521 17.196c6.402 4.468 15.064 3.789 20.584-1.731l21.054-21.055c8.35-8.35 12.094-19.239 11.698-29.806a16.037 16.037 0 0 0-6.947-12.606c-2.912-2.005-6.64-4.875-10.341-8.569-28.073-28.073-28.191-73.639 0-101.83l67.2-67.19c28.239-28.239 74.3-28.069 102.325.51 27.75 28.3 26.872 73.934-1.155 101.96l-13.087 13.087c-4.35 4.35-5.769 10.79-3.783 16.612 5.864 17.194 9.042 34.999 9.69 52.721.509 13.906 17.454 20.446 27.294 10.606l37.106-37.106c59.271-59.259 59.271-155.699.001-214.959z"/></svg> aniruhil.org](https://aniruhil.org)  
[<svg style="height:0.8em;top:.04em;position:relative;" viewBox="0 0 512 512"><path d="M476 3.2L12.5 270.6c-18.1 10.4-15.8 35.6 2.2 43.2L121 358.4l287.3-253.2c5.5-4.9 13.3 2.6 8.6 8.3L176 407v80.5c0 23.6 28.5 32.9 42.5 15.8L282 426l124.6 52.2c14.2 6 30.4-2.9 33-18.2l72-432C515 7.8 493.3-6.8 476 3.2z"/></svg> ruhil@ohio.edu](mailto:ruhil@ohio.edu)