Dates and Times in R

class: title-slide, center, middle
background-image: url(images/ouaerial.jpeg)
background-size: cover

# .crimson[.fancy[Dates and Times in R]]

## .crimson[.fancy[Ani Ruhil]]

---

name: agenda

# .fancy[ Agenda ]

This week we learn how to work with dates and times

.pull-left[
+ package of choice here is `{lubridate}`

+ can parse messy date and time values

+ can calculate passage of time

+ Julian date formats? No worries ...

+ extract date elements? Yup, that too ...
]

.pull.right[
<img src="images/lubridate.png" width="25%" style="display: block; margin: auto;" />
<img src="images/lubridate_ymd.png" width="25%" style="display: block; margin: auto;" />

]

---

> "Date-time data can be frustrating to work with in R. R commands for date-times are generally unintuitive and change depending on the type of date-time object being used. Moreover, the methods we use with date-times must be robust to time zones, leap days, daylight savings times, and other time related quirks, and R lacks these capabilities in some situations. Lubridate makes it easier to do the things R does with date-times and possible to do the things R does not."

First up, some mangled date entries and we'll see how to parse them into correct date formats!

```r
"20171217" -> today1 
"2017-12-17" -> today2 
"2017 December 17" -> today3 
"20171217143241" -> today4 
"2017 December 17 14:32:41" -> today5 
"December 17 2017 14:32:41" -> today6 
"17-Dec, 2017 14:32:41" -> today7 
```

---

Now we fix them up!

```r
library(tidyverse)
library(lubridate)

ymd(today1) -> date1
ymd(today2) -> date2 
ymd(today3) -> date3 
date1; date2; date3
```

```
## [1] "2017-12-17"
```

`today1`, `today2`, and `today3` all had the same structure of year-month-day and so `ymd()` works to get the format right. `today4` has year-month-day-hours-minutes-seconds so we'll have to do this one slightly differently. The same thing works for `today5` as well.

---

```r
ymd_hms(today4) -> date4 
ymd_hms(today5) -> date5 
date4; date5
```

```
## [1] "2017-12-17 14:32:41 UTC"
```

`today6` has a slightly different format, `month-day-year-hours-minutes-seconds` that is read in thus:

```r
mdy_hms(today6) -> date6 
date6
```

```
## [1] "2017-12-17 14:32:41 UTC"
```

`today7` has a slightly different format, `day-month-year-hours-minutes-seconds` that is read in thus:

```r
dmy_hms(today7) -> date7 
date7
```

```
## [1] "2017-12-17 14:32:41 UTC"
```

---

## Working with flight dates
Now we should be able to start working with some date variables, and the ideal candidate would be the flight date column our `cmhflights` data. So the first thing we will do is load that data-set so that we can work with it.

```r
library(here)
load(here("data", "cmhflights_01092017.RData"))
```

I dislike this uppercase-lowercase mixture they have in the column names and so will get rid of it as shown below, making everything nice and lowercase. This is done with the `janitor` package's `clean_names()` command. I am also going to use `select()` to keep only a handful of columns since keeping 100+ is of no value.

```r
library(janitor)

cmhflights %>%
  clean_names() %>%
  select(
    year, month, dayof_month, day_of_week, flight_date, carrier,
    tail_num, flight_num, origin_city_name, dest_city_name,
    dep_time, dep_delay, arr_time, arr_delay, cancelled, diverted
    ) -> cmh.df
```
---
The first thing I want to do now is to label the days of the week, the months, and then also create that flag for the `weekend` versus `weekdays`. Here goes:

```r
cmh.df %>%
  mutate(
    dayofweek = wday(
      day_of_week,
      abbr = FALSE,
      label = TRUE
      ),
    monthname = month(
      month,
      abbr = FALSE,
      label = TRUE
      ),
    weekend = case_when(
      dayofweek %in% c("Saturday", "Sunday") ~ "Weekend",
      TRUE ~ "Weekday"
      )
    ) -> cmh.df 
```

---

Now let us ask some questions: (a) What month had the most flights? (b) What day of the week had the most flights?

.pull-left[

```r
cmh.df %>%
  count(monthname, sort = TRUE) # (a)
```

```
## # A tibble: 9 x 2
##   monthname     n
##   <ord>     <int>
## 1 July       4295
## 2 August     4279
## 3 June       4138
## 4 April      4123
## 5 March      4101
## 6 May        4098
## 7 September  3789
## 8 January    3757
## 9 February   3413
```
]
.pull-right[

```r
cmh.df %>%
  count(dayofweek, sort = TRUE) # (b) 
```

```
## # A tibble: 7 x 2
##   dayofweek     n
##   <ord>     <int>
## 1 Wednesday  5435
## 2 Thursday   5417
## 3 Sunday     5395
## 4 Tuesday    5368
## 5 Monday     5284
## 6 Saturday   4892
## 7 Friday     4202
```
]

---

(c) What about weekends; did weekends have more flights than weekdays? (d) With respect to (c), does whatever pattern we see vary by month or does month not matter?

.pull-left[

```r
cmh.df %>%
  count(weekend, sort = TRUE) # (c) 
```

```
## # A tibble: 2 x 2
##   weekend     n
##   <chr>   <int>
## 1 Weekday 25706
## 2 Weekend 10287
```
]
.pull-right[

```r
cmh.df %>%
  count(monthname, weekend, sort = TRUE) # (d) 
```

```
## # A tibble: 18 x 3
##    monthname weekend     n
##    <ord>     <chr>   <int>
##  1 August    Weekday  3165
##  2 March     Weekday  3047
##  3 June      Weekday  3023
##  4 July      Weekday  2918
##  5 May       Weekday  2908
##  6 April     Weekday  2876
##  7 September Weekday  2760
##  8 January   Weekday  2557
##  9 February  Weekday  2452
## 10 July      Weekend  1377
## 11 April     Weekend  1247
## 12 January   Weekend  1200
## 13 May       Weekend  1190
## 14 June      Weekend  1115
## 15 August    Weekend  1114
## 16 March     Weekend  1054
## 17 September Weekend  1029
## 18 February  Weekend   961
```
]

---
So most flights are on weekdays, but weekend flights lead in July while weekday flights lead in August.

But wait a minute, if I can calculate these frequencies, why not do it by the hour. That may allow us to answer such questions as: What hour of the day has the most flights, the most delays? What about by airline? What if we push this to the minute of the hour?

Well, first we will have to create a new variable that marks just the hour of the day in the 24-hour cycle. But to do this we will first need to create a single `flight_date_time` column that will be in the `ymd_hms` format. How? With `unite()`.

```r
cmh.df %>%
  unite(
    col = "flight_date_time",
    c(flight_date, dep_time),
    sep = ":",
    remove = TRUE
  ) -> cmh.df
```

---
Okay, now we create `flt_date_time` and note the seconds here are automatically coerced to be `00`.

```r
cmh.df %>%
  mutate(
    flt_date_time = ymd_hm(flight_date_time)
      ) -> cmh.df
```

Now we extract just the hour of the day the flight was scheduled to depart.

```r
cmh.df %>%
  mutate(
    flt_hour = hour(flt_date_time),
    flt_minute = minute(flt_date_time)
    ) -> cmh.df
```

---

All righty then, now we start digging in. What hour has the most flights, and does this vary by the day of the week? By the Month?

.pull-left[

```r
cmh.df %>%
  count(flt_hour, sort = TRUE)
```

```
## # A tibble: 24 x 2
##    flt_hour     n
##       <int> <int>
##  1       10  2626
##  2       17  2454
##  3        7  2448
##  4       16  2395
##  5       15  2392
##  6       14  2390
##  7        8  2331
##  8        9  2283
##  9       18  2268
## 10        6  2106
## # … with 14 more rows
```
]
.pull-right[

```r
cmh.df %>%
  count(monthname, flt_hour, sort = TRUE)
```

```
## # A tibble: 199 x 3
##    monthname flt_hour     n
##    <ord>        <int> <int>
##  1 May             10   376
##  2 August           8   328
##  3 June            10   325
##  4 March           17   323
##  5 July             8   319
##  6 May             15   317
##  7 April           17   314
##  8 May             18   314
##  9 March            7   313
## 10 January         16   312
## # … with 189 more rows
```
]

Looks like 10:00 and then 17:00, these would be your best bets if you were looking to catch a flight and wanted as many options as possible. On the flip side, this might also be the time when flights get delayed more often because there are so many flights scheduled at these hours!

---

Now I want to ask the question about delays: Are median delays higher at certain hours?

.pull-left[

```r
cmh.df %>%
  group_by(flt_hour) %>%
  summarise(
    md.delay = median(dep_delay, na.rm = TRUE)
    ) %>%
  arrange(-md.delay)
```

```
## # A tibble: 24 x 2
##    flt_hour md.delay
##       <int>    <dbl>
##  1        3      290
##  2        2      233
##  3        1      174
##  4        0      137
##  5       23       49
##  6       21        6
##  7       18        2
##  8       19        1
##  9       15        0
## 10       16        0
## # … with 14 more rows
```
]

.pull-right[

```r
cmh.df %>%
  group_by(flt_hour) %>%
  summarise(
    md.delay = median(dep_delay, na.rm = TRUE)
    ) %>%
  arrange(md.delay)
```

```
## # A tibble: 24 x 2
##    flt_hour md.delay
##       <int>    <dbl>
##  1        5       -4
##  2        6       -4
##  3        7       -4
##  4        8       -3
##  5        9       -3
##  6       10       -2
##  7       11       -2
##  8       12       -2
##  9       13       -2
## 10       14       -2
## # … with 14 more rows
```
]

The expected result; Shortest median delay is at 5 AM, and delays increase by the hour. Bottom-line: Fly early.

---
#### Might this vary by destination?

.pull-left[

```r
cmh.df %>%
  group_by(dest_city_name, flt_hour) %>%
  summarise(
    md.delay = median(dep_delay, na.rm = TRUE)
    ) %>%
  arrange(-md.delay)
```

```
## # A tibble: 418 x 3
## # Groups:   dest_city_name [26]
##    dest_city_name  flt_hour md.delay
##    <chr>              <int>    <dbl>
##  1 Newark, NJ             6    1046 
##  2 Newark, NJ             7     688 
##  3 Denver, CO            14     489 
##  4 Houston, TX            7     420.
##  5 Minneapolis, MN        0     381 
##  6 Atlanta, GA            1     348 
##  7 New York, NY           0     337 
##  8 Tampa, FL              1     324 
##  9 Nashville, TN         23     323 
## 10 Fort Myers, FL         0     297 
## # … with 408 more rows
```
Avoid flying to Newark, NJ, even at 6 or 7 AM. Might these vary by airline?
]
.pull-right[

```r
cmh.df %>%
  group_by(carrier, dest_city_name, flt_hour) %>%
  summarise(
    md.delay = median(dep_delay, na.rm = TRUE)
    ) %>%
  arrange(-md.delay)
```

```
## # A tibble: 656 x 4
## # Groups:   carrier, dest_city_name [52]
##    carrier dest_city_name  flt_hour md.delay
##    <chr>   <chr>              <int>    <dbl>
##  1 EV      Newark, NJ             6    1046 
##  2 EV      Chicago, IL            6    1024 
##  3 EV      Newark, NJ             7     688 
##  4 DL      Columbus, OH           5     526 
##  5 F9      Denver, CO            14     489 
##  6 DL      Los Angeles, CA       15     481 
##  7 AA      Phoenix, AZ           15     463 
##  8 EV      Houston, TX            7     420.
##  9 UA      Chicago, IL            0     394 
## 10 DL      Minneapolis, MN        0     381 
## # … with 646 more rows
```

Worst early-morning delays are for EV, to Newark and to Chicago. 
]

---
## <svg style="height:0.8em;top:.04em;position:relative;fill:darkred;" viewBox="0 0 384 512"><path d="M216 23.858c0-23.802-30.653-32.765-44.149-13.038C48 191.851 224 200 224 288c0 35.629-29.114 64.458-64.85 63.994C123.98 351.538 96 322.22 96 287.046v-85.51c0-21.703-26.471-32.225-41.432-16.504C27.801 213.158 0 261.332 0 320c0 105.869 86.131 192 192 192s192-86.131 192-192c0-170.29-168-193.003-168-296.142z"/></svg> Passage of Time
Let us assume we are interested in seeing how much time lapses between successive flights of each aircraft seen in the data. We know we can identify each unique aircraft by its `tail_num`. So let us first see how many times is each aircraft seen and create a new column called `number_flew`. Some rows of data are missing `flt_date_time` and `tail_num` so I will filter these out as well.

.pull-left[

```r
cmh.df %>%
  filter( 
    !is.na(tail_num),
    !is.na(flt_date_time) 
    ) %>% 
  group_by(tail_num) %>% 
  arrange(flt_date_time) %>% 
  mutate(n_flew = row_number()) %>% 
  select(tail_num, flt_date_time, n_flew) %>%
  arrange(-n_flew) -> cmh.df2 
```
]

.pull-right[

```
## # A tibble: 6 x 3
## # Groups:   tail_num [1]
##   tail_num flt_date_time       n_flew
##   <chr>    <dttm>               <int>
## 1 N396SW   2017-08-23 10:07:00     73
## 2 N396SW   2017-08-23 08:07:00     72
## 3 N396SW   2017-08-19 08:20:00     71
## 4 N396SW   2017-08-18 15:24:00     70
## 5 N396SW   2017-08-06 21:43:00     69
## 6 N396SW   2017-08-06 18:53:00     68
```
]

So far so good; [<svg style="height:0.8em;top:.04em;position:relative;fill:salmon;" viewBox="0 0 512 512"><path d="M476 3.2L12.5 270.6c-18.1 10.4-15.8 35.6 2.2 43.2L121 358.4l287.3-253.2c5.5-4.9 13.3 2.6 8.6 8.3L176 407v80.5c0 23.6 28.5 32.9 42.5 15.8L282 426l124.6 52.2c14.2 6 30.4-2.9 33-18.2l72-432C515 7.8 493.3-6.8 476 3.2z"/></svg> N396SW is the winner and has well-earned its retirement](https://www.planespotters.net/airframe/boeing-737-n396sw-aerothrust-holdings/38l2ge).

---

Now we need to see how much time lapsed between flights, and this is just the difference between the `preceding flt_date_time` recorded and the most recent `flt_date_time`. As we do this, note that by default time span (`ytspan`) is calculated in seconds.

```r
cmh.df2 %>%
  group_by(tail_num) %>%
  arrange(flt_date_time) %>%
  mutate(
    tspan = interval(
      lag(flt_date_time, order_by = tail_num), flt_date_time
      ), # calculate the time span between successive flights recorded 
    tspan.minutes = as.duration(tspan)/dminutes(1), # convert tspan into minutes 
    tspan.hours = as.duration(tspan)/dhours(1), # convert tspan into hours 
    tspan.days = as.duration(tspan)/ddays(1), # convert tspan into days 
    tspan.weeks = as.duration(tspan)/dweeks(1) # convert tspan into weeks 
    ) -> cmh.df2 
```

Here, `tspan` is being converted into, say, minutes by dividing it by 60, into hours by dividing tspan by 60 x 60 = 3600, and so on. Note that `dminutes(1)` is calculating the time span in one-minute intervals. Similarly for hours, days, and weeks. Thus if you ran `dhours(2)` you would get the time interval in 2-hour increments.

---

```r
cmh.df2 %>%
  filter(tail_num == "N396SW")
```

```
## # A tibble: 73 x 8
## # Groups:   tail_num [1]
##    tail_num flt_date_time       n_flew   tspan tspan.minutes tspan.hours
##    <chr>    <dttm>               <int>   <dbl>         <dbl>       <dbl>
##  1 N396SW   2017-01-05 09:30:00      1      NA            NA       NA   
##  2 N396SW   2017-01-05 12:19:00      2   10140           169        2.82
##  3 N396SW   2017-01-11 08:34:00      3  504900          8415      140.  
##  4 N396SW   2017-01-11 10:44:00      4    7800           130        2.17
##  5 N396SW   2017-01-19 10:31:00      5  690420         11507      192.  
##  6 N396SW   2017-01-19 14:28:00      6   14220           237        3.95
##  7 N396SW   2017-02-10 08:23:00      7 1878900         31315      522.  
##  8 N396SW   2017-02-10 10:32:00      8    7740           129        2.15
##  9 N396SW   2017-02-15 15:20:00      9  449280          7488      125.  
## 10 N396SW   2017-02-15 18:15:00     10   10500           175        2.92
## # … with 63 more rows, and 2 more variables: tspan.days <dbl>, tspan.weeks <dbl>
```

There is a lot more we could do with time but the few things we have covered so far would be the more common tasks we usually encounter.

---

class: right, middle

# Find me at...

[<svg style="height:0.8em;top:.04em;position:relative;" viewBox="0 0 512 512"><path d="M459.37 151.716c.325 4.548.325 9.097.325 13.645 0 138.72-105.583 298.558-298.558 298.558-59.452 0-114.68-17.219-161.137-47.106 8.447.974 16.568 1.299 25.34 1.299 49.055 0 94.213-16.568 130.274-44.832-46.132-.975-84.792-31.188-98.112-72.772 6.498.974 12.995 1.624 19.818 1.624 9.421 0 18.843-1.3 27.614-3.573-48.081-9.747-84.143-51.98-84.143-102.985v-1.299c13.969 7.797 30.214 12.67 47.431 13.319-28.264-18.843-46.781-51.005-46.781-87.391 0-19.492 5.197-37.36 14.294-52.954 51.655 63.675 129.3 105.258 216.365 109.807-1.624-7.797-2.599-15.918-2.599-24.04 0-57.828 46.782-104.934 104.934-104.934 30.213 0 57.502 12.67 76.67 33.137 23.715-4.548 46.456-13.32 66.599-25.34-7.798 24.366-24.366 44.833-46.132 57.827 21.117-2.273 41.584-8.122 60.426-16.243-14.292 20.791-32.161 39.308-52.628 54.253z"/></svg> @aruhil](http://twitter.com/aruhil)  
[<svg style="height:0.8em;top:.04em;position:relative;" viewBox="0 0 512 512"><path d="M326.612 185.391c59.747 59.809 58.927 155.698.36 214.59-.11.12-.24.25-.36.37l-67.2 67.2c-59.27 59.27-155.699 59.262-214.96 0-59.27-59.26-59.27-155.7 0-214.96l37.106-37.106c9.84-9.84 26.786-3.3 27.294 10.606.648 17.722 3.826 35.527 9.69 52.721 1.986 5.822.567 12.262-3.783 16.612l-13.087 13.087c-28.026 28.026-28.905 73.66-1.155 101.96 28.024 28.579 74.086 28.749 102.325.51l67.2-67.19c28.191-28.191 28.073-73.757 0-101.83-3.701-3.694-7.429-6.564-10.341-8.569a16.037 16.037 0 0 1-6.947-12.606c-.396-10.567 3.348-21.456 11.698-29.806l21.054-21.055c5.521-5.521 14.182-6.199 20.584-1.731a152.482 152.482 0 0 1 20.522 17.197zM467.547 44.449c-59.261-59.262-155.69-59.27-214.96 0l-67.2 67.2c-.12.12-.25.25-.36.37-58.566 58.892-59.387 154.781.36 214.59a152.454 152.454 0 0 0 20.521 17.196c6.402 4.468 15.064 3.789 20.584-1.731l21.054-21.055c8.35-8.35 12.094-19.239 11.698-29.806a16.037 16.037 0 0 0-6.947-12.606c-2.912-2.005-6.64-4.875-10.341-8.569-28.073-28.073-28.191-73.639 0-101.83l67.2-67.19c28.239-28.239 74.3-28.069 102.325.51 27.75 28.3 26.872 73.934-1.155 101.96l-13.087 13.087c-4.35 4.35-5.769 10.79-3.783 16.612 5.864 17.194 9.042 34.999 9.69 52.721.509 13.906 17.454 20.446 27.294 10.606l37.106-37.106c59.271-59.259 59.271-155.699.001-214.959z"/></svg> aniruhil.org](https://aniruhil.org)  
[<svg style="height:0.8em;top:.04em;position:relative;" viewBox="0 0 512 512"><path d="M476 3.2L12.5 270.6c-18.1 10.4-15.8 35.6 2.2 43.2L121 358.4l287.3-253.2c5.5-4.9 13.3 2.6 8.6 8.3L176 407v80.5c0 23.6 28.5 32.9 42.5 15.8L282 426l124.6 52.2c14.2 6 30.4-2.9 33-18.2l72-432C515 7.8 493.3-6.8 476 3.2z"/></svg> ruhil@ohio.edu](mailto:ruhil@ohio.edu)