+ - 0:00:00
Notes for current slide
Notes for next slide

Dates and Times in R

Ani Ruhil

1 / 18

Agenda

This week we learn how to work with dates and times

  • package of choice here is {lubridate}

  • can parse messy date and time values

  • can calculate passage of time

  • Julian date formats? No worries ...

  • extract date elements? Yup, that too ...

2 / 18

"Date-time data can be frustrating to work with in R. R commands for date-times are generally unintuitive and change depending on the type of date-time object being used. Moreover, the methods we use with date-times must be robust to time zones, leap days, daylight savings times, and other time related quirks, and R lacks these capabilities in some situations. Lubridate makes it easier to do the things R does with date-times and possible to do the things R does not."

First up, some mangled date entries and we'll see how to parse them into correct date formats!

"20171217" -> today1
"2017-12-17" -> today2
"2017 December 17" -> today3
"20171217143241" -> today4
"2017 December 17 14:32:41" -> today5
"December 17 2017 14:32:41" -> today6
"17-Dec, 2017 14:32:41" -> today7
3 / 18

Now we fix them up!

library(tidyverse)
library(lubridate)
ymd(today1) -> date1
ymd(today2) -> date2
ymd(today3) -> date3
date1; date2; date3
## [1] "2017-12-17"
## [1] "2017-12-17"
## [1] "2017-12-17"

today1, today2, and today3 all had the same structure of year-month-day and so ymd() works to get the format right. today4 has year-month-day-hours-minutes-seconds so we'll have to do this one slightly differently. The same thing works for today5 as well.

4 / 18
ymd_hms(today4) -> date4
ymd_hms(today5) -> date5
date4; date5
## [1] "2017-12-17 14:32:41 UTC"
## [1] "2017-12-17 14:32:41 UTC"

today6 has a slightly different format, month-day-year-hours-minutes-seconds that is read in thus:

mdy_hms(today6) -> date6
date6
## [1] "2017-12-17 14:32:41 UTC"

today7 has a slightly different format, day-month-year-hours-minutes-seconds that is read in thus:

dmy_hms(today7) -> date7
date7
## [1] "2017-12-17 14:32:41 UTC"
5 / 18

Working with flight dates

Now we should be able to start working with some date variables, and the ideal candidate would be the flight date column our cmhflights data. So the first thing we will do is load that data-set so that we can work with it.

library(here)
load(here("data", "cmhflights_01092017.RData"))

I dislike this uppercase-lowercase mixture they have in the column names and so will get rid of it as shown below, making everything nice and lowercase. This is done with the janitor package's clean_names() command. I am also going to use select() to keep only a handful of columns since keeping 100+ is of no value.

library(janitor)
cmhflights %>%
clean_names() %>%
select(
year, month, dayof_month, day_of_week, flight_date, carrier,
tail_num, flight_num, origin_city_name, dest_city_name,
dep_time, dep_delay, arr_time, arr_delay, cancelled, diverted
) -> cmh.df
6 / 18

The first thing I want to do now is to label the days of the week, the months, and then also create that flag for the weekend versus weekdays. Here goes:

cmh.df %>%
mutate(
dayofweek = wday(
day_of_week,
abbr = FALSE,
label = TRUE
),
monthname = month(
month,
abbr = FALSE,
label = TRUE
),
weekend = case_when(
dayofweek %in% c("Saturday", "Sunday") ~ "Weekend",
TRUE ~ "Weekday"
)
) -> cmh.df
7 / 18

Now let us ask some questions: (a) What month had the most flights? (b) What day of the week had the most flights?

cmh.df %>%
count(monthname, sort = TRUE) # (a)
## # A tibble: 9 x 2
## monthname n
## <ord> <int>
## 1 July 4295
## 2 August 4279
## 3 June 4138
## 4 April 4123
## 5 March 4101
## 6 May 4098
## 7 September 3789
## 8 January 3757
## 9 February 3413
cmh.df %>%
count(dayofweek, sort = TRUE) # (b)
## # A tibble: 7 x 2
## dayofweek n
## <ord> <int>
## 1 Wednesday 5435
## 2 Thursday 5417
## 3 Sunday 5395
## 4 Tuesday 5368
## 5 Monday 5284
## 6 Saturday 4892
## 7 Friday 4202
8 / 18

(c) What about weekends; did weekends have more flights than weekdays? (d) With respect to (c), does whatever pattern we see vary by month or does month not matter?

cmh.df %>%
count(weekend, sort = TRUE) # (c)
## # A tibble: 2 x 2
## weekend n
## <chr> <int>
## 1 Weekday 25706
## 2 Weekend 10287
cmh.df %>%
count(monthname, weekend, sort = TRUE) # (d)
## # A tibble: 18 x 3
## monthname weekend n
## <ord> <chr> <int>
## 1 August Weekday 3165
## 2 March Weekday 3047
## 3 June Weekday 3023
## 4 July Weekday 2918
## 5 May Weekday 2908
## 6 April Weekday 2876
## 7 September Weekday 2760
## 8 January Weekday 2557
## 9 February Weekday 2452
## 10 July Weekend 1377
## 11 April Weekend 1247
## 12 January Weekend 1200
## 13 May Weekend 1190
## 14 June Weekend 1115
## 15 August Weekend 1114
## 16 March Weekend 1054
## 17 September Weekend 1029
## 18 February Weekend 961
9 / 18

So most flights are on weekdays, but weekend flights lead in July while weekday flights lead in August.

But wait a minute, if I can calculate these frequencies, why not do it by the hour. That may allow us to answer such questions as: What hour of the day has the most flights, the most delays? What about by airline? What if we push this to the minute of the hour?

Well, first we will have to create a new variable that marks just the hour of the day in the 24-hour cycle. But to do this we will first need to create a single flight_date_time column that will be in the ymd_hms format. How? With unite().

cmh.df %>%
unite(
col = "flight_date_time",
c(flight_date, dep_time),
sep = ":",
remove = TRUE
) -> cmh.df
10 / 18

Okay, now we create flt_date_time and note the seconds here are automatically coerced to be 00.

cmh.df %>%
mutate(
flt_date_time = ymd_hm(flight_date_time)
) -> cmh.df

Now we extract just the hour of the day the flight was scheduled to depart.

cmh.df %>%
mutate(
flt_hour = hour(flt_date_time),
flt_minute = minute(flt_date_time)
) -> cmh.df
11 / 18

All righty then, now we start digging in. What hour has the most flights, and does this vary by the day of the week? By the Month?

cmh.df %>%
count(flt_hour, sort = TRUE)
## # A tibble: 24 x 2
## flt_hour n
## <int> <int>
## 1 10 2626
## 2 17 2454
## 3 7 2448
## 4 16 2395
## 5 15 2392
## 6 14 2390
## 7 8 2331
## 8 9 2283
## 9 18 2268
## 10 6 2106
## # … with 14 more rows
cmh.df %>%
count(monthname, flt_hour, sort = TRUE)
## # A tibble: 199 x 3
## monthname flt_hour n
## <ord> <int> <int>
## 1 May 10 376
## 2 August 8 328
## 3 June 10 325
## 4 March 17 323
## 5 July 8 319
## 6 May 15 317
## 7 April 17 314
## 8 May 18 314
## 9 March 7 313
## 10 January 16 312
## # … with 189 more rows

Looks like 10:00 and then 17:00, these would be your best bets if you were looking to catch a flight and wanted as many options as possible. On the flip side, this might also be the time when flights get delayed more often because there are so many flights scheduled at these hours!

12 / 18

Now I want to ask the question about delays: Are median delays higher at certain hours?

cmh.df %>%
group_by(flt_hour) %>%
summarise(
md.delay = median(dep_delay, na.rm = TRUE)
) %>%
arrange(-md.delay)
## # A tibble: 24 x 2
## flt_hour md.delay
## <int> <dbl>
## 1 3 290
## 2 2 233
## 3 1 174
## 4 0 137
## 5 23 49
## 6 21 6
## 7 18 2
## 8 19 1
## 9 15 0
## 10 16 0
## # … with 14 more rows
cmh.df %>%
group_by(flt_hour) %>%
summarise(
md.delay = median(dep_delay, na.rm = TRUE)
) %>%
arrange(md.delay)
## # A tibble: 24 x 2
## flt_hour md.delay
## <int> <dbl>
## 1 5 -4
## 2 6 -4
## 3 7 -4
## 4 8 -3
## 5 9 -3
## 6 10 -2
## 7 11 -2
## 8 12 -2
## 9 13 -2
## 10 14 -2
## # … with 14 more rows

The expected result; Shortest median delay is at 5 AM, and delays increase by the hour. Bottom-line: Fly early.

13 / 18

Might this vary by destination?

cmh.df %>%
group_by(dest_city_name, flt_hour) %>%
summarise(
md.delay = median(dep_delay, na.rm = TRUE)
) %>%
arrange(-md.delay)
## # A tibble: 418 x 3
## # Groups: dest_city_name [26]
## dest_city_name flt_hour md.delay
## <chr> <int> <dbl>
## 1 Newark, NJ 6 1046
## 2 Newark, NJ 7 688
## 3 Denver, CO 14 489
## 4 Houston, TX 7 420.
## 5 Minneapolis, MN 0 381
## 6 Atlanta, GA 1 348
## 7 New York, NY 0 337
## 8 Tampa, FL 1 324
## 9 Nashville, TN 23 323
## 10 Fort Myers, FL 0 297
## # … with 408 more rows

Avoid flying to Newark, NJ, even at 6 or 7 AM. Might these vary by airline?

cmh.df %>%
group_by(carrier, dest_city_name, flt_hour) %>%
summarise(
md.delay = median(dep_delay, na.rm = TRUE)
) %>%
arrange(-md.delay)
## # A tibble: 656 x 4
## # Groups: carrier, dest_city_name [52]
## carrier dest_city_name flt_hour md.delay
## <chr> <chr> <int> <dbl>
## 1 EV Newark, NJ 6 1046
## 2 EV Chicago, IL 6 1024
## 3 EV Newark, NJ 7 688
## 4 DL Columbus, OH 5 526
## 5 F9 Denver, CO 14 489
## 6 DL Los Angeles, CA 15 481
## 7 AA Phoenix, AZ 15 463
## 8 EV Houston, TX 7 420.
## 9 UA Chicago, IL 0 394
## 10 DL Minneapolis, MN 0 381
## # … with 646 more rows

Worst early-morning delays are for EV, to Newark and to Chicago.

14 / 18

Passage of Time

Let us assume we are interested in seeing how much time lapses between successive flights of each aircraft seen in the data. We know we can identify each unique aircraft by its tail_num. So let us first see how many times is each aircraft seen and create a new column called number_flew. Some rows of data are missing flt_date_time and tail_num so I will filter these out as well.

cmh.df %>%
filter(
!is.na(tail_num),
!is.na(flt_date_time)
) %>%
group_by(tail_num) %>%
arrange(flt_date_time) %>%
mutate(n_flew = row_number()) %>%
select(tail_num, flt_date_time, n_flew) %>%
arrange(-n_flew) -> cmh.df2
## # A tibble: 6 x 3
## # Groups: tail_num [1]
## tail_num flt_date_time n_flew
## <chr> <dttm> <int>
## 1 N396SW 2017-08-23 10:07:00 73
## 2 N396SW 2017-08-23 08:07:00 72
## 3 N396SW 2017-08-19 08:20:00 71
## 4 N396SW 2017-08-18 15:24:00 70
## 5 N396SW 2017-08-06 21:43:00 69
## 6 N396SW 2017-08-06 18:53:00 68

So far so good; N396SW is the winner and has well-earned its retirement.

15 / 18

Now we need to see how much time lapsed between flights, and this is just the difference between the preceding flt_date_time recorded and the most recent flt_date_time. As we do this, note that by default time span (ytspan) is calculated in seconds.

cmh.df2 %>%
group_by(tail_num) %>%
arrange(flt_date_time) %>%
mutate(
tspan = interval(
lag(flt_date_time, order_by = tail_num), flt_date_time
), # calculate the time span between successive flights recorded
tspan.minutes = as.duration(tspan)/dminutes(1), # convert tspan into minutes
tspan.hours = as.duration(tspan)/dhours(1), # convert tspan into hours
tspan.days = as.duration(tspan)/ddays(1), # convert tspan into days
tspan.weeks = as.duration(tspan)/dweeks(1) # convert tspan into weeks
) -> cmh.df2

Here, tspan is being converted into, say, minutes by dividing it by 60, into hours by dividing tspan by 60 x 60 = 3600, and so on. Note that dminutes(1) is calculating the time span in one-minute intervals. Similarly for hours, days, and weeks. Thus if you ran dhours(2) you would get the time interval in 2-hour increments.

16 / 18
cmh.df2 %>%
filter(tail_num == "N396SW")
## # A tibble: 73 x 8
## # Groups: tail_num [1]
## tail_num flt_date_time n_flew tspan tspan.minutes tspan.hours
## <chr> <dttm> <int> <dbl> <dbl> <dbl>
## 1 N396SW 2017-01-05 09:30:00 1 NA NA NA
## 2 N396SW 2017-01-05 12:19:00 2 10140 169 2.82
## 3 N396SW 2017-01-11 08:34:00 3 504900 8415 140.
## 4 N396SW 2017-01-11 10:44:00 4 7800 130 2.17
## 5 N396SW 2017-01-19 10:31:00 5 690420 11507 192.
## 6 N396SW 2017-01-19 14:28:00 6 14220 237 3.95
## 7 N396SW 2017-02-10 08:23:00 7 1878900 31315 522.
## 8 N396SW 2017-02-10 10:32:00 8 7740 129 2.15
## 9 N396SW 2017-02-15 15:20:00 9 449280 7488 125.
## 10 N396SW 2017-02-15 18:15:00 10 10500 175 2.92
## # … with 63 more rows, and 2 more variables: tspan.days <dbl>, tspan.weeks <dbl>

There is a lot more we could do with time but the few things we have covered so far would be the more common tasks we usually encounter.

17 / 18

Agenda

This week we learn how to work with dates and times

  • package of choice here is {lubridate}

  • can parse messy date and time values

  • can calculate passage of time

  • Julian date formats? No worries ...

  • extract date elements? Yup, that too ...

2 / 18
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow