Work with Dates and Times

Ani Ruhil

Work with Dates and Times

If you want to do anything with dates, it helps to have them formatted correctly, and that is not always the case with secondary data we grab for analysis. Consequently, in this module we will see ways to work with date and time variables via base R and {lubridate}, a special package to work with date/time data. We’ll start by creating some data values.

Author

Affiliation

Ani Ruhil

Published

Feb. 16, 2022

Citation

Ruhil, 2022

If you want to do anything with dates, it helps to have them formatted correctly, and that is not always the case with secondary data we grab for analysis. Consequently, in this module we will see ways to work with date and time variables via base R and {lubridate}, a special package to work with date/time data. We’ll start by creating some data values.

Named	As Formatted
basedate	1970-1-1
today	`Sys.Date()`
tomorrow	2017-12-18
yesterday1	12/16/2017
yesterday2	12-16-2017
yesterday3	12 16 2017

Let us create these.

as.Date("1970-1-1") -> basedate 
Sys.Date() -> today 
as.Date("2017-12-18") -> tomorrow 
"12/16/2017" -> yesterday1 
"12-16-17" -> yesterday2 
"12 16 17" -> yesterday3

Now let us see how R treats each of these.

str(
  list(basedate, today, tomorrow,
       yesterday1, yesterday2, yesterday3)
    )

List of 6
 $ : Date[1:1], format: "1970-01-01"
 $ : Date[1:1], format: "2022-02-16"
 $ : Date[1:1], format: "2017-12-18"
 $ : chr "12/16/2017"
 $ : chr "12-16-17"
 $ : chr "12 16 17"

Clearly, only basedate, today and tomorrow are read as date fields; the other three are read as characters.

Using `base R`

This latter group can be flipped into date fields:

as.Date(
  yesterday1, format = "%m/%d/%Y"
  ) -> yesterday1d 

as.Date(
  yesterday2, format = "%m-%d-%y"
  ) -> yesterday2d 

as.Date(
  yesterday3, format = "%m %d %y"
  ) -> yesterday3d 

yesterday1d; yesterday2d; yesterday3d

[1] "2017-12-16"

[1] "2017-12-16"

[1] "2017-12-16"

There are special switches that indicate date components when you format = ""

Code	Represents	Example
%a	Day spelled out (abbreviated)	Mon
%A	Day spelled out	Monday
%d	Day of the month	16
%m	Month	12
%b	Month (abbreviated)	Dec
%B	Month (fully spelled out)	December
%y	Year (2-digits)	17
%Y	Year (4-digits)	2017

So if you ever run into a date field that needs formatting, what we have covered should help you convert it into a proper date formatted variable.

Once you convert dates to proper date formats, you can extract other quantities embedded in them.

weekdays(yesterday1d)

[1] "Saturday"

months(today)

[1] "February"

quarters(today)

[1] "Q1"

Sometimes you will have a number representing the date, with the number representing the number of days since (if positive) or before (if negative) some date of origin. R uses 1970-01-01 but other software packages may (and some do) use different days. Thus, for example, if the dates are listed as

c(17651, -2345, 19760) -> x

Then these numbers represent the following dates:

as.Date(x, origin = "1970-01-01") -> x.dates 

x.dates

[1] "2018-04-30" "1963-08-01" "2024-02-07"

You can also convert dates into the numbers representing dates.

These are Julian days

julian(
  x.dates, origin = as.POSIXct("1970-01-01", tz = "UCT")
  ) -> x.julian

x.julian

[1] 17651 -2345 19760
attr(,"origin")
[1] "1970-01-01 UTC"

What if I wanted to create a sequence of dates, starting with Feb 8, 2018 but increasing by 1 day at a time or by every 5th day?

as.Date(
  as.character("2018-02-08"), 
  format = "%Y-%m-%d"
  ) -> start.date 

seq(
  start.date, 
  by = 1, 
  length.out = 7
  ) -> date.seq1 

date.seq1

[1] "2018-02-08" "2018-02-09" "2018-02-10" "2018-02-11" "2018-02-12"
[6] "2018-02-13" "2018-02-14"

seq(
  start.date, 
  by = 5, 
  length.out = 3
  ) -> date.seq5 

date.seq5

[1] "2018-02-08" "2018-02-13" "2018-02-18"

Note that length.out = is specifying how many you dates want to create.

Now, what if I want to know the date 30 days from today? 19 days ago?

today + 30 -> date.30 

today - 19 -> date.19

How many days lapsed between two dates?

as.Date("2017-04-28") -> date1 

Sys.Date() -> date2 

(date2 - date1) -> lapsed.time 

lapsed.time

Time difference of 1755 days

Say I want to create a vector of dates that starts and ends on specific dates, and the step function is 1 day, a week, 4 months, etc. The step is indicated with the by = "" command

seq(
  from = as.Date("2017-12-17"),
  to = as.Date("2018-12-16"),
  by = "day"
  ) -> my.dates1 

seq(
  from = as.Date("2017-12-17"),
  to = as.Date("2018-12-16"),
  by = "week"
  ) -> my.dates2 

seq(
  from = as.Date("2017-12-17"),
  to = as.Date("2018-12-16"),
  by = "month"
  ) -> my.dates3 

seq(
  from = as.Date("2017-12-17"),
  to = as.Date("2018-12-16"),
  by = "3 days"
  ) -> my.dates4 

seq(
  from = as.Date("2017-12-17"),
  to = as.Date("2018-12-16"),
  by = "2 weeks"
  ) -> my.dates5 

seq(
  from = as.Date("2017-12-17"),
  to = as.Date("2018-12-16"),
  by = "4 months"
  ) -> my.dates6 

seq(
  from = as.Date("2017-12-17"),
  to = as.Date("2019-12-16"),
  by = "year"
  ) -> my.dates7 

seq(
  from = as.Date("2017-12-17"),
  to = as.Date("2022-12-16"),
  by = "2 years"
  ) -> my.dates8

We can also extract components, say from dates in the economics data-set.

library(ggplot2)
data(economics)
names(economics)

[1] "date"     "pce"      "pop"      "psavert"  "uempmed"  "unemploy"

format(economics$date, "%Y") -> economics$year 

format(economics$date, "%B") -> economics$month 

quarters(
  economics$date, 
  abbreviate = FALSE
  ) -> economics$quarter 

head(economics[,  c(1, 7:9)])

# A tibble: 6 × 4
  date       year  month     quarter
  <date>     <chr> <chr>     <chr>  
1 1967-07-01 1967  July      Q3     
2 1967-08-01 1967  August    Q3     
3 1967-09-01 1967  September Q3     
4 1967-10-01 1967  October   Q4     
5 1967-11-01 1967  November  Q4     
6 1967-12-01 1967  December  Q4

Using `{lubridate}`

This package makes working with dates and times quite easy because it does what base R does but in more intuitive ways and perhaps more flexibly. Let us start with some date fields.

"20171217" -> today1 
"2017-12-17" -> today2 
"2017 December 17" -> today3 
"20171217143241" -> today4 
"2017 December 17 14:32:41" -> today5 
"December 17 2017 14:32:41" -> today6 
"17-Dec, 2017 14:32:41" -> today7

These formats are quite varied but {lubridate} deals with them quite seamlessly so long as you pay attention to the order – is year first or last? What about month? day? Is time given in hours, minutes and seconds?.

library(lubridate)
ymd(today1)

[1] "2017-12-17"

ymd(today2)

[1] "2017-12-17"

ymd(today3)

[1] "2017-12-17"

ymd_hms(today4)

[1] "2017-12-17 14:32:41 UTC"

ymd_hms(today5)

[1] "2017-12-17 14:32:41 UTC"

mdy_hms(today6)

[1] "2017-12-17 14:32:41 UTC"

dmy_hms(today7)

[1] "2017-12-17 14:32:41 UTC"

The default time zone will be UTC, the Coordinated Universal Time clock. If you want to change it you must specify the time zone.

dmy_hms(today7, tz = "EST")

[1] "2017-12-17 14:32:41 EST"

I am planning my spring break travel to Italy and want to know when my flight departs on March 10.

now("America/New_York") -> now.charleston 
now.charleston

[1] "2022-02-16 12:08:33 EST"

now.charleston + days(30) -> my.departure 
my.departure

[1] "2022-03-18 12:08:33 EDT"

To look-up time zones, run

grep("America",  OlsonNames(), value =  TRUE)

[1] "America/Adak"                   "America/Anchorage"             
[3] "America/Anguilla"               "America/Antigua"               
[5] "America/Araguaina"              "America/Argentina/Buenos_Aires"

Hang on a second, the boarding pass shows departure time to be at 11:41 AM. Ah, easily corrected:

update(
  my.departure, hours = 11, minutes = 41, seconds = 0
  ) -> corrected.departure

corrected.departure

[1] "2022-03-18 11:41:00 EDT"

If I need to extract data/time elements, I can do that as well.

dmy_hms(today7) -> today 

today # a date and time set in today7

[1] "2017-12-17 14:32:41 UTC"

year(today) -> today.y 

today.y # The year

[1] 2017

month(today) -> today.m1 

today.m1 # the month, as a number

[1] 12

month(
  today, label = TRUE, abbr = TRUE
  ) -> today.m2 

today.m2 # labeling the month but with an abbreviation

[1] Dec
12 Levels: Jan < Feb < Mar < Apr < May < Jun < Jul < Aug < ... < Dec

month(
  today, label = TRUE, abbr = FALSE
  ) -> today.m3

today.m3# fully labelling the month

[1] December
12 Levels: January < February < March < April < May < ... < December

week(today) -> today.w 

today.w # what week of the year is it?

[1] 51

yday(today) -> today.doy 

today.doy # what day of the year is it?

[1] 351

mday(today) -> today.dom 

today.dom # what day of the month is it?

[1] 17

wday(today) -> today.dow1 

today.dow1 # what day of the week is it,  as a number?

[1] 1

wday(
  today, label = TRUE, abbr = TRUE
  ) -> today.dow2 

today.dow2 #  day of the week labelled but abbreviated

[1] Sun
Levels: Sun < Mon < Tue < Wed < Thu < Fri < Sat

wday(
  today, label = TRUE, abbr = FALSE
  ) -> today.dow3 

today.dow3 # day of the week fully labelled

[1] Sunday
7 Levels: Sunday < Monday < Tuesday < Wednesday < ... < Saturday

hour(today) -> today.h 

today.h # what hour is it?

[1] 14

minute(today) -> today.m 

today.m # what minute is it?

[1] 32

second(today) -> today.s 

today.s # what second is it?

[1] 41

tz(today) -> today.tz

today.tz # what time zone is it?

[1] "UTC"

`Intervals` vs. `Durations` vs. `Periods`

Calculating time lapsed between dates is tricky because you have to take into account daylight savings time

Daylight Saving Time (DST) is the practice of setting the clocks forward 1 hour from standard time during the summer months, and back again in the fall, in order to make better use of natural daylight. Source

² , leap years

Leap years are needed to keep our modern day Gregorian calendar in alignment with the Earth’s revolutions around the sun. It takes the Earth approximately 365.242189 days – or 365 days, 5 hours, 48 minutes, and 45 seconds – to circle once around the Sun. This is called a tropical year, and is measured from the March equinox. However, the Gregorian calendar has only 365 days in a year, so if we didn’t add a leap day on February 29 nearly every four years, we would lose almost six hours off our calendar every year. After only 100 years, our calendar would be off by around 24 days! Source

³ , and leap seconds.

Two components are used to determine UTC (Coordinated Universal Time): International Atomic Time (TAI): A time scale that combines the output of some 200 highly precise atomic clocks worldwide, and provides the exact speed for our clocks to tick. Universal Time (UT1), also known as Astronomical Time, refers to the Earth’s rotation around its own axis, which determines the length of a day. Before the difference between UTC and UT1 reaches 0.9 seconds, a leap second is added to UTC and to clocks worldwide. By adding an additional second to the time count, our clocks are effectively stopped for that second to give Earth the opportunity to catch up. Source.

⁴ These render the length of months, weeks, days, hours, and minutes to be relative units of time but seconds tends to be exact units of time. As a result, and by design, lubridate differentiates between intervals, durations, and periods, each measuring time spans in different ways.

Let us start with a duration, the simplest measure of lapsed time since it measures the passing of time in seconds. Say I pick two dates and times … 05:00am on March 9, 2019 and 05:00am on March 10, 2019.

ymd_hms(
  "2019-03-09 05:00:00", tz = "US/Eastern"
  ) -> date1 

ymd_hms(
  "2019-03-10 05:00:00", tz = "US/Eastern"
  ) -> date2 

interval(date1, date2) -> timint 

timint

[1] 2019-03-09 05:00:00 EST--2019-03-10 05:00:00 EDT

How much time has lapsed between date1 and date2? We calculate with as.duration()

as.duration(timint) -> timelapsed.d 

timelapsed.d

[1] "82800s (~23 hours)"

Why is it 23 hours and not 24 hours? Because daylight savings kicks in at 2:00 AM on March 10, and since the clocks kick forward by one hour, only 23 hours have passed, not 24 as we might naively expect. Go back and look at the output from timeint; see the EST versus EDT?

In contrast to duration, if I ask for the period of time, what do I get?

as.period(timint) -> timelapsed.p 

timelapsed.p

[1] "1d 0H 0M 0S"

Aha! as.period() is imprecise because it tells me 1 day has passed.

Now, we can use the duration of time lapsed in whatever units we want, but if we want accuracy, we better work with durations, as shown below.

time_length(timelapsed.d, unit = "second")

[1] 82800

time_length(timelapsed.d, unit = "minute")

[1] 1380

time_length(timelapsed.d, unit = "hour")

[1] 23

time_length(timelapsed.d, unit = "day")

[1] 0.9583333

time_length(timelapsed.d, unit = "week")

[1] 0.1369048

time_length(timelapsed.d, unit = "month")

[1] 0.03148528

time_length(timelapsed.d, unit = "year")

[1] 0.002623774

If, unfortunately, we rely on periods, well then we have inherited a problem!

time_length(timelapsed.p, unit = "second")

[1] 86400

time_length(timelapsed.p, unit = "minute")

[1] 1440

time_length(timelapsed.p, unit = "hour")

[1] 24

time_length(timelapsed.p, unit = "day")

[1] 1

time_length(timelapsed.p, unit = "week")

[1] 0.1428571

time_length(timelapsed.p, unit = "month")

[1] 0.03285421

time_length(timelapsed.p, unit = "year")

[1] 0.002737851

time_length(timint, unit = "second")

[1] 82800

time_length(timint, unit = "minute")

[1] 1380

time_length(timint, unit = "hour")

[1] 23

time_length(timint, unit = "day")

[1] 0.9583333

time_length(timint, unit = "week")

[1] 0.1369048

time_length(timint, unit = "month")

[1] 0.03095559

time_length(timint, unit = "year")

[1] 0.002618695

The differences will be unclear until you mess around with lubridate for a while and the nuances sink in. Look at the table for a summary of the differences. Note how similar duration and interval values are through week. But month and year are vastly different. The period column is off by design since it does not take into account daylight savings time; this we know. The bottom-line: Make sure you are working with as.duration() because duration calculates everything in seconds.

unit of time	duration	period	interval
second	8.280000e+04	8.640000e+04	8.280000e+04
minute	1.380000e+03	1.440000e+03	1.380000e+03
hour	2.300000e+01	2.400000e+01	2.300000e+01
day	9.583333e-01	1.000000e+00	9.583333e-01
week	1.369048e-01	1.428571e-01	1.369048e-01
month	3.148528e-02	3.285421e-02	3.095559e-02
year	2.623774e-03	2.737851e-03	2.618695e-03

Calculating & working with time in the flights data

Let us load the CMH flights data we used earlier to learn dplyr.

load(here::here("data", "cmhflights2017.RData"))

library(janitor)
clean_names(cmhflights2017) -> cmh.df

Say we had to construct a date field from year, month, and day_of_week. This could be done with unite(), as shown below.

library(tidyverse)
cmh.df %>%
  unite(
    "fltdate",
    c("year", "month", "day_of_week"),
    sep = "-",
    remove = FALSE
    ) -> cmh.df

What if I wanted to see which month, and then which day of the week has the most number of flights?

cmh.df %>%
  mutate(
    monthnamed = month(fltdate, label = TRUE, abbr = TRUE),
    weekday = wday(fltdate, label = TRUE, abbr = TRUE)
    ) -> cmh.df 

table(cmh.df$monthnamed)


 Jan  Feb  Mar  Apr  May  Jun  Jul  Aug  Sep  Oct  Nov  Dec 
3757 3413 4101 4123 4098 4138 4295 4279 3789 4027 3951 3816

table(cmh.df$weekday)


 Sun  Mon  Tue  Wed  Thu  Fri  Sat 
6920 6747 6805 6795 6756 6808 6956

library(hrbrthemes)
library(viridis)

cmh.df %>% 
  group_by(monthnamed) %>%
  summarise(nflights = n()) %>%
  ggplot(aes(x = monthnamed, y = nflights, fill = nflights)) +
  geom_bar(stat = "identity") +
  labs(x = "Flight Month",
       y = "Number of flights",
       fill = "") +
  theme_ipsum_rc() +
  scale_fill_viridis(option = "viridis", direction = -1) + 
  theme(legend.position = "bottom")

cmh.df %>% 
  group_by(weekday) %>%
  summarise(nflights = n()) %>%
  ggplot(aes(x = weekday, y = nflights, fill = nflights)) +
  geom_bar(stat = "identity") +
  labs(x = "Flight Day of the Week",
       y = "Number of flights",
       fill = "") +
  theme_ipsum_rc() +
  scale_fill_viridis(option = "viridis", direction = -1) + 
  theme(legend.position = "bottom")

Hmm, I’d like the week to start on Monday so I can better show the obvious – most flights occur over the weekend.

cmh.df %>% 
  mutate(
    weekday.f = ordered(
      weekday,
      levels = c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun")
      )
    ) %>%
  group_by(weekday.f) %>%
  summarise(nflights = n()) %>%
  ggplot(
    aes(x = weekday.f, y = nflights, fill = nflights)
    ) +
  geom_bar(stat = "identity") +
  labs(x = "Flight Day of the Week",
       y = "Number of flights",
       fill = "") +
  theme_ipsum_rc() +
  scale_fill_viridis(option = "viridis", direction = -1) + 
  theme(legend.position = "bottom")

What time of the day do most/least flights depart? Let us start by creating a date-time field. You will see the code below steps into it by splitting dep_time into hours and minutes and then combines them with a : separating the two elements, then unites date with hourmin before flipping it into deptime stored in ymd_hm() format.

cmh.df %>%
  separate(
    crs_dep_time,
    c("hour", "minute"),
    sep = 2,
    remove = FALSE
    ) %>%
  unite(
    hourmin,
    c("hour", "minute"),
    sep = ":",
    remove = FALSE
    ) %>% 
  unite(
    depdatetime,
    c("fltdate", "hourmin"),
    sep = " ",
    remove = FALSE
    ) %>%
  mutate(
    deptime = ymd_hm(depdatetime)
    ) -> cmh.df

Now I want the number of flights by the minute of the hour when the flight departed. .

cmh.df %>%
  mutate(fltminute = minute(deptime)) %>%
  group_by(fltminute) %>%
  summarise(nflights = n()) %>%
  ggplot(
    aes(x = fltminute, y = nflights, fill = nflights)
    ) +
  geom_bar(stat = "identity") +
  labs(x = "Flight Minute",
       y = "Number of flights",
       fill = "") +
  theme_ipsum_rc() +
  scale_fill_viridis(option = "viridis", direction = -1) + 
  theme(legend.position = "bottom")

Quite clearly, most flights depart 25 minutes after the hour, followed by 55 minutes after the hour, and then at 30 minutes after the hour, and then at the hour. Fair enough, but does this differ by the day of the week? All days are not the same!

cmh.df %>%
  mutate(
    weekday.f = ordered(
      weekday,
      levels = c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun")
      )
    ) %>%
  mutate(fltminute = minute(deptime)) %>%
  group_by(weekday.f, fltminute) %>%
  summarise(nflights = n()) %>%
  ggplot(
    aes(x = fltminute, y = nflights, fill = nflights)
    ) +
  geom_bar(stat = "identity") +
  labs(x = "Flight Minute",
       y = "Number of flights",
       fill = "") +
  theme_ipsum_rc() +
  scale_fill_gradient(low = "white", high = "red") + 
  theme(legend.position = "bottom") + 
  facet_wrap(~ weekday.f, ncol = 2)

But maybe what we really want is what departure minute is associated with higher/lower average arrival delays (in minutes).

cmh.df %>%
  mutate(fltminute = minute(deptime)) %>%
  group_by(fltminute) %>%
  summarise(delay = mean(arr_delay, na.rm = TRUE),
            nflights = n()) %>%
  ggplot(
    aes(x = fltminute, y = delay, color = nflights)
    ) +
  geom_line() +
  labs(x = "Flight Minute",
       y = "Mean Arrival Delay",
       color = "Number of Flights in the Mean") +
  theme_ft_rc() +
  scale_color_gradient(low = "white", high = "red") + 
  theme(legend.position = "bottom")

These calculations have been made with the scheduled departure time. So we could also ask about mean departure delays based on scheduled departure times. We could also keep it clean by assuming the destination is Columbus, and by insisting that at least 30 flights must exist for that minute otherwise the mean is not calculated.

cmh.df %>%
  separate(
    crs_dep_time,
    c("hour", "minute"),
    sep = 2,
    remove = FALSE
    ) %>%
  unite(
    hourmin,
    c("hour", "minute"),
    sep = ":",
    remove = FALSE
    ) %>% 
  unite(
    depdatetime,
    c("fltdate", "hourmin"), 
    sep = " ", 
    remove = FALSE
    ) %>%
  mutate(deptime = ymd_hm(depdatetime)) %>%
  mutate(fltminute = minute(deptime)) %>%
  filter(dest == "CMH") %>% 
  group_by(fltminute) %>%
  summarise(
    delay = mean(dep_delay, na.rm = TRUE),
    nflights = n()
    ) %>%
  filter(nflights >= 30) %>%
  ggplot(
    aes(x = fltminute, y = delay, color = nflights)
    ) +
  geom_line() +
  labs(x = "Scheduled Flight Minute",
       y = "Mean Departure Delay",
       color = "Number of Flights in the Mean") +
  theme_ft_rc() +
  scale_color_gradient(low = "white", high = "red") + 
  theme(legend.position = "bottom")

Does this vary by airline?

cmh.df %>%
  separate(
    crs_dep_time,
    c("hour", "minute"),
    sep = 2,
    remove = FALSE
    ) %>%
  unite(
    hourmin,
    c("hour", "minute"),
    sep = ":",
    remove = FALSE
    ) %>% 
  unite(
    depdatetime,
    c("fltdate", "hourmin"),
    sep = " ",
    remove = FALSE
    ) %>%
  mutate(deptime = ymd_hm(depdatetime)) %>%
  mutate(fltminute = minute(deptime)) %>%
  filter(dest == "CMH") %>% 
  group_by(fltminute, reporting_airline) %>%
  summarise(
    delay = mean(dep_delay, na.rm = TRUE),
    nflights = n()
    ) %>%
  filter(nflights >= 30) %>%
  ggplot(
    aes(x = fltminute, y = delay, color = nflights)
    ) +
  geom_line() +
  labs(x = "Scheduled Flight Minute",
       y = "Mean Departure Delay",
       color = "Number of Flights in the Mean") +
  theme_ft_rc() +
  scale_color_gradient(low = "white", high = "red") + 
  theme(legend.position = "bottom") + 
  facet_wrap(~ reporting_airline)

And then here are arrival delays by airlines flying to Columbus.

cmh.df %>%
  separate(
    crs_dep_time, 
    c("hour", "minute"), 
    sep = 2, 
    remove = FALSE
    ) %>%
  unite(
    hourmin, 
    c("hour", "minute"), 
    sep = ":", 
    remove = FALSE
    ) %>% 
  unite(
    depdatetime, 
    c("fltdate", "hourmin"), 
    sep = " ", 
    remove = FALSE
    ) %>%
  mutate(deptime = ymd_hm(depdatetime)) %>%
  mutate(fltminute = minute(deptime)) %>%
  filter(dest == "CMH") %>% 
  group_by(fltminute, reporting_airline) %>%
  summarise(
    delay = mean(arr_delay, na.rm = TRUE),
    nflights = n()
    ) %>%
  filter(nflights >= 30) %>%
  ggplot(
    aes(x = fltminute, y = delay, color = nflights)
    ) +
  geom_line() +
  labs(x = "Scheduled Flight Minute",
       y = "Mean Arrival Delay",
       color = "Number of Flights in the Mean") +
  theme_ft_rc() +
  scale_color_gradient(low = "white", high = "red") + 
  theme(legend.position = "bottom") + 
  facet_wrap(~ reporting_airline)

One final query. Say we are curious about how often a particular aircraft flies. How might we calculate this time span?

cmh.df %>%
  filter(!is.na(tail_number)) %>% 
  group_by(tail_number) %>%
  arrange(deptime) %>%
  mutate(nflew = row_number()) %>%
  select(deptime, tail_number, nflew) %>%
  arrange(tail_number, nflew) -> cmh.df2

The preceding code has basically flagged each aircraft’s flight sequence (basically when it was first seen, seen for the second time, and so on). Let us now calculate the time lapsed between each flight to/from CMH of each aircraft.

cmh.df2 %>%
  ungroup() %>%
  arrange(tail_number) %>% 
  group_by(tail_number) %>%
  mutate(
    tspan = interval(lag(deptime, order_by = tail_number), deptime), 
    tspan.minutes = as.duration(tspan)/dminutes(1),
    tspan.hours = as.duration(tspan)/dhours(1),
    tspan.days = as.duration(tspan)/ddays(1),
    tspan.weeks = as.duration(tspan)/dweeks(1)
    ) -> cmh.df2 

mean(cmh.df2$tspan.minutes, na.rm = TRUE)

[1] 15834.02

median(cmh.df2$tspan.minutes, na.rm = TRUE)

[1] 612

mean(cmh.df2$tspan.days, na.rm = TRUE)

[1] 10.99585

median(cmh.df2$tspan.days, na.rm = TRUE)

[1] 0.425

If you look at the data you realize that the overwhelming majority of aircraft are flying to/from CMH within a median span of 609 minutes (a shade over 10 hours). This might differ by airline, or does it? Sure it does!

cmh.df %>%
  arrange(tail_number, deptime) %>% 
  group_by(tail_number, reporting_airline) %>%
  select(deptime, tail_number, reporting_airline) %>% 
  mutate(
    tspan = interval(lag(deptime), deptime), 
    tspan.minutes = as.duration(tspan)/dminutes(1),
    tspan.days = as.duration(tspan)/ddays(1)
    ) %>% 
  group_by(reporting_airline) %>%
  summarise(md.tspan = median(tspan.minutes, na.rm = TRUE)) %>%
  arrange(md.tspan) -> cmh.ts

Table 1: Median number of minutes between aircraft flight in databse
Airline	Median Minutes
OO	243
EV	263
WN	585
DL	686
F9	735
UA	738
AA	765

A Few Other Resources

See the book chapter for more advanced work with dates and times and some practice exercises.

Footnotes

These are Julian days [↩]
Daylight Saving Time (DST) is the practice of setting the clocks forward 1 hour from standard time during the summer months, and back again in the fall, in order to make better use of natural daylight. Source [↩]
Leap years are needed to keep our modern day Gregorian calendar in alignment with the Earth’s revolutions around the sun. It takes the Earth approximately 365.242189 days – or 365 days, 5 hours, 48 minutes, and 45 seconds – to circle once around the Sun. This is called a tropical year, and is measured from the March equinox. However, the Gregorian calendar has only 365 days in a year, so if we didn’t add a leap day on February 29 nearly every four years, we would lose almost six hours off our calendar every year. After only 100 years, our calendar would be off by around 24 days! Source [↩]
Two components are used to determine UTC (Coordinated Universal Time): International Atomic Time (TAI): A time scale that combines the output of some 200 highly precise atomic clocks worldwide, and provides the exact speed for our clocks to tick. Universal Time (UT1), also known as Astronomical Time, refers to the Earth’s rotation around its own axis, which determines the length of a day. Before the difference between UTC and UT1 reaches 0.9 seconds, a leap second is added to UTC and to clocks worldwide. By adding an additional second to the time count, our clocks are effectively stopped for that second to give Earth the opportunity to catch up. Source.[↩]

Citation

For attribution, please cite this work as

Ruhil (2022, Feb. 16). Work with Dates and Times. Retrieved from https://aniruhil.org/courses/mpa6020/handouts/module05.html

BibTeX citation

@misc{ruhil2022work,
  author = {Ruhil, Ani},
  title = {Work with Dates and Times},
  url = {https://aniruhil.org/courses/mpa6020/handouts/module05.html},
  year = {2022}
}

Work with Dates and Times

Author

Affiliation

Published

Citation

Using base R

Using {lubridate}

Intervals vs. Durations vs. Periods

Calculating & working with time in the flights data

A Few Other Resources

Footnotes

Citation

Using `base R`

Using `{lubridate}`

`Intervals` vs. `Durations` vs. `Periods`