If you want to do anything with dates, it helps to have them formatted correctly, and that is not always the case with secondary data we grab for analysis. Consequently, in this module we will see ways to work with date and time variables via base R and {lubridate}, a special package to work with date/time data. We’ll start by creating some data values.
If you want to do anything with dates, it helps to have them formatted correctly, and that is not always the case with secondary data we grab for analysis. Consequently, in this module we will see ways to work with date and time variables via base R and {lubridate}, a special package to work with date/time data. We’ll start by creating some data values.
| Named | As Formatted |
|---|---|
| basedate | 1970-1-1 |
| today | Sys.Date() |
| tomorrow | 2017-12-18 |
| yesterday1 | 12/16/2017 |
| yesterday2 | 12-16-2017 |
| yesterday3 | 12 16 2017 |
Let us create these dates before we see how R treats each.
as.Date("1970-1-1") -> basedate
Sys.Date() -> today
as.Date("2017-12-18") -> tomorrow
"12/16/2017" -> yesterday1
"12-16-17" -> yesterday2
"12 16 17" -> yesterday3 How does R treat each?
str(
list(basedate, today, tomorrow,
yesterday1, yesterday2, yesterday3)
)List of 6
$ : Date[1:1], format: "1970-01-01"
$ : Date[1:1], format: "2026-03-15"
$ : Date[1:1], format: "2017-12-18"
$ : chr "12/16/2017"
$ : chr "12-16-17"
$ : chr "12 16 17"
Clearly, only basedate, today and tomorrow are read as date fields; the other three are read as characters. But we need each to be a proper date field. How do we flip a date stored as a character into a proper date format that R can recognize and work with? A few ways to do this but let us start with base R.
base RThe three date-as-character offenders can be flipped into date fields as follows:
as.Date(
yesterday1,
format = "%m/%d/%Y"
) -> yesterday1d
as.Date(
yesterday2,
format = "%m-%d-%y"
) -> yesterday2d
as.Date(
yesterday3,
format = "%m %d %y"
) -> yesterday3d
yesterday1d; yesterday2d; yesterday3d[1] "2017-12-16"
[1] "2017-12-16"
[1] "2017-12-16"
There are special switches that indicate date components when you format = ""
| Code | Represents | Example |
|---|---|---|
| %a | Day spelled out (abbreviated) | Mon |
| %A | Day spelled out | Monday |
| %d | Day of the month | 16 |
| %m | Month | 12 |
| %b | Month (abbreviated) | Dec |
| %B | Month (fully spelled out) | December |
| %y | Year (2-digits) | 17 |
| %Y | Year (4-digits) | 2017 |
So if you ever run into a date field that needs formatting, what we have covered should help you convert it into a proper date formatted variable.
## Extracting other quantities
Once you convert dates to proper date formats, you can extract other quantities embedded in them. For example,
weekdays(yesterday1d)[1] "Saturday"
months(today)[1] "March"
quarters(today)[1] "Q1"
Sometimes you will have a number representing the date, with the number representing the number of days since (if positive) or before (if negative) some date of origin. R uses 1970-01-01 as the date of origin but other software packages may (and some do) use different days. Thus, for example, if the dates are listed as
c(17651, -2345, 19760) -> x Then these numbers represent the following dates:
as.Date(x, origin = "1970-01-01") -> x.dates
x.dates[1] "2018-04-30" "1963-08-01" "2024-02-07"
You can also convert dates into the numbers representing dates.
julian(
x.dates,
origin = as.POSIXct(
"1970-01-01",
tz = "UCT"
)
) -> x.julian
x.julian[1] 17651 -2345 19760
attr(,"origin")
[1] "1970-01-01 UTC"
What if I wanted to create a sequence of dates, starting with Feb 8, 2018 but increasing by 1 day at a time or by every 5th day?
as.Date(
as.character("2018-02-08"),
format = "%Y-%m-%d"
) -> start.date
seq(
start.date,
by = 1,
length.out = 7
) -> date.seq1
date.seq1[1] "2018-02-08" "2018-02-09" "2018-02-10" "2018-02-11" "2018-02-12"
[6] "2018-02-13" "2018-02-14"
seq(
start.date,
by = 5,
length.out = 3
) -> date.seq5
date.seq5[1] "2018-02-08" "2018-02-13" "2018-02-18"
Note that length.out = is specifying how many you dates want to create.
Now, what if I want to know the date 30 days from today? 19 days ago?
today + 30 -> date.30
today - 19 -> date.19 How many days lapsed between two dates?
as.Date("2017-04-28") -> date1
Sys.Date() -> date2
(date2 - date1) -> lapsed.time
lapsed.timeTime difference of 3243 days
Say I want to create a vector of dates that starts and ends on specific dates, and the step function is 1 day, a week, 4 months, etc. The step is indicated with the by = "" command
seq(
from = as.Date("2017-12-17"),
to = as.Date("2018-12-16"),
by = "day"
) -> my.dates1
seq(
from = as.Date("2017-12-17"),
to = as.Date("2018-12-16"),
by = "week"
) -> my.dates2
seq(
from = as.Date("2017-12-17"),
to = as.Date("2018-12-16"),
by = "month"
) -> my.dates3
seq(
from = as.Date("2017-12-17"),
to = as.Date("2018-12-16"),
by = "3 days"
) -> my.dates4
seq(
from = as.Date("2017-12-17"),
to = as.Date("2018-12-16"),
by = "2 weeks"
) -> my.dates5
seq(
from = as.Date("2017-12-17"),
to = as.Date("2018-12-16"),
by = "4 months"
) -> my.dates6
seq(
from = as.Date("2017-12-17"),
to = as.Date("2019-12-16"),
by = "year"
) -> my.dates7
seq(
from = as.Date("2017-12-17"),
to = as.Date("2022-12-16"),
by = "2 years"
) -> my.dates8 We can also extract components, say from dates in the economics data-set.
library(ggplot2)
data(economics)
names(economics)[1] "date" "pce" "pop" "psavert" "uempmed" "unemploy"
format(economics$date, "%Y") -> economics$year
format(economics$date, "%B") -> economics$month
quarters(
economics$date,
abbreviate = FALSE
) -> economics$quarter
head(economics[, c(1, 7:9)])# A tibble: 6 × 4
date year month quarter
<date> <chr> <chr> <chr>
1 1967-07-01 1967 July Q3
2 1967-08-01 1967 August Q3
3 1967-09-01 1967 September Q3
4 1967-10-01 1967 October Q4
5 1967-11-01 1967 November Q4
6 1967-12-01 1967 December Q4
{lubridate}This specific package makes working with dates and times quite easy because it does what base R does but in more intuitive ways and perhaps more flexibly. Again, let us start by creating some date fields.
"20171217" -> today1
"2017-12-17" -> today2
"2017 December 17" -> today3
"20171217143241" -> today4
"2017 December 17 14:32:41" -> today5
"December 17 2017 14:32:41" -> today6
"17-Dec, 2017 14:32:41" -> today7 These formats are quite varied but {lubridate} deals with them quite seamlessly so long as you pay attention to the order – is year first or last? What about month? day? Is time given in hours, minutes and seconds?.
library(lubridate)
ymd(today1)[1] "2017-12-17"
ymd(today2)[1] "2017-12-17"
ymd(today3)[1] "2017-12-17"
ymd_hms(today4)[1] "2017-12-17 14:32:41 UTC"
ymd_hms(today5)[1] "2017-12-17 14:32:41 UTC"
mdy_hms(today6)[1] "2017-12-17 14:32:41 UTC"
dmy_hms(today7)[1] "2017-12-17 14:32:41 UTC"
The default time zone will be UTC, the Coordinated Universal Time clock. If you want to change it you must specify the time zone.
dmy_hms(today7, tz = "EST")[1] "2017-12-17 14:32:41 EST"
I am planning my summer break travel to Italy and want to know when my flight departs on March 10.
now("America/New_York") -> now.nyc
now.nyc[1] "2026-03-15 15:00:06 EDT"
now.nyc + days(60) -> my.departure
my.departure[1] "2026-05-14 15:00:06 EDT"
To look-up time zones, run
grep("America", OlsonNames(), value = TRUE)[1] "America/Adak" "America/Anchorage"
[3] "America/Anguilla" "America/Antigua"
[5] "America/Araguaina" "America/Argentina/Buenos_Aires"
Hang on a second, the boarding pass shows departure time to be at 11:41 AM. Ah, easily corrected:
update(
my.departure,
hours = 11,
minutes = 41,
seconds = 0
) -> corrected.departure
corrected.departure[1] "2026-05-14 11:41:00 EDT"
If I need to extract data/time elements, I can do that as well.
dmy_hms(today7) -> today
today # a date and time set in today7 [1] "2017-12-17 14:32:41 UTC"
year(today) -> today.y
today.y # The year [1] 2017
month(today) -> today.m1
today.m1 # the month, as a number [1] 12
month(
today, label = TRUE, abbr = TRUE
) -> today.m2
today.m2 # labeling the month but with an abbreviation [1] Dec
12 Levels: Jan < Feb < Mar < Apr < May < Jun < Jul < Aug < ... < Dec
month(
today, label = TRUE, abbr = FALSE
) -> today.m3
today.m3# fully labelling the month [1] December
12 Levels: January < February < March < April < May < ... < December
week(today) -> today.w
today.w # what week of the year is it? [1] 51
yday(today) -> today.doy
today.doy # what day of the year is it? [1] 351
mday(today) -> today.dom
today.dom # what day of the month is it? [1] 17
wday(today) -> today.dow1
today.dow1 # what day of the week is it, as a number? [1] 1
wday(
today, label = TRUE, abbr = TRUE
) -> today.dow2
today.dow2 # day of the week labelled but abbreviated [1] Sun
Levels: Sun < Mon < Tue < Wed < Thu < Fri < Sat
wday(
today, label = TRUE, abbr = FALSE
) -> today.dow3
today.dow3 # day of the week fully labelled [1] Sunday
7 Levels: Sunday < Monday < Tuesday < Wednesday < ... < Saturday
hour(today) -> today.h
today.h # what hour is it? [1] 14
minute(today) -> today.m
today.m # what minute is it? [1] 32
second(today) -> today.s
today.s # what second is it? [1] 41
tz(today) -> today.tz
today.tz # what time zone is it? [1] "UTC"
Intervals vs. Durations vs. PeriodsCalculating time lapsed between dates is tricky because you have to take into account daylight savings timerelative units of time but seconds tends to be exact units of time. As a result, and by design, lubridate differentiates between intervals, durations, and periods, each measuring time spans in different ways.
Let us start with a duration, the simplest measure of lapsed time since it measures the passing of time in seconds. Say I pick two dates and times … 05:00am on March 7, 2026 and 05:00am on March 8, 2026.
ymd_hms(
"2026-03-07 05:00:00", tz = "US/Eastern"
) -> date1
ymd_hms(
"2026-03-08 05:00:00", tz = "US/Eastern"
) -> date2
as.interval(date1, date2) -> timint
timint[1] 2026-03-07 05:00:00 EST--2026-03-08 05:00:00 EDT
How much time has lapsed between date1 and date2? We calculate with as.duration()
as.duration(timint) -> timint.duration
timint.duration[1] "82800s (~23 hours)"
Why is it 23 hours and not 24 hours? Because daylight savings kicks in at 2:00 AM on March 08, and since the clocks kick forward by one hour, only 23 hours have passed, not 24 as we might naively expect. Go back and look at the output from timeint and pay attention to the EST versus EDT?
In contrast to duration, if I ask for the period of time, what do I get?
as.period(timint) -> timint.period
timint.period[1] "1d 0H 0M 0S"
Aha! as.period() is imprecise because it reports that 1 whole day has passed.
What does this tell us? That we can use the duration of time lapsed in whatever units we want, but if we want accuracy, we better work with duration, as shown below.
time_length(timint.duration, unit = "second")[1] 82800
time_length(timint.duration, unit = "minute")[1] 1380
time_length(timint.duration, unit = "hour")[1] 23
time_length(timint.duration, unit = "day")[1] 0.9583333
time_length(timint.duration, unit = "week")[1] 0.1369048
time_length(timint.duration, unit = "month")[1] 0.03148528
time_length(timint.duration, unit = "year")[1] 0.002623774
If, unfortunately, we rely on periods, well then we have inherited a problem!
time_length(timint.period, unit = "second")[1] 86400
time_length(timint.period, unit = "minute")[1] 1440
time_length(timint.period, unit = "hour")[1] 24
time_length(timint.period, unit = "day")[1] 1
time_length(timint.period, unit = "week")[1] 0.1428571
time_length(timint.period, unit = "month")[1] 0.03285421
time_length(timint.period, unit = "year")[1] 0.002737851
Now recall that we saved the difference between the two dates. Let us repeat that code here:
ymd_hms(
"2026-03-07 05:00:00",
tz = "US/Eastern"
) -> date1
date1 # stored in EST[1] "2026-03-07 05:00:00 EST"
ymd_hms(
"2026-03-08 05:00:00",
tz = "US/Eastern"
) -> date2
date2 # stored in EDT[1] "2026-03-08 05:00:00 EDT"
as.interval(date1, date2) -> timint
timint[1] 2026-03-07 05:00:00 EST--2026-03-08 05:00:00 EDT
Despite the EST versus EDT issue in timeinit, if you use time.length() you will get the correct quantities.
time_length(timint, unit = "second")[1] 82800
time_length(timint, unit = "minute")[1] 1380
time_length(timint, unit = "hour")[1] 23
time_length(timint, unit = "day")[1] 0.9583333
time_length(timint, unit = "week")[1] 0.1369048
time_length(timint, unit = "month")[1] 0.03095559
time_length(timint, unit = "year")[1] 0.002625571
Mess around with lubridate for a while and the nuances will sink in. Look at the table for a summary of the differences. Note how similar duration and interval values are through week. But month and year are vastly different.
The bottom-line: If you care about precise passing of time then make sure you are working with
as.duration()because duration calculates everything in seconds.
If you care about time in the sense of the human calendar (“how many days have passed since my birthday? Has this person been in quarantine for 10 days since the day they were infected?”) then work with
as.period()and be aware that EST versus EDT differences will be ignored for some units.
Use
as.interval()when you want to preserve start–end times and postpone choosing between seconds or calendar units. Think of as.interval()as the move you make when you want to keep the raw reality of a span instead of immediately flattening it into “X days” or “Y months.”
Below we have dumped the as.duration() and as.period() into a single table for comparison.
| unit of time | as.duration | as.period |
|---|---|---|
| second | 82800.000000000 | 86400.000000000 |
| minute | 1380.000000000 | 1440.000000000 |
| hour | 23.000000000 | 24.000000000 |
| day | 0.958333333 | 1.000000000 |
| week | 0.136904762 | 0.142857143 |
| month | 0.031485284 | 0.032854209 |
| year | 0.002623774 | 0.002737851 |
Note the difference in the values show in the columns, even though in theory these are for the same unit of time. Why the difference?
86400144024Now calculate how many weeks in a year (52), and months (12) in a year.
When as.period() thus calculates how many weeks is a day equal to, it calculates 17=0.1428571, ignoring the fact that an entire 24 hour period has not passed!
When it calculates how many months is a day equal to it just does 130.4375=0.03285421 … see below for why 30.4375 days in a month.
Taking leap years into account, the average year must have 365.25 days. Given 12 months in a year, this means on average there are 365.2512=30.4375 days in a month.
as.duration() however uses the actual seconds that are squeezed in here, which are 82800, and hence calculates 82800(30.4375×24×60×60)=0.03148528
For that last row, the same thing is happening, where if you calculate 86400(30.4375∗12∗24∗60∗60)=0.002737851 but if you use the actual seconds in that span you get 82800(30.4375∗12∗24∗60∗60)=0.002623774
I use a specific mnemonic …
| command | what it does … |
|---|---|
as.duration() |
Focus on exact seconds in a timespan |
as.period() |
Focus on human calendar units |
as.interval() |
Keep both endpoints so I can choose seconds or calendar units |
Let us load the CMH flights data we used earlier to learn dplyr.
load(
here::here(
"data",
"cmhotp.RData"
)
)
janitor::clean_names(
cmhotp
) -> cmh.df Say we had to construct a date field from year, month, and day_of_week. This could be done with unite(), as shown below.
library(tidyverse)
library(tidylog, warn.conflicts = FALSE)
cmh.df %>%
unite(
"fltdate",
c("year", "month", "day_of_week"),
sep = "-",
remove = FALSE
) -> cmh.df What if I wanted to see which month, and then which day of the week has the most number of flights?
cmh.df %>%
mutate(
monthnamed = month(fltdate, label = TRUE, abbr = TRUE),
weekday = wday(fltdate, label = TRUE, abbr = TRUE)
) -> cmh.df
table(cmh.df$monthnamed)
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
3757 3413 4101 4123 4098 4138 4295 4279 3789 4027 3951 3816
table(cmh.df$weekday)
Sun Mon Tue Wed Thu Fri Sat
6920 6747 6805 6795 6756 6808 6956
library(hrbrthemes)
library(viridis)
cmh.df %>%
group_by(monthnamed) %>%
summarise(nflights = n()) %>%
ggplot(aes(x = monthnamed, y = nflights, fill = nflights)) +
geom_bar(stat = "identity") +
labs(x = "Flight Month",
y = "Number of flights",
fill = "") +
theme_ipsum_rc() +
scale_fill_viridis(option = "viridis", direction = -1) +
theme(legend.position = "bottom")cmh.df %>%
group_by(weekday) %>%
summarise(nflights = n()) %>%
ggplot(aes(x = weekday, y = nflights, fill = nflights)) +
geom_bar(stat = "identity") +
labs(x = "Flight Day of the Week",
y = "Number of flights",
fill = "") +
theme_ipsum_rc() +
scale_fill_viridis(option = "viridis", direction = -1) +
theme(legend.position = "bottom")Hmm, I’d like the week to start on Monday so I can better show the obvious – most flights occur over the weekend.
cmh.df %>%
mutate(
weekday.f = ordered(
weekday,
levels = c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun")
)
) %>%
group_by(weekday.f) %>%
summarise(nflights = n()) %>%
ggplot(
aes(x = weekday.f, y = nflights, fill = nflights)
) +
geom_bar(stat = "identity") +
labs(x = "Flight Day of the Week",
y = "Number of flights",
fill = "") +
theme_ipsum_rc() +
scale_fill_viridis(option = "viridis", direction = -1) +
theme(legend.position = "bottom")What time of the day do most/least flights depart? Let us start by creating a date-time field. You will see the code below steps into it by splitting dep_time into hours and minutes and then combines them with a : separating the two elements, then unites date with hourmin before flipping it into deptime stored in ymd_hm() format.
cmh.df %>%
separate(
crs_dep_time,
c("hour", "minute"),
sep = 2,
remove = FALSE
) %>%
unite(
hourmin,
c("hour", "minute"),
sep = ":",
remove = FALSE
) %>%
unite(
depdatetime,
c("fltdate", "hourmin"),
sep = " ",
remove = FALSE
) %>%
mutate(
deptime = ymd_hm(depdatetime)
) -> cmh.df Now I want the number of flights by the minute of the hour when the flight departed. .
cmh.df %>%
mutate(fltminute = minute(deptime)) %>%
group_by(fltminute) %>%
summarise(nflights = n()) %>%
ggplot(
aes(x = fltminute, y = nflights, fill = nflights)
) +
geom_bar(stat = "identity") +
labs(x = "Flight Minute",
y = "Number of flights",
fill = "") +
theme_ipsum_rc() +
scale_fill_viridis(option = "viridis", direction = -1) +
theme(legend.position = "bottom")Quite clearly, most flights depart 25 minutes after the hour, followed by 55 minutes after the hour, and then at 30 minutes after the hour, and then at the hour. Fair enough, but does this differ by the day of the week? All days are not the same!
cmh.df %>%
mutate(
weekday.f = ordered(
weekday,
levels = c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun")
)
) %>%
mutate(fltminute = minute(deptime)) %>%
group_by(weekday.f, fltminute) %>%
summarise(nflights = n()) %>%
ggplot(
aes(x = fltminute, y = nflights, fill = nflights)
) +
geom_bar(stat = "identity") +
labs(x = "Flight Minute",
y = "Number of flights",
fill = "") +
theme_ipsum_rc() +
scale_fill_gradient(low = "white", high = "red") +
theme(legend.position = "bottom") +
facet_wrap(~ weekday.f, ncol = 2)But maybe what we really want is what departure minute is associated with higher/lower average arrival delays (in minutes).
cmh.df %>%
mutate(fltminute = minute(deptime)) %>%
group_by(fltminute) %>%
summarise(delay = mean(arr_delay, na.rm = TRUE),
nflights = n()) %>%
ggplot(
aes(x = fltminute, y = delay, color = nflights)
) +
geom_line() +
labs(x = "Flight Minute",
y = "Mean Arrival Delay",
color = "Number of Flights in the Mean") +
theme_ft_rc() +
scale_color_gradient(low = "white", high = "red") +
theme(legend.position = "bottom")These calculations have been made with the scheduled departure time. So we could also ask about mean departure delays based on scheduled departure times. We could also keep it clean by assuming the destination is Columbus, and by insisting that at least 30 flights must exist for that minute otherwise the mean is not calculated.
cmh.df %>%
separate(
crs_dep_time,
c("hour", "minute"),
sep = 2,
remove = FALSE
) %>%
unite(
hourmin,
c("hour", "minute"),
sep = ":",
remove = FALSE
) %>%
unite(
depdatetime,
c("fltdate", "hourmin"),
sep = " ",
remove = FALSE
) %>%
mutate(deptime = ymd_hm(depdatetime)) %>%
mutate(fltminute = minute(deptime)) %>%
filter(dest == "CMH") %>%
group_by(fltminute) %>%
summarise(
delay = mean(dep_delay, na.rm = TRUE),
nflights = n()
) %>%
filter(nflights >= 30) %>%
ggplot(
aes(x = fltminute, y = delay, color = nflights)
) +
geom_line() +
labs(x = "Scheduled Flight Minute",
y = "Mean Departure Delay",
color = "Number of Flights in the Mean") +
theme_ft_rc() +
scale_color_gradient(low = "white", high = "red") +
theme(legend.position = "bottom")Does this vary by airline?
cmh.df %>%
separate(
crs_dep_time,
c("hour", "minute"),
sep = 2,
remove = FALSE
) %>%
unite(
hourmin,
c("hour", "minute"),
sep = ":",
remove = FALSE
) %>%
unite(
depdatetime,
c("fltdate", "hourmin"),
sep = " ",
remove = FALSE
) %>%
mutate(deptime = ymd_hm(depdatetime)) %>%
mutate(fltminute = minute(deptime)) %>%
filter(dest == "CMH") %>%
group_by(fltminute, reporting_airline) %>%
summarise(
delay = mean(dep_delay, na.rm = TRUE),
nflights = n()
) %>%
filter(nflights >= 30) %>%
ggplot(
aes(x = fltminute, y = delay, color = nflights)
) +
geom_line() +
labs(x = "Scheduled Flight Minute",
y = "Mean Departure Delay",
color = "Number of Flights in the Mean") +
theme_ft_rc() +
scale_color_gradient(low = "white", high = "red") +
theme(legend.position = "bottom") +
facet_wrap(~ reporting_airline)And then here are arrival delays by airlines flying to Columbus.
cmh.df %>%
separate(
crs_dep_time,
c("hour", "minute"),
sep = 2,
remove = FALSE
) %>%
unite(
hourmin,
c("hour", "minute"),
sep = ":",
remove = FALSE
) %>%
unite(
depdatetime,
c("fltdate", "hourmin"),
sep = " ",
remove = FALSE
) %>%
mutate(deptime = ymd_hm(depdatetime)) %>%
mutate(fltminute = minute(deptime)) %>%
filter(dest == "CMH") %>%
group_by(fltminute, reporting_airline) %>%
summarise(
delay = mean(arr_delay, na.rm = TRUE),
nflights = n()
) %>%
filter(nflights >= 30) %>%
ggplot(
aes(x = fltminute, y = delay, color = nflights)
) +
geom_line() +
labs(x = "Scheduled Flight Minute",
y = "Mean Arrival Delay",
color = "Number of Flights in the Mean") +
theme_ft_rc() +
scale_color_gradient(low = "white", high = "red") +
theme(legend.position = "bottom") +
facet_wrap(~ reporting_airline)One final query. Say we are curious about how often a particular aircraft flies. How might we calculate this time span?
cmh.df %>%
filter(!is.na(tail_number)) %>%
group_by(tail_number) %>%
arrange(deptime) %>%
mutate(nflew = row_number()) %>%
select(deptime, tail_number, nflew) %>%
arrange(tail_number, nflew) -> cmh.df2 The preceding code has basically flagged each aircraft’s flight sequence (basically when it was first seen, seen for the second time, and so on). Let us now calculate the time lapsed between each flight to/from CMH of each aircraft.
cmh.df2 %>%
ungroup() %>%
arrange(tail_number) %>%
group_by(tail_number) %>%
mutate(
tspan = interval(lag(deptime, order_by = tail_number), deptime),
tspan.minutes = as.duration(tspan)/dminutes(1),
tspan.hours = as.duration(tspan)/dhours(1),
tspan.days = as.duration(tspan)/ddays(1),
tspan.weeks = as.duration(tspan)/dweeks(1)
) -> cmh.df2
mean(cmh.df2$tspan.minutes, na.rm = TRUE)[1] 15834.02
median(cmh.df2$tspan.minutes, na.rm = TRUE)[1] 612
mean(cmh.df2$tspan.days, na.rm = TRUE)[1] 10.99585
median(cmh.df2$tspan.days, na.rm = TRUE)[1] 0.425
If you look at the data you realize that the overwhelming majority of aircraft are flying to/from CMH within a median span of 609 minutes (a shade over 10 hours). This might differ by airline, or does it? Sure it does!
cmh.df %>%
arrange(tail_number, deptime) %>%
group_by(tail_number, reporting_airline) %>%
select(deptime, tail_number, reporting_airline) %>%
mutate(
tspan = interval(lag(deptime), deptime),
tspan.minutes = as.duration(tspan)/dminutes(1),
tspan.days = as.duration(tspan)/ddays(1)
) %>%
group_by(reporting_airline) %>%
summarise(md.tspan = median(tspan.minutes, na.rm = TRUE)) %>%
arrange(md.tspan) -> cmh.ts | Airline | Median Minutes |
|---|---|
| OO | 243 |
| EV | 263 |
| WN | 585 |
| DL | 686 |
| F9 | 735 |
| UA | 738 |
| AA | 765 |
See the book chapter for more advanced work with dates and times and some practice exercises.
See also the rds chapter for more nuanced operations
For attribution, please cite this work as
Ruhil (2026, March 15). Work with Dates and Times. Retrieved from https://aniruhil.org/courses/mpa6020/handouts/module05.html
BibTeX citation
@misc{ruhil2026work,
author = {Ruhil, Ani},
title = {Work with Dates and Times},
url = {https://aniruhil.org/courses/mpa6020/handouts/module05.html},
year = {2026}
}