TSA Throughput with Scraping html Tables

I saw one of the smartest people I e-know (the one and only @hrbrmstr) post TSA’s daily airline passenger traffic numbers circa March 01, 2020 and later, and the corresponding numbers for the same dates albeit in 2019. The data source is an html table, and since I haven’t scraped html tables in a while, I wanted to get rid of the cobwebs. Surprisingly easy but then the TSA data source-page is a very clean setup.

library(rvest)

read_html("https://www.tsa.gov/coronavirus/passenger-throughput") -> myurl

html_table(myurl, header = TRUE, fill = TRUE) -> tsa

library(tidyverse)

as_tibble(tsa[[1]]) %>%
  janitor::clean_names() %>%
  mutate(
    total_traveler_throughput = as.numeric(
      gsub(",", "", total_traveler_throughput)),
    total_traveler_throughput_1_year_ago_same_weekday = as.numeric(
      gsub(",", "", total_traveler_throughput_1_year_ago_same_weekday)),
    date = lubridate::mdy(date)
         ) %>%
  group_by(date) %>%
  pivot_longer(
    names_to = "period",
    values_to = "numbers",
    cols = 2:3
    ) %>%
  filter(!is.na(date)) -> tsa.df

ggplot(tsa.df) +
  geom_line(aes(x = date, y = numbers, color = period)) +
  geom_point(aes(x = date, y = numbers, color = period)) +
  geom_smooth(aes(x = date, y = numbers, color = period)) +
  themeani::theme_ani_nunito() +
  theme(legend.position = "") +
  annotate("text", x = as.Date("2020-04-20"), y = 500000,
           label = "Throughput in the same week but in 2019") +
  annotate("text", x = as.Date("2020-04-20"), y = 1750000,
           label = "Throughput in 2020") +
  scale_x_date(date_labels = "%b-%d", date_breaks = "1 week") +
  scale_y_continuous(labels = scales::"comma") +
  labs(x = "Month and Day",
       y = "Number of Passengers",
       caption = "Data Source: https://www.tsa.gov/coronavirus/passenger-throughput | @aruhil",
       title = "TSA checkpoint travel numbers for 2020 and 2019"
       )

Related