TSA Throughput with Scraping html Tables

R

I saw one of the smartest people I know of (the one and only @hrbrmstr) post TSA’s daily airline passenger traffic numbers circa March 01, 2020 and later, and the corresponding numbers for the same dates albeit in 2019. The data source is an html table, and since I haven’t scraped html tables in a while, I wanted to get rid of the cobwebs. Surprisingly easy but then the TSA data source-page is a very clean setup.

Ani Ruhil true
12-18-2020

This post was originally written on 2020-05-16

I saw one of the smartest people I know (the one and only @hrbrmstr) post TSA’s daily airline passenger traffic numbers circa March 01, 2020 and later, and the corresponding numbers for the same dates albeit in 2019. The data source is an html table, and since I haven’t scraped html tables in a while, I wanted to get rid of the cobwebs and relearn from a master. Surprisingly easy but then the TSA data source-page looks like a very clean setup.

library(rvest)

read_html("https://www.tsa.gov/coronavirus/passenger-throughput") -> myurl

html_table(myurl, header = TRUE, fill = TRUE) -> tsa

library(tidyverse)

as_tibble(tsa[[1]]) %>%
  janitor::clean_names() %>%
  mutate(
    total_traveler_throughput = as.numeric(
      gsub(",", "", total_traveler_throughput)),
    total_traveler_throughput_1_year_ago_same_weekday = as.numeric(
      gsub(",", "", total_traveler_throughput_1_year_ago_same_weekday)),
    date = lubridate::mdy(date)
         ) %>%
  group_by(date) %>%
  pivot_longer(
    names_to = "period",
    values_to = "numbers",
    cols = 2:3
    ) %>%
  filter(!is.na(date)) -> tsa.df

ggplot(tsa.df) +
  geom_line(aes(x = date, y = numbers, color = period)) +
  geom_point(aes(x = date, y = numbers, color = period)) +
  geom_smooth(aes(x = date, y = numbers, color = period)) +
  themeani::theme_ani_nunito() +
  theme(legend.position = "") +
  annotate("text", x = as.Date("2020-04-20"), y = 500000,
           label = "Throughput in the same week but in 2019") +
  annotate("text", x = as.Date("2020-04-20"), y = 1750000,
           label = "Throughput in 2020") +
  scale_x_date(date_labels = "%b-%d", date_breaks = "1 week") +
  scale_y_continuous(labels = scales::"comma") +
  labs(x = "Month and Day",
       y = "Number of Passengers",
       caption = "Data Source: https://www.tsa.gov/coronavirus/passenger-throughput | @aruhil",
       title = "TSA checkpoint travel numbers for 2020 and 2019"
       )

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY-SA 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".

Citation

For attribution, please cite this work as

Ruhil (2020, Dec. 18). From an Attican Hollow: TSA Throughput with Scraping html Tables. Retrieved from https://aniruhil.org/posts/2020-12-18-tsa-throughput-with-scraping-html-tables/

BibTeX citation

@misc{ruhil2020tsa,
  author = {Ruhil, Ani},
  title = {From an Attican Hollow: TSA Throughput with Scraping html Tables},
  url = {https://aniruhil.org/posts/2020-12-18-tsa-throughput-with-scraping-html-tables/},
  year = {2020}
}