I saw one of the smartest people I know of (the one and only @hrbrmstr) post TSA’s daily airline passenger traffic numbers circa March 01, 2020 and later, and the corresponding numbers for the same dates albeit in 2019. The data source is an html table, and since I haven’t scraped html tables in a while, I wanted to get rid of the cobwebs. Surprisingly easy but then the TSA data source-page is a very clean setup.
This post was originally written on 2020-05-16
I saw one of the smartest people I know (the one and only @hrbrmstr) post TSA’s daily airline passenger traffic numbers circa March 01, 2020 and later, and the corresponding numbers for the same dates albeit in 2019. The data source is an html table, and since I haven’t scraped html tables in a while, I wanted to get rid of the cobwebs and relearn from a master. Surprisingly easy but then the TSA data source-page looks like a very clean setup.
library(rvest)
read_html("https://www.tsa.gov/coronavirus/passenger-throughput") -> myurl
html_table(myurl, header = TRUE, fill = TRUE) -> tsa
library(tidyverse)
as_tibble(tsa[[1]]) %>%
janitor::clean_names() %>%
mutate(
total_traveler_throughput = as.numeric(
gsub(",", "", total_traveler_throughput)),
total_traveler_throughput_1_year_ago_same_weekday = as.numeric(
gsub(",", "", total_traveler_throughput_1_year_ago_same_weekday)),
date = lubridate::mdy(date)
) %>%
group_by(date) %>%
pivot_longer(
names_to = "period",
values_to = "numbers",
cols = 2:3
) %>%
filter(!is.na(date)) -> tsa.df
ggplot(tsa.df) +
geom_line(aes(x = date, y = numbers, color = period)) +
geom_point(aes(x = date, y = numbers, color = period)) +
geom_smooth(aes(x = date, y = numbers, color = period)) +
themeani::theme_ani_nunito() +
theme(legend.position = "") +
annotate("text", x = as.Date("2020-04-20"), y = 500000,
label = "Throughput in the same week but in 2019") +
annotate("text", x = as.Date("2020-04-20"), y = 1750000,
label = "Throughput in 2020") +
scale_x_date(date_labels = "%b-%d", date_breaks = "1 week") +
scale_y_continuous(labels = scales::"comma") +
labs(x = "Month and Day",
y = "Number of Passengers",
caption = "Data Source: https://www.tsa.gov/coronavirus/passenger-throughput | @aruhil",
title = "TSA checkpoint travel numbers for 2020 and 2019"
)
Text and figures are licensed under Creative Commons Attribution CC BY-SA 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Ruhil (2020, Dec. 18). From an Attican Hollow: TSA Throughput with Scraping html Tables. Retrieved from https://aniruhil.org/posts/2020-12-18-tsa-throughput-with-scraping-html-tables/
BibTeX citation
@misc{ruhil2020tsa, author = {Ruhil, Ani}, title = {From an Attican Hollow: TSA Throughput with Scraping html Tables}, url = {https://aniruhil.org/posts/2020-12-18-tsa-throughput-with-scraping-html-tables/}, year = {2020} }