Visualizing Data

Ani Ruhil
2020-07-07

ggplot2 is one of the more popular R packages for data visualization and hence is the package I will walk you through. It can do a lot but we will only focus on the very basics, the type of visualizations you may need for program evaluation. In particular, we will look at bar-charts, histograms, box-plots, scatter-plots, line-charts, and if we get that far, maybe a simple map or two. I will stick with our myhsb data so let us load our file, and the ggplot2 library as well.

library(here)

load(here("workshops/ropeg/handouts/data", "myhsb.RData"))

library(ggplot2)

Bar-Charts

Let us build a bar-chart of read.quartiles

ggplot(
  data = myhsb, # the data to be used
  aes(x = read.quartiles) # what should go on the x-axis
  ) +
  geom_bar() # the type of geom we want

Now, this is the same as moving all of the aes(...) commands to within the geom_bar(...) as shown below.

ggplot(
  data = myhsb # the data to be used
  ) +
  geom_bar(
      aes(x = read.quartiles) # what should go on the x-axis
  ) # the type of geom we want

We can spruce things up by improving the labels on the x-axis and y-axis, and by adding a title to the plot.

ggplot(
  data = myhsb 
  ) +
  geom_bar(
      aes(x = read.quartiles) 
  ) + 
  labs(
    x = "Quartiles of the Standardized Reading Score",
    y = "Frequency",
    title = "Bar-chart of Reading Quartiles"
  )

Now you might want to explore differences between male and female students here. Since the bars will be gray for both, we can use a fill statement that will lean on unique levels of the variable we specify and assign a unique color to each level.

ggplot(
  data = myhsb 
  ) +
  geom_bar(
      aes(x = read.quartiles, fill = female.f)
  ) + 
  labs(
    x = "Quartiles of the Standardized Reading Score",
    y = "Frequency",
    title = "Bar-chart of Reading Quartiles",
    subtitle = "(by Sex)"
  )

Notice this just stacks one group on top of another, making comparison difficult. One way to avoid this would be to tell R to dodge the bars, i.e., put them side-by-side.

ggplot(
  data = myhsb 
  ) +
  geom_bar(
      aes(x = read.quartiles, fill = female.f),
      position = "dodge"
  ) + 
  labs(
    x = "Quartiles of the Standardized Reading Score",
    y = "Frequency",
    title = "Bar-chart of Reading Quartiles",
    subtitle = "(by Sex)"
  )

Alternatively, you could have also put each sex side-by-side via facet_grid().

ggplot(
  data = myhsb 
  ) +
  geom_bar(
      aes(x = read.quartiles)
  ) + 
  facet_grid(~ female.f) + 
  labs(
    x = "Quartiles of the Standardized Reading Score",
    y = "Frequency",
    title = "Bar-chart of Reading Quartiles",
    subtitle = "(by Sex)"
  )

Now, the beautiful thing here is say you want to also break this out by race/ethnicity. How could that work?

ggplot(
  data = myhsb 
  ) +
  geom_bar(
      aes(x = read.quartiles)
  ) + 
  facet_grid(race.f ~ female.f) + 
  labs(
    x = "Quartiles of the Standardized Reading Score",
    y = "Frequency",
    title = "Bar-chart of Reading Quartiles",
    subtitle = "(by Race/Ethnicity and Sex)"
  )

Maybe you do not want it this way but rather each sex within each race/ethnicity?

ggplot(
  data = myhsb 
  ) +
  geom_bar(
      aes(x = read.quartiles, fill = female.f),
      position = "dodge"
  ) + 
  facet_wrap(~ race.f) + 
  labs(
    x = "Quartiles of the Standardized Reading Score",
    y = "Frequency",
    title = "Bar-chart of Reading Quartiles",
    subtitle = "(by Race/Ethnicity and Sex)",
    fill = ""
  ) +
  theme(legend.position = "bottom")

You could add a fourth dimension if you needed to:

ggplot(
  data = myhsb 
  ) +
  geom_bar(
      aes(x = read.quartiles, fill = female.f),
      position = "dodge"
  ) + 
  facet_wrap(race.f ~ schtyp.f, ncol = 2) + 
  labs(
    x = "Quartiles of the Standardized Reading Score",
    y = "Frequency",
    title = "Bar-chart of Reading Quartiles",
    subtitle = "(by Race/Ethnicity, Sex, and School Type)",
    fill = ""
  ) +
  theme(legend.position = "bottom")

Scatter-plots

Scatter-plots need both the x and y variables to be numeric, and instead of geom_bar() we use geom_point.

ggplot(data = myhsb) +
  geom_point(aes(x = read, y = math)) +
  labs(
    x = "Standardized Reading Score",
    y = "Standardized Mathematics Score"
  )

Again, if we wanted to break this out by sex, and or race/ethnicity, we could try the following:

ggplot(data = myhsb) +
  geom_point(aes(x = read, y = math, color = female.f)) +
  labs(
    x = "Standardized Reading Score",
    y = "Standardized Mathematics Score",
    color = ""
  ) +
  facet_wrap(~ race.f) +
  theme(legend.position = "bottom")

Line-plots

A common plot these days is the line-plot that shows COVID-19 cases over time. Well, let us pull the data for Ohio and draw a few such plots. You will see some code that combines the county-level data to create state-level data.

library(tidyverse)
library(tidylog)

read_csv("https://coronavirus.ohio.gov/static/dashboards/COVIDSummaryData.csv") %>%
  filter(County != "Grand Total") %>%
  janitor::clean_names() -> c19

c19 %>%
  mutate(date = lubridate::mdy(onset_date)) %>%
  group_by(date) %>% # perform the calculations by date
  summarise(cases = sum(case_count) # total cases across counties
            ) -> c19.ohio

Now the plot …

ggplot(data = c19.ohio) +
  geom_line(aes(x = date, y = cases)) +
  labs(
    x = "Date",
    y = "Number of Cases Reported"
  ) 

There you have it!

Now say you wanted to look at the trend by age-group and sex. How could we do that? Well, the first thing we would need to do would be to modify our covid19-01 code to calculate total cases as we want them to be.

c19 %>%
  mutate(date = lubridate::mdy(onset_date)) %>%
  group_by(date, age_range, sex) %>% # perform the calculations by date, age, & sex
  summarise(cases = sum(case_count) # total cases across counties
            ) -> c19.ohio2

Excellent! Now we can modify our plotting command.

ggplot(data = c19.ohio2) +
  geom_line(aes(x = date, y = cases, color = sex)) +
  facet_wrap(~ age_range, ncol = 3) + 
  labs(
    x = "Date",
    y = "Number of Cases Reported",
    color = "Sex"
  ) +
  theme(legend.position = "bottom")

A word of caution: The relative size of each population is not the same so an accurate reading of which age + sex group has increasing or decreasing rates would require that we weight the cases by respective population sizes.

Before we move on, here is a variation of the preceding line-plot.

ggplot(data = c19.ohio2) +
  geom_line(aes(x = date, y = cases, color = age_range)) +
  facet_wrap(~ sex, ncol = 3) + 
  labs(
    x = "Date",
    y = "Number of Cases Reported",
    color = "Age-Groups"
  ) +
  theme(legend.position = "bottom")

Notice that the preceding graph looks scrunched up because the y-axis range is being held constant. We could allow it to vary.

ggplot(data = c19.ohio2) +
  geom_line(aes(x = date, y = cases, color = age_range)) +
  facet_wrap(~ sex, ncol = 3, scales = "free_y") + 
  labs(
    x = "Date",
    y = "Number of Cases Reported",
    color = "Age-Groups"
  ) +
  theme(legend.position = "bottom")

Box-plots

I use these a lot when I have a numeric measure and want to see variation within and across groups. The first plot will be for both male and female students, but then we can disaggregate it.

ggplot(data = myhsb) +
  geom_boxplot(aes(x = read, y = "")) +
  labs(x = "Standardized Reading Score",
       y = "")

So the median reading score is about 50, we have a positively-skewed distribution of reading scores. Does the distribution vary by the student’s sex?

ggplot(data = myhsb) +
  geom_boxplot(aes(x = read, y = female.f, fill = female.f)) +
  labs(x = "Standardized Reading Score",
       y = "Student's Sex") +
  theme(legend.position = "hide")

Simple Maps

One of the many things R does with ease and aesthetics is to allow you to build maps with geographies filled-in by a color scheme to represent low versus high values of some measure of interest. Let us assume that we want to look at a specific date or date-range for which we have county-level COVID-19 data. I will use all days in July thus far. Since the county data also have breakouts by age and sex I will have to aggregate these so that we have a single, cumulative count of cases per county.

c19 %>%
  filter(onset_date >= "7/1/2020") %>%
  group_by(county) %>%
  summarise(cases = sum(case_count)) -> mydf

Note some counties may be missing if they had no case reported in July thus far, and that is fine. How do we take this and make a simple map?

The first thing we will do is load the urbnmapr package, built specifically by the Urban Institute to make mapping the states and counties pretty easy.

library(urbnmapr)
data(counties)
names(counties)
 [1] "long"        "lat"         "order"       "hole"       
 [5] "piece"       "group"       "county_fips" "state_abbv" 
 [9] "state_fips"  "county_name" "fips_class"  "state_name" 

Look at the counties data-set to get a feel for the columns’ contents. This has every state but we only need Ohio. No problem. And our COVID-19 data has the county names’ without the word “County” that appears in county_name in the counties data. No problem. Let us tackle both these things next.

counties %>%
  filter(state_abbv == "OH") -> ohmap

gsub(" County", "", ohmap$county_name) -> ohmap$county 

Now that we have a column called county in ohmap and in mydf, we can merge these two data-sets as follows:

merge(ohmap, mydf, by = c("county"), all = TRUE) -> map.data

map.data %>%
  arrange(group, order) -> map.data

Now the map itself!

ggplot(data = map.data) +
  geom_polygon(aes(x = long, y = lat, group = group, fill = cases),
               color = "white") + 
  coord_fixed(1.3) +
  ggthemes::theme_map() + 
  theme(legend.position = "bottom") +
  labs(fill = "",
       title = "Number of Cases Reported",
       subtitle = "by County, July 01, 2020 through Present")