Visualizing Data



Ani Ruhil



July 7, 2020


ggplot2 is one of the more popular R packages for data visualization and hence is the package I will walk you through. It can do a lot but we will only focus on the very basics, the type of visualizations you may need for program evaluation. In particular, we will look at bar-charts, histograms, box-plots, scatter-plots, line-charts, and if we get that far, maybe a simple map or two. I will stick with our myhsb data so let us load our file, and the ggplot2 library as well.


load(here("workshops/ropeg/handouts/data", "myhsb.RData"))



Let us build a bar-chart of read.quartiles

  data = myhsb, # the data to be used
  aes(x = read.quartiles) # what should go on the x-axis
  ) +
  geom_bar() # the type of geom we want

Now, this is the same as moving all of the aes(...) commands to within the geom_bar(...) as shown below.

  data = myhsb # the data to be used
  ) +
      aes(x = read.quartiles) # what should go on the x-axis
  ) # the type of geom we want

We can spruce things up by improving the labels on the x-axis and y-axis, and by adding a title to the plot.

  data = myhsb 
  ) +
      aes(x = read.quartiles) 
  ) + 
    x = "Quartiles of the Standardized Reading Score",
    y = "Frequency",
    title = "Bar-chart of Reading Quartiles"

Now you might want to explore differences between male and female students here. Since the bars will be gray for both, we can use a fill statement that will lean on unique levels of the variable we specify and assign a unique color to each level.

  data = myhsb 
  ) +
      aes(x = read.quartiles, fill = female.f)
  ) + 
    x = "Quartiles of the Standardized Reading Score",
    y = "Frequency",
    title = "Bar-chart of Reading Quartiles",
    subtitle = "(by Sex)"

Notice this just stacks one group on top of another, making comparison difficult. One way to avoid this would be to tell R to dodge the bars, i.e., put them side-by-side.

  data = myhsb 
  ) +
      aes(x = read.quartiles, fill = female.f),
      position = "dodge"
  ) + 
    x = "Quartiles of the Standardized Reading Score",
    y = "Frequency",
    title = "Bar-chart of Reading Quartiles",
    subtitle = "(by Sex)"

Alternatively, you could have also put each sex side-by-side via facet_grid().

  data = myhsb 
  ) +
      aes(x = read.quartiles)
  ) + 
  facet_grid(~ female.f) + 
    x = "Quartiles of the Standardized Reading Score",
    y = "Frequency",
    title = "Bar-chart of Reading Quartiles",
    subtitle = "(by Sex)"

Now, the beautiful thing here is say you want to also break this out by race/ethnicity. How could that work?

  data = myhsb 
  ) +
      aes(x = read.quartiles)
  ) + 
  facet_grid(race.f ~ female.f) + 
    x = "Quartiles of the Standardized Reading Score",
    y = "Frequency",
    title = "Bar-chart of Reading Quartiles",
    subtitle = "(by Race/Ethnicity and Sex)"

Maybe you do not want it this way but rather each sex within each race/ethnicity?

  data = myhsb 
  ) +
      aes(x = read.quartiles, fill = female.f),
      position = "dodge"
  ) + 
  facet_wrap(~ race.f) + 
    x = "Quartiles of the Standardized Reading Score",
    y = "Frequency",
    title = "Bar-chart of Reading Quartiles",
    subtitle = "(by Race/Ethnicity and Sex)",
    fill = ""
  ) +
  theme(legend.position = "bottom")

You could add a fourth dimension if you needed to:

  data = myhsb 
  ) +
      aes(x = read.quartiles, fill = female.f),
      position = "dodge"
  ) + 
  facet_wrap(race.f ~ schtyp.f, ncol = 2) + 
    x = "Quartiles of the Standardized Reading Score",
    y = "Frequency",
    title = "Bar-chart of Reading Quartiles",
    subtitle = "(by Race/Ethnicity, Sex, and School Type)",
    fill = ""
  ) +
  theme(legend.position = "bottom")


Scatter-plots need both the x and y variables to be numeric, and instead of geom_bar() we use geom_point.

ggplot(data = myhsb) +
  geom_point(aes(x = read, y = math)) +
    x = "Standardized Reading Score",
    y = "Standardized Mathematics Score"

Again, if we wanted to break this out by sex, and or race/ethnicity, we could try the following:

ggplot(data = myhsb) +
  geom_point(aes(x = read, y = math, color = female.f)) +
    x = "Standardized Reading Score",
    y = "Standardized Mathematics Score",
    color = ""
  ) +
  facet_wrap(~ race.f) +
  theme(legend.position = "bottom")


A common plot these days is the line-plot that shows COVID-19 cases over time. Well, let us pull the data for Ohio and draw a few such plots. You will see some code that combines the county-level data to create state-level data.


read_csv("") %>%
  filter(County != "Grand Total") %>%
  janitor::clean_names() -> c19

c19 %>%
  mutate(date = lubridate::mdy(onset_date)) %>%
  group_by(date) %>% # perform the calculations by date
  summarise(cases = sum(case_count) # total cases across counties
            ) -> c19.ohio

Now the plot …

ggplot(data = c19.ohio) +
  geom_line(aes(x = date, y = cases)) +
    x = "Date",
    y = "Number of Cases Reported"

There you have it!

Now say you wanted to look at the trend by age-group and sex. How could we do that? Well, the first thing we would need to do would be to modify our covid19-01 code to calculate total cases as we want them to be.

c19 %>%
  mutate(date = lubridate::mdy(onset_date)) %>%
  group_by(date, age_range, sex) %>% # perform the calculations by date, age, & sex
  summarise(cases = sum(case_count) # total cases across counties
            ) -> c19.ohio2

Excellent! Now we can modify our plotting command.

ggplot(data = c19.ohio2) +
  geom_line(aes(x = date, y = cases, color = sex)) +
  facet_wrap(~ age_range, ncol = 3) + 
    x = "Date",
    y = "Number of Cases Reported",
    color = "Sex"
  ) +
  theme(legend.position = "bottom")

A word of caution: The relative size of each population is not the same so an accurate reading of which age + sex group has increasing or decreasing rates would require that we weight the cases by respective population sizes.

Before we move on, here is a variation of the preceding line-plot.

ggplot(data = c19.ohio2) +
  geom_line(aes(x = date, y = cases, color = age_range)) +
  facet_wrap(~ sex, ncol = 3) + 
    x = "Date",
    y = "Number of Cases Reported",
    color = "Age-Groups"
  ) +
  theme(legend.position = "bottom")

Notice that the preceding graph looks scrunched up because the y-axis range is being held constant. We could allow it to vary.

ggplot(data = c19.ohio2) +
  geom_line(aes(x = date, y = cases, color = age_range)) +
  facet_wrap(~ sex, ncol = 3, scales = "free_y") + 
    x = "Date",
    y = "Number of Cases Reported",
    color = "Age-Groups"
  ) +
  theme(legend.position = "bottom")


I use these a lot when I have a numeric measure and want to see variation within and across groups. The first plot will be for both male and female students, but then we can disaggregate it.

ggplot(data = myhsb) +
  geom_boxplot(aes(x = read, y = "")) +
  labs(x = "Standardized Reading Score",
       y = "")

So the median reading score is about 50, we have a positively-skewed distribution of reading scores. Does the distribution vary by the student’s sex?

ggplot(data = myhsb) +
  geom_boxplot(aes(x = read, y = female.f, fill = female.f)) +
  labs(x = "Standardized Reading Score",
       y = "Student's Sex") +
  theme(legend.position = "hide")

Simple Maps

One of the many things R does with ease and aesthetics is to allow you to build maps with geographies filled-in by a color scheme to represent low versus high values of some measure of interest. Let us assume that we want to look at a specific date or date-range for which we have county-level COVID-19 data. I will use all days in July thus far. Since the county data also have breakouts by age and sex I will have to aggregate these so that we have a single, cumulative count of cases per county.

c19 %>%
  filter(onset_date >= "7/1/2020") %>%
  group_by(county) %>%
  summarise(cases = sum(case_count)) -> mydf

Note some counties may be missing if they had no case reported in July thus far, and that is fine. How do we take this and make a simple map?

The first thing we will do is load the urbnmapr package, built specifically by the Urban Institute to make mapping the states and counties pretty easy.

 [1] "long"        "lat"         "order"       "hole"       
 [5] "piece"       "group"       "county_fips" "state_abbv" 
 [9] "state_fips"  "county_name" "fips_class"  "state_name" 

Look at the counties data-set to get a feel for the columns’ contents. This has every state but we only need Ohio. No problem. And our COVID-19 data has the county names’ without the word “County” that appears in county_name in the counties data. No problem. Let us tackle both these things next.

counties %>%
  filter(state_abbv == "OH") -> ohmap

gsub(" County", "", ohmap$county_name) -> ohmap$county 

Now that we have a column called county in ohmap and in mydf, we can merge these two data-sets as follows:

merge(ohmap, mydf, by = c("county"), all = TRUE) -> %>%
  arrange(group, order) ->

Now the map itself!

ggplot(data = +
  geom_polygon(aes(x = long, y = lat, group = group, fill = cases),
               color = "white") + 
  coord_fixed(1.3) +
  ggthemes::theme_map() + 
  theme(legend.position = "bottom") +
  labs(fill = "",
       title = "Number of Cases Reported",
       subtitle = "by County, July 01, 2020 through Present")
