ggplot2
is one of the more popular R packages for data visualization and hence is the package I will walk you through. It can do a lot but we will only focus on the very basics, the type of visualizations you may need for program evaluation. In particular, we will look at bar-charts, histograms, box-plots, scatter-plots, line-charts, and if we get that far, maybe a simple map or two. I will stick with our myhsb
data so let us load our file, and the ggplot2
library as well.
Let us build a bar-chart of read.quartiles
Now, this is the same as moving all of the aes(...)
commands to within the geom_bar(...)
as shown below.
We can spruce things up by improving the labels on the x-axis and y-axis, and by adding a title to the plot.
Now you might want to explore differences between male and female students here. Since the bars will be gray for both, we can use a fill
statement that will lean on unique levels of the variable we specify and assign a unique color to each level.
Notice this just stacks one group on top of another, making comparison difficult. One way to avoid this would be to tell R to dodge
the bars, i.e., put them side-by-side.
Alternatively, you could have also put each sex side-by-side via facet_grid()
.
ggplot(
data = myhsb
) +
geom_bar(
aes(x = read.quartiles)
) +
facet_grid(~ female.f) +
labs(
x = "Quartiles of the Standardized Reading Score",
y = "Frequency",
title = "Bar-chart of Reading Quartiles",
subtitle = "(by Sex)"
)
Now, the beautiful thing here is say you want to also break this out by race/ethnicity. How could that work?
ggplot(
data = myhsb
) +
geom_bar(
aes(x = read.quartiles)
) +
facet_grid(race.f ~ female.f) +
labs(
x = "Quartiles of the Standardized Reading Score",
y = "Frequency",
title = "Bar-chart of Reading Quartiles",
subtitle = "(by Race/Ethnicity and Sex)"
)
Maybe you do not want it this way but rather each sex within each race/ethnicity?
ggplot(
data = myhsb
) +
geom_bar(
aes(x = read.quartiles, fill = female.f),
position = "dodge"
) +
facet_wrap(~ race.f) +
labs(
x = "Quartiles of the Standardized Reading Score",
y = "Frequency",
title = "Bar-chart of Reading Quartiles",
subtitle = "(by Race/Ethnicity and Sex)",
fill = ""
) +
theme(legend.position = "bottom")
You could add a fourth dimension if you needed to:
ggplot(
data = myhsb
) +
geom_bar(
aes(x = read.quartiles, fill = female.f),
position = "dodge"
) +
facet_wrap(race.f ~ schtyp.f, ncol = 2) +
labs(
x = "Quartiles of the Standardized Reading Score",
y = "Frequency",
title = "Bar-chart of Reading Quartiles",
subtitle = "(by Race/Ethnicity, Sex, and School Type)",
fill = ""
) +
theme(legend.position = "bottom")
Scatter-plots need both the x and y variables to be numeric, and instead of geom_bar()
we use geom_point
.
ggplot(data = myhsb) +
geom_point(aes(x = read, y = math)) +
labs(
x = "Standardized Reading Score",
y = "Standardized Mathematics Score"
)
Again, if we wanted to break this out by sex, and or race/ethnicity, we could try the following:
ggplot(data = myhsb) +
geom_point(aes(x = read, y = math, color = female.f)) +
labs(
x = "Standardized Reading Score",
y = "Standardized Mathematics Score",
color = ""
) +
facet_wrap(~ race.f) +
theme(legend.position = "bottom")
A common plot these days is the line-plot that shows COVID-19 cases over time. Well, let us pull the data for Ohio and draw a few such plots. You will see some code that combines the county-level data to create state-level data.
library(tidyverse)
library(tidylog)
read_csv("https://coronavirus.ohio.gov/static/dashboards/COVIDSummaryData.csv") %>%
filter(County != "Grand Total") %>%
janitor::clean_names() -> c19
c19 %>%
mutate(date = lubridate::mdy(onset_date)) %>%
group_by(date) %>% # perform the calculations by date
summarise(cases = sum(case_count) # total cases across counties
) -> c19.ohio
Now the plot …
There you have it!
Now say you wanted to look at the trend by age-group and sex. How could we do that? Well, the first thing we would need to do would be to modify our covid19-01
code to calculate total cases as we want them to be.
Excellent! Now we can modify our plotting command.
A word of caution: The relative size of each population is not the same so an accurate reading of which age + sex group has increasing or decreasing rates would require that we weight the cases by respective population sizes.
Before we move on, here is a variation of the preceding line-plot.
Notice that the preceding graph looks scrunched up because the y-axis range is being held constant. We could allow it to vary.
I use these a lot when I have a numeric measure and want to see variation within and across groups. The first plot will be for both male and female students, but then we can disaggregate it.
ggplot(data = myhsb) +
geom_boxplot(aes(x = read, y = "")) +
labs(x = "Standardized Reading Score",
y = "")
So the median reading score is about 50, we have a positively-skewed distribution of reading scores. Does the distribution vary by the student’s sex?
ggplot(data = myhsb) +
geom_boxplot(aes(x = read, y = female.f, fill = female.f)) +
labs(x = "Standardized Reading Score",
y = "Student's Sex") +
theme(legend.position = "hide")
One of the many things R does with ease and aesthetics is to allow you to build maps with geographies filled-in by a color scheme to represent low versus high values of some measure of interest. Let us assume that we want to look at a specific date or date-range for which we have county-level COVID-19 data. I will use all days in July thus far. Since the county data also have breakouts by age and sex I will have to aggregate these so that we have a single, cumulative count of cases per county.
Note some counties may be missing if they had no case reported in July thus far, and that is fine. How do we take this and make a simple map?
The first thing we will do is load the urbnmapr
package, built specifically by the Urban Institute to make mapping the states and counties pretty easy.
[1] "long" "lat" "order" "hole"
[5] "piece" "group" "county_fips" "state_abbv"
[9] "state_fips" "county_name" "fips_class" "state_name"
Look at the counties
data-set to get a feel for the columns’ contents. This has every state but we only need Ohio. No problem. And our COVID-19 data has the county names’ without the word “County” that appears in county_name
in the counties
data. No problem. Let us tackle both these things next.
Now that we have a column called county
in ohmap
and in mydf
, we can merge these two data-sets as follows:
Now the map itself!
ggplot(data = map.data) +
geom_polygon(aes(x = long, y = lat, group = group, fill = cases),
color = "white") +
coord_fixed(1.3) +
ggthemes::theme_map() +
theme(legend.position = "bottom") +
labs(fill = "",
title = "Number of Cases Reported",
subtitle = "by County, July 01, 2020 through Present")