ggplot2
is one of the more popular R packages for data visualization and hence is the package I will walk you through. It can do a lot but we will only focus on the very basics, the type of visualizations you may need for program evaluation. In particular, we will look at bar-charts, histograms, box-plots, scatter-plots, line-charts, and if we get that far, maybe a simple map or two. I will stick with our myhsb
data so let us load our file, and the ggplot2
library as well.
Let us build a bar-chart of read.quartiles
ggplot(
data = myhsb, # the data to be used
aes(x = read.quartiles) # what should go on the x-axis
) +
geom_bar() # the type of geom we want
Now, this is the same as moving all of the aes(...)
commands to within the geom_bar(...)
as shown below.
ggplot(
data = myhsb # the data to be used
) +
geom_bar(
aes(x = read.quartiles) # what should go on the x-axis
) # the type of geom we want
We can spruce things up by improving the labels on the x-axis and y-axis, and by adding a title to the plot.
ggplot(
data = myhsb
) +
geom_bar(
aes(x = read.quartiles)
) +
labs(
x = "Quartiles of the Standardized Reading Score",
y = "Frequency",
title = "Bar-chart of Reading Quartiles"
)
Now you might want to explore differences between male and female students here. Since the bars will be gray for both, we can use a fill
statement that will lean on unique levels of the variable we specify and assign a unique color to each level.
ggplot(
data = myhsb
) +
geom_bar(
aes(x = read.quartiles, fill = female.f)
) +
labs(
x = "Quartiles of the Standardized Reading Score",
y = "Frequency",
title = "Bar-chart of Reading Quartiles",
subtitle = "(by Sex)"
)
Notice this just stacks one group on top of another, making comparison difficult. One way to avoid this would be to tell R to dodge
the bars, i.e., put them side-by-side.
ggplot(
data = myhsb
) +
geom_bar(
aes(x = read.quartiles, fill = female.f),
position = "dodge"
) +
labs(
x = "Quartiles of the Standardized Reading Score",
y = "Frequency",
title = "Bar-chart of Reading Quartiles",
subtitle = "(by Sex)"
)
Alternatively, you could have also put each sex side-by-side via facet_grid()
.
ggplot(
data = myhsb
) +
geom_bar(
aes(x = read.quartiles)
) +
facet_grid(~ female.f) +
labs(
x = "Quartiles of the Standardized Reading Score",
y = "Frequency",
title = "Bar-chart of Reading Quartiles",
subtitle = "(by Sex)"
)
Now, the beautiful thing here is say you want to also break this out by race/ethnicity. How could that work?
ggplot(
data = myhsb
) +
geom_bar(
aes(x = read.quartiles)
) +
facet_grid(race.f ~ female.f) +
labs(
x = "Quartiles of the Standardized Reading Score",
y = "Frequency",
title = "Bar-chart of Reading Quartiles",
subtitle = "(by Race/Ethnicity and Sex)"
)
Maybe you do not want it this way but rather each sex within each race/ethnicity?
ggplot(
data = myhsb
) +
geom_bar(
aes(x = read.quartiles, fill = female.f),
position = "dodge"
) +
facet_wrap(~ race.f) +
labs(
x = "Quartiles of the Standardized Reading Score",
y = "Frequency",
title = "Bar-chart of Reading Quartiles",
subtitle = "(by Race/Ethnicity and Sex)",
fill = ""
) +
theme(legend.position = "bottom")
You could add a fourth dimension if you needed to:
ggplot(
data = myhsb
) +
geom_bar(
aes(x = read.quartiles, fill = female.f),
position = "dodge"
) +
facet_wrap(race.f ~ schtyp.f, ncol = 2) +
labs(
x = "Quartiles of the Standardized Reading Score",
y = "Frequency",
title = "Bar-chart of Reading Quartiles",
subtitle = "(by Race/Ethnicity, Sex, and School Type)",
fill = ""
) +
theme(legend.position = "bottom")
Scatter-plots need both the x and y variables to be numeric, and instead of geom_bar()
we use geom_point
.
ggplot(data = myhsb) +
geom_point(aes(x = read, y = math)) +
labs(
x = "Standardized Reading Score",
y = "Standardized Mathematics Score"
)
Again, if we wanted to break this out by sex, and or race/ethnicity, we could try the following:
ggplot(data = myhsb) +
geom_point(aes(x = read, y = math, color = female.f)) +
labs(
x = "Standardized Reading Score",
y = "Standardized Mathematics Score",
color = ""
) +
facet_wrap(~ race.f) +
theme(legend.position = "bottom")
A common plot these days is the line-plot that shows COVID-19 cases over time. Well, let us pull the data for Ohio and draw a few such plots. You will see some code that combines the county-level data to create state-level data.
library(tidyverse)
library(tidylog)
read_csv("https://coronavirus.ohio.gov/static/dashboards/COVIDSummaryData.csv") %>%
filter(County != "Grand Total") %>%
janitor::clean_names() -> c19
c19 %>%
mutate(date = lubridate::mdy(onset_date)) %>%
group_by(date) %>% # perform the calculations by date
summarise(cases = sum(case_count) # total cases across counties
) -> c19.ohio
Now the plot …
ggplot(data = c19.ohio) +
geom_line(aes(x = date, y = cases)) +
labs(
x = "Date",
y = "Number of Cases Reported"
)
There you have it!
Now say you wanted to look at the trend by age-group and sex. How could we do that? Well, the first thing we would need to do would be to modify our covid19-01
code to calculate total cases as we want them to be.
c19 %>%
mutate(date = lubridate::mdy(onset_date)) %>%
group_by(date, age_range, sex) %>% # perform the calculations by date, age, & sex
summarise(cases = sum(case_count) # total cases across counties
) -> c19.ohio2
Excellent! Now we can modify our plotting command.
ggplot(data = c19.ohio2) +
geom_line(aes(x = date, y = cases, color = sex)) +
facet_wrap(~ age_range, ncol = 3) +
labs(
x = "Date",
y = "Number of Cases Reported",
color = "Sex"
) +
theme(legend.position = "bottom")
A word of caution: The relative size of each population is not the same so an accurate reading of which age + sex group has increasing or decreasing rates would require that we weight the cases by respective population sizes.
Before we move on, here is a variation of the preceding line-plot.
ggplot(data = c19.ohio2) +
geom_line(aes(x = date, y = cases, color = age_range)) +
facet_wrap(~ sex, ncol = 3) +
labs(
x = "Date",
y = "Number of Cases Reported",
color = "Age-Groups"
) +
theme(legend.position = "bottom")
Notice that the preceding graph looks scrunched up because the y-axis range is being held constant. We could allow it to vary.
ggplot(data = c19.ohio2) +
geom_line(aes(x = date, y = cases, color = age_range)) +
facet_wrap(~ sex, ncol = 3, scales = "free_y") +
labs(
x = "Date",
y = "Number of Cases Reported",
color = "Age-Groups"
) +
theme(legend.position = "bottom")
I use these a lot when I have a numeric measure and want to see variation within and across groups. The first plot will be for both male and female students, but then we can disaggregate it.
ggplot(data = myhsb) +
geom_boxplot(aes(x = read, y = "")) +
labs(x = "Standardized Reading Score",
y = "")
So the median reading score is about 50, we have a positively-skewed distribution of reading scores. Does the distribution vary by the student’s sex?
ggplot(data = myhsb) +
geom_boxplot(aes(x = read, y = female.f, fill = female.f)) +
labs(x = "Standardized Reading Score",
y = "Student's Sex") +
theme(legend.position = "hide")
One of the many things R does with ease and aesthetics is to allow you to build maps with geographies filled-in by a color scheme to represent low versus high values of some measure of interest. Let us assume that we want to look at a specific date or date-range for which we have county-level COVID-19 data. I will use all days in July thus far. Since the county data also have breakouts by age and sex I will have to aggregate these so that we have a single, cumulative count of cases per county.
c19 %>%
filter(onset_date >= "7/1/2020") %>%
group_by(county) %>%
summarise(cases = sum(case_count)) -> mydf
Note some counties may be missing if they had no case reported in July thus far, and that is fine. How do we take this and make a simple map?
The first thing we will do is load the urbnmapr
package, built specifically by the Urban Institute to make mapping the states and counties pretty easy.
[1] "long" "lat" "order" "hole"
[5] "piece" "group" "county_fips" "state_abbv"
[9] "state_fips" "county_name" "fips_class" "state_name"
Look at the counties
data-set to get a feel for the columns’ contents. This has every state but we only need Ohio. No problem. And our COVID-19 data has the county names’ without the word “County” that appears in county_name
in the counties
data. No problem. Let us tackle both these things next.
counties %>%
filter(state_abbv == "OH") -> ohmap
gsub(" County", "", ohmap$county_name) -> ohmap$county
Now that we have a column called county
in ohmap
and in mydf
, we can merge these two data-sets as follows:
merge(ohmap, mydf, by = c("county"), all = TRUE) -> map.data
map.data %>%
arrange(group, order) -> map.data
Now the map itself!
ggplot(data = map.data) +
geom_polygon(aes(x = long, y = lat, group = group, fill = cases),
color = "white") +
coord_fixed(1.3) +
ggthemes::theme_map() +
theme(legend.position = "bottom") +
labs(fill = "",
title = "Number of Cases Reported",
subtitle = "by County, July 01, 2020 through Present")