ggplot2

“A grammar of graphics is a tool that enables us to concisely describe the components of a graphic. Such a grammar allows us to move beyond named graphics (e.g., the “scatterplot”) and gain insight into the deep structure that underlies statistical graphics."
— Hadley Wickham: in A Layered Grammar of Graphics

We are used to describing graphics with simple labels – a scatterplot, a line chart, a map, and so on. But that belies the details that go into that graphic’s construction. The ggplot2 package has been built with an eye on capturing all the layers that build up to the final graphic. The underlying philsoophy can perhaps be captured most simply as follows.

data = cleaned and mutated or summarized to give us what we’d like to visualize

geom = what kind of a visual do you want? A map, bar-chart, line-chart, scatter-plot, something else?

coordinate system = what should go on the x-axis? y-axis?

What other aesthetics should be used, border colors, fill colors, text or other annotations, plotting symbols, facet the plot to show breakouts by some attribute, something else?

Let us understand how ggplot2 works by first loading up the multkey.merge data frame and building the baseplot.

load("data/multkey.merge.RData")
my.df <- multkey.merge

Note that I am setting up my.df as a carbon copy of multkey.merge so I have a shorter data frame name to work with.

The basic plot starts with the ggplot() command where we would reference the data frame to be used.

library(tidyverse)
ggplot(data = my.df)

Uh oh! The canvas is blank because we have not specified what goes on the x-axis, y-axis. Those details can be specified via aes … the aesthetics

ggplot(data = my.df, aes(x = college_desc))

Aha! Now we see the labels on the x-axis but nothing more. Why? Because we have not specified what type of a graphic we want … a bar-chart perhaps?

ggplot(data = my.df, aes(x = college_desc)) +
  geom_bar()

Notice that geom_bar() generates a bar-chart, and that by default the y-axis is mapping the frequency

We could do better of course, by customizing the labels for the x-axis and y-axis, adding a title and/or a subtitle, a caption, maybe even coloring the bars

ggplot(data = my.df, 
       aes(x = college_desc, fill = college_desc)) +
  geom_bar() + 
  labs(title = "Distribution of Students by College",
       subtitle = "(Multiple Terms)",
       caption = "Source: Ohio University's Office of International Research") +
  theme(legend.position = "bottom")

The x-axis labels are hard to read so we could flip the x- and y-axis

ggplot(data = my.df, 
       aes(x = college_desc, fill = college_desc)) +
  geom_bar() + 
  labs(title = "Distribution of Students by College",
       subtitle = "(Multiple Terms)",
       caption = "Source: Ohio University's Office of International Research") +
  theme(legend.position = "bottom") +
  coord_flip()

How about ordering the colleges in terms of increasing/decreasing frequency?

library(forcats)

my.df %>%
  group_by(college_desc) %>%
  summarise(frequency = n()) %>%
  ggplot(aes(x = fct_reorder(college_desc, frequency),
           y = frequency,
           fill = college_desc)) +
  geom_bar(stat = "identity") + 
  labs(title = "Distribution of Students by College",
       subtitle = "(Multiple Terms)",
       caption = "Source: Ohio University's Office of International Research") +
  theme(legend.position = "bottom") +
  coord_flip()

Note how we are using the pipe operator %>% to do some calculations before seamlessly rolling into the plotting commands

fct_reorder(college_desc, frequency) is ordering the bars for us

library(forcats)

my.df %>%
  group_by(college_desc) %>%
  summarise(frequency = n()) %>%
  ggplot(aes(x = fct_reorder(college_desc, -frequency),
           y = frequency,
           fill = college_desc)) +
  geom_bar(stat = "identity") + 
  labs(title = "Distribution of Students by College",
       subtitle = "(Multiple Terms)",
       caption = "Source: Ohio University's Office of International Research") +
  theme(legend.position = "bottom") +
  coord_flip()

Note: fct_reorder(college_desc, -frequency)

We could still do better … do we need a legend? No. How about better axis labels?

my.df %>%
  group_by(college_desc) %>%
  summarise(frequency = n()) %>%
  ggplot(aes(x = fct_reorder(college_desc, frequency),
           y = frequency,
           fill = college_desc)) +
  geom_bar(stat = "identity") + 
  labs(title = "Distribution of Students by College",
       subtitle = "(Multiple Terms)",
       caption = "Source: Ohio University's Office of International Research",
       x = "Number of Students",
       y = "College") +
  theme(legend.position = "hide") +
  coord_flip()

You can also tweak the default colors, the general theme.

library(viridis)

my.df %>%
  group_by(college_desc) %>%
  summarise(frequency = n()) %>%
  ggplot(aes(x = fct_reorder(college_desc, frequency),
           y = frequency,
           fill = college_desc)) +
  geom_bar(stat = "identity") + 
  labs(title = "Distribution of Students by College",
       subtitle = "(Multiple Terms)",
       caption = "Source: Ohio University's Office of International Research",
       x = "Number of Students",
       y = "College") +
  theme_minimal() + 
  theme(legend.position = "hide") +
  coord_flip() +
  scale_fill_viridis_d(option = "viridis")

my.df %>%
  group_by(college_desc) %>%
  summarise(frequency = n()) %>%
  ggplot(aes(x = fct_reorder(college_desc, frequency),
           y = frequency,
           fill = college_desc)) +
  geom_bar(stat = "identity") + 
  labs(title = "Distribution of Students by College",
       subtitle = "(Multiple Terms)",
       caption = "Source: Ohio University's Office of International Research",
       x = "Number of Students",
       y = "College") +
  hrbrthemes::theme_ft_rc() + 
  theme(legend.position = "hide") +
  coord_flip() +
  scale_fill_viridis_d(option = "plasma")

Some more examples …

What if I want to look at enrollment numbers by sex?

my.df %>%
  filter(sex.f %in% c("Male", "Female")) %>%
  group_by(college_desc, sex.f) %>%
  summarise(frequency = n()) %>%
  ggplot(aes(x = fct_reorder(college_desc, frequency),
           y = frequency,
           fill = college_desc)) +
  geom_bar(stat = "identity") + 
  labs(title = "Distribution of Students by College",
       subtitle = "(Multiple Terms)",
       caption = "Source: Ohio University's Office of International Research",
       y = "Number of Students",
       x = "College") +
  theme(legend.position = "hide") +
  coord_flip() + 
  facet_wrap(~ sex.f)

What if I want to show percentages instead of frequencies?

my.df %>%
  filter(sex.f %in% c("Male", "Female")) %>%
  group_by(college_desc, sex.f) %>%
  summarise(frequency = n()) %>%
  mutate(percent = (frequency / sum(frequency)) * 100) %>%
  ggplot(aes(x = fct_reorder(college_desc, percent),
           y = percent,
           fill = sex.f)) +
  geom_bar(stat = "identity") + 
  labs(title = "Distribution of Students by Sex and College",
       subtitle = "(Multiple Terms)",
       caption = "Source: Ohio University's Office of International Research",
       y = "Percent",
       x = "College",
       fill = "") +
  theme(legend.position = "bottom") +
  coord_flip()

What if I want to store the summarized data as a data frame and then plot?11 This approach is handy if you need to print or make available the table.

tab1 <- my.df %>%
  filter(sex.f %in% c("Male", "Female")) %>%
  group_by(college_desc, sex.f) %>%
  summarise(frequency = n()) %>%
  mutate(percent = (frequency / sum(frequency)) * 100) 

ggplot(data = tab1, aes(x = fct_reorder(college_desc, percent),
           y = percent,
           fill = sex.f)) +
  geom_bar(stat = "identity") + 
  labs(title = "Distribution of Students by Sex and College",
       subtitle = "(Multiple Terms)",
       caption = "Source: Ohio University's Office of International Research",
       y = "Percent",
       x = "College",
       fill = "") +
  theme(legend.position = "bottom") +
  coord_flip()

tab1p <- tab1 %>% mutate_at(vars(starts_with("percent")), funs(round(., 2)))

DT::datatable(tab1p, caption = "Distribution of Students by Sex and College",
              rownames = FALSE,
              colnames = c("College", "Sex", "Number", "Percent"))

Show entries

Search:

Distribution of Students by Sex and College
College	Sex	Number	Percent
Arts & Sciences	Female	98267	56.53
Arts & Sciences	Male	75574	43.47
Business	Female	19875	37.08
Business	Male	33731	62.92
Communication	Female	18844	56.57
Communication	Male	14468	43.43
Education	Female	25939	69.09
Education	Male	11605	30.91
Engineering & Technology	Female	4542	16.76
Engineering & Technology	Male	22552	83.24

Showing 1 to 10 of 28 entries

Previous1 2 3Next

What if we’d like to break out the preceding plot by students’ rank?

my.df %>%
  filter(sex.f %in% c("Male", "Female")) %>%
  group_by(college_desc, rank_desc, sex.f) %>%
  summarise(frequency = n()) %>%
  mutate(percent = (frequency / sum(frequency)) * 100) %>%
  ggplot(aes(x = fct_reorder(college_desc, percent),
           y = percent,
           fill = sex.f)) +
  geom_bar(stat = "identity") + 
  labs(title = "Distribution of Students by College",
       subtitle = "(Multiple Terms)",
       caption = "Source: Ohio University's Office of International Research",
       y = "Percent",
       x = "College",
       fill = "Student's Sex at Birth") +
  theme(legend.position = "bottom") +
  coord_flip() +
  facet_wrap(~ rank_desc)

`geom_line() and geom_point()`

Let us calculate enrollments by college and term

tab2 <- my.df %>%
  group_by(term_code, college_desc) %>%
  summarise(frequency = n_distinct(anon_id)) 

ggplot() +
  geom_point(data = tab2, aes(x = term_code, y = frequency,
                              group = college_desc,
                              color = college_desc,
                              size = frequency,
                              shape = college_desc)) +
  geom_line(data = tab2, aes(x = term_code, y = frequency,
                             group = college_desc,
                             color = college_desc,
                             linetype = college_desc)) +
  labs(x = "Term",
       y = "Number Enrolled",
       color = "")

Introduction to ggplot2

Workshop Session 03 @ Ohio University

Ani Ruhil

2019-05-18

ggplot2

Some more examples …

`geom_line() and geom_point()`