Visualizing Data with RAni Ruhil1 / 28

Agenda

Graphics with ggplot2
Interactive graphics with highcharter
Interactive graphics with plotly
Maps
- with ggplot2
- with leaflet

2 / 28

3 / 28

the grammar of graphics

data = cleaned and mutated or summarized to give us what we'd like to visualize

geom = what kind of a visual do you want? A map, bar-chart, line-chart, scatter-plot, something else?

coordinate system = what should go on the x-axis? y-axis?

What other aesthetics should be used, border colors, fill colors, text or other annotations, plotting symbols, facet the plot to show breakouts by some attribute, something else?

4 / 28

load("data/multkey.merge.RData")
my.df <- multkey.merge
library(tidyverse)
ggplot(data = my.df)

The canvas is blank because we have not specified what goes on the x-axis, y-axis

That can be specified via aes ... the aesthetics

5 / 28

ggplot(data = my.df, aes(x = college_desc))

Aha! Now we see the labels on the x-axis but nothing more. Why?

Because we have not specified what type of a graphic we want ... a bar-chart perhaps?

ggplot(data = my.df, aes(x = college_desc)) +
  geom_bar()

Notice that geom_bar() generates a bar-chart

The y-axis is mapping the frequency

6 / 28

We could do better of course, by customizing the labels for the x-axis and y-axis, adding a title and/or subtitle, a caption, maybe even coloring the bars

ggplot(data = my.df, 
       aes(x = college_desc, fill = college_desc)) +
  geom_bar() + 
  labs(title = "Distribution of Students by College",
       subtitle = "(Multiple Terms)",
       caption = "Source: Ohio University's Office of Institutional Research") +
  theme(legend.position = "bottom")

7 / 28

8 / 28

The x-axis labels are hard to read so we could flip the x- and y-axis

ggplot(data = my.df, 
       aes(x = college_desc, fill = college_desc)) +
  geom_bar() + 
  labs(title = "Distribution of Students by College",
       subtitle = "(Multiple Terms)",
       caption = "Source: Ohio University's Office of Institutional Research") +
  theme(legend.position = "bottom") +
  coord_flip()

9 / 28

10 / 28

How about ordering the colleges in terms of increasing/decreasing frequency?

library(forcats)
my.df %>%
  group_by(college_desc) %>%
  summarise(frequency = n()) %>%
  ggplot(aes(x = fct_reorder(college_desc, frequency),
           y = frequency,
           fill = college_desc)) +
  geom_bar(stat = "identity") + 
  labs(title = "Distribution of Students by College",
       subtitle = "(Multiple Terms)",
       caption = "Source: Ohio University's Office of Institutional Research") +
  theme(legend.position = "bottom") +
  coord_flip()

Note how we are using the pipe operator %>% to do some calculations before seamlessly rolling into the plotting commands

fct_reorder(college_desc, frequency) is ordering the bars for us

11 / 28

12 / 28

library(forcats)
my.df %>%
  group_by(college_desc) %>%
  summarise(frequency = n()) %>%
  ggplot(aes(x = fct_reorder(college_desc, -frequency),
           y = frequency,
           fill = college_desc)) +
  geom_bar(stat = "identity") + 
  labs(title = "Distribution of Students by College",
       subtitle = "(Multiple Terms)",
       caption = "Source: Ohio University's Office of Institutional Research") +
  theme(legend.position = "bottom") +
  coord_flip()

Note: fct_reorder(college_desc, -frequency)

13 / 28

14 / 28

We could still do better ... do we need a legend? No. How about better axis labels?

my.df %>%
  group_by(college_desc) %>%
  summarise(frequency = n()) %>%
  ggplot(aes(x = fct_reorder(college_desc, frequency),
           y = frequency,
           fill = college_desc)) +
  geom_bar(stat = "identity") + 
  labs(title = "Distribution of Students by College",
       subtitle = "(Multiple Terms)",
       caption = "Source: Ohio University's Office of Institutional Research",
       x = "Number of Students",
       y = "College") +
  theme(legend.position = "hide") +
  coord_flip()

15 / 28

16 / 28

What if I want to look at enrollment numbers by sex?

my.df %>%
  filter(sex.f %in% c("Male", "Female")) %>%
  group_by(college_desc, sex.f) %>%
  summarise(frequency = n()) %>%
  ggplot(aes(x = fct_reorder(college_desc, frequency),
           y = frequency,
           fill = college_desc)) +
  geom_bar(stat = "identity") + 
  labs(title = "Distribution of Students by College",
       subtitle = "(Multiple Terms)",
       caption = "Source: Ohio University's Office of Institutional Research",
       y = "Number of Students",
       x = "College") +
  theme(legend.position = "hide") +
  coord_flip() + 
  facet_wrap(~ sex.f)

17 / 28

18 / 28

What if I want to show percentages instead of frequencies?

my.df %>%
  filter(sex.f %in% c("Male", "Female")) %>%
  group_by(college_desc, sex.f) %>%
  summarise(frequency = n()) %>%
  mutate(percent = (frequency / sum(frequency)) * 100) %>%
  ggplot(aes(x = fct_reorder(college_desc, percent),
           y = percent,
           fill = sex.f)) +
  geom_bar(stat = "identity") + 
  labs(title = "Distribution of Students by Sex and College",
       subtitle = "(Multiple Terms)",
       caption = "Source: Ohio University's Office of Institutional Research",
       y = "Percent",
       x = "College",
       fill = "") +
  theme(legend.position = "bottom") +
  coord_flip()

19 / 28

20 / 28

What if I want to store the summarized data as a data frame and then plot?

tab1 <- my.df %>%
  filter(sex.f %in% c("Male", "Female")) %>%
  group_by(college_desc, sex.f) %>%
  summarise(frequency = n()) %>%
  mutate(percent = (frequency / sum(frequency)) * 100) 
ggplot(data = tab1, aes(x = fct_reorder(college_desc, percent),
           y = percent,
           fill = sex.f)) +
  geom_bar(stat = "identity") + 
  labs(title = "Distribution of Students by Sex and College",
       subtitle = "(Multiple Terms)",
       caption = "Source: Ohio University's Office of Institutional Research",
       y = "Percent",
       x = "College",
       fill = "") +
  theme(legend.position = "bottom") +
  coord_flip()

This approach is handy if you need to print or make available the table

21 / 28

22 / 28

tab1p <- tab1 %>% mutate_at(vars(starts_with("percent")), funs(round(., 2)))
DT::datatable(tab1p, caption = "Distribution of Students by Sex and College",
              rownames = FALSE,
              colnames = c("College", "Sex", "Number", "Percent"))

<div id="htmlwidget-5b5b5f9eadd376fe290a" style="width:100%;height:auto;" class="datatables html-widget"></div>
<script type="application/json" data-for="htmlwidget-5b5b5f9eadd376fe290a">{"x":{"filter":"none","caption":"<caption>Distribution of Students by Sex and College<\/caption>","data":[["Arts &amp; Sciences","Arts &amp; Sciences","Business","Business","Communication","Communication","Education","Education","Engineering &amp; Technology","Engineering &amp; Technology","Fine Arts","Fine Arts","George Voinovich School","George Voinovich School","Health Sciences &amp; Professions","Health Sciences &amp; Professions","Honors Tutorial","Honors Tutorial","International Studies","International Studies","Miscellaneous","Miscellaneous","Osteopathic Medicine","Osteopathic Medicine","Regional Higher Ed","Regional Higher Ed","University College","University College"],["Female","Male","Female","Male","Female","Male","Female","Male","Female","Male","Female","Male","Female","Male","Female","Male","Female","Male","Female","Male","Female","Male","Female","Male","Female","Male","Female","Male"],[98267,75574,19875,33731,18844,14468,25939,11605,4542,22552,20396,13900,877,440,62345,14740,315,166,506,415,198,110,5608,6562,10774,8269,9164,8378],[56.53,43.47,37.08,62.92,56.57,43.43,69.09,30.91,16.76,83.24,59.47,40.53,66.59,33.41,80.88,19.12,65.49,34.51,54.94,45.06,64.29,35.71,46.08,53.92,56.58,43.42,52.24,47.76]],"container":"<table class=\"display\">\n  <thead>\n    <tr>\n      <th>College<\/th>\n      <th>Sex<\/th>\n      <th>Number<\/th>\n      <th>Percent<\/th>\n    <\/tr>\n  <\/thead>\n<\/table>","options":{"columnDefs":[{"className":"dt-right","targets":[2,3]}],"order":[],"autoWidth":false,"orderClasses":false}},"evals":[],"jsHooks":[]}</script>

23 / 28

What if we'd like to break out the preceding plot by students' rank?

my.df %>%
  filter(sex.f %in% c("Male", "Female")) %>%
  group_by(college_desc, rank_desc, sex.f) %>%
  summarise(frequency = n()) %>%
  mutate(percent = (frequency / sum(frequency)) * 100) %>%
  ggplot(aes(x = fct_reorder(college_desc, percent),
           y = percent,
           fill = sex.f)) +
  geom_bar(stat = "identity") + 
  labs(title = "Distribution of Students by College",
       subtitle = "(Multiple Terms)",
       caption = "Source: Ohio University's Office of Institutional Research",
       y = "Percent",
       x = "College",
       fill = "Student's Sex at Birth") +
  theme(legend.position = "bottom") +
  coord_flip() +
  facet_wrap(~ rank_desc)

24 / 28

25 / 28

`geom_line() and geom_point()`

Let us calculate enrollments by college and term

tab2 <- my.df %>%
  group_by(term_code, college_desc) %>%
  summarise(frequency = n_distinct(anon_id)) 
ggplot() +
  geom_point(data = tab2, aes(x = term_code, y = frequency,
                              group = college_desc,
                              color = college_desc,
                              size = frequency,
                              shape = college_desc)) +
  geom_line(data = tab2, aes(x = term_code, y = frequency,
                             group = college_desc,
                             color = college_desc,
                             linetype = college_desc)) +
  labs(x = "Term",
       y = "Number Enrolled",
       color = "")

26 / 28

27 / 28

Find me at...

@aruhil
aniruhil.org
ruhil@ohio.edu

28 / 28

Help

Keyboard shortcuts

↑, ←, Pg Up, k

Go to previous slide

↓, →, Pg Dn, Space, j

Go to next slide

Home

Go to first slide

End

Go to last slide

Number + Return

Go to specific slide

b / m / f

Toggle blackout / mirrored / fullscreen mode

Clone slideshow

Toggle presenter mode

Restart the presentation timer

?, h

Toggle this help