Visualizing Data in RAni Ruhil1 / 60

Agenda

Visualizing data

the grammar of graphics
working with ggplot2
color consciousness

2 / 60

Visualizing data 3 / 60

graphics in R

Three common ways to generate graphics in R are via

base R
lattice (little development in recent years though)
ggplot2

We will skip base R graphics since ggplot2 will be the graphics package for this class.

Let us see how we use it, starting with a simple bar-chart

Remember the basic options...

one qualitative/categorical variables: bar-chart
one quantitative/continuous variables: histogram/box-plot/area-chart
two quantitative/continuous variables: scatter-plot/hex-bin

4 / 60

I will use two data-sets, the first being this IMDB data-set

The internet movie database, http://imdb.com/, is a website devoted to collecting movie data supplied by studios and fans. It claims to be the biggest movie database on the web and is run by amazon. More about information imdb.com can be found online, http://imdb.com/help/show_ leaf?about, including information about the data collection process, http://imdb.com/help/show_leaf?infosource.

library(ggplot2movies)

A data frame with 28819 rows and 24 variables

title. Title of the movie.
year. Year of release.
budget. Total budget (if known) in US dollars
length. Length in minutes.
rating. Average IMDB user rating.
votes. Number of IMDB users who rated this movie.
r1-10. Multiplying by ten gives percentile (to nearest 10%) of users who rated this movie a 1.
mpaa. MPAA rating (missing for a lot of movies)
action, animation, comedy, drama, documentary, romance, short. Binary variables representing if movie was classified as belonging to that genre.

5 / 60

The second data-set is the Star Wars dataset, a tibble with 87 rows and 13 variables:

library(dplyr)
data(starwars)

name: Name of the character
height: Height (cm)
mass: Weight (kg)
hair_color,skin_color,eye_color: Hair, skin, and eye colors
birth_year: Year born (BBY = Before Battle of Yavin)
gender: male, female, hermaphrodite, or none.
homeworld: Name of homeworld
species: Name of species
films: List of films the character appeared in
vehicles: List of vehicles the character has piloted
starships: List of starships the character has piloted

a tibble you say?

6 / 60

data frames vs. tibbles

R's default is to store a data frame, as shown below with a small example and there is a tendency to convert characters into factors, change column names, etc.

data.frame(
  `Some Letters` = c("A", "B", "C"), 
  `Some Numbers` = c(1, 2, 3)
  ) -> adf
str(adf)

## 'data.frame':    3 obs. of  2 variables:
##  $ Some.Letters: Factor w/ 3 levels "A","B","C": 1 2 3
##  $ Some.Numbers: num  1 2 3

print(adf)

##   Some.Letters Some.Numbers
## 1            A            1
## 2            B            2
## 3            C            3

tibbles is the brainchild of the team behind a bundle of packages (and RStudio) called the tidyverse that drop R's bad habits

tibble(
  `Some Letters` = c("A", "B", "C"), 
  `Some Numbers` = c(1, 2, 3)
  ) -> atib
glimpse(atib)

## Observations: 3
## Variables: 2
## $ `Some Letters` <chr> "A", "B", "C"
## $ `Some Numbers` <dbl> 1, 2, 3

print(atib)

## # A tibble: 3 x 2
##   `Some Letters` `Some Numbers`
##   <chr>                   <dbl>
## 1 A                           1
## 2 B                           2
## 3 C                           3

7 / 60

Questions??

8 / 60

and the grammar of graphics

library(ggplot2)

9 / 60

qplot will generate a quick plot but ggplot2 is the way to go so we build with it

ggplot(data = starwars)

Nothing results since we have not specified how we want the variable(s) to be mapped to the coordinate system ... what variable should go on what axis?

ggplot(data = starwars,
        mapping = aes(x = eye_color)
       )

Now we are getting somewhere. We see the canvas with the specific eye colors on the x-axis but nothing else has been drawn since we have not specified the geometry ... do you want a bar-chart? histogram? dot-plot? line-chart?

10 / 60

point out that they should have read the material on the grammar of graphics ... not essential to understand ggplot2 but helpful if they want to specialize in visualizations

With a categorical variable the bar-chart would be appropriate and so we ask for a geom_bar()

ggplot(data = starwars,
       mapping = aes(x = eye_color)) +
  geom_bar()

Other aesthetics can be added, such as

group
color
fill
size
alpha
and then axis labels, plot title/subtitle etc

Two commands for adding a color scheme --
(a) color or colour
(b) fill

You can also change color palettes, roll your own, and more

11 / 60

ggplot(data = starwars, 
       mapping = aes(x = eye_color, 
                     colour = eye_color)) +
  geom_bar() +
  labs(x = "Eye Color", 
       y = "Frequency", 
       title = "Bar-chart of Eye Color", 
       subtitle = "(of Star Wars characters)")

Note what colour = generated for us

labs() is allowing us to customize the axis labels, add a title and subtitle.

If you wanted a caption, you could add a line of code, something like caption = "lorem ipsum"

This is not a very good plot of course. Why is that?

12 / 60

ggplot(data = starwars, 
       mapping = aes(x = eye_color, 
                     fill = eye_color)) +
  geom_bar() +
  labs(x = "Eye Color", 
       y = "Frequency", 
       title = "Bar-chart of Eye Color", 
       subtitle = "(of Star Wars characters)")

Note what fill = generated for us

Of course, it would be good to have the fill colors match the eye-color so let us do that next

We can also eliminate the legend. Why is that?

13 / 60

Point out how important it is to label x-axis, y-axis, add a title
Remind them about setting y limit to start at 0 or then making a note in the plot for readers so they don't assume it is at 0 when in fact it has been truncated for ease of data presentation. If this misstates the pattern in the data, they should not do it or again, annotate the plot to that effect so nobody is misled

c("black", "blue", "slategray",
  "brown", "gray34", "gold",
  "greenyellow", "navajowhite1",
  "orange", "pink", "red",
  "magenta", "thistle3", "white",
  "yellow") -> mycolors
ggplot(data = starwars, 
       mapping = aes(x = eye_color)) +
  geom_bar(fill = mycolors) +
  labs(x = "Eye Color", 
       y = "Frequency", 
       title = "Bar-chart of Eye Color", 
       subtitle = "(of Star Wars characters)",
       caption = "Much better!")

R Colors used from this source but see also this source

We can still do better. How can we improve this?

14 / 60

c("black", "blue", "slategray",
  "brown", "gray34", "gold",
  "greenyellow", "navajowhite1",
  "orange", "pink", "red",
  "magenta", "thistle3", "white",
  "yellow") -> mycolors
ggplot(data = starwars, 
       mapping = aes(x = eye_color)) +
  geom_bar(fill = mycolors) +
  labs(x = "Eye Color", 
       y = "Frequency", 
       title = "Bar-chart of Eye Color", 
       subtitle = "(of Star Wars characters)",
       caption = "Much better!")

R Colors used from this source but see also this source

We can still do better. How can we improve this?

think in terms of the frequencies of eye-color

14 / 60

Colors can be customized by generating your own palettes
Color Brewer is here
Remind them to read the materials on choosing colors wisely, particularly the point about qualitative palettes, divergent palettes, and then palettes that work well even with colorblind audiences

15 / 60

I'll switch to a different variable and show you how to use prebuilt color palettes

ggplot(data = starwars, 
       mapping = aes(x = gender)) +
  geom_bar(aes(fill = gender)) +
  labs(x = "Gender", y = "Frequency",
       title = "Bar-chart of Gender", 
       subtitle = "(of Star Wars characters)", 
       caption = "(Source: The dplyr package)") +
  scale_fill_brewer(palette = "Pastel1")

library(wesanderson)
ggplot(data = starwars, mapping = aes(x = gender)) +
  geom_bar(aes(fill = gender)) +
  labs(x = "Gender", y = "Frequency",
       title = "Bar-chart of Gender", 
       subtitle = "(of Star Wars characters)",
       caption = "(Source: The dplyr package)") +
  scale_fill_manual(values = 
                      wes_palette("Darjeeling1"))

scale_fill_brewer in the plot on the left, calling on built-in color palettes. You can review them here
scale_fill_manual in the plot on the right and a specific palette is being invoked from the wesanderson package

16 / 60

Using the wes anderson palette
Adding the caption = command to source the data
Switching from scale_fill_brewer() to scale_fill_manual()

One can also lean on various plotting themes as shown below

library(ggthemes)
ggplot(data = starwars, aes(x = eye_color)) +
  geom_bar() +
  theme_tufte() +
  theme(axis.text.x = element_text(size = 6)) -> p1
ggplot(data = starwars, aes(x = eye_color)) +
  geom_bar() +
  theme_solarized() +
  theme(axis.text.x = element_text(size = 6)) -> p2
ggplot(data = starwars, aes(x = eye_color)) +
  geom_bar() +
  theme_economist() +
  theme(axis.text.x = element_text(size = 6)) -> p3
ggplot(data = starwars, aes(x = eye_color)) +
  geom_bar() +
  theme_fivethirtyeight() +
  theme(axis.text.x = element_text(size = 6)) -> p4
library(patchwork)
p1 + p2 + p3 + p4 + plot_layout(ncol = 2)

Note: I am not using mapping = aes(...) any longer

Later on you will learn these & other ways to build advanced visualizations ...for now we get to work more with ggplot2

17 / 60

more with bar-charts18 / 60

library(ggplot2)
ggplot(data = movies, aes(x = mpaa)) +
  geom_bar() +
  theme_minimal()

library(ggplot2)
ggplot(data = movies) +
  geom_bar(aes(x = mpaa)) +
  theme_minimal()

Notice how the aes(x = mpaa) is placed differently in the two commands and yet yields identical plots

19 / 60

MPAA ratings are missing for a lot of movies so we eliminate these from the plot via subset(mpa != "")

str(movies$mpaa)

##  chr [1:58788] "" "" "" "" "" "" "R" "" "" "" "" "" "" "" "PG-13" "PG-13" "" "" "" "" ...

ggplot(subset(movies, mpaa != ""), aes(x = mpaa)) + geom_bar() + theme_minimal()

The order of the bars is fortuitous in that it goes from the smallest frequency to the highest frequency.

20 / 60

I said fortuitous because ggplot2's default is to order the bars in an ascending alphabetic/alphanumeric ssequence if the variable is a character. See below for an example.

df = tibble(x = c(rep("A", 2), rep("B", 4), rep("C", 1)))
ggplot(data = df, aes(x = x)) + geom_bar() + theme_minimal()

Later on we'll learn how to order the bars with ascending/descending frequencies or by some other logic

21 / 60

What about plotting the relative frequencies on the y-axis rather than the frequencies?

ggplot(data = subset(movies, mpaa != ""), 
       aes(x = mpaa, 
           y = (..count..)/sum(..count..))) +
  geom_bar() +
  scale_y_continuous(labels = scales::percent) +
  labs(x = "MPAA Rating", 
       y = "Relative Frequency (%)") +
  theme_minimal()

Note: I used

y = (..count..)/sum(..count..)
scale_y_continuous(labels = scales::percent)

22 / 60

We could also add a second or even a third/fourth categorical variable. Let us see this with our hsb2 data-set

library(here)
load("data/hsb2.RData")
ggplot(data = hsb2, aes(x = ses, group = female)) +
  geom_bar(aes(fill = female)) +
  theme_minimal()

This is not very useful since the viewer has to estimate the relative sizes of the two colors within any given bar.

23 / 60

ggplot(data = hsb2, aes(x = ses, group = female)) +
  geom_bar(aes(fill = female), position = "dodge") +
  theme_minimal()

position = "dodge" will juxtapose the bars, and this is much better.

24 / 60

This is fine if you want to know what percent of the 200 students are low SES males, low SES females, etc. What if you wanted to calculate percentages within each sex?

ggplot(data = hsb2, aes(x = ses, y = female)) +
  geom_bar(aes(group = female, fill = female, y = ..prop..), position = "dodge") +
  scale_y_continuous(labels = scales::percent) +
  labs(y = "Relative Frequency (%)", x = "Socioeconomic Status Groups") +
  theme_minimal()

25 / 60

What about within each ses?

ggplot(data = hsb2, aes(x = female, y = ses)) + 
  geom_bar(aes(group = ses, fill = ses, y = ..prop..), position = "dodge") +
  scale_y_continuous(labels = scales::percent) +
  labs(y = "Relative Frequency (%)", x = "Socioeconomic Status Groups") +
  theme_minimal()

26 / 60

histograms27 / 60

They can visit to remind themselves of all the various pieces that build up a histogram and the meaning/value behind each component
They should look at Yau's piece here and here

geom_histogram() does the trick but note that the default number of bins is not very useful and can be tweaked, along with other embellishments

ggplot(data = hsb2, aes(x = read)) +
  geom_histogram(fill = "cornflowerblue", color = "white") +
  ggtitle("Histogram of Reading Scores") +
  labs(x = "Reading Score", y = "Frequency") +
  theme_minimal()

28 / 60

See bins = 5 and also experiment with binwidth =

ggplot(data = hsb2, aes(x = read)) +
  geom_histogram(fill = "cornflowerblue", color = "white", bins = 5) +
  ggtitle("Histogram of Reading Scores") +
  labs(x = "Reading Score", y = "Frequency") +
  theme_minimal()

29 / 60

If we wanted to break out the histogram by one or more categorical variables, we could do so

ggplot(hsb2, aes(x = read)) +
  geom_histogram(fill = "cornflowerblue", bins = 5, color = "white") +
  ggtitle("Histogram of Reading Scores") +
  labs(x = "Reading Score", y = "Frequency") +
  facet_wrap(~ female) +
  theme_minimal()

30 / 60

Or better yet,

ggplot(hsb2, aes(x = read)) +
  geom_histogram(fill="cornflowerblue", 
                 bins = 10, 
                 color = "white") +
  ggtitle("Histogram of Reading Scores") +
  labs(x = "Reading Score", y = "Frequency") +
  facet_wrap(~ female, ncol = 1) +
  theme_minimal()

Why? ... the distributions are stacked above each other, making for an easier comparison

31 / 60

One useful design element with breakouts is placing in relief the consolidated data (i.e., the distribution without break-outs)

ggplot(data = hsb2, aes(x = read, fill = female)) +
  geom_histogram(bins = 10, color = "white") +
  ggtitle("Histogram of Reading Scores") +
  labs(x = "Reading Score", y = "Frequency") +
  facet_wrap(~ female, ncol = 1) +
  geom_histogram(data = hsb2[, -2],
                 bins = 10, 
                 fill = "grey",
                 alpha = .5) +
  theme_minimal()

Here it is obvious that the distribution of readings scores of any one sex are similar to the overall distribution so perhaps the groups are not really that different in terms of reading scores

32 / 60

For breakouts with two categorical variables we could do

ggplot(data = hsb2, aes(x = read)) +
  geom_histogram(fill="cornflowerblue",
                 bins = 10, 
                 color = "white") +
  ggtitle("Histogramx of Reading Scores") +
  labs(x = "Reading Score", 
       y = "Frequency") +
  facet_wrap(~ female + schtyp, 
             ncol = 2) +
  theme_minimal()

Note that ~ female + schtyp renders the panels for the first category of female by all categories of schtyp and then repeats for the other category of female

33 / 60

ggplot(data = hsb2, aes(x = read)) + 
  geom_histogram(fill="cornflowerblue", 
                 bins = 10, 
                 color = "white") + 
  ggtitle("Histogramx of Reading Scores") + 
  labs(x = "Reading Score", y = "Frequency") +
  facet_wrap(schtyp ~ female, ncol = 2) +
  theme_minimal()

Note that schtyp ~ female renders the panels for the first category of schtype for all categories of female and then repeats for the other category of schtyp

34 / 60

... which is the same as ...

ggplot(data = hsb2, aes(x = read)) +
  geom_histogram(fill = "cornflowerblue", bins = 10, color = "white") +
  ggtitle("Histogramx of Reading Scores") +
  labs(x = "Reading Score", y = "Frequency") +
  facet_wrap(~ schtyp + female, ncol = 2) +
  theme_minimal()

35 / 60

Ridge-plots36 / 60

These were all the rage in the summer of 2017, and initially named joy plots but the unfortunate connection with the source of the plots led the name to be revised to ridge-plots. If you are curious, see why not joy?. You need to have installed the ggridges package but other than that, they are easy to craft.

library(viridis)
library(ggridges)
library(ggthemes)
ggplot(lincoln_weather, 
       aes(x = `Mean Temperature [F]`, y = `Month`)) +
  geom_density_ridges(scale = 3, alpha = 0.3, 
                      aes(fill = Month)) +
  labs(x = "Mean Temperature",
       y = "Month", 
       title = 'Temperatures in Lincoln NE', 
       subtitle = 'Mean temperatures (Fahrenheit) 
       by month for 2016',
       caption = "Data: The Weather Underground") +
  theme_ridges() +
  theme(axis.title.y = element_blank(), 
        legend.position = "none")

37 / 60

Here is another one, mapping the distribution of hemoglobin in four populations (the US being the reference group) as part of a study looking at the impacy of altitude on hemoglobin concentration (courtesy Whitlock and Schluter).

hemo <- read.csv(url("http://whitlockschluter.zoology.ubc.ca/wp-content/data/chapter02/chap02e3cHumanHemoglobinElevation.csv"))
ggplot(hemo, aes(x = hemoglobin, y = population)) +
  geom_density_ridges(scale = 3, alpha = 0.3, 
                      aes(fill = population)) +
  labs(title = 'Hemoglobin Concentration Levels',
       subtitle = 'in Four populations') +
  theme_ridges() +
  theme(axis.title.y = element_blank(),
        legend.position = "none")

As should be evident, they are visually appealing when comparing a large number of groups on a single continuous variable and using simple facet-wrap or other options would be suboptimal

38 / 60

Box-plots39 / 60

These can be useful to look at the distribution of a continuous variable

ggplot(hemo, aes(y = hemoglobin, x = "")) +
  geom_boxplot(fill = "cornflowerblue") +
  coord_flip() + 
  labs(x = "", y = "Hemoglobin Concentration") +
  theme_minimal()

Note

the x = "" in aes() because otherwise with a single group the box-plot will not build up
coord_flip() is flipping the x-axis and y-axis

40 / 60

ggplot(hemo, aes(y = hemoglobin, x = population,
                 fill = population)) +
  geom_boxplot() +
  coord_flip() +
  labs(x = "", y = "Hemoglobin Concentration") +
  theme(axis.title.y = element_blank(),
        legend.position = "none") +
  theme_minimal()

Notice the need for no legend with fill = population

41 / 60

line charts42 / 60

Ideal use: time series data

library(plotly)
data(economics)
# names(economics)
ggplot(economics, aes(x = date, y = uempmed)) +
  geom_line() + labs(x = "Date", 
                     y = "Unemployment Rate") +
  theme_minimal()

43 / 60

load("data/gap.df.RData")
ggplot(gap.df, aes(x = year, y = LifeExp,
                   group = continent, 
                   color = continent)) +
  geom_line() +
  geom_point() +
  labs(x = "Year", 
       y = "Median Life Expectancy (in years)",
       color = "") +
  theme_minimal() +
  theme(legend.position = "bottom")

Note: I switched off the legend label via color = ""

Pretty ugly!!

44 / 60

but we can dress these up by using plotly

library(plotly)
myplot <- plot_ly(economics, x = ~date) %>%
  add_trace(y = ~uempmed, line = list(color = 'black'),
            mode = "lines", 
            name = "Unemployment Rate") %>%
  add_trace(y = ~psavert, line = list(color = 'red'),
            mode = "lines", 
            name = "Personal Savings Rate") %>%
  layout(autosize = F, width = 600, height = 300)
library(shiny)
div(myplot, align = "center")

45 / 60

Two quantitative variables46 / 60

scatter-plots ...

load("data/hsb2.RData")
ggplot(hsb2, aes(x = write, y = science)) +
  geom_point() +
  labs(x = "Writing Scores", y = "Science Scores") +
  theme_minimal()

47 / 60

color by ses

ggplot(hsb2, aes(x = write, y = science)) +
  geom_point(aes(color = ses)) +
  labs(x = "Writing Scores", y = "Science Scores") +
  theme_minimal() +
  theme(legend.position = "bottom")

48 / 60

breakout ses for ease of interpretation

ggplot(hsb2, aes(x = write, y = science)) +
  geom_point() +
  labs(x = "Writing Scores", y = "Science Scores") +
  facet_wrap(~ ses) +
  theme_minimal()

49 / 60

make it interactive with plotly

p = plot_ly(hsb2, x = ~write, y = ~science, color = ~ses) %>%
  layout(autosize = FALSE, width = 600, height = 300) 
div(p, align = "center")

50 / 60

count plots: see the frequency of given pairs of values by varying sizes of the points. The more the frequency of a pair the greater the size of these points.

data(mpg, package = "ggplot2")
ggplot(mpg, aes(x = cty, y = hwy)) +
  geom_count(col = "firebrick", show.legend = FALSE) +
  labs(subtitle = "City vs Highway mileage",
       y = "Highway mileage", 
       x = "City mileage")  +
  theme_minimal()

51 / 60

Boston Marathon data

boston = read.csv("./data/BostonMarathon.csv")
boston2 = boston[sample(nrow(boston), 200), ]
ggplot(boston2, aes(x = Age, y = finishtime, group = M.F)) +
  geom_count(aes(color = M.F), show.legend = FALSE) +
  labs(subtitle = "", y = "Finishing Times (in seconds)",
       x = "Age (in years)") +
  facet_wrap(~ M.F, ncol = 1) +
  theme_minimal()

52 / 60

hexbins53 / 60

scatter-plots not helpful when data points overlap
hex-bins carve up the plotting grid into hexagons of equal size, count how many $x, y$ pairs fall in each hexagon, and use a color scheme (like a heatmap) to show where hexagons have more data versus less.

ggplot(data = diamonds, aes(y = price, x = carat)) +
  geom_hex() +
  labs(x = "Weight in Carats", y = "Price") +
  theme_minimal()

54 / 60

adding a third variable

ggplot(data = diamonds, aes(y = price, x = carat)) +
  geom_hex() +
  labs(x = "Weight in Carats", y = "Price") +
  facet_wrap(~ color) +
  theme_minimal()

55 / 60

`ggplot2`

rules & resources

ggplot(data, aes()) + geom_(aes()) + ...

aes() will take x =, y =, fill =, color =, group =, size =, radius =, size = and more

each geom has its own components

plenty of themes available; see for e.g., ggthemes here

don't forget to stay in touch with development of ggplot2 extensions

of course, the plotly site and Carson Sievert's plotly book

keep ggplot2 cheatsheet handy

join stackoverflow but if you ask a question, post with a MWE

56 / 60

MWE = minimum working example

Some helpful resources

The Data Visualization Catalogue developed by Severino Ribecca to create a library of different information visualisation types

The R Graph Catalog maintained by Joanna Zhao and Jenny Bryan is always useful to see code-and-resulting-figure

Ferdio's Data Visualization Project, a website trying to present all relevant data visualizations, so you can find the right visualization and get inspired how to make them

The Chartmaker Directory will offer an answer to one of the most common questions in data visualisation: which tool do you need to make that chart?

Emery's Essentials focusing on the charts that give you the best bang for your buck

The Data Visualization Checklist by Ann K. Emery & Stephanie Evergreen

Data Visualization: A Practical Introduction by Kieran Healy

57 / 60

Remember ...

🔁 start with pencil and paper, sketch prototypes of desired visualization(s)

😄 graphics are relatively easy to generate with base R & with ggplot2

👏 common-sense: number & type of variable(s) guide plotting

🎇 stay color conscious: sensible colors & sensitive to color blindness

🔰 experiment, experiment, experiment until you are happy

use the 🆓 learning resources available online

📒 if you learn something new in R, write it down

58 / 60

Some relevant RStudio webinars

Check out Data Camp and the blogosphere for plenty of examples

59 / 60

Find me at...

@aruhil
aniruhil.org
ruhil@ohio.edu

60 / 60

Help

Keyboard shortcuts

↑, ←, Pg Up, k

Go to previous slide

↓, →, Pg Dn, Space, j

Go to next slide

Home

Go to first slide

End

Go to last slide

Number + Return

Go to specific slide

b / m / f

Toggle blackout / mirrored / fullscreen mode

Clone slideshow

Toggle presenter mode

Restart the presentation timer

?, h

Toggle this help