Graphics with ggplot2

Ani Ruhil

Our goal in this module is to understand some basic ways of visualizing data. We will skip base R commands and instead just work with ggplot2, the most popular visualization package in the R universe.

Remember the basic options…

one qualitative/categorical variables: bar-chart
one quantitative/continuous variables: histogram/box-plot/area-chart
two quantitative/continuous variables: scatter-plot/hex-bin

Two data-sets

I will use two data-sets, the first being this IMDB data-set

The internet movie database, http://imdb.com/, is a website devoted to collecting movie data supplied by studios and fans. It claims to be the biggest movie database on the web and is run by amazon. More about information imdb.com can be found online, http://imdb.com/help/show_ leaf?about, including information about the data collection process, http://imdb.com/help/show_leaf?infosource.

library(ggplot2movies)

A data frame with 28819 rows and 24 variables

title. Title of the movie.
year. Year of release.
budget. Total budget (if known) in US dollars
length. Length in minutes.
rating. Average IMDB user rating.
votes. Number of IMDB users who rated this movie.
r1-10. Multiplying by ten gives percentile (to nearest 10%) of users who rated this movie a 1.
mpaa. MPAA rating (missing for a lot of movies)
action, animation, comedy, drama, documentary, romance, short. Binary variables representing if movie was classified as belonging to that genre.

The second data-set is the Star Wars dataset, a tibble with 87 rows and 13 variables:

library(dplyr)
data(starwars)

name: Name of the character
height: Height (cm)
mass: Weight (kg)
hair_color,skin_color,eye_color: Hair, skin, and eye colors
birth_year: Year born (BBY = Before Battle of Yavin)
gender: male, female, hermaphrodite, or none.
homeworld: Name of homeworld
species: Name of species
films: List of films the character appeared in
vehicles: List of vehicles the character has piloted
starships: List of starships the character has piloted

a tibble you say?

data.frame vs. tibbles

R’s default is to store a dataframe, as shown below with a small example and there is a tendency to convert characters into factors, change column names, etc.

data.frame(
  `Some Letters` = c("A", "B", "C"), 
  `Some Numbers` = c(1, 2, 3)
  ) -> adf 

str(adf) # show me the structure of this object called adf

'data.frame':   3 obs. of  2 variables:
 $ Some.Letters: chr  "A" "B" "C"
 $ Some.Numbers: num  1 2 3

print(adf) # display the object adf in the console

  Some.Letters Some.Numbers
1            A            1
2            B            2
3            C            3

tibbles is the brainchild of the team behind an idiosyncratic bundle of packages (and RStudio) called the tidyverse that drop some of R’s bad habits

library(dplyr)

tibble(
  `Some Letters` = c("A", "B", "C"), 
  `Some Numbers` = c(1, 2, 3)
  ) -> atib 

glimpse(atib) # a transposed version of the print command

Rows: 3
Columns: 2
$ `Some Letters` <chr> "A", "B", "C"
$ `Some Numbers` <dbl> 1, 2, 3

print(atib) # display the object atib in the console

# A tibble: 3 × 2
  `Some Letters` `Some Numbers`
  <chr>                   <dbl>
1 A                           1
2 B                           2
3 C                           3

Look at adf and compare it with atib. While there are other advantages to tibbles that we will encounter at a later stage, for now, focus on the following benefits: Unlike data.frames, (1) tibbles retain the column names as created, and (2) tibbles do not force characters into factors.

`ggplot2` and the grammar of graphics

qplot will generate a quick plot but ggplot2 is the way to go so we build with it. Read the relevant chapter on the grammar of graphics from the link in the syllabus or then watch this video.

library(ggplot2)
ggplot(data = starwars)

Nothing results since we have not specified how we want the variable(s) to be mapped to the coordinate system… what variable should go on what axis?

ggplot(data = starwars, 
       mapping = aes(x = eye_color))

Now we are getting somewhere. We see the canvas with the specific eye colors on the x-axis but nothing else has been drawn since we have not specified the geometry … do you want a bar-chart? histogram? dot-plot? line-chart? what??

With a categorical variable the bar-chart would be appropriate and so we ask for a geom_bar()

ggplot(data = starwars, 
       mapping = aes(x = eye_color)) +
  geom_bar()

Other aesthetics can be added, such as group, color, fill, size, alpha , axis labels, plot title/subtitle etc.

There are two commands for adding a color scheme – color or colour versus fill

ggplot(data = starwars, 
       mapping = aes(x = eye_color, colour = eye_color)) +
  geom_bar() +
  labs(x = "Eye Color", 
       y = "Frequency (n)", 
       title = "Bar-chart of Eye Color", 
       subtitle = "(of Star Wars characters)")

Note what colour = generated for us, and how this differs from fill = (see below).

ggplot(data = starwars, 
       mapping = aes(x = eye_color, fill = eye_color)) +
  geom_bar() +
  labs(x = "Eye Color", y = "Frequency",
       title = "Bar-chart of Eye Color",
       subtitle = "(of Star Wars characters)")

Of course, it would be good to have the colors match the eye-color so let us do that next.

c("black", "blue", "slategray", "brown",
  "gray34", "gold", "greenyellow",
  "navajowhite1", "orange", "pink", "red",
  "magenta", "thistle3", "white", "yellow"
  ) -> mycolors 

ggplot(data = starwars, mapping = aes(x = eye_color)) +
  geom_bar(fill = mycolors) +
  labs(x = "Eye Color",
       y = "Frequency",
       title = "Bar-chart of Eye Color",
       subtitle = "(of Star Wars characters)")

R Colors used from are this source but see also this source. Colors can be customized by generating your own palettes via the Color Brewer here. But don’t get carried away: Remember to read the materials on choosing colors wisely, particularly the point about qualitative palettes, divergent palettes, and then palettes that work well even with colorblind audiences.

I’ll switch to a different variable and show you how to use prebuilt color palettes.

ggplot(data = starwars, mapping = aes(x = gender)) +
  geom_bar(aes(fill = gender)) +
  labs(x = "Gender",
       y = "Frequency",
       title = "Bar-chart of Gender",
       subtitle = "(of Star Wars characters)",
       caption = "(Source: The dplyr package)") +
  scale_fill_brewer(palette = "Pastel1")

ggplot(data = starwars, mapping = aes(x = gender)) +
  geom_bar(aes(fill = gender)) +
  labs(x = "Gender",
       y = "Frequency",
       title = "Bar-chart of Gender",
       subtitle = "(of Star Wars characters)",
       caption = "(Source: The dplyr package)") +
  scale_fill_brewer(palette = "Set1")

library(ggplot2)
library(wesanderson)
ggplot(data = starwars, mapping = aes(x = gender)) +
  geom_bar(aes(fill = gender)) +
  labs(x = "Gender", y = "Frequency",
       title = "Bar-chart of Gender",
       subtitle = "(of Star Wars characters)",
       caption = "(Source: The dplyr package)") +
  scale_fill_manual(values = wes_palette("Darjeeling1"))

ggplot(data = starwars, aes(x = homeworld)) +
  geom_bar() +
  coord_flip()

ggplot(data = starwars, aes(x = species)) +
  geom_bar() +
  coord_flip()

ggplot(data = starwars, aes(x = gender)) +
  geom_bar()

Now add labels, a title, subtitle

ggplot(data = starwars, aes(x = gender)) +
  geom_bar() +
  labs(x = "Gender",
       y = "Frequency",
       title = "Bar-chart of Gender",
       subtitle = "(of Star Wars characters)",
       caption = "(Source: The dplyr package)")

Study the commands carefully and note that

scale_fill_brewer is being used in the first plot, calling on built-in color palettes. You can review them here
scale_fill_manual is being used in the second plot and a specific palette is being invoked from the wesanderson package

Color palettes will come into play far more later on in this course.

Themes

One can also lean on various plotting themes as shown below.

library(ggthemes)

ggplot(data = starwars, 
             mapping = aes(x = eye_color)) +
  geom_bar() +
  theme_tufte() +
  theme(axis.text.x = element_text(size = 6)) -> p1 

ggplot(data = starwars, 
             mapping = aes(x = eye_color)) +
  geom_bar() +
  theme_solarized() +
  theme(axis.text.x = element_text(size = 6)) -> p2 

ggplot(data = starwars, 
             mapping = aes(x = eye_color)) +
  geom_bar() +
  theme_economist() +
  theme(axis.text.x = element_text(size = 6)) -> p3 

ggplot(data = starwars, 
             mapping = aes(x = eye_color)) +
  geom_bar() +
  theme_fivethirtyeight() +
  theme(axis.text.x = element_text(size = 6)) -> p4 

library(patchwork)
p1 + p2 + p3 + p4 + plot_layout(ncol = 1)

Later on you will learn these & other ways to build advanced visualizations …for now we get to work more with ggplot2.

More with bar-charts

library(ggplot2movies)
ggplot(data = movies, aes(x = mpaa)) +
  geom_bar() +
  theme_minimal()

library(ggplot2)
ggplot(data = movies) +
  geom_bar(aes(x = mpaa)) +
  theme_minimal()

Notice that we switched the aes() piece of the code but that made no difference; this is important to bear in mind because it will come in handy down the road when we need to build some advanced visualizations.

The plot is sub-optimal since MPAA ratings are missing for a lot of movies and should be eliminated from the plot via subset(mpa != "")

str(movies$mpaa)

 chr [1:58788] "" "" "" "" "" "" "R" "" "" "" "" "" "" "" "PG-13" ...

ggplot(subset(movies, mpaa != ""), aes(x = mpaa)) +
  geom_bar() +
  theme_minimal()

The order of the bars is fortuitous in that it goes from the smallest frequency to the highest frequency, drawing the reader’s eye. I said fortuitous because the default is to order the bars in an ascending alphabetic/alphanumeric order if the variable is a character. See below for an example.

library(dplyr)
tibble(
  x = c(
    rep("A", 2),
    rep("B", 4),
    rep("C", 1)
    )
  ) -> df 

ggplot(data = df, aes(x = x)) +
  geom_bar() +
  theme_minimal()

Later on we’ll learn how to order the bars with ascending/descending frequencies or by some other logic.

What about plotting the relative frequencies on the y-axis rather than the frequencies?

library(ggplot2movies) 
library(scales)
library(ggplot2)
ggplot(data = subset(movies, mpaa != ""), 
       aes(x = mpaa,
           y = (..count..)/sum(..count..))) +
  geom_bar() +
  scale_y_continuous(labels = scales::percent) +
  labs(x = "MPAA Rating",
       y = "Relative Frequency (%)") +
  theme_minimal()

Note the addition of y = (..count..)/sum(..count..) gives us proportions on the y-axis that are then converted into % via scale_y_continuous(labels = scales::percent)

We could also add a second or even a third/fourth categorical variable. Let us see this with our hsb2 data-set. we can start by reading in the data file.

library(here)
load(here("data", "hsb2.RData"))
colnames(hsb2) <- tolower(colnames(hsb2))

ggplot(data = hsb2, aes(x = ses, group = female)) + 
  geom_bar(aes(fill = female)) + 
  theme_minimal()

This is not very useful since the viewer has to estimate the relative sizes of the two colors within any given bar. That can be fixed with position = "dodge", juxtaposing the bars for the groups as a result, and the end product is much better.

ggplot(data = hsb2, aes(x = ses, group = female)) + 
  geom_bar(aes(fill = female), position = "dodge") + 
  theme_minimal()

This is fine if you want to know what percent of the 200 students are low SES males, low SES females, etc. What if you wanted to calculate percentages within each sex?

ggplot(data = hsb2, aes(x = ses, y = female)) + 
  geom_bar(aes(group = female,
               fill = female, y = ..prop..),
           position = "dodge") +
  scale_y_continuous(labels = scales::percent) +
  labs(y = "Relative Frequency (%)",
       x = "Socioeconomic Status Groups") +
  theme_minimal()

What about within each ses?

ggplot(data = hsb2, aes(x = female, y = ses)) +
  geom_bar(aes(group = ses, fill = ses, y = ..prop..),
           position = "dodge") +
  scale_y_continuous(labels = scales::percent) +
  labs(y = "Relative Frequency (%)",
       x = "Socioeconomic Status Groups") +
  theme_minimal()

Histograms

If you’ve forgotten what these are, see histogram, or then Yau’s piece here and here. There is a short video available as well.

Let us load the hsb2 data we had downloaded, processed (adding value labels to categorical variables) and saved in our data folder in the last module.

load(here("data", "hsb2.RData"))

For histograms in ggplot2, geom_histogram() does the trick but note that the default number of bins is not very useful and can be tweaked, along with other embellishments.

ggplot(data = hsb2, aes(x = read)) + 
  geom_histogram(fill = "cornflowerblue",
                 color = "white") + 
  labs(title = "Histogram of Reading Scores",
       x = "Reading Score",
       y = "Frequency") +
  theme_minimal()

We could set bins = 5 and we could also experiment with increasing the binwidth to 10

ggplot(data = hsb2, aes(x=read)) +
  geom_histogram(fill="cornflowerblue",
                 color = "white",
                 bins = 5) +
  labs(title = "Histogram of Reading Scores",
       x = "Reading Score",
       y = "Frequency") +
  theme_minimal()

ggplot(data = hsb2, aes(x=read)) +
  geom_histogram(fill="cornflowerblue",
                 color = "white",
                 binwidth = 10) +
  labs(title = "Histogram of Reading Scores",
       x = "Reading Score",
       y = "Frequency") +
  theme_minimal()

If we wanted to break out the histogram by one or more categorical variables, we could do so quite easily:

ggplot(hsb2, aes(x = read)) +
  geom_histogram(fill="cornflowerblue",
                 bins = 5,
                 color = "white") +
  labs(title = "Histogram of Reading Scores",
       x = "Reading Score",
       y = "Frequency") +
  facet_wrap(~ female) +
  theme_minimal()

Or better yet,

ggplot(hsb2, aes(x = read)) +
  geom_histogram(fill="cornflowerblue",
                 bins = 10,
                 color = "white") +
  labs(title = "Histogram of Reading Scores",
       x = "Reading Score",
       y = "Frequency") +
  facet_wrap(~ female, ncol = 1) +
  theme_minimal()

since now the distributions are stacked above each, easing comparisons.

One useful design element with breakouts is placing in relief the consolidated data (i.e., the distribution for all of the data rather than by female/male).

ggplot(data = hsb2, aes(x = read, fill = female)) +
  geom_histogram(bins = 10, color = "white") +
  labs(title = "Histogram of Reading Scores",
       x = "Reading Score",
       y = "Frequency") +
  facet_wrap(~ female, ncol = 1) + 
  geom_histogram(data = hsb2[, -2],
                 bins = 10,
                 fill = "grey",
                 alpha = .5) +
  theme_minimal()

Here it is obvious that the distribution of readings scores of any one sex are similar to the overall distribution so perhaps the groups are not really that different in terms of reading scores

For breakouts with two categorical variables we could do

ggplot(data = hsb2, aes(x = read)) +
  geom_histogram(fill="cornflowerblue",
                 bins = 10,
                 color = "white") +
  labs(title = "Histogram of Reading Scores",
       x = "Reading Score",
       y = "Frequency") + 
  facet_wrap(~ female + schtyp, ncol = 2) + 
  theme_minimal()

Note that ~ female + schtyp renders the panels for the first category of female by all categories of schtyp and then repeats for the other category of female.

ggplot(data = hsb2, aes(x = read)) +
  geom_histogram(fill = "cornflowerblue",
                 bins = 10, color = "white") +
  labs(title = "Histogram of Reading Scores",
       x = "Reading Score",
       y = "Frequency") +
  facet_wrap(schtyp ~ female, ncol = 2) +
  theme_minimal()

Note that schtyp ~ female renders the panels for the first category of schtyp for all categories of female and then repeats for the other category of schtyp

… which is the same as …

ggplot(data = hsb2, aes(x = read)) +
  geom_histogram(fill = "cornflowerblue",
                 bins = 10,
                 color = "white") +
  labs(title = "Histogram of Reading Scores",
       x = "Reading Score",
       y = "Frequency") +
  facet_wrap(~ schtyp + female, ncol = 2) +
  theme_minimal()

In general, do not forget to set the y limit to start at 0 or then make a note in the plot for readers so they don’t assume it is at 0 when in fact it has been truncated for ease of data presentation. If this misstates the pattern in the data, do not do it or then, again, annotate the plot to that effect so nobody is misled. Bar-charts will have 0 as the minimum y-limit but not so for histograms and some other plots involving continuous variables.

Ridge-plots

These were all the rage in the summer of 2017, and named joy plots but the unfortunate connection with the source of the plots led the name to be revised to ridge-plots. If you are curious, see why not joy?. You need to have installed the ggridges package but other than that, they are easy to craft.

library(viridis)
library(ggridges)
library(ggthemes)
ggplot(lincoln_weather, aes(x = `Mean Temperature [F]`, y = `Month`)) +
  geom_density_ridges(scale = 3, alpha = 0.3, aes(fill = Month)) +
  labs(title = 'Temperatures in Lincoln NE',
       subtitle = 'Mean temperatures (Fahrenheit) by month for
       2016\nData: Original CSV from the Weather Underground') +
  theme_ridges() +
  theme(axis.title.y = element_blank(),
        legend.position = "none")

Here is another one, mapping the distribution of hemoglobin in four populations (the US being the reference group) as part of a study looking at the impact of altitude on hemoglobin concentration (courtesy Whitlock and Schluter).

hemoglobinData <- read.csv(url("http://whitlockschluter.zoology.ubc.ca/wp-content/data/chapter02/chap02e3cHumanHemoglobinElevation.csv"))

ggplot(hemoglobinData, aes(x = hemoglobin, y = population)) +
  geom_density_ridges(scale = 3, alpha = 0.3,
                      aes(fill = population)) +
  labs(title = 'Hemoglobin Concentration Levels',
       subtitle = 'in Four populations') +
  theme_ridges() +
  theme(axis.title.y = element_blank(),
        legend.position = "none")

As should be evident, they are visually appealing when comparing a large number of groups on a single continuous variable and using simple facet-wrap or other options would be unfeasible.

Box-plots

These can be useful to look at the distribution of a continuous variable. See this video.

ggplot(hemoglobinData, aes(y = hemoglobin, x = "")) +
  geom_boxplot(fill = "cornflowerblue") +
  coord_flip() + 
  labs(x = "",
       y = "Hemoglobin Concentration") + 
  theme_minimal()

Note:

the x = "" in aes() because otherwise with a single group the box-plot will not build up
coord_flip() is flipping the x-axis and y-axis

And now for the hemoglobin data.

ggplot(hemoglobinData, aes(y = hemoglobin, x = population, fill = population)) +
  geom_boxplot() +
  coord_flip() +
  labs(x = "",
       y = "Hemoglobin Concentration") +
  theme_minimal() + 
  theme(axis.title.y = element_blank(),
        legend.position = "none")

Notice the need for no legend with fill = population

Line-charts

These are useful for time-series data since they map trends over time.

library(plotly)
data(economics)
# names(economics)
ggplot(economics, aes(x = date, y = uempmed)) +
  geom_line() +
  labs(x = "Date",
       y = "Unemployment Rate") + 
  theme_minimal()

They can look very plain and aesthetically unappealing unless you dress them up. See the one below and then the one that follows.

load(here("data", "gap.df.RData"))
ggplot(
  gap.df,
  aes(x = year, y = LifeExp,
      group = continent,
      color = continent)
  ) +
  geom_line() +
  geom_point() +
  labs(x = "Year",
       y = "Median Life Expectancy (in years)") +
  theme_minimal() +
  theme(legend.position = "bottom")

Here is the more aesthetically pleasing version built using plotly

library(plotly)
plot_ly(economics, x = ~date,
                  color = I("black")) %>%
  add_trace(y = ~uempmed,
            name = 'Unemployment Rate',
            line = list(color = 'black'),
            mode = "lines") %>%
  add_trace(y = ~psavert,
            name = 'Personal Saving Rate',
            line = list(color = 'red'),
            mode = "lines") %>%
  layout(autosize = F, width = 700, height = 300) -> myplot 

library(shiny)
div(myplot, align = "center")

Scatter-plots

These are great with two continuous variables, and work well to highlight the nature and strength of a relationship between the two variables …. what happens to y as x increases? s

ggplot(hsb2, aes(x = write,
                 y = science)
       ) +
  geom_point() +
  labs(x = "Writing Scores",
       y = "Science Scores") +
  theme_minimal()

We could lean on ggplot2 and highlight the different ses groups, to see if there is any difference.

ggplot(hsb2, aes(x = write,
                 y = science)) +
  geom_point(aes(color = ses)) +
  labs(x = "Writing Scores",
       y = "Science Scores") +
  theme_minimal() +
  theme(legend.position = "bottom")

This is not very helpful so why not breakout ses for ease of interpretation?

ggplot(hsb2, aes(x = write,
                 y = science)) +
  geom_point() +
  labs(x = "Writing Scores",
       y = "Science Scores") +
  facet_wrap(~ ses)  +
  theme_minimal()

And then of course we could make it interactive with plotly …

plot_ly(data = hsb2,
             x = ~write,
             y = ~science,
             color = ~ses) -> p 
div(p, align = "right")

Count plots

count plots show the frequency of given pairs of values by varying sizes of the points. The more the frequency of a pair, the greater the size of these points. Useful but somehow I don’t end up using them much.

data(mpg, package = "ggplot2")
ggplot(mpg, aes(x = cty, y = hwy)) +
  geom_count(col = "firebrick",
             show.legend = FALSE) +
  labs(subtitle = "City vs Highway mileage",
       y = "Highway mileage",
       x = "City mileage")  +
  theme_minimal()

The second example relies on our Boston Marathon data, looking at finishing times of men and women, respectively. We could have tried to put both groups in the same plot but that would end up obscuring things more than revealing anything.

read.csv(here("data", "BostonMarathon.csv")) -> boston 

boston[sample(nrow(boston), 200, replace = FALSE), ] -> boston2 # draw a random sample of 200 runners without replacement 

ggplot(boston2,
       aes(x = Age, 
           y = finishtime,
           group = M.F)
       ) +
  geom_count(aes(color = M.F),
             show.legend = FALSE) +
  labs(subtitle = "", 
       y = "Finishing Times (in seconds)",
       x = "Age (in years)") +
  facet_wrap(~ M.F, ncol = 1)  +
  theme_minimal()

Hex-bins

Scatter-plots and count plots are not helpful when data points overlap. This is where hex-bins come in handy. In brief, they carve up the plotting grid into hexagons of equal size, count how many \(x,y\) pairs fall in each hexagon, and use a color scheme (like a heatmap) to show where hexagons have more data versus less.

ggplot(data = diamonds, aes(y = price, x = carat)) +
  geom_hex() +
  labs(x = "Weight in Carats",
       y = "Price") +
  theme_minimal()

We could add a third variable, diamond color, for example.

ggplot(data = diamonds, aes(y = price, x = carat)) +
  geom_hex() +
  labs(x = "Weight in Carats",
       y = "Price") +
  facet_wrap(~ color) +
  theme_minimal()

Some reminders about `ggplot2` rules & other resources

basic structure: ggplot(data, aes()) + geom_(aes()) + ...
aes() will take x =, y =, fill =, color =, group =, size =, radius =, size = and more
each geom has its own components
plenty of themes available; see for e.g., ggthemes here
don’t forget to stay in touch with development of ggplot2 extensions
of course, the plotly site and Carson Sievert’s plotly book
keep ggplot2 cheatsheet handy
join stackoverflow but if you ask a question, post with a MWE (minimum working example). This is crucial otherwise be prepared to have your head bitten off or at best have nobody respond to your question!

Some resources to bear in mind:

The Data Visualization Catalogue developed by Severino Ribecca to create a library of different information visualization types
The R Graph Catalog maintained by Joanna Zhao and Jenny Bryan is always useful to see code-and-resulting-figure
Ferdio’s Data Visualization Project, a website trying to present all relevant data visualizations, so you can find the right visualization and get inspired how to make them
The Chartmaker Directory will offer an answer to one of the most common questions in data visualization: ‘which tool do you need to make that chart?’
Emery’s Essentials focusing on the charts that give you the best bang for your buck
The Data Visualization Checklist by Ann K. Emery & Stephanie Evergreen

And finally, my suggestion of how to go about building your visualizations:

🔁 start with pencil and paper, sketch prototypes of desired visualization(s)
😄 graphics are relatively easy to generate with base R & with ggplot2
👏 common-sense: number & type of variable(s) guide plotting
🎇 stay color conscious: sensible colors & sensitive to color blindness
🔰 experiment, experiment, experiment until you are happy
use the 🆓 learning resources available online
📒 if you learn something new in R, write it down

Practice tasks

Ex. 1: Lord of the Rings trilogy data

Use the Lord of the Rings data emailed to you to answer the following questions. Note that these data are from jennybc and represent the number of words spoken by characters in the LOTR trilogy. Some other, pretty amazing visualizations an be seen here, the work of Nadieh Bremer. You are merely looking at how many times a particular race or character appears on screen with a dialogue of at least one word.

Generate an appropriate chart that shows the distribution of Race
Now break this distribution out by Film to see how Race is distributed across Film.
Now generate an appropriate chart to show the distribution of Character by film. Use coord_flip() to flip the coordinates so that the characters show up on the y-axis.
Now use facet_wrap() to generate the three-panel layout, one panel per film.
Use an appropriate chart to plot the distribution of the number of words spoken overall
Now break up this chart by movie.
What if you did it by Race? Which race seems to speak the most?

Ex. 2: Water levels in the Great Lakes

Download the monthly Great Lakes water level dataset SPSS format from here and Excel format from here. Note that water level is in meters.

Use the following command to read in the excel file:

library(readxl)
url <- "https://aniruhil.github.io/avsr/teaching/dataviz/greatlakes.xlsx"
destfile <- "greatlakes.xlsx"
curl::curl_download(url, destfile)
greatlakes <- read_excel(destfile, col_types = c("date", 
     "numeric", "numeric", "numeric", "numeric", 
     "numeric"))

Now use an appropriate chart to show the water level for Lake Superior.

Ex. 3: County Health Rankings

Download the 2017 County Health Rankings data SPSS format from here, Excel format from here and the accompanying codebook.

Construct appropriate plots that shows the relationship between the following pairs of variables

Adult obesity and High school graduation
Children in poverty and High school graduation
Preventable hospital stays and Unemployment rate

Ex. 4: Unemployment Rates

Use the unemployment data given to you and construct appropriate plots that show the distribution of unemployment rates for each of the four educational attainment groups.