Our goal in this module is to understand some basic ways of visualizing data with graphics – bar-charts, histograms, box-plots, and so on. We will skip base R commands and instead just work with ggplot2, the most popular visualization package in the {tidyverse} universe as of now.

If you remember MPA 6010, recall the usual options…

  • one qualitative/categorical variables: bar-chart
  • one quantitative/continuous variables: histogram/box-plot/area-chart
  • two quantitative/continuous variables: scatter-plot/hex-bin

We can then ratchet up as we need to.

I will use two data-sets to walk through the initial examples in this module, the first being this IMDB data-set

The internet movie database, http://imdb.com/, is a website devoted to collecting movie data supplied by studios and fans. It claims to be the biggest movie database on the web and is run by amazon. More about information imdb.com can be found online, http://imdb.com/help/show_ leaf?about, including information about the data collection process, http://imdb.com/help/show_leaf?infosource.

library(ggplot2movies)
data(movies)
names(movies)
##  [1] "title"       "year"        "length"      "budget"      "rating"     
##  [6] "votes"       "r1"          "r2"          "r3"          "r4"         
## [11] "r5"          "r6"          "r7"          "r8"          "r9"         
## [16] "r10"         "mpaa"        "Action"      "Animation"   "Comedy"     
## [21] "Drama"       "Documentary" "Romance"     "Short"

A data frame with 28819 rows and 24 variables

Variable Description
title Title of the movie
year Year of release
budget Total budget (if known) in US dollars
length Length in minutes
rating Average IMDB user rating
votes Number of IMDB users who rated this movie
r1-10 Multiplying by ten gives percentile (to nearest 10%) of users who rated this movie a 1
mpaa MPAA rating (missing for a lot of movies)
action, animation, comedy, drama, documentary, romance, short Binary variables representing if movie was classified as belonging to that genre

The second data-set is the Star Wars dataset, a tibble with 87 rows and 13 variables:

library(tidyverse)
data(starwars)
names(starwars)
##  [1] "name"       "height"     "mass"       "hair_color" "skin_color"
##  [6] "eye_color"  "birth_year" "gender"     "homeworld"  "species"   
## [11] "films"      "vehicles"   "starships"
# str(starwars)
Variable Description
name Name of the character
height Height (cm)
mass Weight (kg)
hair_color, skin_color, eye_color Hair, skin, and eye colors
birth_year Year born (BBY = Before Battle of Yavin)
gender male, female, hermaphrodite, or none
homeworld Name of homeworld
species Name of species
films List of films the character appeared in
vehicles List of vehicles the character has piloted
starships List of starships the character has piloted

1 ggplot2 and the grammar of graphics

The {ggplot2} package has a special syntax and I will point out things you should note as we move through this module. First up, the library is called ggplot2 but the command starts with ggplot so don’t let that throw you off-track.

Second, you need to have a data-set to work with. In the code below I start by loading the library and then specifying the data-set to be used.

library(ggplot2)
ggplot(
  data = starwars
  )

Nothing results from these commands because we have not yet specified anything about what should go on the x-axis, what should go on the y-axis. Well, let us do that then by asking for the column eye_color to be put on the x-axis.

ggplot(
  data = starwars,
  mapping = aes(
    x = eye_color
    )
  ) 

1.1 geom_bar() … the bar-chart

This results in a gray canvas with the eye colors on the x-axis but nothing else has been drawn since we have not specified the geometry … do you want a bar-chart? histogram? dot-plot? line-chart? This is a categorical variable and hence a bar-chart would be appropriate. We call for a bar-chart with the geom_bar() command.

ggplot(
  data = starwars,
  mapping = aes(
    x = eye_color
    )
  ) +
  geom_bar()

1.2 color versus fill

The aes() refers to the aesthetics of the chart, and many other aesthetics can be added, such as group, color, fill, size, alpha, etc. We will see some of these in due course but for now I want to focus on two of these, both involving coloring of the geom_. Specifically, there are two commands for adding colors – (1) color or colour, and (2) fill – to a chart.

ggplot(
  data = starwars,
  mapping = aes(
    x = eye_color,
    color = eye_color
    )
  ) +
  geom_bar()

Note what the color = eye_color command did … it drew a colored border for the bars, and an accompanying legend. What if we had used fill = eye_color instead?

ggplot(
  data = starwars,
  mapping = aes(
    x = eye_color,
    fill = eye_color
    )
  ) +
  geom_bar()

Aha! Now the bars are filled with colors and an accompanying legend is drawn as well. So fill = and color = behave very differently, bear this in mind.

1.3 Adding labels with labs

One of the nice things about this software environment is that there are plenty of coloring schemes available to us and we will play with some of these shortly, but before we do that, let us look at one more improvement – adding titles, subtitles, captions, and axis labels to our chart. This is done with the labs = () command.

ggplot(
  data = starwars,
  mapping = aes(
    x = eye_color,
    fill = eye_color
    )
  ) +
  geom_bar() +
  labs(
    x = "Eye Color",
    y = "Frequency (n)",
    title = "Bar-chart of Eye Colors",
    subtitle = "(of Star Wars characters)",
    caption = "My little work of art!!"
    )

Notice the text that now appears as a result of what has been specified in the labs() command.

1.4 Controlling the chart legend with theme()

In this bar-chart, do we really need the legend? No, because the colors and color names show up in the chart itself. How can we hide the legend? Turns out there is a neat command that will allow you to move the legend around and even to hide it.

ggplot(
  data = starwars,
  mapping = aes(
    x = eye_color,
    fill = eye_color
    )
  ) +
  geom_bar() +
  labs(
    x = "Eye Color",
    y = "Frequency (n)",
    title = "Bar-chart of Eye Colors",
    subtitle = "(of Star Wars characters)",
    caption = "My little work of art!!"
    ) +
  theme(legend.position = "none")

Voila! The legend is gone. Instead of β€œnone” you could have specified β€œbottom”, β€œleft”, β€œtop”, β€œright” to place the legend in a particular direction.

1.5 Customizing colors

Of course, it would be good to have the colors match the eye-color so let us do that next. The way we can do this is by calling specific colors by name. I have tried to order the lineup of the colors to match, as closely as I can, the eye colors.

c(
  "black", "blue", "slategray", "brown", "gray34", "gold",
  "greenyellow", "navajowhite1", "orange", "pink", "red",
  "magenta", "thistle3", "white", "yellow"
  ) -> mycolors

ggplot(
  data = starwars,
  mapping = aes(
    x = eye_color
    )
  ) + 
  geom_bar(
    fill = mycolors
    ) + 
  labs(
    x = "Eye Color",
    y = "Frequency (n)",
    title = "Bar-chart of Eye Colors",
    subtitle = "(of Star Wars characters)",
    caption = "My little work of art!!"
    ) +
  theme(legend.position = "none")

These colors are from this source but see also this source. Colors can be customized by generating your own palettes via the Color Brewer here. But don’t get carried away: Remember to read the materials on choosing colors wisely, particularly the point about qualitative palettes, divergent palettes, and then palettes that work well even with colorblind audiences.

1.6 Selected color palettes

I had mentioned the existence of a number of color palettes so let us look at a few, but we will do this with a different variable. First up, the Pastel1 palette.

ggplot(
  data = starwars,
  mapping = aes(
    x = gender
    )
  ) + 
  geom_bar(
    aes(fill = gender)
    ) + 
  labs(
    x = "Gender",
    y = "Frequency",
    title = "Bar-chart of Gender",
    subttitle = "(of Star Wars characters)",
    caption = "(Source: The dplyr package)") +
  scale_fill_brewer(
    palette = "Pastel1"
    )

Not bad but doesn’t work too well here. How about trying another palette, Set?

ggplot(
  data = starwars,
  mapping = aes(
    x = gender
    )
  ) + 
  geom_bar(
    aes(fill = gender)
    ) + 
  labs(
    x = "Gender",
    y = "Frequency",
    title = "Bar-chart of Gender",
    subttitle = "(of Star Wars characters)",
    caption = "(Source: The dplyr package)") +
  scale_fill_brewer(
    palette = "Set1"
    )

Nice! But what is also noticeable here is that there are some characters in the data-set whose gender data is missing. These are the NA values. By default, you will see NA values showing up in some types of charts and so it is always good to exclude them from the chart. Here is one way of doing that.

ggplot(
  data = subset(starwars, !is.na(gender)),
  mapping = aes(
    x = gender
    )
  ) + 
  geom_bar(
    aes(fill = gender)
    ) + 
  labs(
    x = "Gender",
    y = "Frequency",
    title = "Bar-chart of Gender",
    subttitle = "(of Star Wars characters)",
    caption = "(Source: The dplyr package)") +
  scale_fill_brewer(
    palette = "Set1"
    )

Notice what is different here: data = subset(starwars, !is.na(gender)) and that this command is effectively saying subset the starwars data to only include those cases where gender is not missing (this is the !is.na() portion of the command).

Another way to do the same thing would have been to use filter() and create a cleaned up copy of the data. If you take this route, be careful not to overwrite the original data-set; note how I am giving a new name (my.data) after filter() to save the results in. Then we lean on this data-set via data = my.data.

starwars %>%
  filter(!is.na(gender)) -> my.data

ggplot(
  data = my.data,
  mapping = aes(
    x = gender
    )
  ) + 
  geom_bar(
    aes(fill = gender)
    ) + 
  labs(
    x = "Gender",
    y = "Frequency",
    title = "Bar-chart of Gender",
    subttitle = "(of Star Wars characters)",
    caption = "(Source: The dplyr package)") +
  scale_fill_brewer(
    palette = "Set1"
    )

There is one color palette you should remember, and this is the {viridis} color scheme that works around varying types of color blindness in the population. Here come the palettes:

ggplot(
  data = my.data,
  mapping = aes(
    x = gender
    )
  ) + 
  geom_bar(
    aes(fill = gender)
    ) + 
  labs(
    x = "Gender",
    y = "Frequency",
    title = "Bar-chart of Gender",
    subttitle = "(of Star Wars characters)",
    caption = "(Source: The dplyr package)") +
  scale_fill_viridis_d(
    option = "viridis"
    )

ggplot(
  data = my.data,
  mapping = aes(
    x = gender
    )
  ) + 
  geom_bar(
    aes(fill = gender)
    ) + 
  labs(
    x = "Gender",
    y = "Frequency",
    title = "Bar-chart of Gender",
    subttitle = "(of Star Wars characters)",
    caption = "(Source: The dplyr package)") +
  scale_fill_viridis_d(
    option = "magma"
    )

ggplot(
  data = my.data,
  mapping = aes(
    x = gender
    )
  ) + 
  geom_bar(
    aes(fill = gender)
    ) + 
  labs(
    x = "Gender",
    y = "Frequency",
    title = "Bar-chart of Gender",
    subttitle = "(of Star Wars characters)",
    caption = "(Source: The dplyr package)") +
  scale_fill_viridis_d(
    option = "plasma"
    )

ggplot(
  data = my.data,
  mapping = aes(
    x = gender
    )
  ) + 
  geom_bar(
    aes(fill = gender)
    ) + 
  labs(
    x = "Gender",
    y = "Frequency",
    title = "Bar-chart of Gender",
    subttitle = "(of Star Wars characters)",
    caption = "(Source: The dplyr package)") +
  scale_fill_viridis_d(
    option = "cividis"
    )

1.7 Themes with {ggthemes}

One can also lean on various plotting themes as shown below. These themes mimic the style of graphics popularized by some data visualization experts (for e.g., Stephen Few, Edward Tufte), news-media houses (Fivethirtyeight, The Economist, The Wall Street Journal), some software packages (Excel, Stata, Google docs), and a few others. Below I show you just a handful.

library(ggthemes)

ggplot(
  data = starwars,
  mapping = aes(
    x = eye_color
    )
  ) +
  geom_bar() +
  theme_tufte() 

ggplot(
  data = starwars,
  mapping = aes(
    x = eye_color
    )
  ) +
  geom_bar() +
  theme_economist() 

ggplot(
  data = starwars,
  mapping = aes(
    x = eye_color
    )
  ) +
  geom_bar() +
  theme_fivethirtyeight() 

Later on you will learn these & other ways to build advanced visualizations. For now we get to work more with ggplot2.

1.8 More with bar-charts

I want to show a few things with bar-charts now. First, we can specify things a bit differently without altering the result. For example, compare the following two pieces of code.

ggplot(
  data = movies,
  mapping = aes(
    x = mpaa
    )
  ) + 
  geom_bar()

ggplot() + 
  geom_bar(
    data = movies,
    mapping = aes(x = mpaa)
    )

Notice that we switched the data = and the aes() pieces of the code but that made no difference; this is important to bear in mind because it will come in handy down the road when we need to build some advanced visualizations.

The plot is sub-optimal since MPAA ratings are missing for a lot of movies and should be eliminated from the plot via subset(mpa != "") or by running dplyr’s filter() to create another data-set. I will lean on filter().

movies %>% 
  filter(mpaa != "") -> movies2 

ggplot() + 
  geom_bar(
    data = movies2,
    mapping = aes(x = mpaa)
    )

The order of the bars here is fortuitous in that it goes from the smallest frequency to the highest frequency, drawing the reader’s eye. I said fortuitous because {ggplot2} defaults to drawing the bars in an ascending alphabetic/alphanumeric order if the variable is a character. See below for an example.

df = tibble(x = c(rep("A", 2), rep("B", 4), rep("C", 1)))

ggplot() + 
  geom_bar(
    data = df, 
    mapping = aes(x = x)
  )

Notice the bars here do not follow in ascending/descending order of frequencies. Later on we’ll learn how to order the bars with ascending/descending frequencies or by some other logic.

What about plotting relative frequencies on the y-axis rather than the frequencies?

library(scales)

ggplot() + 
  geom_bar(
    data = movies2,
    mapping = aes(
      x = mpaa,
      y = (..count..)/sum(..count..)
      )
    ) + 
  scale_y_continuous(labels = percent) + 
  labs(
    x = "MPAA Rating",
    y  = "Relative Frequency (%)"
    ) 

Note the addition of

  • y = (..count..)/sum(..count..) to change the y-axis to reflect the relative frequency as a proportion, and
  • scale_y_continuous(labels = percent) to then multiply these proportions by 100 to get percentages as the labels rather than 0.2, 0.4, 0.6, etc.

1.9 Disaggregating bar-charts for groups

Let us build a simple bar-chart with the hsb2 data we saw in Module 01. Here we first download it, label the values, save it, and then start charting it.

read.table(
  'https://stats.idre.ucla.edu/stat/data/hsb2.csv',
  header = TRUE,
  sep = ","
  ) -> hsb2

factor(hsb2$female,
       levels = c(0, 1),
       labels = c("Male", "Female")
       ) -> hsb2$female 

factor(hsb2$race,
       levels = c(1:4),
       labels = c("Hispanic", "Asian", "African American", "White")
       ) -> hsb2$race

factor(hsb2$ses,
       levels = c(1:3),
       labels = c("Low", "Middle", "High")
       ) -> hsb2$ses

factor(hsb2$schtyp,
       levels = c(1:2),
       labels = c("Public", "Private")
       ) -> hsb2$schtyp

factor(hsb2$prog,
       levels = c(1:3),
       labels = c("General", "Academic", "Vocational")
       ) -> hsb2$prog

save(
  hsb2, file = here::here("data", "hsb2.RData")
  )
ggplot() + 
  geom_bar(
    data = hsb2,
    mapping = aes(x = ses)) +
  labs(x = "Socioeconomic Status")

Okay, fair enough. But what if I wanted to see how socioeconomic status varies across male and female students?

ggplot() + 
  geom_bar(
    data = hsb2,
    mapping = aes(
      x = ses,
      group = female, 
      fill = female
      )
    ) +
  labs(
    x = "Socioeconomic Status",
    y = "Frequency"
  )

This is not very useful since the viewer has to estimate the relative sizes of the two colors within any given bar. That can be fixed with position = "dodge", juxtaposing the bars for the groups as a result, and the end product is much better. But note: position = "dodge" has to be put outside the aes() but still inside geom_bar() so be careful.

ggplot() + 
  geom_bar(
    data = hsb2,
    mapping = aes(
      x = ses,
      group = female, 
      fill = female
      ),
    position = "dodge"
    ) +
  labs(
    x = "Socioeconomic Status",
    y = "Frequency"
  )

What if you wanted to calculate percentages within each sex? That is, what percent of male students fall within a particular ses category, and the same thing for female students?

ggplot() + 
  geom_bar(
    data = hsb2, 
    aes(
      x = ses, 
      group = female,
      fill = female, 
      y = ..prop..
      ),
    position = "dodge") +
  scale_y_continuous(labels = scales::percent) + 
  labs(
    x = "Socioeconomic Status",
    y = "Relative Frequency (%)"
    )

What about within each ses instead of within gender? That is, what if we wanted percent of Low ses that is Male versus Female, and so on?

ggplot() + 
  geom_bar(
    data = hsb2, 
    aes(
      x = female, 
      group = ses,
      fill = ses, 
      y = ..prop..
      ),
    position = "dodge") +
  scale_y_continuous(labels = scales::percent) + 
  labs(
    x = "Socioeconomic Status",
    y = "Relative Frequency (%)"
    )

ggplot() + 
  geom_bar(
    data = hsb2, 
    aes(
      x = female, 
      group = ses,
      fill = ses, 
      y = ..prop..
      ),
    position = "dodge") +
  scale_y_continuous(labels = scales::percent) + 
  labs(
    x = "Socioeconomic Status",
    y = "Relative Frequency (%)"
    ) 

There is some more we will do with bar-charts but for now let us set them aside and instead look at a few other charts – histograms, box-plots, and line-charts.

1.10 Histograms

If you’ve forgotten what these are, see histogram, or then Yau’s piece here and here. There is a short video available as well.

For histograms in ggplot2, geom_histogram() is the geometry needed but note that the default number of bins is not very useful and can be tweaked, along with other embellishments that are possible as well.

ggplot() + 
  geom_histogram(
    data = hsb2,
    aes(x = read), 
    fill = "cornflowerblue",
    color = "white"
    ) + 
  labs(
    title = "Histogram of Reading Scores",
    x = "Reading Score",
    y = "Frequency"
    )

Note the warning stat_bin() using bins = 30. Pick better value with binwidth. This is because numerical variables need to be grouped in order to have meaningful histograms we can make sense of. How do you define the bins (aka the groups)? We could set bins = 5 and we could also experiment with binwidth =. Let us do bins = 5 which will say give us 5 groups, and go ahead and calculate them yourself.

ggplot() + 
  geom_histogram(
    data = hsb2,
    aes(x = read), 
    fill = "cornflowerblue",
    color = "white",
    bins = 5
    ) + 
  labs(
    title = "Histogram of Reading Scores",
    subtitle = "(with bins = )",
    x = "Reading Score",
    y = "Frequency"
    )

If we wanted more/fewer bins we could tweak it up or down as needed. What about binwidth? This will specify how wide each group must be.

ggplot() + 
  geom_histogram(
    data = hsb2,
    aes(x = read), 
    fill = "cornflowerblue",
    color = "white",
    binwidth = 5
    ) + 
  labs(
    title = "Histogram of Reading Scores",
    subtitle = "(with binwidth = )",
    x = "Reading Score",
    y = "Frequency"
    )

If we wanted to disaggregate the histogram by one or more categorical variables, we could do so quite easily:

ggplot() + 
  geom_histogram(
    data = hsb2,
    aes(x = read), 
    fill = "cornflowerblue",
    color = "white",
    bins = 5
    ) + 
  labs(
    title = "Histogram of Reading Scores",
    subtitle = "(broken out for Male vs. Female students)",
    x = "Reading Score",
    y = "Frequency"
    ) +
  facet_wrap(~ female)

When we do this, it is often useful to organize them so that only one histogram shows up in a row. This is done with the ncol = 1 command.

ggplot() + 
  geom_histogram(
    data = hsb2,
    aes(x = read), 
    fill = "cornflowerblue",
    color = "white",
    bins = 5
    ) + 
  labs(
    title = "Histogram of Reading Scores",
    subtitle = "(broken out for Male vs. Female students)",
    x = "Reading Score",
    y = "Frequency"
    ) +
  facet_wrap(~ female, ncol = 1)

ggplot() + 
  geom_histogram(
    data = hsb2,
    aes(x = read), 
    fill = "cornflowerblue",
    color = "white",
    bins = 5
    ) + 
  labs(
    title = "Histogram of Reading Scores",
    subtitle = "(broken out by Socioeconomic Status)",
    x = "Reading Score",
    y = "Frequency"
    ) +
  facet_wrap(~ ses, ncol = 1)

Now the distributions are stacked above each, easing comparisons; do they have the same average? Do they vary the same? Are they similarly skewed/symmetric?.

For breakouts with two categorical variables we could do

ggplot() + 
  geom_histogram(
    data = hsb2,
    aes(x = read), 
    fill = "cornflowerblue",
    color = "white",
    bins = 5
    ) + 
  labs(
    title = "Histogram of Reading Scores",
    subtitle = "(broken out by Socioeconomic Status and School Type)",
    x = "Reading Score",
    y = "Frequency"
    ) +
  facet_wrap(ses ~ schtyp, ncol = 2)

Note that ses ~ schtyp renders the panels for the first category of ses by all categories of schtyp and then repeats for the other categories in rows 2 and 3. If we did facet_wrap(schtype ~ ses, ncol = 3) we would have a different result:

ggplot() + 
  geom_histogram(
    data = hsb2,
    aes(x = read), 
    fill = "cornflowerblue",
    color = "white",
    bins = 5
    ) + 
  labs(
    title = "Histogram of Reading Scores",
    subtitle = "(broken out by Socioeconomic Status and School Type)",
    x = "Reading Score",
    y = "Frequency"
    ) +
  facet_wrap(schtyp ~ ses, ncol = 3) +
  ylim(c(0, 23))

Notice that here I also add a ylim(c(...)) command to set the minimum and maximum values of the y-axis. This is useful, and I suggest you do not forget to set the y limit to start at 0 or then make a note in the plot for readers so they don’t assume it is at 0 when in fact it has been truncated for ease of data presentation. This misstates the pattern in the data, do not do it or then, again, annotate the plot to that effect so nobody is misled. Bar-charts and histograms will have 0 as the minimum y-limit but this is not true for some other plots.

1.11 Box-plots

Remember these, our friends from MPA 6010? These can be useful for studying the distribution of a continuous variable. See this video. Let us see these in action with the cmhflights data.

load(
  here::here("data", "cmhflights_01092017.RData")
  )

ggplot() + 
  geom_boxplot(
    data = cmhflights,
    mapping = aes(
      y = ArrDelay,
      x = ""
      ),
    fill = "cornflowerblue"
    ) 

Note:

  • the x = "" is in aes() because otherwise with a single group the box-plot will not build up nicely

But, I prefer to see them running horizontally, so how can I do that? With coord_flip() since this just flips the columns.

ggplot() + 
  geom_boxplot(
    data = cmhflights,
    mapping = aes(
      y = ArrDelay,
      x = ""
      ),
    fill = "cornflowerblue"
    ) +
  coord_flip()

And now for a slightly different data-set, one that measures male adults’ hemoglobin concentration for a few populations.

read_csv(
  "http://whitlockschluter.zoology.ubc.ca/wp-content/data/chapter02/chap02e3cHumanHemoglobinElevation.csv"
  ) -> hemoglobin

ggplot() +
  geom_boxplot(
    data = hemoglobin,
    mapping = aes(
      x = population,
      y = hemoglobin,
      fill = population
      )
    ) +
  coord_flip() +
  labs(
    x = "Population",
    y = "Hemoglobin Concentration",
    title = "Hemoglobin Concentration in Adult Males",
    subtitle = "(Andes, Ethiopia, Tibet, USA)"
    ) +
  theme(legend.position = "none")

Notice the need for no legend with fill = population Notice also how fill = is inside aes(...) here because we are asking that each unique value seen in a variable called population be mapped to a unique color.

Could we use our facet_wrap(...) here too? Of course.

ggplot() + 
  geom_boxplot(
    data = cmhflights,
    mapping = aes(
      y = ArrDelay,
      x = Carrier
      ),
    fill = "cornflowerblue"
    ) +
  coord_flip() +
  facet_wrap(~ Month)

1.12 Line-charts

If we have data over time for one or more units, then line-charts work really well to exhibit trends. A classic, current example would be the number of confirmed COVID-19 cases per country per date. For example, say we have data on the unemployment rate for the country. These data are coming from the {plotly} library so we have to make sure it is installed and load it.

library(plotly)
data(economics)
names(economics)
## [1] "date"     "pce"      "pop"      "psavert"  "uempmed"  "unemploy"
ggplot() +
  geom_line(
    data = economics, 
    mapping = aes(
      x = date,
      y = uempmed
      )
    ) + 
  labs(
    x = "Date",
    y = "Unemployment Rate"
  )

They can look very plain and aesthetically unappealing unless you dress them up. See the one below and then the one that follows.

load(
  here::here("data", "gap.df.RData")
  )

ggplot() +
  geom_line(
    data = gap.df, 
    mapping = aes(
      x = year,
      y = LifeExp,
      group = continent,
      color = continent
      )
    ) + 
  geom_point(
    data = gap.df, 
    mapping = aes(
      x = year,
      y = LifeExp,
      group = continent,
      color = continent
      )
    ) + 
  labs(
    x = "Year",
    y = "Median Life Expectancy (in years)",
    color = ""
  ) + 
  theme(legend.position = "bottom") 

1.13 Scatter-plots

These work well if we have two or more continuous variables, and work well to highlight the nature and strength of a relationship between the two variables …. what happens to y as x increases? s

ggplot() + 
  geom_point(
    data = hsb2, 
    mapping = aes(
      x = write,
      y = science
      )
    ) +
  labs(
    x = "Writing Scores", 
    y = "Science Scores"
    ) 

We could highlight the different ses groups, to see if there is any difference in the relationship between writing scores and science scores by the different ses levels.

ggplot() +
  geom_point(
    data = hsb2,
    mapping = aes(
      x = write,
      y = science,
      color = ses
      )
    ) + 
  labs(
    x = "Writing Scores", 
    y = "Science Scores",
    color = ""
    ) + 
  theme(legend.position = "bottom") 

This is not very helpful so why not breakout ses for ease of interpretation?

ggplot() +
  geom_point(
    data = hsb2,
    mapping = aes(
      x = write,
      y = science
      )
    ) + 
  labs(
    x = "Writing Scores", 
    y = "Science Scores"
    ) + 
  facet_wrap(~ ses) 

Could we add another layer, perhaps female?

ggplot() +
  geom_point(
    data = hsb2,
    mapping = aes(
      x = write,
      y = science
      )
    ) + 
  labs(
    x = "Writing Scores", 
    y = "Science Scores"
    ) + 
  facet_wrap(ses ~ female, ncol = 2) 

And finally, a few suggestion about how to build up your visualizations:

  • πŸ” start with pencil and paper, sketch prototypes of desired visualization(s)
  • πŸ˜„ graphics are relatively easy to generate with base R & with ggplot2
  • πŸ‘ common-sense: number & type of variable(s) guide plotting
  • πŸŽ‡ stay color conscious: sensible colors & sensitive to color blindness
  • πŸ”° experiment, experiment, experiment until you are happy
  • use the πŸ†“ learning resources available online
  • πŸ“’ if you learn something new in R, write it down

2 Practice Exercises

2.1 Nobel Prize Winners

Georgios Karamanis gathered and shared data on Nobel prize winners over the years, with a fair amount of detail, and used in the tidytuesday series a while back. These data are to be used for the questions that follow.

readr::read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-05-14/nobel_winners.csv") -> nobel_winners 
variable class description
prize_year double Year that Nobel Prize was awarded
category character Field of study/category
prize character Prize Name
motivation character Motivation of the award
prize_share character Share eg 1 of 1, 1 of 2, 1 of 4, etc
laureate_id double ID assigned to each winner
laureate_type character Individual or organization
full_name character name of the winner
birth_date double birth date of winner
birth_city character birth city/state of winner
birth_country character birth country of winner
gender character binary gender of the winner
organization_name character organization name
organization_city character organization city
organization_country character organization country
death_date double death date of the winner (if dead)
death_city character death city (if dead)
death_country character death country (if dead)
  1. First create nobel.df that keeps only records starting in the year 1960, and only for the β€œPhysics” category. Now generate an appropriate chart that shows the distribution of winners by birth_country

  2. Now break this distribution out by gender to see how winners by country differs across gender

  3. Now go back to noble_winners, the full data-set, and create a simple plot that shows the distribution of prize winners by death_country, gender, and category

2.2 Water levels in the Great Lakes

Download the monthly Great Lakes water level data-set SPSS format from here and Excel format from here. Note that water level is in meters.

You may use the following command to read in the excel file:

library(readxl)
url <- "https://aniruhil.github.io/avsr/teaching/dataviz/greatlakes.xlsx"
destfile <- "greatlakes.xlsx"
curl::curl_download(url, destfile)
read_excel(destfile, col_types = c("date", 
     "numeric", "numeric", "numeric", "numeric", 
     "numeric")) -> greatlakes 

Now use an appropriate chart to show the water level for Lake Superior.

2.3 County Health Rankings

Download the 2017 County Health Rankings data SPSS format from here, Excel format from here and the accompanying codebook.

These data can be downloaded with the code provided below:

library(readxl)
url <- "https://aniruhil.github.io/avsr/teaching/dataviz/CountyHealthRankings2017.xlsx"
destfile <- "CountyHealthRankings2017.xlsx"
curl::curl_download(url, destfile)
read_excel(destfile) -> chr.df 

Construct appropriate plots that shows the relationship between the following pairs of variables

  1. Adult obesity and High school graduation

  2. Children in poverty and High school graduation

  3. Preventable hospital stays and Unemployment rate

2.4 Unemployment Rates

Use the unemployment data given to you (unemprate.RData) and construct appropriate plots that show the distribution of unemployment rates across years for each of the four educational attainment groups.

load(
  here::here("data", "unemprate.RData")
  ) -> urate 

Be sure to use a unique color for each educational attainment group

---
title: "MPA 5830 - Module 05"
subtitle: "Spring 2020"
author: "Professor Ruhil"
date: "Updated on `r Sys.Date()`"
output: 
  html_document: 
    code_download: yes
    fig_caption: yes
    highlight: zenburn
    number_sections: yes
    theme: flatly
    toc: yes
    toc_float: yes
editor_options: 
  chunk_output_type: console
---

<style type="text/css">

body{ /* Normal  */
/*    font-family: Lato, sans-serif;  
      font-family: Mukta, sans-serif; 
      font-family: 'Nunito Sans', sans-serif;
      font-family: Karla, sans-serif;  */
      font-family: 'Merriweather Sans', sans-serif; 
      font-size: 18px;
  }

h1.title {
  font-size: 38px;
  color: DarkRed;
}

h1 { /* Header 1 */
  font-size: 28px;
  color: DarkBlue;
}

h2 { /* Header 2 */
    font-size: 22px;
  color: DarkBlue;
}

h3 { /* Header 3 */
  font-size: 18px;
  color: DarkBlue;
}

code.r{ /* Code block */
    font-family: Mukta, sans-serif; 
    font-weight: 600;  
    font-size: 16px;
}

/* pre { /* Code block - determines code spacing between lines */
    font-size: 16px;
} */
</style>


```{r klippy, echo = FALSE, include = TRUE}
klippy::klippy(tooltip_message = 'Click to copy', tooltip_success = 'Done', color = 'cornflowerblue', position = c('top', 'right'))
```

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, warning = FALSE, message = FALSE, dpi = 300, cache = TRUE, fig.align = "center", fig.width = 7, fig.height = 5, out.width = "100%", highlight = TRUE) 
```

Our goal in this module is to understand some basic ways of visualizing data with graphics -- bar-charts, histograms, box-plots, and so on. We will skip `base R` commands and instead just work with `ggplot2`, the most popular visualization package in the `{tidyverse}` universe as of now. 

If you remember MPA 6010, recall the usual options... 

- one qualitative/categorical variables: `bar-chart` 
- one quantitative/continuous variables: `histogram/box-plot/area-chart`  
- two quantitative/continuous variables: `scatter-plot/hex-bin` 

We can then ratchet up as we need to. 

I will use two data-sets to walk through the initial examples in this module, the first being this [IMDB data-set](http://imdb.com/)

> The internet movie database, http://imdb.com/, is a website devoted to collecting movie data supplied by studios and fans. It claims to be the biggest movie database on the web and is run by amazon. More about information imdb.com can be found online, http://imdb.com/help/show_ leaf?about, including information about the data collection process, http://imdb.com/help/show_leaf?infosource.

```{r movies}
library(ggplot2movies)
data(movies)
names(movies)
```

A data frame with 28819 rows and 24 variables 

| Variable | Description |
| :-- | :-- |
| title | Title of the movie |
| year | Year of release |
| budget | Total budget (if known) in US dollars |
| length | Length in minutes |
| rating | Average IMDB user rating |
| votes | Number of IMDB users who rated this movie |
| r1-10 | Multiplying by ten gives percentile (to nearest 10%) of users who rated this movie a 1 |
| mpaa | MPAA rating (missing for a lot of movies) |
| action, animation, comedy, drama, documentary, romance, short | Binary variables representing if movie was classified as belonging to that genre |


The second data-set is the [Star Wars dataset](https://swapi.co), a `tibble` with 87 rows and 13 variables:

```{r starwars}
library(tidyverse)
data(starwars)
names(starwars)
# str(starwars)
```

| Variable | Description |
| :-- | :-- |
| name | Name of the character |
| height | Height (cm) |
| mass | Weight (kg) |
| hair_color, skin_color, eye_color | Hair, skin, and eye colors |
| birth_year | Year born (BBY = Before Battle of Yavin) |
| gender | male, female, hermaphrodite, or none |
| homeworld | Name of homeworld |
| species | Name of species |
| films | List of films the character appeared in |
| vehicles | List of vehicles the character has piloted |
| starships | List of starships the character has piloted |


# `ggplot2` and the [grammar of graphics](http://vita.had.co.nz/papers/layered-grammar.html)

The `{ggplot2}` package has a special syntax and I will point out things you should note as we move through this module. First up, the library is called `ggplot2` but the command starts with `ggplot` so don't let that throw you off-track. 

Second, you need to have a data-set to work with. In the code below I start by loading the library and then specifying the data-set to be used. 

```{r gg000}
library(ggplot2)
ggplot(
  data = starwars
  )
```

Nothing results from these commands because we have not yet specified anything about what should go on the x-axis, what should go on the y-axis. Well, let us do that then by asking for the column `eye_color` to be put on the x-axis. 

```{r gg001}
ggplot(
  data = starwars,
  mapping = aes(
    x = eye_color
    )
  ) 
```

## `geom_bar()` ... the bar-chart 
This results in a gray canvas with the eye colors on the x-axis but nothing else has been drawn since we have not specified the `geometry` ... do you want a bar-chart? histogram? dot-plot? line-chart? This is a categorical variable and hence a bar-chart would be appropriate. We call for a bar-chart with the `geom_bar()` command. 

```{r gg002a}
ggplot(
  data = starwars,
  mapping = aes(
    x = eye_color
    )
  ) +
  geom_bar()
```

## `color` versus `fill` 
The `aes()` refers to the **aesthetics** of the chart, and many other `aesthetics` can be added, such as `group`, `color`, `fill`, `size`, `alpha`, etc. We will see some of these in due course but for now I want to focus on two of these, both involving coloring of the `geom_`. Specifically, there are two commands for adding colors -- (1) `color` or `colour`, and (2) `fill` -- to a chart. 

```{r col1a}
ggplot(
  data = starwars,
  mapping = aes(
    x = eye_color,
    color = eye_color
    )
  ) +
  geom_bar()
```

Note what the `color = eye_color` command did ... it drew a colored border for the bars, and an accompanying legend. What if we had used `fill = eye_color` instead?  

```{r col1b}
ggplot(
  data = starwars,
  mapping = aes(
    x = eye_color,
    fill = eye_color
    )
  ) +
  geom_bar()
```

Aha! Now the bars are filled with colors and an accompanying legend is drawn as well. So `fill =` and `color =` behave very differently, bear this in mind. 

## Adding labels with `labs` 
One of the nice things about this software environment is that there are plenty of coloring schemes available to us and we will play with some of these shortly, but before we do that, let us look at one more improvement -- adding titles, subtitles, captions, and axis labels to our chart. This is done with the `labs = ()` command.

```{r col1b2}
ggplot(
  data = starwars,
  mapping = aes(
    x = eye_color,
    fill = eye_color
    )
  ) +
  geom_bar() +
  labs(
    x = "Eye Color",
    y = "Frequency (n)",
    title = "Bar-chart of Eye Colors",
    subtitle = "(of Star Wars characters)",
    caption = "My little work of art!!"
    )
```

Notice the text that now appears as a result of what has been specified in the `labs()` command. 

## Controlling the chart legend with `theme()`
In this bar-chart, do we really need the legend? No, because the colors and color names show up in the chart itself. How can we hide the legend? Turns out there is a neat command that will allow you to move the legend around and even to hide it. 

```{r col1c}
ggplot(
  data = starwars,
  mapping = aes(
    x = eye_color,
    fill = eye_color
    )
  ) +
  geom_bar() +
  labs(
    x = "Eye Color",
    y = "Frequency (n)",
    title = "Bar-chart of Eye Colors",
    subtitle = "(of Star Wars characters)",
    caption = "My little work of art!!"
    ) +
  theme(legend.position = "none")
```

Voila! The legend is gone. Instead of "none" you could have specified "bottom", "left", "top", "right" to place the legend in a particular direction.

## Customizing colors 
Of course, it would be good to have the colors match the eye-color so let us do that next. The way we can do this is by calling specific colors by name. I have tried to order the lineup of the colors to match, as closely as I can, the eye colors. 

```{r col3}
c(
  "black", "blue", "slategray", "brown", "gray34", "gold",
  "greenyellow", "navajowhite1", "orange", "pink", "red",
  "magenta", "thistle3", "white", "yellow"
  ) -> mycolors

ggplot(
  data = starwars,
  mapping = aes(
    x = eye_color
    )
  ) + 
  geom_bar(
    fill = mycolors
    ) + 
  labs(
    x = "Eye Color",
    y = "Frequency (n)",
    title = "Bar-chart of Eye Colors",
    subtitle = "(of Star Wars characters)",
    caption = "My little work of art!!"
    ) +
  theme(legend.position = "none")
```

These colors are from [this source](http://www.stat.columbia.edu/~tzheng/files/Rcolor.pdf) but see also [this source](https://www.nceas.ucsb.edu/~frazier/RSpatialGuides/colorPaletteCheatsheet.pdf). Colors can be customized by generating your own palettes via the [Color Brewer here](http://colorbrewer2.org/#type=sequential&scheme=YlGnBu&n=3). But don't get carried away: Remember to read the materials on choosing colors wisely, particularly the point about qualitative palettes, divergent palettes, and then palettes that work well even with colorblind audiences.

## Selected color palettes
I had mentioned the existence of a number of color palettes so let us look at a few, but we will do this with a different variable. First up, the `Pastel1` palette. 

```{r colfillb1}
ggplot(
  data = starwars,
  mapping = aes(
    x = gender
    )
  ) + 
  geom_bar(
    aes(fill = gender)
    ) + 
  labs(
    x = "Gender",
    y = "Frequency",
    title = "Bar-chart of Gender",
    subttitle = "(of Star Wars characters)",
    caption = "(Source: The dplyr package)") +
  scale_fill_brewer(
    palette = "Pastel1"
    )
```

Not bad but doesn't work too well here. How about trying another palette, `Set`?

```{r colfillb2}
ggplot(
  data = starwars,
  mapping = aes(
    x = gender
    )
  ) + 
  geom_bar(
    aes(fill = gender)
    ) + 
  labs(
    x = "Gender",
    y = "Frequency",
    title = "Bar-chart of Gender",
    subttitle = "(of Star Wars characters)",
    caption = "(Source: The dplyr package)") +
  scale_fill_brewer(
    palette = "Set1"
    )
```

Nice! But what is also noticeable here is that there are some characters in the data-set whose gender data is missing. These are the `NA` values. By default, you will see `NA` values showing up in some types of charts and so it is always good to exclude them from the chart. Here is one way of doing that. 

```{r colfillb3}
ggplot(
  data = subset(starwars, !is.na(gender)),
  mapping = aes(
    x = gender
    )
  ) + 
  geom_bar(
    aes(fill = gender)
    ) + 
  labs(
    x = "Gender",
    y = "Frequency",
    title = "Bar-chart of Gender",
    subttitle = "(of Star Wars characters)",
    caption = "(Source: The dplyr package)") +
  scale_fill_brewer(
    palette = "Set1"
    )
```

Notice what is different here: `data = subset(starwars, !is.na(gender))` and that this command is effectively saying subset the starwars data to only include those cases where gender is not missing (this is the `!is.na()` portion of the command). 

Another way to do the same thing would have been to use `filter()` and create a cleaned up copy of the data. If you take this route, be careful not to overwrite the original data-set; note how I am giving a new name `(my.data)` after `filter()` to save the results in. Then we lean on this data-set via `data = my.data`.

```{r colfillb4}
starwars %>%
  filter(!is.na(gender)) -> my.data

ggplot(
  data = my.data,
  mapping = aes(
    x = gender
    )
  ) + 
  geom_bar(
    aes(fill = gender)
    ) + 
  labs(
    x = "Gender",
    y = "Frequency",
    title = "Bar-chart of Gender",
    subttitle = "(of Star Wars characters)",
    caption = "(Source: The dplyr package)") +
  scale_fill_brewer(
    palette = "Set1"
    )
```

There is one color palette you should remember, and this is the `{viridis}` color scheme that works around varying types of color blindness in the population. Here come the palettes:

```{r colfillv1}
ggplot(
  data = my.data,
  mapping = aes(
    x = gender
    )
  ) + 
  geom_bar(
    aes(fill = gender)
    ) + 
  labs(
    x = "Gender",
    y = "Frequency",
    title = "Bar-chart of Gender",
    subttitle = "(of Star Wars characters)",
    caption = "(Source: The dplyr package)") +
  scale_fill_viridis_d(
    option = "viridis"
    )
```

```{r colfillv2}
ggplot(
  data = my.data,
  mapping = aes(
    x = gender
    )
  ) + 
  geom_bar(
    aes(fill = gender)
    ) + 
  labs(
    x = "Gender",
    y = "Frequency",
    title = "Bar-chart of Gender",
    subttitle = "(of Star Wars characters)",
    caption = "(Source: The dplyr package)") +
  scale_fill_viridis_d(
    option = "magma"
    )
```

```{r colfillv3}
ggplot(
  data = my.data,
  mapping = aes(
    x = gender
    )
  ) + 
  geom_bar(
    aes(fill = gender)
    ) + 
  labs(
    x = "Gender",
    y = "Frequency",
    title = "Bar-chart of Gender",
    subttitle = "(of Star Wars characters)",
    caption = "(Source: The dplyr package)") +
  scale_fill_viridis_d(
    option = "plasma"
    )
```

```{r colfillv4}
ggplot(
  data = my.data,
  mapping = aes(
    x = gender
    )
  ) + 
  geom_bar(
    aes(fill = gender)
    ) + 
  labs(
    x = "Gender",
    y = "Frequency",
    title = "Bar-chart of Gender",
    subttitle = "(of Star Wars characters)",
    caption = "(Source: The dplyr package)") +
  scale_fill_viridis_d(
    option = "cividis"
    )
```

## Themes with `{ggthemes}`
One can also lean on various plotting themes as shown below. These themes mimic the style of graphics popularized by some data visualization experts (for e.g., Stephen Few, Edward Tufte), news-media houses (Fivethirtyeight, The Economist, The Wall Street Journal), some software packages (Excel, Stata, Google docs), and a few others. Below I show you just a handful.  

```{r ggt1}
library(ggthemes)

ggplot(
  data = starwars,
  mapping = aes(
    x = eye_color
    )
  ) +
  geom_bar() +
  theme_tufte() 
```

```{r ggt2}
ggplot(
  data = starwars,
  mapping = aes(
    x = eye_color
    )
  ) +
  geom_bar() +
  theme_economist() 
```

```{r ggt3}
ggplot(
  data = starwars,
  mapping = aes(
    x = eye_color
    )
  ) +
  geom_bar() +
  theme_fivethirtyeight() 
```

Later on you will learn these & other ways to build advanced visualizations. For now we get to work more with `ggplot2`. 

## More with bar-charts

I want to show a few things with bar-charts now. First, we can specify things a bit differently without altering the result. For example, compare the following two pieces of code. 

```{r bar01}
ggplot(
  data = movies,
  mapping = aes(
    x = mpaa
    )
  ) + 
  geom_bar()
```

```{r bar02}
ggplot() + 
  geom_bar(
    data = movies,
    mapping = aes(x = mpaa)
    )
```

Notice that we switched the `data =` and the `aes()` pieces of the code but that made no difference; this is important to bear in mind because it will come in handy down the road when we need to build some advanced visualizations. 

The plot is sub-optimal since MPAA ratings are missing for a lot of movies and should be  eliminated from the plot via `subset(mpa != "")`  or by running dplyr's `filter()`  to create another data-set. I will lean on `filter()`. 

```{r bar2}
movies %>% 
  filter(mpaa != "") -> movies2 

ggplot() + 
  geom_bar(
    data = movies2,
    mapping = aes(x = mpaa)
    )
```

The order of the bars here is fortuitous in that it goes from the smallest frequency to the highest frequency, drawing the reader's eye. I said fortuitous because `{ggplot2}` defaults to drawing the bars in an ascending alphabetic/alphanumeric order if the variable is a **character**. See below for an example. 

```{r bar3}
df = tibble(x = c(rep("A", 2), rep("B", 4), rep("C", 1)))

ggplot() + 
  geom_bar(
    data = df, 
    mapping = aes(x = x)
  )
```

Notice the bars here do not follow in ascending/descending order of frequencies. Later on we'll learn how to order the bars with ascending/descending frequencies or by some other logic. 

What about plotting `relative frequencies` on the y-axis rather than the frequencies? 

```{r bar4}
library(scales)

ggplot() + 
  geom_bar(
    data = movies2,
    mapping = aes(
      x = mpaa,
      y = (..count..)/sum(..count..)
      )
    ) + 
  scale_y_continuous(labels = percent) + 
  labs(
    x = "MPAA Rating",
    y  = "Relative Frequency (%)"
    ) 
```

Note the addition of 

+ `y = (..count..)/sum(..count..)` to change the y-axis to reflect the relative frequency as a proportion, and  
+ `scale_y_continuous(labels = percent)` to then multiply these proportions by 100 to get percentages as the labels rather than 0.2, 0.4, 0.6, etc. 

## Disaggregating bar-charts for groups 
Let us build a simple bar-chart with the `hsb2` data we saw in Module 01. Here we first download it, label the values, save it, and then start charting it.  

```{r hsb2-datain}
read.table(
  'https://stats.idre.ucla.edu/stat/data/hsb2.csv',
  header = TRUE,
  sep = ","
  ) -> hsb2

factor(hsb2$female,
       levels = c(0, 1),
       labels = c("Male", "Female")
       ) -> hsb2$female 

factor(hsb2$race,
       levels = c(1:4),
       labels = c("Hispanic", "Asian", "African American", "White")
       ) -> hsb2$race

factor(hsb2$ses,
       levels = c(1:3),
       labels = c("Low", "Middle", "High")
       ) -> hsb2$ses

factor(hsb2$schtyp,
       levels = c(1:2),
       labels = c("Public", "Private")
       ) -> hsb2$schtyp

factor(hsb2$prog,
       levels = c(1:3),
       labels = c("General", "Academic", "Vocational")
       ) -> hsb2$prog

save(
  hsb2, file = here::here("data", "hsb2.RData")
  )
```

```{r bar5-base}
ggplot() + 
  geom_bar(
    data = hsb2,
    mapping = aes(x = ses)) +
  labs(x = "Socioeconomic Status")
```

Okay, fair enough. But what if I wanted to see how socioeconomic status varies across male and female students?  

```{r bar5}
ggplot() + 
  geom_bar(
    data = hsb2,
    mapping = aes(
      x = ses,
      group = female, 
      fill = female
      )
    ) +
  labs(
    x = "Socioeconomic Status",
    y = "Frequency"
  )
```

This is not very useful since the viewer has to estimate the relative sizes of the two colors within any given bar. That can be fixed with `position = "dodge"`, juxtaposing the bars for the groups as a result, and the end product is much better. But note: `position = "dodge"` has to be put outside the `aes()` but still inside `geom_bar()` so be careful. 

```{r bar6}
ggplot() + 
  geom_bar(
    data = hsb2,
    mapping = aes(
      x = ses,
      group = female, 
      fill = female
      ),
    position = "dodge"
    ) +
  labs(
    x = "Socioeconomic Status",
    y = "Frequency"
  )
```

What if you wanted to calculate percentages within each sex? That is, what percent of male students fall within a particular ses category, and the same thing for female students? 

```{r bar7}
ggplot() + 
  geom_bar(
    data = hsb2, 
    aes(
      x = ses, 
      group = female,
      fill = female, 
      y = ..prop..
      ),
    position = "dodge") +
  scale_y_continuous(labels = scales::percent) + 
  labs(
    x = "Socioeconomic Status",
    y = "Relative Frequency (%)"
    )
```

What about within each ses instead of within gender? That is, what if we wanted percent of Low ses that is Male versus Female, and so on?

```{r bar8}
ggplot() + 
  geom_bar(
    data = hsb2, 
    aes(
      x = female, 
      group = ses,
      fill = ses, 
      y = ..prop..
      ),
    position = "dodge") +
  scale_y_continuous(labels = scales::percent) + 
  labs(
    x = "Socioeconomic Status",
    y = "Relative Frequency (%)"
    )
```

```{r bar9}
ggplot() + 
  geom_bar(
    data = hsb2, 
    aes(
      x = female, 
      group = ses,
      fill = ses, 
      y = ..prop..
      ),
    position = "dodge") +
  scale_y_continuous(labels = scales::percent) + 
  labs(
    x = "Socioeconomic Status",
    y = "Relative Frequency (%)"
    ) 
```

There is some more we will do with bar-charts but for now let us set them aside and instead look at a few other charts -- histograms, box-plots, and line-charts. 

## Histograms 
If you've forgotten what these are, see [histogram](http://tinlizzie.org/histograms/), or then [Yau's piece here](https://flowingdata.com/2014/02/27/how-to-read-histograms-and-use-them-in-r/) and [here](https://flowingdata.com/2017/06/07/how-histograms-work/). [There is a short video available as well](https://vimeo.com/221607341). 

For histograms in ggplot2, `geom_histogram()` is the geometry needed but note that the default number of bins is not very useful and can be tweaked, along with other embellishments that are possible as well. 

```{r gg2a}
ggplot() + 
  geom_histogram(
    data = hsb2,
    aes(x = read), 
    fill = "cornflowerblue",
    color = "white"
    ) + 
  labs(
    title = "Histogram of Reading Scores",
    x = "Reading Score",
    y = "Frequency"
    )
```

Note the warning `stat_bin()` using `bins = 30`. Pick better value with `binwidth`. This is because numerical variables need to be grouped in order to have meaningful histograms we can make sense of. How do you define the bins (aka the groups)? We could set `bins = 5` and we could also experiment with `binwidth =`. Let us do `bins = 5` which will say give us 5 groups, and go ahead and calculate them yourself. 

```{r gg2b}
ggplot() + 
  geom_histogram(
    data = hsb2,
    aes(x = read), 
    fill = "cornflowerblue",
    color = "white",
    bins = 5
    ) + 
  labs(
    title = "Histogram of Reading Scores",
    subtitle = "(with bins = )",
    x = "Reading Score",
    y = "Frequency"
    )
```

If we wanted more/fewer bins we could tweak it up or down as needed. 
What about binwidth? This will specify how wide each group must be. 

```{r gg2c}
ggplot() + 
  geom_histogram(
    data = hsb2,
    aes(x = read), 
    fill = "cornflowerblue",
    color = "white",
    binwidth = 5
    ) + 
  labs(
    title = "Histogram of Reading Scores",
    subtitle = "(with binwidth = )",
    x = "Reading Score",
    y = "Frequency"
    )
```

If we wanted to disaggregate the histogram by one or more categorical variables, we could do so quite easily: 

```{r gg3}
ggplot() + 
  geom_histogram(
    data = hsb2,
    aes(x = read), 
    fill = "cornflowerblue",
    color = "white",
    bins = 5
    ) + 
  labs(
    title = "Histogram of Reading Scores",
    subtitle = "(broken out for Male vs. Female students)",
    x = "Reading Score",
    y = "Frequency"
    ) +
  facet_wrap(~ female)
```

When we do this, it is often useful to organize them so that only one histogram shows up in a row. This is done with the `ncol = 1` command. 

```{r gg4a}
ggplot() + 
  geom_histogram(
    data = hsb2,
    aes(x = read), 
    fill = "cornflowerblue",
    color = "white",
    bins = 5
    ) + 
  labs(
    title = "Histogram of Reading Scores",
    subtitle = "(broken out for Male vs. Female students)",
    x = "Reading Score",
    y = "Frequency"
    ) +
  facet_wrap(~ female, ncol = 1)
```

```{r gg4b}
ggplot() + 
  geom_histogram(
    data = hsb2,
    aes(x = read), 
    fill = "cornflowerblue",
    color = "white",
    bins = 5
    ) + 
  labs(
    title = "Histogram of Reading Scores",
    subtitle = "(broken out by Socioeconomic Status)",
    x = "Reading Score",
    y = "Frequency"
    ) +
  facet_wrap(~ ses, ncol = 1)
```

Now the distributions are stacked above each, easing comparisons; do they have the same average? Do they vary the same? Are they similarly skewed/symmetric?. 

For breakouts with two categorical variables we could do 

```{r gg5a}
ggplot() + 
  geom_histogram(
    data = hsb2,
    aes(x = read), 
    fill = "cornflowerblue",
    color = "white",
    bins = 5
    ) + 
  labs(
    title = "Histogram of Reading Scores",
    subtitle = "(broken out by Socioeconomic Status and School Type)",
    x = "Reading Score",
    y = "Frequency"
    ) +
  facet_wrap(ses ~ schtyp, ncol = 2)
```

Note that `ses ~ schtyp` renders the panels for the first category of `ses` by all categories of schtyp and then repeats for the other categories in rows 2 and 3. If we did `facet_wrap(schtype ~ ses, ncol = 3)` we would have a different result:

```{r gg5b}
ggplot() + 
  geom_histogram(
    data = hsb2,
    aes(x = read), 
    fill = "cornflowerblue",
    color = "white",
    bins = 5
    ) + 
  labs(
    title = "Histogram of Reading Scores",
    subtitle = "(broken out by Socioeconomic Status and School Type)",
    x = "Reading Score",
    y = "Frequency"
    ) +
  facet_wrap(schtyp ~ ses, ncol = 3) +
  ylim(c(0, 23))
```

Notice that here I also add a `ylim(c(...))` command to set the minimum and maximum values of the y-axis. This is useful, and I suggest you do not forget to set the y limit to start at 0 or then make a note in the plot for readers so they don't assume it is at 0 when in fact it has been truncated for ease of data presentation. This misstates the pattern in the data, do not do it or then, again, annotate the plot to that effect so nobody is misled. Bar-charts and histograms will have 0 as the minimum y-limit but this is not true for some other plots. 


## Box-plots 
Remember these, our friends from MPA 6010? These can be useful for studying the distribution of a continuous variable. [See this video](https://vimeo.com/222358034). Let us see these in action with the `cmhflights` data. 

```{r box1a}
load(
  here::here("data", "cmhflights_01092017.RData")
  )

ggplot() + 
  geom_boxplot(
    data = cmhflights,
    mapping = aes(
      y = ArrDelay,
      x = ""
      ),
    fill = "cornflowerblue"
    ) 
```

Note: 

+ the `x = ""` is in `aes()` because otherwise with a single group the box-plot will not build up nicely

But, I prefer to see them running horizontally, so how can I do that? With `coord_flip()` since this just flips the columns. 

```{r box1b}
ggplot() + 
  geom_boxplot(
    data = cmhflights,
    mapping = aes(
      y = ArrDelay,
      x = ""
      ),
    fill = "cornflowerblue"
    ) +
  coord_flip()
```

And now for a slightly different data-set, one that measures male adults' hemoglobin concentration for a few populations. 

```{r box2}
read_csv(
  "http://whitlockschluter.zoology.ubc.ca/wp-content/data/chapter02/chap02e3cHumanHemoglobinElevation.csv"
  ) -> hemoglobin

ggplot() +
  geom_boxplot(
    data = hemoglobin,
    mapping = aes(
      x = population,
      y = hemoglobin,
      fill = population
      )
    ) +
  coord_flip() +
  labs(
    x = "Population",
    y = "Hemoglobin Concentration",
    title = "Hemoglobin Concentration in Adult Males",
    subtitle = "(Andes, Ethiopia, Tibet, USA)"
    ) +
  theme(legend.position = "none")
```

Notice the need for no legend with `fill = population` Notice also how `fill = ` is inside `aes(...)` here because we are asking that each unique value seen in a variable called `population` be mapped to a unique color. 

Could we use our `facet_wrap(...)` here too? Of course. 

```{r box1c}
ggplot() + 
  geom_boxplot(
    data = cmhflights,
    mapping = aes(
      y = ArrDelay,
      x = Carrier
      ),
    fill = "cornflowerblue"
    ) +
  coord_flip() +
  facet_wrap(~ Month)
```


## Line-charts 
If we have data over time for one or more units, then line-charts work really well to exhibit trends. A classic, current example would be the number of confirmed COVID-19 cases per country per date. For example, say we have data on the unemployment rate for the country. These data are coming from the `{plotly}` library so we have to make sure it is installed and load it.

```{r line1}
library(plotly)
data(economics)
names(economics)
ggplot() +
  geom_line(
    data = economics, 
    mapping = aes(
      x = date,
      y = uempmed
      )
    ) + 
  labs(
    x = "Date",
    y = "Unemployment Rate"
  )
```

They can look very plain and aesthetically unappealing unless you dress them up. See the one below and then the one that follows. 

```{r line2}
load(
  here::here("data", "gap.df.RData")
  )

ggplot() +
  geom_line(
    data = gap.df, 
    mapping = aes(
      x = year,
      y = LifeExp,
      group = continent,
      color = continent
      )
    ) + 
  geom_point(
    data = gap.df, 
    mapping = aes(
      x = year,
      y = LifeExp,
      group = continent,
      color = continent
      )
    ) + 
  labs(
    x = "Year",
    y = "Median Life Expectancy (in years)",
    color = ""
  ) + 
  theme(legend.position = "bottom") 
```

## Scatter-plots 
These work well if we have two or more continuous variables, and work well to highlight the nature and strength of a relationship between the two variables .... what happens to `y` as `x` increases? s

```{r sc1}
ggplot() + 
  geom_point(
    data = hsb2, 
    mapping = aes(
      x = write,
      y = science
      )
    ) +
  labs(
    x = "Writing Scores", 
    y = "Science Scores"
    ) 
```

We could highlight the different `ses` groups, to see if there is any difference in the relationship between writing scores and science scores by the different ses levels. 

```{r sc2}
ggplot() +
  geom_point(
    data = hsb2,
    mapping = aes(
      x = write,
      y = science,
      color = ses
      )
    ) + 
  labs(
    x = "Writing Scores", 
    y = "Science Scores",
    color = ""
    ) + 
  theme(legend.position = "bottom") 
```

This is not very helpful so why not breakout ses for ease of interpretation? 

```{r sc3}
ggplot() +
  geom_point(
    data = hsb2,
    mapping = aes(
      x = write,
      y = science
      )
    ) + 
  labs(
    x = "Writing Scores", 
    y = "Science Scores"
    ) + 
  facet_wrap(~ ses) 
```

Could we add another layer, perhaps `female`? 

```{r sc4}
ggplot() +
  geom_point(
    data = hsb2,
    mapping = aes(
      x = write,
      y = science
      )
    ) + 
  labs(
    x = "Writing Scores", 
    y = "Science Scores"
    ) + 
  facet_wrap(ses ~ female, ncol = 2) 
```

And finally, a few suggestion about how to build up your visualizations: 

- `r emojifont::emoji('repeat')` start with pencil and paper, sketch prototypes of desired visualization(s)
- `r emojifont::emoji('smile')` graphics are relatively easy to generate with base R & with `ggplot2` 
- `r emojifont::emoji('clap')` common-sense: `number` & `type` of variable(s) guide plotting 
- `r emojifont::emoji('sparkler')` stay `color conscious`: sensible colors & sensitive to color blindness
- `r emojifont::emoji('beginner')` experiment, experiment, experiment until you are happy 
- use the `r emojifont::emoji('free')` learning resources available online 
- `r emojifont::emoji('ledger')` if you learn something new in R, write it down 

----------

# Practice Exercises 

## Nobel Prize Winners 
Georgios Karamanis gathered and shared data on Nobel prize winners over the years, with a fair amount of detail, and used in the `tidytuesday` series a while back. These data are to be used for the questions that follow. 

```{r nobel-winners}
readr::read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-05-14/nobel_winners.csv") -> nobel_winners 
```

|variable             |class     |description |
|:---|:---|:-----------|
|prize_year           |double    | Year that Nobel Prize was awarded|
|category             |character | Field of study/category|
|prize                |character | Prize Name |
|motivation           |character | Motivation of the award |
|prize_share          |character | Share eg 1 of 1, 1 of 2, 1 of 4, etc |
|laureate_id          |double    | ID assigned to each winner |
|laureate_type        |character | Individual or organization  |
|full_name            |character | name of the winner|
|birth_date           |double    | birth date of winner |
|birth_city           |character | birth city/state of winner |
|birth_country        |character | birth country of winner |
|gender               |character | binary gender of the winner |
|organization_name    |character | organization name |
|organization_city    |character | organization city |
|organization_country |character | organization country |
|death_date           |double    | death date of the winner (if dead) |
|death_city           |character | death city (if dead) |
|death_country        |character | death country (if dead) |


(a) First create `nobel.df` that keeps only records starting in the year 1960, and only for the "Physics" category. Now generate an appropriate chart that shows the distribution of winners by `birth_country` 

(b) Now break this distribution out by `gender` to see how winners by country differs across gender 

(c) Now go back to `noble_winners`, the full data-set, and create a simple plot that shows the distribution of prize winners by `death_country`, `gender`, and `category`  


## Water levels in the Great Lakes

Download the monthly Great Lakes water level data-set [SPSS format from here](https://aniruhil.github.io/avsr/teaching/dataviz/greatlakes.sav) and [Excel format from here](https://aniruhil.github.io/avsr/teaching/dataviz/greatlakes.xlsx). *Note that water level is in meters.* 

You may use the following command to read in the excel file: 

```{r, eval=FALSE}
library(readxl)
url <- "https://aniruhil.github.io/avsr/teaching/dataviz/greatlakes.xlsx"
destfile <- "greatlakes.xlsx"
curl::curl_download(url, destfile)
read_excel(destfile, col_types = c("date", 
     "numeric", "numeric", "numeric", "numeric", 
     "numeric")) -> greatlakes 
```

Now use an appropriate chart to show the water level for Lake Superior. 

## County Health Rankings
Download the 2017 County Health Rankings data [SPSS format from here](https://aniruhil.github.io/avsr/teaching/dataviz/CountyHealthRankings2017.sav), [Excel format from here](https://aniruhil.github.io/avsr/teaching/dataviz/CountyHealthRankings2017.xlsx) and the [accompanying codebook](http://www.countyhealthrankings.org/sites/default/files/2017TrendsDocumentation.pdf). 

These data can be downloaded with the code provided below: 

```{r great-lakes}
library(readxl)
url <- "https://aniruhil.github.io/avsr/teaching/dataviz/CountyHealthRankings2017.xlsx"
destfile <- "CountyHealthRankings2017.xlsx"
curl::curl_download(url, destfile)
read_excel(destfile) -> chr.df 
```

Construct appropriate plots that shows the relationship between the following pairs of variables 

(a) Adult obesity and High school graduation 

(b) Children in poverty and High school graduation 

(c) Preventable hospital stays and Unemployment rate 


## Unemployment Rates
Use the unemployment data given to you `(unemprate.RData)` and construct appropriate plots that show the distribution of unemployment rates across years for each of the four educational attainment groups. 

```{r}
load(
  here::here("data", "unemprate.RData")
  ) -> urate 
```

Be sure to use a unique color for each educational attainment group

