Visualizing Data for AnalyticsAni Ruhil1 / 52

Agenda

This week we learn how to visualize data

package of choice here is {ggplot2}
built on the grammar of graphics philosophy
the basis for elegant yet highly customized plots
can be extended and even animated with ease

2 / 52

what is "a" grammar of graphics?

Think about it as a way of building up any graphic layer by layer

(1) starts with the data you are going to use

(2) then comes the aesthetics -- what goes on which axis? How are colors to be assigned? Ar there groups? Is there a size consideration here? What else?

(3) is some scaling necessary? (show the proportions as percentagess, or convert frequencies to proportions, and so on?)

(4) what geometry should be used -- bars? lines, scatterplots? box-plots? geographic maps? something else?

(5) should some statistic be displayed, such as means, standard errors, confidence/prediction intervals, etc?

(6) What about faceting -- should there be separate panels for some groups?

These (and other) layers are baked into the structure of the {ggplot2} package

3 / 52

I will use two data-sets to walk through the initial examples in this module, the first being this IMDB data-set

The internet movie database, http://imdb.com/, is a website devoted to collecting movie data supplied by studios and fans. It claims to be the biggest movie database on the web and is run by amazon. More about information imdb.com can be found online, http://imdb.com/help/show_ leaf?about, including information about the data collection process, http://imdb.com/help/show_leaf?infosource.

library(ggplot2movies)
data(movies)
names(movies)

##  [1] "title"       "year"        "length"      "budget"      "rating"     
##  [6] "votes"       "r1"          "r2"          "r3"          "r4"         
## [11] "r5"          "r6"          "r7"          "r8"          "r9"         
## [16] "r10"         "mpaa"        "Action"      "Animation"   "Comedy"     
## [21] "Drama"       "Documentary" "Romance"     "Short"

A data frame with 28819 rows and 24 variables

Variable	Description
title	Title of the movie
year	Year of release
budget	Total budget (if known) in US dollars
length	Length in minutes
rating	Average IMDB user rating
votes	Number of IMDB users who rated this movie
r1-10	Multiplying by ten gives percentile (to nearest 10%) of users who rated this movie a 1
mpaa	MPAA rating (missing for a lot of movies)
action, animation, comedy, drama, documentary, romance, short	Binary variables representing if movie was classified as belonging to that genre

4 / 52

The second data-set is the Star Wars dataset, a tibble with 87 rows and 13 variables:

library(tidyverse)
data(starwars)
names(starwars)

##  [1] "name"       "height"     "mass"       "hair_color" "skin_color"
##  [6] "eye_color"  "birth_year" "gender"     "homeworld"  "species"   
## [11] "films"      "vehicles"   "starships"

head(starwars)

## # A tibble: 6 x 13
##   name  height  mass hair_color skin_color eye_color birth_year gender homeworld
##   <chr>  <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr>  <chr>    
## 1 Luke…    172    77 blond      fair       blue            19   male   Tatooine 
## 2 C-3PO    167    75 <NA>       gold       yellow         112   <NA>   Tatooine 
## 3 R2-D2     96    32 <NA>       white, bl… red             33   <NA>   Naboo    
## 4 Dart…    202   136 none       white      yellow          41.9 male   Tatooine 
## 5 Leia…    150    49 brown      light      brown           19   female Alderaan 
## 6 Owen…    178   120 brown, gr… light      blue            52   male   Tatooine 
## # … with 4 more variables: species <chr>, films <list>, vehicles <list>,
## #   starships <list>

Variable	Description
name	Name of the character
height	Height (cm)
mass	Weight (kg)
hair_color, skin_color, eye_color	Hair, skin, and eye colors
birth_year	Year born (BBY = Before Battle of Yavin)
gender	male, female, hermaphrodite, or none
homeworld	Name of homeworld
species	Name of species
films	List of films the character appeared in
vehicles	List of vehicles the character has piloted
starships	List of starships the character has piloted

5 / 52

`ggplot2` and the grammar of graphics

The {ggplot2} package has a special syntax and I will point out things you should note as we move through this module. First up, the library is called ggplot2 but the command starts with ggplot so don't let that throw you off-track.

Second, you need to have a data-set to work with. In the code below I start by loading the library and then specifying the data-set to be used.

library(ggplot2)
ggplot(data = starwars)

6 / 52

Nothing results from these commands because we have not yet specified anything about what should go on the x-axis, what should go on the y-axis. Well, let us do that then by asking for the column eye_color to be put on the x-axis.

ggplot(
  data = starwars,
  mapping = aes(
    x = eye_color
    )
  )

7 / 52

`geom_bar()` ... the bar-chart

This results in a gray canvas with the eye colors on the x-axis but nothing else has been drawn since we have not specified the geometry ... do you want a bar-chart? histogram? dot-plot? line-chart? This is a categorical variable and hence a bar-chart would be appropriate. We call for a bar-chart with the geom_bar() command.

ggplot(
  data = starwars,
  mapping = aes(
    x = eye_color
    )
  ) +
  geom_bar()

8 / 52

`color` versus `fill`

The aes() refers to the aesthetics of the chart, and many other aesthetics can be added, such as group, color, fill, size, alpha, etc. We will see some of these in due course but for now I want to focus on two of these, both involving coloring of the geom_. Specifically, there are two commands for adding colors -- (1) color or colour, and (2) fill -- to a chart.

ggplot(
  data = starwars,
  mapping = aes(
    x = eye_color,
    color = eye_color
    )
  ) +
  geom_bar()

Note what the color = eye_color command did ... it drew a colored border for the bars, and an accompanying legend. What if we had used fill = eye_color instead?

9 / 52

using `fill =`

ggplot(
  data = starwars,
  mapping = aes(
    x = eye_color,
    fill = eye_color
    )
  ) +
  geom_bar()

Aha! Now the bars are filled with colors and an accompanying legend is drawn as well. So fill = and color = behave very differently, bear this in mind.

10 / 52

Adding labels with `labs`

One of the nice things about this software environment is that there are plenty of coloring schemes available to us and we will play with some of these shortly, but before we do that, let us look at one more improvement -- adding titles, subtitles, captions, and axis labels to our chart. This is done with the labs = () command.

ggplot(
  data = starwars,
  mapping = aes(
    x = eye_color,
    fill = eye_color
    )
  ) +
  geom_bar() +
  labs(
    x = "Eye Color",
    y = "Frequency (n)",
    title = "Bar-chart of Eye Colors",
    subtitle = "(of Star Wars characters)",
    caption = "My little work of art!!"
    )

Notice the text that now appears as a result of what has been specified in the labs() command.

11 / 52

Controlling the chart legend with `theme()`

In this bar-chart, do we really need the legend? No, because the colors and color names show up in the chart itself. How can we hide the legend? Turns out there is a neat command that will allow you to move the legend around/hide it.

ggplot(
  data = starwars,
  mapping = aes(
    x = eye_color,
    fill = eye_color
    )
  ) +
  geom_bar() +
  labs(
    x = "Eye Color",
    y = "Frequency (n)",
    title = "Bar-chart of Eye Colors",
    subtitle = "(of Star Wars characters)",
    caption = "My little work of art!!"
    ) +
  theme(legend.position = "none")

Instead of "none" you could have specified "bottom", "left", "top", "right" to place the legend in a particular direction.

12 / 52

Customizing colors

Of course, it would be good to have the colors match the eye-color so let us do that next. The way we can do this is by calling specific colors by name. I have tried to order the lineup of the colors to match, as closely as I can, the eye colors.

c(
  "black", "blue", "slategray", "brown", "gray34", "gold",
  "greenyellow", "navajowhite1", "orange", "pink", "red",
  "magenta", "thistle3", "white", "yellow"
  ) -> mycolors
ggplot(
  data = starwars,
  mapping = aes(x = eye_color)
  ) + 
  geom_bar(fill = mycolors) + 
  labs(
    x = "Eye Color",
    y = "Frequency (n)",
    title = "Bar-chart of Eye Colors",
    subtitle = "(of Star Wars characters)",
    caption = "My little work of art!!"
    ) +
  theme(legend.position = "none")

13 / 52

These colors are from this source but see also this source. Colors can be customized by generating your own palettes via the Color Brewer here. But don't get carried away: Remember to read the materials on choosing colors wisely, particularly the point about qualitative palettes, divergent palettes, and then palettes that work well even with colorblind audiences.

Image Source:

14 / 52

Selected color palettes

I had mentioned the existence of a number of color palettes so let us look at a few, but we will do this with a different variable. First up, the Pastel1 palette.

ggplot(
  data = starwars,
  mapping = aes(
    x = gender
    )
  ) + 
  geom_bar(
    aes(fill = gender)
    ) + 
  labs(
    x = "Gender",
    y = "Frequency",
    title = "Bar-chart of Gender",
    subttitle = "(of Star Wars characters)",
    caption = "(Source: The dplyr package)") +
  scale_fill_brewer(
    palette = "Pastel1"
    )

15 / 52

Not bad but doesn't work too well here. How about trying another palette, Set1?

ggplot(
  data = starwars,
  mapping = aes(
    x = gender
    )
  ) + 
  geom_bar(
    aes(fill = gender)
    ) + 
  labs(
    x = "Gender",
    y = "Frequency",
    title = "Bar-chart of Gender",
    subttitle = "(of Star Wars characters)",
    caption = "(Source: The dplyr package)") +
  scale_fill_brewer(
    palette = "Set1"
    )

Check out this package as well for several other custom palette packages

16 / 52

Nice! But what is also noticeable here is that there are some characters in the data-set whose gender data is missing. These are the NA values. By default, you will see NA values showing up in some types of charts and so it is always good to exclude them from the chart. Here is one way of doing that.

ggplot(
  data = subset(starwars, !is.na(gender)),
  mapping = aes(
    x = gender
    )
  ) + 
  geom_bar(
    aes(fill = gender)
    ) + 
  labs(
    x = "Gender",
    y = "Frequency",
    title = "Bar-chart of Gender",
    subttitle = "(of Star Wars characters)",
    caption = "(Source: The dplyr package)") +
  scale_fill_brewer(
    palette = "Set1"
    )

17 / 52

Or use filter()

starwars %>%
  filter(!is.na(gender)) -> my.data
ggplot(
  data = my.data,
  mapping = aes(
    x = gender
    )
  ) + 
  geom_bar(
    aes(fill = gender)
    ) + 
  labs(
    x = "Gender",
    y = "Frequency",
    title = "Bar-chart of Gender",
    subttitle = "(of Star Wars characters)",
    caption = "(Source: The dplyr package)") +
  scale_fill_brewer(
    palette = "Set1"
    )

18 / 52

There is one color palette you should remember, and this is the {viridis} color scheme that works around varying types of color blindness in the population. Here come the palettes:

ggplot(
  data = my.data,
  mapping = aes(
    x = gender
    )
  ) + 
  geom_bar(
    aes(fill = gender)
    ) + 
  labs(
    x = "Gender",
    y = "Frequency",
    title = "Bar-chart of Gender",
    subttitle = "(of Star Wars characters)",
    caption = "(Source: The dplyr package)") +
  scale_fill_viridis_d(
    option = "viridis"
    )

Other options would be plasma, magma, and cividis

19 / 52

Themes with `{ggthemes}`

One can also lean on various plotting themes as shown below. These themes mimic the style of graphics popularized by some data visualization experts (for e.g., Stephen Few, Edward Tufte), news-media houses (Fivethirtyeight, The Economist, The Wall Street Journal), some software packages (Excel, Stata, Google docs), and a few others. Below I show you just a handful.

library(ggthemes)
ggplot(
  data = starwars,
  mapping = aes(
    x = eye_color
    )
  ) +
  geom_bar() +
  theme_tufte()

20 / 52

ggplot(
  data = starwars,
  mapping = aes(
    x = eye_color
    )
  ) +
  geom_bar() +
  theme_economist()

21 / 52

ggplot(
  data = starwars,
  mapping = aes(
    x = eye_color
    )
  ) +
  geom_bar() +
  theme_fivethirtyeight()

22 / 52

More with bar-charts

I want to show a few things with bar-charts now. First, we can specify things a bit differently without altering the result. For example, compare the following two pieces of code.

ggplot(
  data = movies,
  mapping = aes(x = mpaa)
  ) + 
  geom_bar()

ggplot() + 
  geom_bar(
    data = movies,
    mapping = aes(x = mpaa)
    )

23 / 52

The plot is sub-optimal since MPAA ratings are missing for a lot of movies and should be eliminated from the plot via subset(mpa != "") or by running dplyr's filter() to create another data-set. I will lean on filter().

movies %>% 
  filter(mpaa != "") -> movies2 
ggplot() + 
  geom_bar(
    data = movies2,
    mapping = aes(x = mpaa)
    )

24 / 52

The order of the bars here is fortuitous in that it goes from the smallest frequency to the highest frequency, drawing the reader's eye. I said fortuitous because {ggplot2} defaults to drawing the bars in an ascending alphabetic/alphanumeric order if the variable is a character. See below for an example.

df = tibble(x = c(rep("A", 2), rep("B", 4), rep("C", 1)))
ggplot() + 
  geom_bar(
    data = df, 
    mapping = aes(x = x)
  )

Notice the bars here do not follow in ascending/descending order of frequencies

Later on we'll learn how to order the bars with ascending/descending frequencies or by some other logic.

25 / 52

What about plotting relative frequencies on the y-axis rather than the frequencies?

library(scales)
ggplot() + 
  geom_bar(
    data = movies2,
    mapping = aes(
      x = mpaa,
      y = (..count..)/sum(..count..)
      )
    ) + 
  scale_y_continuous(labels = percent) +
  labs(
    x = "MPAA Rating",
    y  = "Relative Frequency (%)"
    )

Note the addition of

y = (..count..)/sum(..count..) to change the y-axis to reflect the relative frequency as a proportion, and
scale_y_continuous(labels = percent) to then multiply these proportions by 100 to get percentages as the labels rather than 0.2, 0.4, 0.6, etc.

26 / 52

Disaggregating bar-charts for groups

Let us build a simple bar-chart with the hsb2 data we saw in Module 01. Here we first download the data, label the values, save it, and then start charting.

read.table(
  'https://stats.idre.ucla.edu/stat/data/hsb2.csv',
  header = TRUE,
  sep = ","
  ) -> hsb2
factor(hsb2$female,
       levels = c(0, 1),
       labels = c("Male", "Female")
       ) -> hsb2$female 
factor(hsb2$race,
       levels = c(1:4),
       labels = c("Hispanic", "Asian", "African American", "White")
       ) -> hsb2$race

factor(hsb2$ses,
       levels = c(1:3),
       labels = c("Low", "Middle", "High")
       ) -> hsb2$ses
factor(hsb2$schtyp,
       levels = c(1:2),
       labels = c("Public", "Private")
       ) -> hsb2$schtyp
factor(hsb2$prog,
       levels = c(1:3),
       labels = c("General", "Academic", "Vocational")
       ) -> hsb2$prog
save(
  hsb2, file = here::here("data", "hsb2.RData")
  )

27 / 52

What if I wanted to see how socioeconomic status varies across male and female students?

ggplot() + 
  geom_bar(
    data = hsb2,
    mapping = aes(
      x = ses,
      group = female,
      fill = female
      )
    ) +
  labs(
    x = "Socioeconomic Status",
    y = "Frequency"
  )

28 / 52

This is not very useful since the viewer has to estimate the relative sizes of the two colors within any given bar. That can be fixed with position = "dodge", juxtaposing the bars for the groups as a result, and the end product is much better. But note: position = "dodge" has to be put outside the aes() but still inside geom_bar() so be careful.

ggplot() + 
  geom_bar(
    data = hsb2,
    mapping = aes(
      x = ses,
      group = female, 
      fill = female
      ),
    position = "dodge"
    ) +
  labs(
    x = "Socioeconomic Status",
    y = "Frequency"
  )

29 / 52

What if you wanted to calculate percentages within each sex? That is, what percent of male students fall within a particular ses category, and the same thing for female students?

ggplot() + 
  geom_bar(
    data = hsb2, 
    aes(
      x = ses, 
      group = female,
      fill = female, 
      y = ..prop..
      ),
    position = "dodge") +
  scale_y_continuous(labels = scales::percent) + 
  labs(
    x = "Socioeconomic Status",
    y = "Relative Frequency (%)"
    )

30 / 52

What about within each ses instead of within gender? That is, what if we wanted percent of Low ses that is Male versus Female, and so on?

ggplot() + 
  geom_bar(
    data = hsb2, 
    aes(
      x = female, 
      group = ses,
      fill = ses, 
      y = ..prop..
      ),
    position = "dodge") +
  scale_y_continuous(labels = scales::percent) + 
  labs(
    x = "Socioeconomic Status",
    y = "Relative Frequency (%)"
    )

31 / 52

ggplot() + 
  geom_bar(
    data = hsb2, 
    aes(
      x = female, 
      group = ses,
      fill = ses, 
      y = ..prop..
      ),
    position = "dodge") +
  scale_y_continuous(labels = scales::percent) + 
  labs(
    x = "Socioeconomic Status",
    y = "Relative Frequency (%)"
    )

There is some more we will do with bar-charts but for now let us set them aside and instead look at a few other charts -- histograms, box-plots, and line-charts.

32 / 52

Histograms

If you've forgotten what these are, see histogram, or then Yau's piece here and here. There is a short video available as well.

For histograms in ggplot2, geom_histogram() is the geometry needed but note that the default number of bins is not very useful and can be tweaked, along with other embellishments that are possible as well.

ggplot() + 
  geom_histogram(
    data = hsb2,
    aes(x = read), 
    fill = "cornflowerblue",
    color = "white"
    ) + 
  labs(
    title = "Histogram of Reading Scores",
    x = "Reading Score",
    y = "Frequency"
    )

33 / 52

Note the warning stat_bin() using bins = 30. Pick better value with binwidth. This is because numerical variables need to be grouped in order to have meaningful histograms we can make sense of. How do you define the bins (aka the groups)? We could set bins = 5 and we could also experiment with binwidth =. Let us do bins = 5 which will say give us 5 groups, and go ahead and calculate them yourself.

ggplot() + 
  geom_histogram(
    data = hsb2,
    aes(x = read), 
    fill = "cornflowerblue",
    color = "white",
    bins = 5
    ) + 
  labs(
    title = "Histogram of Reading Scores",
    subtitle = "(with bins = )",
    x = "Reading Score",
    y = "Frequency"
    )

If we wanted more/fewer bins we could tweak it up or down as needed.

34 / 52

What about binwidth? This will specify how wide each group must be.

ggplot() + 
  geom_histogram(
    data = hsb2,
    aes(x = read), 
    fill = "cornflowerblue",
    color = "white",
    binwidth = 5
    ) + 
  labs(
    title = "Histogram of Reading Scores",
    subtitle = "(with binwidth = )",
    x = "Reading Score",
    y = "Frequency"
    )

35 / 52

If we wanted to disaggregate the histogram by one or more categorical variables, we could do so quite easily:

ggplot() + 
  geom_histogram(
    data = hsb2,
    aes(x = read), 
    fill = "cornflowerblue",
    color = "white",
    bins = 5
    ) + 
  labs(
    title = "Histogram of Reading Scores",
    subtitle = "(broken out for Male vs. Female students)",
    x = "Reading Score",
    y = "Frequency"
    ) +
  facet_wrap(~ female)

36 / 52

When we do this, it is often useful to organize them so that only one histogram shows up in a row. This is done with the ncol = 1 command.

ggplot() + 
  geom_histogram(
    data = hsb2,
    aes(x = read), 
    fill = "cornflowerblue",
    color = "white",
    bins = 5
    ) + 
  labs(
    title = "Histogram of Reading Scores",
    subtitle = "(broken out for Male vs. Female students)",
    x = "Reading Score",
    y = "Frequency"
    ) +
  facet_wrap(~ female, ncol = 1)

37 / 52

ggplot() + 
  geom_histogram(
    data = hsb2,
    aes(x = read), 
    fill = "cornflowerblue",
    color = "white",
    bins = 5
    ) + 
  labs(
    title = "Histogram of Reading Scores",
    subtitle = "(broken out by Socioeconomic Status)",
    x = "Reading Score",
    y = "Frequency"
    ) +
  facet_wrap(~ ses, ncol = 1)

Now the distributions are stacked above each, easing comparisons; do they have the same average? Do they vary the same? Are they similarly skewed/symmetric?.

38 / 52

For breakouts with two categorical variables we could do

ggplot() + 
  geom_histogram(
    data = hsb2,
    aes(x = read), 
    fill = "cornflowerblue",
    color = "white",
    bins = 5
    ) + 
  labs(
    title = "Histogram of Reading Scores",
    subtitle = "(broken out by Socioeconomic Status and School Type)",
    x = "Reading Score",
    y = "Frequency"
    ) +
  facet_wrap(ses ~ schtyp, ncol = 2)

Note that ses ~ schtyp renders the panels for the first category of ses by all categories of schtyp and then repeats for the other categories in rows 2 and 3.

39 / 52

If we did facet_wrap(schtype ~ ses, ncol = 3) we would have a different result:

ggplot() + 
  geom_histogram(
    data = hsb2,
    aes(x = read), 
    fill = "cornflowerblue",
    color = "white",
    bins = 5
    ) + 
  labs(
    title = "Histogram of Reading Scores",
    subtitle = "(broken out by Socioeconomic Status and School Type)",
    x = "Reading Score",
    y = "Frequency"
    ) +
  facet_wrap(schtyp ~ ses, ncol = 3) +
  ylim(c(0, 23))

Notice that here I also add a ylim(c(...)) command to set the minimum and maximum values of the y-axis. This is useful, and I suggest you do not forget to set the y limit to start at 0 or then make a note in the plot for readers so they don't assume it is at 0 when in fact it has been truncated for ease of data presentation. This misstates the pattern in the data, do not do it or then, again, annotate the plot to that effect so nobody is misled. Bar-charts and histograms will have 0 as the minimum y-limit but this is not true for some other plots.

40 / 52

Box-plots

Remember these, our friends from MPA 6010? These can be useful for studying the distribution of a continuous variable. See this video. Let us see these in action with the cmhflights data.

load(
  here::here("data", "cmhflights_01092017.RData")
  )
ggplot() + 
  geom_boxplot(
    data = cmhflights,
    mapping = aes(
      y = ArrDelay,
      x = ""
      ),
    fill = "cornflowerblue"
    )

41 / 52

the x = "" is in aes() because otherwise with a single group the box-plot will not build up nicely

But, I prefer to see them running horizontally, so how can I do that? With coord_flip() since this just flips the columns.

ggplot() + 
  geom_boxplot(
    data = cmhflights,
    mapping = aes(
      y = ArrDelay,
      x = ""
      ),
    fill = "cornflowerblue"
    ) +
  coord_flip()

42 / 52

And now for a slightly different data-set, one that measures male adults' hemoglobin concentration for a few populations.

read_csv(
  "http://whitlockschluter.zoology.ubc.ca/wp-content/data/chapter02/chap02e3cHumanHemoglobinElevation.csv"
  ) -> hemoglobin
ggplot() +
  geom_boxplot(
    data = hemoglobin,
    mapping = aes(
      x = population,
      y = hemoglobin,
      fill = population
      )
    ) +
  coord_flip() +
  labs(
    x = "Population",
    y = "Hemoglobin Concentration",
    title = "Hemoglobin Concentration in Adult Males",
    subtitle = "(Andes, Ethiopia, Tibet, USA)"
    ) +
  theme(legend.position = "none")

43 / 52

Could we use our facet_wrap(...) here too? Of course.

ggplot() + 
  geom_boxplot(
    data = cmhflights,
    mapping = aes(
      y = ArrDelay,
      x = Carrier
      ),
    fill = "cornflowerblue"
    ) +
  coord_flip() +
  facet_wrap(~ Month)

44 / 52

Line-charts

If we have data over time for one or more units, then line-charts work really well to exhibit trends. A classic, current example would be the number of confirmed COVID-19 cases per country per date. For example, say we have data on the unemployment rate for the country. These data are coming from the {plotly} library so we have to make sure it is installed and load it.

library(plotly)
data(economics)
#names(economics)
ggplot() +
  geom_line(
    data = economics, 
    mapping = aes(
      x = date,
      y = uempmed
      )
    ) + 
  labs(
    x = "Date",
    y = "Unemployment Rate"
  )

45 / 52

They can look very plain and aesthetically unappealing unless you dress them up. See the one below and then the one that follows.

load(
  here::here("data", "gap.df.RData")
  )
ggplot() +
  geom_line(
    data = gap.df, 
    mapping = aes(x = year, y = LifeExp,
      group = continent, color = continent
      )
    ) + 
  geom_point(
    data = gap.df, 
    mapping = aes(
      x = year,
      y = LifeExp,
      group = continent,
      color = continent
      )
    ) + 
  labs(
    x = "Year",
    y = "Median Life Expectancy (in years)",
    color = ""
  ) + 
  theme(legend.position = "bottom")

46 / 52

Scatter-plots

These work well if we have two or more continuous variables, and work well to highlight the nature and strength of a relationship between the two variables .... what happens to y as x increases? s

ggplot() + 
  geom_point(
    data = hsb2, 
    mapping = aes(
      x = write,
      y = science
      )
    ) +
  labs(
    x = "Writing Scores", 
    y = "Science Scores"
    )

47 / 52

We could highlight the different ses groups, to see if there is any difference in the relationship between writing scores and science scores by the different ses levels.

ggplot() +
  geom_point(
    data = hsb2,
    mapping = aes(
      x = write,
      y = science,
      color = ses
      )
    ) + 
  labs(
    x = "Writing Scores", 
    y = "Science Scores",
    color = ""
    ) + 
  theme(legend.position = "bottom")

48 / 52

This is not very helpful so why not breakout ses for ease of interpretation?

ggplot() +
  geom_point(
    data = hsb2,
    mapping = aes(
      x = write,
      y = science
      )
    ) + 
  labs(
    x = "Writing Scores", 
    y = "Science Scores"
    ) + 
  facet_wrap(~ ses)

49 / 52

Could we add another layer, perhaps female?

ggplot() +
  geom_point(
    data = hsb2,
    mapping = aes(
      x = write,
      y = science
      )
    ) + 
  labs(
    x = "Writing Scores", 
    y = "Science Scores"
    ) + 
  facet_wrap(ses ~ female, ncol = 2)

50 / 52

And finally, a few suggestion about how to build up your visualizations:

🔁 start with pencil and paper, sketch prototypes of desired visualization(s)
😄 graphics are relatively easy to generate with base R & with ggplot2
👏 common-sense: number & type of variable(s) guide plotting
🎇 stay color conscious: sensible colors & sensitive to color blindness
🔰 experiment, experiment, experiment until you are happy
use the 🆓 learning resources available online
📒 if you learn something new in R, write it down

51 / 52

Find me at...

@aruhil
aniruhil.org
ruhil@ohio.edu

52 / 52

Help

Keyboard shortcuts

↑, ←, Pg Up, k

Go to previous slide

↓, →, Pg Dn, Space, j

Go to next slide

Home

Go to first slide

End

Go to last slide

Number + Return

Go to specific slide

b / m / f

Toggle blackout / mirrored / fullscreen mode

Clone slideshow

Toggle presenter mode

Restart the presentation timer

?, h

Toggle this help

Visualizing Data for Analytics

Ani Ruhil

Agenda

what is "a" grammar of graphics?

ggplot2 and the grammar of graphics

geom_bar() ... the bar-chart

color versus fill

using fill =

Adding labels with labs

Controlling the chart legend with theme()

Customizing colors

Selected color palettes

Themes with {ggthemes}

More with bar-charts

Disaggregating bar-charts for groups

Histograms

Box-plots

Line-charts

Scatter-plots

Find me at...

Agenda

Help

`ggplot2` and the grammar of graphics

`geom_bar()` ... the bar-chart

`color` versus `fill`

using `fill =`

Adding labels with `labs`

Controlling the chart legend with `theme()`

Themes with `{ggthemes}`