class: title-slide, center, middle background-image: url(images/ouaerial.jpeg) background-size: cover # .crimson[.fancy[Visualizing Data for Analytics]] ## .crimson[.fancy[Ani Ruhil]] --- name: agenda # .fancy[ Agenda ] This week we learn how to visualize data .pull-left[ + package of choice here is `{ggplot2}` + built on the `grammar of graphics` philosophy + the basis for elegant yet highly customized plots + can be extended and even animated with ease ] .pull.right[ <img src="images/ggplot2_exploratory.png" width="35%" style="display: block; margin: auto;" /> ] --- ###.fancy[.heat[what is "a" grammar of graphics?]] Think about it as a way of building up any graphic layer by layer (1) starts with the `data` you are going to use (2) then comes the `aesthetics` -- what goes on which axis? How are colors to be assigned? Ar there groups? Is there a size consideration here? What else? (3) is some `scaling` necessary? (show the proportions as percentagess, or convert frequencies to proportions, and so on?) (4) what `geometry` should be used -- bars? lines, scatterplots? box-plots? geographic maps? something else? (5) should some `statistic` be displayed, such as means, standard errors, confidence/prediction intervals, etc? (6) What about `faceting` -- should there be separate panels for some groups? .medium[.center[.fancy[.heat[ These (and other) layers are baked into the structure of the `{ggplot2}` package ]]]] --- .pull-left[ I will use two data-sets to walk through the initial examples in this module, the first being this [IMDB data-set](http://imdb.com/) > The internet movie database, http://imdb.com/, is a website devoted to collecting movie data supplied by studios and fans. It claims to be the biggest movie database on the web and is run by amazon. More about information imdb.com can be found online, http://imdb.com/help/show_ leaf?about, including information about the data collection process, http://imdb.com/help/show_leaf?infosource. ```r library(ggplot2movies) data(movies) names(movies) ``` ``` ## [1] "title" "year" "length" "budget" "rating" ## [6] "votes" "r1" "r2" "r3" "r4" ## [11] "r5" "r6" "r7" "r8" "r9" ## [16] "r10" "mpaa" "Action" "Animation" "Comedy" ## [21] "Drama" "Documentary" "Romance" "Short" ``` ] .pull-right[ A data frame with 28819 rows and 24 variables | Variable | Description | | :-- | :-- | | title | Title of the movie | | year | Year of release | | budget | Total budget (if known) in US dollars | | length | Length in minutes | | rating | Average IMDB user rating | | votes | Number of IMDB users who rated this movie | | r1-10 | Multiplying by ten gives percentile (to nearest 10%) of users who rated this movie a 1 | | mpaa | MPAA rating (missing for a lot of movies) | | action, animation, comedy, drama, documentary, romance, short | Binary variables representing if movie was classified as belonging to that genre | ] --- .pull-left[ The second data-set is the [Star Wars dataset](https://swapi.co), a `tibble` with 87 rows and 13 variables: ```r library(tidyverse) data(starwars) names(starwars) ``` ``` ## [1] "name" "height" "mass" "hair_color" "skin_color" ## [6] "eye_color" "birth_year" "gender" "homeworld" "species" ## [11] "films" "vehicles" "starships" ``` ```r head(starwars) ``` ``` ## # A tibble: 6 x 13 ## name height mass hair_color skin_color eye_color birth_year gender homeworld ## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> ## 1 Luke… 172 77 blond fair blue 19 male Tatooine ## 2 C-3PO 167 75 <NA> gold yellow 112 <NA> Tatooine ## 3 R2-D2 96 32 <NA> white, bl… red 33 <NA> Naboo ## 4 Dart… 202 136 none white yellow 41.9 male Tatooine ## 5 Leia… 150 49 brown light brown 19 female Alderaan ## 6 Owen… 178 120 brown, gr… light blue 52 male Tatooine ## # … with 4 more variables: species <chr>, films <list>, vehicles <list>, ## # starships <list> ``` ] .pull-right[ | Variable | Description | | :-- | :-- | | name | Name of the character | | height | Height (cm) | | mass | Weight (kg) | | hair_color, skin_color, eye_color | Hair, skin, and eye colors | | birth_year | Year born (BBY = Before Battle of Yavin) | | gender | male, female, hermaphrodite, or none | | homeworld | Name of homeworld | | species | Name of species | | films | List of films the character appeared in | | vehicles | List of vehicles the character has piloted | | starships | List of starships the character has piloted | ] --- # `ggplot2` and the [grammar of graphics](http://vita.had.co.nz/papers/layered-grammar.html) The `{ggplot2}` package has a special syntax and I will point out things you should note as we move through this module. First up, the library is called `ggplot2` but the command starts with `ggplot` so don't let that throw you off-track. Second, you need to have a data-set to work with. In the code below I start by loading the library and then specifying the data-set to be used. .pull-left[ ```r library(ggplot2) ggplot(data = starwars) ``` ] .pull-left[ <img src="Module05sp20_files/figure-html/gg000b-1.svg" width="80%" style="display: block; margin: auto;" /> ] --- Nothing results from these commands because we have not yet specified anything about what should go on the x-axis, what should go on the y-axis. Well, let us do that then by asking for the column `eye_color` to be put on the x-axis. ```r ggplot( data = starwars, mapping = aes( x = eye_color ) ) ``` <img src="Module05sp20_files/figure-html/gg001-1.svg" width="40%" style="display: block; margin: auto;" /> --- ## `geom_bar()` ... the bar-chart This results in a gray canvas with the eye colors on the x-axis but nothing else has been drawn since we have not specified the `geometry` ... do you want a bar-chart? histogram? dot-plot? line-chart? This is a categorical variable and hence a bar-chart would be appropriate. We call for a bar-chart with the `geom_bar()` command. .pull-left[ ```r ggplot( data = starwars, mapping = aes( x = eye_color ) ) + geom_bar() ``` ] .pull-right[ <img src="Module05sp20_files/figure-html/gg002a-1.svg" width="100%" style="display: block; margin: auto;" /> ] --- ## `color` versus `fill` The `aes()` refers to the **aesthetics** of the chart, and many other `aesthetics` can be added, such as `group`, `color`, `fill`, `size`, `alpha`, etc. We will see some of these in due course but for now I want to focus on two of these, both involving coloring of the `geom_`. Specifically, there are two commands for adding colors -- (1) `color` or `colour`, and (2) `fill` -- to a chart. .pull-left[ ```r ggplot( data = starwars, mapping = aes( x = eye_color, color = eye_color ) ) + geom_bar() ``` Note what the `color = eye_color` command did ... it drew a colored border for the bars, and an accompanying legend. What if we had used `fill = eye_color` instead? ] .pull-right[ <img src="Module05sp20_files/figure-html/col1a-1.svg" width="100%" style="display: block; margin: auto;" /> ] --- ## using `fill = ` .pull-left[ ```r ggplot( data = starwars, mapping = aes( x = eye_color, fill = eye_color ) ) + geom_bar() ``` ] .pull-right[ <img src="Module05sp20_files/figure-html/col1b-1.svg" width="100%" style="display: block; margin: auto;" /> ] Aha! Now the bars are filled with colors and an accompanying legend is drawn as well. So `fill =` and `color =` behave very differently, bear this in mind. --- ## Adding labels with `labs` One of the nice things about this software environment is that there are plenty of coloring schemes available to us and we will play with some of these shortly, but before we do that, let us look at one more improvement -- adding titles, subtitles, captions, and axis labels to our chart. This is done with the `labs = ()` command. .pull-left[ ```r ggplot( data = starwars, mapping = aes( x = eye_color, fill = eye_color ) ) + geom_bar() + labs( x = "Eye Color", y = "Frequency (n)", title = "Bar-chart of Eye Colors", subtitle = "(of Star Wars characters)", caption = "My little work of art!!" ) ``` ] .pull-right[ <img src="Module05sp20_files/figure-html/col1b2-1.svg" width="85%" style="display: block; margin: auto;" /> ] Notice the text that now appears as a result of what has been specified in the `labs()` command. --- ## Controlling the chart legend with `theme()` In this bar-chart, do we really need the legend? No, because the colors and color names show up in the chart itself. How can we hide the legend? Turns out there is a neat command that will allow you to move the legend around/hide it. .pull-left[ ```r ggplot( data = starwars, mapping = aes( x = eye_color, fill = eye_color ) ) + geom_bar() + labs( x = "Eye Color", y = "Frequency (n)", title = "Bar-chart of Eye Colors", subtitle = "(of Star Wars characters)", caption = "My little work of art!!" ) + theme(legend.position = "none") ``` ] .pull-right[ <img src="Module05sp20_files/figure-html/col1c-1.svg" width="90%" style="display: block; margin: auto;" /> ] Instead of "none" you could have specified "bottom", "left", "top", "right" to place the legend in a particular direction. --- ## Customizing colors Of course, it would be good to have the colors match the eye-color so let us do that next. The way we can do this is by calling specific colors by name. I have tried to order the lineup of the colors to match, as closely as I can, the eye colors. .pull-left[ ```r c( "black", "blue", "slategray", "brown", "gray34", "gold", "greenyellow", "navajowhite1", "orange", "pink", "red", "magenta", "thistle3", "white", "yellow" ) -> mycolors ggplot( data = starwars, mapping = aes(x = eye_color) ) + geom_bar(fill = mycolors) + labs( x = "Eye Color", y = "Frequency (n)", title = "Bar-chart of Eye Colors", subtitle = "(of Star Wars characters)", caption = "My little work of art!!" ) + theme(legend.position = "none") ``` ] .pull-right[ <img src="Module05sp20_files/figure-html/col3-1.svg" width="100%" style="display: block; margin: auto;" /> ] --- These colors are from [this source](http://www.stat.columbia.edu/~tzheng/files/Rcolor.pdf) but see also [this source](https://www.nceas.ucsb.edu/~frazier/RSpatialGuides/colorPaletteCheatsheet.pdf). Colors can be customized by generating your own palettes via the [Color Brewer here](http://colorbrewer2.org/#type=sequential&scheme=YlGnBu&n=3). But don't get carried away: Remember to read the materials on choosing colors wisely, particularly the point about qualitative palettes, divergent palettes, and then palettes that work well even with colorblind audiences. .center[ <img src = "http://images6.fanpop.com/image/photos/37900000/color-of-the-rainbow-birds-color-of-the-life-37998314-900-441.jpg"; width = 600px> .fancy[[Image Source:](http://images6.fanpop.com/image/photos/37900000/color-of-the-rainbow-birds-color-of-the-life-37998314-900-441.jpg)] ] --- ## Selected color palettes I had mentioned the existence of a number of color palettes so let us look at a few, but we will do this with a different variable. First up, the `Pastel1` palette. .pull-left[ ```r ggplot( data = starwars, mapping = aes( x = gender ) ) + geom_bar( aes(fill = gender) ) + labs( x = "Gender", y = "Frequency", title = "Bar-chart of Gender", subttitle = "(of Star Wars characters)", caption = "(Source: The dplyr package)") + scale_fill_brewer( palette = "Pastel1" ) ``` ] .pull-right[ <img src="Module05sp20_files/figure-html/colfillb1-1.svg" width="100%" style="display: block; margin: auto;" /> ] --- Not bad but doesn't work too well here. How about trying another palette, `Set1`? .pull-left[ ```r ggplot( data = starwars, mapping = aes( x = gender ) ) + geom_bar( aes(fill = gender) ) + labs( x = "Gender", y = "Frequency", title = "Bar-chart of Gender", subttitle = "(of Star Wars characters)", caption = "(Source: The dplyr package)") + scale_fill_brewer( palette = "Set1" ) ``` ] .pull-right[ <img src="Module05sp20_files/figure-html/colfillb2-1.svg" width="100%" style="display: block; margin: auto;" /> ] .center[ Check out this package as well for several other custom palette packages <img src = "https://github.com/EmilHvitfeldt/paletteer/raw/master/man/figures/logo.png"; width = 80px> ] --- Nice! But what is also noticeable here is that there are some characters in the data-set whose gender data is missing. These are the `NA` values. By default, you will see `NA` values showing up in some types of charts and so it is always good to exclude them from the chart. Here is one way of doing that. .pull-left[ ```r ggplot( * data = subset(starwars, !is.na(gender)), mapping = aes( x = gender ) ) + geom_bar( aes(fill = gender) ) + labs( x = "Gender", y = "Frequency", title = "Bar-chart of Gender", subttitle = "(of Star Wars characters)", caption = "(Source: The dplyr package)") + scale_fill_brewer( palette = "Set1" ) ``` ] .pull-right[ <img src="Module05sp20_files/figure-html/colfillb3-1.svg" width="100%" style="display: block; margin: auto;" /> ] --- Or use `filter()` .pull-left[ ```r *starwars %>% * filter(!is.na(gender)) -> my.data ggplot( * data = my.data, mapping = aes( x = gender ) ) + geom_bar( aes(fill = gender) ) + labs( x = "Gender", y = "Frequency", title = "Bar-chart of Gender", subttitle = "(of Star Wars characters)", caption = "(Source: The dplyr package)") + scale_fill_brewer( palette = "Set1" ) ``` ] .pull-right[ <img src="Module05sp20_files/figure-html/colfillb4-1.svg" width="100%" style="display: block; margin: auto;" /> ] --- There is one color palette you should remember, and this is the `{viridis}` color scheme that works around varying types of color blindness in the population. Here come the palettes: .pull-left[ ```r ggplot( data = my.data, mapping = aes( x = gender ) ) + geom_bar( aes(fill = gender) ) + labs( x = "Gender", y = "Frequency", title = "Bar-chart of Gender", subttitle = "(of Star Wars characters)", caption = "(Source: The dplyr package)") + * scale_fill_viridis_d( * option = "viridis" ) ``` ] .pull-right[ <img src="Module05sp20_files/figure-html/colfillv1-1.svg" width="100%" style="display: block; margin: auto;" /> ] .center[ Other options would be `plasma`, `magma`, and `cividis` ] --- ## Themes with `{ggthemes}` One can also lean on various plotting themes as shown below. These themes mimic the style of graphics popularized by some data visualization experts (for e.g., Stephen Few, Edward Tufte), news-media houses (Fivethirtyeight, The Economist, The Wall Street Journal), some software packages (Excel, Stata, Google docs), and a few others. Below I show you just a handful. .pull-left[ ```r *library(ggthemes) ggplot( data = starwars, mapping = aes( x = eye_color ) ) + geom_bar() + * theme_tufte() ``` ] .pull-right[ <img src="Module05sp20_files/figure-html/ggt1-1.svg" width="100%" style="display: block; margin: auto;" /> ] --- ```r ggplot( data = starwars, mapping = aes( x = eye_color ) ) + geom_bar() + * theme_economist() ``` <img src="Module05sp20_files/figure-html/ggt2-1.svg" width="50%" style="display: block; margin: auto;" /> --- ```r ggplot( data = starwars, mapping = aes( x = eye_color ) ) + geom_bar() + * theme_fivethirtyeight() ``` <img src="Module05sp20_files/figure-html/ggt3-1.svg" width="50%" style="display: block; margin: auto;" /> --- ## More with bar-charts I want to show a few things with bar-charts now. First, we can specify things a bit differently without altering the result. For example, compare the following two pieces of code. .pull-left[ ```r ggplot( data = movies, mapping = aes(x = mpaa) ) + geom_bar() ``` <img src="Module05sp20_files/figure-html/bar01-1.svg" width="65%" style="display: block; margin: auto;" /> ] .pull-right[ ```r ggplot() + geom_bar( data = movies, mapping = aes(x = mpaa) ) ``` <img src="Module05sp20_files/figure-html/bar02-1.svg" width="65%" style="display: block; margin: auto;" /> ] --- The plot is sub-optimal since MPAA ratings are missing for a lot of movies and should be eliminated from the plot via `subset(mpa != "")` or by running dplyr's `filter()` to create another data-set. I will lean on `filter()`. ```r movies %>% filter(mpaa != "") -> movies2 ggplot() + geom_bar( data = movies2, mapping = aes(x = mpaa) ) ``` <img src="Module05sp20_files/figure-html/bar2-1.svg" width="45%" style="display: block; margin: auto;" /> --- The order of the bars here is fortuitous in that it goes from the smallest frequency to the highest frequency, drawing the reader's eye. I said fortuitous because `{ggplot2}` defaults to drawing the bars in an ascending alphabetic/alphanumeric order if the variable is a **character**. See below for an example. .pull-left[ ```r df = tibble(x = c(rep("A", 2), rep("B", 4), rep("C", 1))) ggplot() + geom_bar( data = df, mapping = aes(x = x) ) ``` <img src="Module05sp20_files/figure-html/bar3-1.svg" width="100%" style="display: block; margin: auto;" /> ] .pull-right[ Notice the bars here do not follow in `ascending/descending order of frequencies` Later on we'll learn how to order the bars with ascending/descending frequencies or by some other logic. ] --- What about plotting `relative frequencies` on the y-axis rather than the frequencies? .pull-left[ ```r *library(scales) ggplot() + geom_bar( data = movies2, mapping = aes( x = mpaa, * y = (..count..)/sum(..count..) ) ) + * scale_y_continuous(labels = percent) + labs( x = "MPAA Rating", y = "Relative Frequency (%)" ) ``` ] .pull-right[ <img src="Module05sp20_files/figure-html/bar4-1.svg" width="100%" style="display: block; margin: auto;" /> ] Note the addition of + `y = (..count..)/sum(..count..)` to change the y-axis to reflect the relative frequency as a proportion, and + `scale_y_continuous(labels = percent)` to then multiply these proportions by 100 to get percentages as the labels rather than 0.2, 0.4, 0.6, etc. --- ## Disaggregating bar-charts for groups Let us build a simple bar-chart with the `hsb2` data we saw in Module 01. Here we first download the data, label the values, save it, and then start charting. .pull-left[ ```r read.table( 'https://stats.idre.ucla.edu/stat/data/hsb2.csv', header = TRUE, sep = "," ) -> hsb2 factor(hsb2$female, levels = c(0, 1), labels = c("Male", "Female") ) -> hsb2$female factor(hsb2$race, levels = c(1:4), labels = c("Hispanic", "Asian", "African American", "White") ) -> hsb2$race ``` ] .pull-right[ ```r factor(hsb2$ses, levels = c(1:3), labels = c("Low", "Middle", "High") ) -> hsb2$ses factor(hsb2$schtyp, levels = c(1:2), labels = c("Public", "Private") ) -> hsb2$schtyp factor(hsb2$prog, levels = c(1:3), labels = c("General", "Academic", "Vocational") ) -> hsb2$prog save( hsb2, file = here::here("data", "hsb2.RData") ) ``` ] --- What if I wanted to see how socioeconomic status varies across male and female students? .pull-left[ ```r ggplot() + geom_bar( data = hsb2, mapping = aes( x = ses, * group = female, * fill = female ) ) + labs( x = "Socioeconomic Status", y = "Frequency" ) ``` ] .pull-right[ <img src="Module05sp20_files/figure-html/bar5-1.svg" width="100%" style="display: block; margin: auto;" /> ] --- This is not very useful since the viewer has to estimate the relative sizes of the two colors within any given bar. That can be fixed with `position = "dodge"`, juxtaposing the bars for the groups as a result, and the end product is much better. But note: `position = "dodge"` has to be put outside the `aes()` but still inside `geom_bar()` so be careful. .pull-left[ ```r ggplot() + geom_bar( data = hsb2, mapping = aes( x = ses, group = female, fill = female ), * position = "dodge" ) + labs( x = "Socioeconomic Status", y = "Frequency" ) ``` ] .pull-right[ <img src="Module05sp20_files/figure-html/bar6-1.svg" width="100%" style="display: block; margin: auto;" /> ] --- What if you wanted to calculate percentages within each sex? That is, what percent of male students fall within a particular ses category, and the same thing for female students? .pull-left[ ```r ggplot() + geom_bar( data = hsb2, aes( x = ses, group = female, fill = female, y = ..prop.. ), position = "dodge") + scale_y_continuous(labels = scales::percent) + labs( x = "Socioeconomic Status", y = "Relative Frequency (%)" ) ``` ] .pull-right[ <img src="Module05sp20_files/figure-html/bar7-1.svg" width="100%" style="display: block; margin: auto;" /> ] --- What about within each ses instead of within gender? That is, what if we wanted percent of Low ses that is Male versus Female, and so on? .pull-left[ ```r ggplot() + geom_bar( data = hsb2, aes( x = female, group = ses, fill = ses, y = ..prop.. ), position = "dodge") + scale_y_continuous(labels = scales::percent) + labs( x = "Socioeconomic Status", y = "Relative Frequency (%)" ) ``` ] .pull-right[ <img src="Module05sp20_files/figure-html/bar8-1.svg" width="100%" style="display: block; margin: auto;" /> ] --- .pull-left[ ```r ggplot() + geom_bar( data = hsb2, aes( x = female, group = ses, fill = ses, y = ..prop.. ), position = "dodge") + scale_y_continuous(labels = scales::percent) + labs( x = "Socioeconomic Status", y = "Relative Frequency (%)" ) ``` ] .pull-right[ <img src="Module05sp20_files/figure-html/bar9-1.svg" width="100%" style="display: block; margin: auto;" /> ] There is some more we will do with bar-charts but for now let us set them aside and instead look at a few other charts -- `histograms`, `box-plots`, and `line-charts`. --- ## Histograms If you've forgotten what these are, see [histogram](http://tinlizzie.org/histograms/), or then [Yau's piece here](https://flowingdata.com/2014/02/27/how-to-read-histograms-and-use-them-in-r/) and [here](https://flowingdata.com/2017/06/07/how-histograms-work/). [There is a short video available as well](https://vimeo.com/221607341). For histograms in ggplot2, `geom_histogram()` is the geometry needed but note that the default number of bins is not very useful and can be tweaked, along with other embellishments that are possible as well. .pull-left[ ```r ggplot() + geom_histogram( data = hsb2, aes(x = read), fill = "cornflowerblue", color = "white" ) + labs( title = "Histogram of Reading Scores", x = "Reading Score", y = "Frequency" ) ``` ] .pull-right[ <img src="Module05sp20_files/figure-html/gg2a-1.svg" width="80%" style="display: block; margin: auto;" /> ] --- Note the warning `stat_bin()` using `bins = 30`. Pick better value with `binwidth`. This is because numerical variables need to be grouped in order to have meaningful histograms we can make sense of. How do you define the bins (aka the groups)? We could set `bins = 5` and we could also experiment with `binwidth =`. Let us do `bins = 5` which will say give us 5 groups, and go ahead and calculate them yourself. .pull-left[ ```r ggplot() + geom_histogram( data = hsb2, aes(x = read), fill = "cornflowerblue", color = "white", bins = 5 ) + labs( title = "Histogram of Reading Scores", subtitle = "(with bins = )", x = "Reading Score", y = "Frequency" ) ``` ] .pull-right[ <img src="Module05sp20_files/figure-html/gg2b-1.svg" width="100%" style="display: block; margin: auto;" /> ] If we wanted more/fewer `bins` we could tweak it up or down as needed. --- What about `binwidth`? This will specify how wide each group must be. .pull-left[ ```r ggplot() + geom_histogram( data = hsb2, aes(x = read), fill = "cornflowerblue", color = "white", * binwidth = 5 ) + labs( title = "Histogram of Reading Scores", subtitle = "(with binwidth = )", x = "Reading Score", y = "Frequency" ) ``` ] .pull-right[ <img src="Module05sp20_files/figure-html/gg2c-1.svg" width="100%" style="display: block; margin: auto;" /> ] --- If we wanted to disaggregate the histogram by one or more categorical variables, we could do so quite easily: .pull-left[ ```r ggplot() + geom_histogram( data = hsb2, aes(x = read), fill = "cornflowerblue", color = "white", * bins = 5 ) + labs( title = "Histogram of Reading Scores", subtitle = "(broken out for Male vs. Female students)", x = "Reading Score", y = "Frequency" ) + * facet_wrap(~ female) ``` ] .pull-right[ <img src="Module05sp20_files/figure-html/gg3-1.svg" width="100%" style="display: block; margin: auto;" /> ] --- When we do this, it is often useful to organize them so that only one histogram shows up in a row. This is done with the `ncol = 1` command. .pull-left[ ```r ggplot() + geom_histogram( data = hsb2, aes(x = read), fill = "cornflowerblue", color = "white", bins = 5 ) + labs( title = "Histogram of Reading Scores", subtitle = "(broken out for Male vs. Female students)", x = "Reading Score", y = "Frequency" ) + * facet_wrap(~ female, ncol = 1) ``` ] .pull-right[ <img src="Module05sp20_files/figure-html/gg4a-1.svg" width="100%" style="display: block; margin: auto;" /> ] --- .pull-left[ ```r ggplot() + geom_histogram( data = hsb2, aes(x = read), fill = "cornflowerblue", color = "white", bins = 5 ) + labs( title = "Histogram of Reading Scores", subtitle = "(broken out by Socioeconomic Status)", x = "Reading Score", y = "Frequency" ) + facet_wrap(~ ses, ncol = 1) ``` ] .pull-right[ <img src="Module05sp20_files/figure-html/gg4b-1.svg" width="100%" style="display: block; margin: auto;" /> ] Now the distributions are stacked above each, easing comparisons; do they have the same average? Do they vary the same? Are they similarly skewed/symmetric?. --- For breakouts with two categorical variables we could do .pull-left[ ```r ggplot() + geom_histogram( data = hsb2, aes(x = read), fill = "cornflowerblue", color = "white", bins = 5 ) + labs( title = "Histogram of Reading Scores", subtitle = "(broken out by Socioeconomic Status and School Type)", x = "Reading Score", y = "Frequency" ) + facet_wrap(ses ~ schtyp, ncol = 2) ``` ] .pull-right[ <img src="Module05sp20_files/figure-html/gg5a-1.svg" width="100%" style="display: block; margin: auto;" /> ] Note that `ses ~ schtyp` renders the panels for the first category of `ses` by all categories of schtyp and then repeats for the other categories in rows 2 and 3. --- If we did `facet_wrap(schtype ~ ses, ncol = 3)` we would have a different result: .pull-left[ ```r ggplot() + geom_histogram( data = hsb2, aes(x = read), fill = "cornflowerblue", color = "white", bins = 5 ) + labs( title = "Histogram of Reading Scores", subtitle = "(broken out by Socioeconomic Status and School Type)", x = "Reading Score", y = "Frequency" ) + facet_wrap(schtyp ~ ses, ncol = 3) + ylim(c(0, 23)) ``` ] .pull-right[ <img src="Module05sp20_files/figure-html/gg5b-1.svg" width="100%" style="display: block; margin: auto;" /> ] Notice that here I also add a `ylim(c(...))` command to set the minimum and maximum values of the y-axis. This is useful, and I suggest you do not forget to set the y limit to start at 0 or then make a note in the plot for readers so they don't assume it is at 0 when in fact it has been truncated for ease of data presentation. This misstates the pattern in the data, do not do it or then, again, annotate the plot to that effect so nobody is misled. Bar-charts and histograms will have 0 as the minimum y-limit but this is not true for some other plots. --- ## Box-plots Remember these, our friends from MPA 6010? These can be useful for studying the distribution of a continuous variable. [See this video](https://vimeo.com/222358034). Let us see these in action with the `cmhflights` data. .pull-left[ ```r load( here::here("data", "cmhflights_01092017.RData") ) ggplot() + geom_boxplot( data = cmhflights, mapping = aes( y = ArrDelay, x = "" ), fill = "cornflowerblue" ) ``` ] .pull-right[ <img src="Module05sp20_files/figure-html/box1a-1.svg" width="100%" style="display: block; margin: auto;" /> ] --- Note: + the `x = ""` is in `aes()` because otherwise with a single group the box-plot will not build up nicely But, I prefer to see them running horizontally, so how can I do that? With `coord_flip()` since this just flips the columns. .pull-left[ ```r ggplot() + geom_boxplot( data = cmhflights, mapping = aes( y = ArrDelay, x = "" ), fill = "cornflowerblue" ) + coord_flip() ``` ] .pull-right[ <img src="Module05sp20_files/figure-html/box1b-1.svg" width="100%" style="display: block; margin: auto;" /> ] --- And now for a slightly different data-set, one that measures male adults' hemoglobin concentration for a few populations. .pull-left[ ```r read_csv( "http://whitlockschluter.zoology.ubc.ca/wp-content/data/chapter02/chap02e3cHumanHemoglobinElevation.csv" ) -> hemoglobin ggplot() + geom_boxplot( data = hemoglobin, mapping = aes( x = population, y = hemoglobin, fill = population ) ) + coord_flip() + labs( x = "Population", y = "Hemoglobin Concentration", title = "Hemoglobin Concentration in Adult Males", subtitle = "(Andes, Ethiopia, Tibet, USA)" ) + theme(legend.position = "none") ``` ] .pull-right[ <img src="Module05sp20_files/figure-html/box2-1.svg" width="100%" style="display: block; margin: auto;" /> ] --- Could we use our `facet_wrap(...)` here too? Of course. .pull-left[ ```r ggplot() + geom_boxplot( data = cmhflights, mapping = aes( y = ArrDelay, x = Carrier ), fill = "cornflowerblue" ) + coord_flip() + facet_wrap(~ Month) ``` ] .pull-right[ <img src="Module05sp20_files/figure-html/box1c-1.svg" width="100%" style="display: block; margin: auto;" /> ] --- ## Line-charts If we have data over time for one or more units, then line-charts work really well to exhibit trends. A classic, current example would be the number of confirmed COVID-19 cases per country per date. For example, say we have data on the unemployment rate for the country. These data are coming from the `{plotly}` library so we have to make sure it is installed and load it. .pull-left[ ```r library(plotly) data(economics) #names(economics) ggplot() + geom_line( data = economics, mapping = aes( x = date, y = uempmed ) ) + labs( x = "Date", y = "Unemployment Rate" ) ``` ] .pull-right[ <img src="Module05sp20_files/figure-html/line1-1.svg" width="90%" style="display: block; margin: auto;" /> ] --- They can look very plain and aesthetically unappealing unless you dress them up. See the one below and then the one that follows. .pull-left[ ```r load( here::here("data", "gap.df.RData") ) ggplot() + geom_line( data = gap.df, mapping = aes(x = year, y = LifeExp, group = continent, color = continent ) ) + geom_point( data = gap.df, mapping = aes( x = year, y = LifeExp, group = continent, color = continent ) ) + labs( x = "Year", y = "Median Life Expectancy (in years)", color = "" ) + theme(legend.position = "bottom") ``` ] .pull-right[ <img src="Module05sp20_files/figure-html/line2-1.svg" width="100%" style="display: block; margin: auto;" /> ] --- ## Scatter-plots These work well if we have two or more continuous variables, and work well to highlight the nature and strength of a relationship between the two variables .... what happens to `y` as `x` increases? s .pull-left[ ```r ggplot() + geom_point( data = hsb2, mapping = aes( x = write, y = science ) ) + labs( x = "Writing Scores", y = "Science Scores" ) ``` ] .pull-right[ <img src="Module05sp20_files/figure-html/sc1-1.svg" width="100%" style="display: block; margin: auto;" /> ] --- We could highlight the different `ses` groups, to see if there is any difference in the relationship between writing scores and science scores by the different ses levels. .pull-left[ ```r ggplot() + geom_point( data = hsb2, mapping = aes( x = write, y = science, color = ses ) ) + labs( x = "Writing Scores", y = "Science Scores", color = "" ) + theme(legend.position = "bottom") ``` ] .pull-right[ <img src="Module05sp20_files/figure-html/sc2-1.svg" width="100%" style="display: block; margin: auto;" /> ] --- This is not very helpful so why not breakout ses for ease of interpretation? .pull-left[ ```r ggplot() + geom_point( data = hsb2, mapping = aes( x = write, y = science ) ) + labs( x = "Writing Scores", y = "Science Scores" ) + facet_wrap(~ ses) ``` ] .pull-right[ <img src="Module05sp20_files/figure-html/sc3-1.svg" width="100%" style="display: block; margin: auto;" /> ] --- Could we add another layer, perhaps `female`? .pull-left[ ```r ggplot() + geom_point( data = hsb2, mapping = aes( x = write, y = science ) ) + labs( x = "Writing Scores", y = "Science Scores" ) + facet_wrap(ses ~ female, ncol = 2) ``` ] .pull-right[ <img src="Module05sp20_files/figure-html/sc4-1.svg" width="100%" style="display: block; margin: auto;" /> ] --- And finally, a few suggestion about how to build up your visualizations: - 🔁 start with pencil and paper, sketch prototypes of desired visualization(s) - 😄 graphics are relatively easy to generate with base R & with `ggplot2` - 👏 common-sense: `number` & `type` of variable(s) guide plotting - 🎇 stay `color conscious`: sensible colors & sensitive to color blindness - 🔰 experiment, experiment, experiment until you are happy - use the 🆓 learning resources available online - 📒 if you learn something new in R, write it down --- class: right, middle <img class="circle" src="https://github.com/aniruhil.png" width="175px"/> # Find me at... [<svg style="height:0.8em;top:.04em;position:relative;" viewBox="0 0 512 512"><path d="M459.37 151.716c.325 4.548.325 9.097.325 13.645 0 138.72-105.583 298.558-298.558 298.558-59.452 0-114.68-17.219-161.137-47.106 8.447.974 16.568 1.299 25.34 1.299 49.055 0 94.213-16.568 130.274-44.832-46.132-.975-84.792-31.188-98.112-72.772 6.498.974 12.995 1.624 19.818 1.624 9.421 0 18.843-1.3 27.614-3.573-48.081-9.747-84.143-51.98-84.143-102.985v-1.299c13.969 7.797 30.214 12.67 47.431 13.319-28.264-18.843-46.781-51.005-46.781-87.391 0-19.492 5.197-37.36 14.294-52.954 51.655 63.675 129.3 105.258 216.365 109.807-1.624-7.797-2.599-15.918-2.599-24.04 0-57.828 46.782-104.934 104.934-104.934 30.213 0 57.502 12.67 76.67 33.137 23.715-4.548 46.456-13.32 66.599-25.34-7.798 24.366-24.366 44.833-46.132 57.827 21.117-2.273 41.584-8.122 60.426-16.243-14.292 20.791-32.161 39.308-52.628 54.253z"/></svg> @aruhil](http://twitter.com/aruhil) [<svg style="height:0.8em;top:.04em;position:relative;" viewBox="0 0 512 512"><path d="M326.612 185.391c59.747 59.809 58.927 155.698.36 214.59-.11.12-.24.25-.36.37l-67.2 67.2c-59.27 59.27-155.699 59.262-214.96 0-59.27-59.26-59.27-155.7 0-214.96l37.106-37.106c9.84-9.84 26.786-3.3 27.294 10.606.648 17.722 3.826 35.527 9.69 52.721 1.986 5.822.567 12.262-3.783 16.612l-13.087 13.087c-28.026 28.026-28.905 73.66-1.155 101.96 28.024 28.579 74.086 28.749 102.325.51l67.2-67.19c28.191-28.191 28.073-73.757 0-101.83-3.701-3.694-7.429-6.564-10.341-8.569a16.037 16.037 0 0 1-6.947-12.606c-.396-10.567 3.348-21.456 11.698-29.806l21.054-21.055c5.521-5.521 14.182-6.199 20.584-1.731a152.482 152.482 0 0 1 20.522 17.197zM467.547 44.449c-59.261-59.262-155.69-59.27-214.96 0l-67.2 67.2c-.12.12-.25.25-.36.37-58.566 58.892-59.387 154.781.36 214.59a152.454 152.454 0 0 0 20.521 17.196c6.402 4.468 15.064 3.789 20.584-1.731l21.054-21.055c8.35-8.35 12.094-19.239 11.698-29.806a16.037 16.037 0 0 0-6.947-12.606c-2.912-2.005-6.64-4.875-10.341-8.569-28.073-28.073-28.191-73.639 0-101.83l67.2-67.19c28.239-28.239 74.3-28.069 102.325.51 27.75 28.3 26.872 73.934-1.155 101.96l-13.087 13.087c-4.35 4.35-5.769 10.79-3.783 16.612 5.864 17.194 9.042 34.999 9.69 52.721.509 13.906 17.454 20.446 27.294 10.606l37.106-37.106c59.271-59.259 59.271-155.699.001-214.959z"/></svg> aniruhil.org](https://aniruhil.org) [<svg style="height:0.8em;top:.04em;position:relative;" viewBox="0 0 512 512"><path d="M476 3.2L12.5 270.6c-18.1 10.4-15.8 35.6 2.2 43.2L121 358.4l287.3-253.2c5.5-4.9 13.3 2.6 8.6 8.3L176 407v80.5c0 23.6 28.5 32.9 42.5 15.8L282 426l124.6 52.2c14.2 6 30.4-2.9 33-18.2l72-432C515 7.8 493.3-6.8 476 3.2z"/></svg> ruhil@ohio.edu](mailto:ruhil@ohio.edu)