class: title-slide, center, middle background-image: url(images/ouaerial.jpeg) background-size: cover # .fat[.fancy[Visualizing Data in R]] ## .fat[.fancy[Ani Ruhil]] --- name: agenda ## Agenda Visualizing data - the grammar of graphics - working with `ggplot2` - color consciousness --- class: inverse, center, middle # .heat[.fancy[ Visualizing data ]] --- # .fancy[ graphics in R ] Three common ways to generate graphics in R are via - `base R` - `lattice` (little development in recent years though) - `ggplot2` We will skip `base R` graphics since `ggplot2` will be the graphics package for this class. Let us see how we use it, starting with a simple bar-chart Remember the basic options... - one qualitative/categorical variables: `bar-chart` - one quantitative/continuous variables: `histogram/box-plot/area-chart` - two quantitative/continuous variables: `scatter-plot/hex-bin` --- I will use two data-sets, the first being this [IMDB data-set](http://imdb.com/) > The internet movie database, http://imdb.com/, is a website devoted to collecting movie data supplied by studios and fans. It claims to be the biggest movie database on the web and is run by amazon. More about information imdb.com can be found online, http://imdb.com/help/show_ leaf?about, including information about the data collection process, http://imdb.com/help/show_leaf?infosource. ```r library(ggplot2movies) ``` A data frame with 28819 rows and 24 variables - title. Title of the movie. - year. Year of release. - budget. Total budget (if known) in US dollars - length. Length in minutes. - rating. Average IMDB user rating. - votes. Number of IMDB users who rated this movie. - r1-10. Multiplying by ten gives percentile (to nearest 10%) of users who rated this movie a 1. - mpaa. MPAA rating (missing for a lot of movies) - action, animation, comedy, drama, documentary, romance, short. Binary variables representing if movie was classified as belonging to that genre. --- The second data-set is the [Star Wars dataset](https://swapi.co), a `tibble` with 87 rows and 13 variables: ```r library(dplyr) data(starwars) ``` - name: Name of the character - height: Height (cm) - mass: Weight (kg) - hair_color,skin_color,eye_color: Hair, skin, and eye colors - birth_year: Year born (BBY = Before Battle of Yavin) - gender: male, female, hermaphrodite, or none. - homeworld: Name of homeworld - species: Name of species - films: List of films the character appeared in - vehicles: List of vehicles the character has piloted - starships: List of starships the character has piloted <img src= "images/hex-tibble.png", width = 50px> a `tibble` you say? --- ## data frames vs. tibbles .pull-left[ R's default is to store a `data frame`, as shown below with a small example and there is a tendency to convert characters into factors, change column names, etc. ```r data.frame( `Some Letters` = c("A", "B", "C"), `Some Numbers` = c(1, 2, 3) ) -> adf str(adf) ``` ``` ## 'data.frame': 3 obs. of 2 variables: ## $ Some.Letters: Factor w/ 3 levels "A","B","C": 1 2 3 ## $ Some.Numbers: num 1 2 3 ``` ```r print(adf) ``` ``` ## Some.Letters Some.Numbers ## 1 A 1 ## 2 B 2 ## 3 C 3 ``` ] .pull-right[ `tibbles` is the brainchild of the team behind a bundle of packages (and RStudio) called the `tidyverse` that drop R's bad habits ```r tibble( `Some Letters` = c("A", "B", "C"), `Some Numbers` = c(1, 2, 3) ) -> atib glimpse(atib) ``` ``` ## Observations: 3 ## Variables: 2 ## $ `Some Letters` <chr> "A", "B", "C" ## $ `Some Numbers` <dbl> 1, 2, 3 ``` ```r print(atib) ``` ``` ## # A tibble: 3 x 2 ## `Some Letters` `Some Numbers` ## <chr> <dbl> ## 1 A 1 ## 2 B 2 ## 3 C 3 ``` ] --- class: inverse, center, middle .salt[.fancy[.large[ Questions?? ]]] ![](./images/questions.gif) --- #### <img src = "images/hex-ggplot2.png", width = 60px> and the [grammar of graphics](http://vita.had.co.nz/papers/layered-grammar.html) <center><img src = "images/grammarofgraphics.png", width = 400px></center> ```r library(ggplot2) ``` --- `qplot` will generate a quick plot but `ggplot2` is the way to go so we build with it .pull-left[ ```r ggplot(data = starwars) ``` <img src="Module02_files/figure-html/Module02-3-1.png" width="10%" style="display: block; margin: auto;" /> Nothing results since we have not specified how we want the variable(s) to be `mapped` to the coordinate system ... what variable should go on what axis? ] .pull-right[ ```r ggplot(data = starwars, * mapping = aes(x = eye_color) ) ``` <img src="Module02_files/figure-html/Module02-4-1.png" width="10%" style="display: block; margin: auto;" /> Now we are getting somewhere. We see the canvas with the specific eye colors on the x-axis but nothing else has been drawn since we have not specified the `geometry` ... do you want a bar-chart? histogram? dot-plot? line-chart? ] ??? - point out that they should have read the material on the `grammar of graphics` ... not essential to understand `ggplot2` but helpful if they want to specialize in visualizations --- .pull-left[ With a `categorical variable` the `bar-chart` would be appropriate and so we ask for a `geom_bar()` ```r ggplot(data = starwars, mapping = aes(x = eye_color)) + geom_bar() ``` Other `aesthetics` can be added, such as * `group` * `color` * `fill` * `size` * `alpha` * and then axis labels, plot title/subtitle etc Two commands for adding a color scheme -- (a) `color` or `colour` (b) `fill` You can also change color palettes, roll your own, and more ] .pull-right[ <img src="Module02_files/figure-html/unnamed-chunk-1-1.png" width="100%" style="display: block; margin: auto;" /> ] --- .pull-left[ ```r ggplot(data = starwars, mapping = aes(x = eye_color, colour = eye_color)) + geom_bar() + labs(x = "Eye Color", y = "Frequency", title = "Bar-chart of Eye Color", subtitle = "(of Star Wars characters)") ``` Note what `colour = ` generated for us `labs()` is allowing us to customize the axis labels, add a title and subtitle. If you wanted a caption, you could add a line of code, something like `caption = "lorem ipsum"` ] .pull-right[ <img src="Module02_files/figure-html/unnamed-chunk-2-1.png" width="100%" style="display: block; margin: auto;" /> ] This is not a very good plot of course. .heatinline[ Why is that? ] --- .pull-left[ ```r ggplot(data = starwars, mapping = aes(x = eye_color, fill = eye_color)) + geom_bar() + labs(x = "Eye Color", y = "Frequency", title = "Bar-chart of Eye Color", subtitle = "(of Star Wars characters)") ``` Note what `fill = ` generated for us Of course, it would be good to have the fill colors match the eye-color so let us do that next We can also eliminate the legend. .heatinline[ Why is that? ] ] .pull-right[ <img src="Module02_files/figure-html/unnamed-chunk-3-1.png" width="100%" style="display: block; margin: auto;" /> ] ??? - Point out how important it is to label x-axis, y-axis, add a title - Remind them about setting y limit to start at 0 or then making a note in the plot for readers so they don't assume it is at 0 when in fact it has been truncated for ease of data presentation. If this misstates the pattern in the data, they should not do it or again, annotate the plot to that effect so nobody is misled --- .pull-left[ ```r c("black", "blue", "slategray", "brown", "gray34", "gold", "greenyellow", "navajowhite1", "orange", "pink", "red", "magenta", "thistle3", "white", "yellow") -> mycolors ggplot(data = starwars, mapping = aes(x = eye_color)) + geom_bar(fill = mycolors) + labs(x = "Eye Color", y = "Frequency", title = "Bar-chart of Eye Color", subtitle = "(of Star Wars characters)", caption = "Much better!") ``` R Colors used from [this source](http://www.stat.columbia.edu/~tzheng/files/Rcolor.pdf) but see also [this source](https://www.nceas.ucsb.edu/~frazier/RSpatialGuides/colorPaletteCheatsheet.pdf) ] .pull-right[ <img src="Module02_files/figure-html/unnamed-chunk-4-1.png" width="100%" style="display: block; margin: auto;" /> ] We can still do better. .heatinline[ How can we improve this? ] -- `think in terms of the frequencies of eye-color` ??? - Colors can be customized by generating your own palettes - [Color Brewer is here](http://colorbrewer2.org/#type=sequential&scheme=YlGnBu&n=3) - Remind them to read the materials on choosing colors wisely, particularly the point about qualitative palettes, divergent palettes, and then palettes that work well even with colorblind audiences --- <img src="Module02_files/figure-html/forcats-1-1.png" width="80%" style="display: block; margin: auto;" /> --- I'll switch to a different variable and show you how to use `prebuilt color palettes` .pull-left[ ```r ggplot(data = starwars, mapping = aes(x = gender)) + geom_bar(aes(fill = gender)) + labs(x = "Gender", y = "Frequency", title = "Bar-chart of Gender", subtitle = "(of Star Wars characters)", caption = "(Source: The dplyr package)") + scale_fill_brewer(palette = "Pastel1") ``` <img src="Module02_files/figure-html/Module02-7-1.png" width="70%" style="display: block; margin: auto;" /> ] .pull-right[ ```r library(wesanderson) ggplot(data = starwars, mapping = aes(x = gender)) + geom_bar(aes(fill = gender)) + labs(x = "Gender", y = "Frequency", title = "Bar-chart of Gender", subtitle = "(of Star Wars characters)", caption = "(Source: The dplyr package)") + scale_fill_manual(values = wes_palette("Darjeeling1")) ``` <img src="Module02_files/figure-html/Module02-8-1.png" width="70%" style="display: block; margin: auto;" /> ] - `scale_fill_brewer` in the plot on the left, calling on built-in color palettes. You can [review them here](http://ggplot2.tidyverse.org/reference/scale_brewer.html) - `scale_fill_manual` in the plot on the right and a specific palette is being invoked from [the wesanderson package](https://github.com/karthik/wesanderson) ??? - Using the wes anderson palette - Adding the `caption = ` command to source the data - Switching from `scale_fill_brewer()` to `scale_fill_manual()` --- One can also lean on various plotting `themes` as shown below .pull-left[ ```r library(ggthemes) ggplot(data = starwars, aes(x = eye_color)) + geom_bar() + theme_tufte() + theme(axis.text.x = element_text(size = 6)) -> p1 ggplot(data = starwars, aes(x = eye_color)) + geom_bar() + theme_solarized() + theme(axis.text.x = element_text(size = 6)) -> p2 ggplot(data = starwars, aes(x = eye_color)) + geom_bar() + theme_economist() + theme(axis.text.x = element_text(size = 6)) -> p3 ggplot(data = starwars, aes(x = eye_color)) + geom_bar() + theme_fivethirtyeight() + theme(axis.text.x = element_text(size = 6)) -> p4 library(patchwork) p1 + p2 + p3 + p4 + plot_layout(ncol = 2) ``` ] .pull-right[ <img src="Module02_files/figure-html/unnamed-chunk-5-1.png" width="100%" style="display: block; margin: auto;" /> ] Note: I am not using `mapping = aes(...)` any longer Later on you will learn these & other ways to build advanced visualizations ...for now we get to work more with `ggplot2` --- class: inverse, center, middle ## more with bar-charts --- .pull-left[ ```r library(ggplot2) ggplot(data = movies, aes(x = mpaa)) + geom_bar() + theme_minimal() ``` <img src="Module02_files/figure-html/bar01-1.png" width="100%" style="display: block; margin: auto;" /> ] .pull-right[ ```r library(ggplot2) ggplot(data = movies) + geom_bar(aes(x = mpaa)) + theme_minimal() ``` <img src="Module02_files/figure-html/bar02-1.png" width="100%" style="display: block; margin: auto;" /> ] Notice how the `aes(x = mpaa)` is placed differently in the two commands and yet yields identical plots --- MPAA ratings are missing for a lot of movies so we eliminate these from the plot via `subset(mpa != "")` ```r str(movies$mpaa) ``` ``` ## chr [1:58788] "" "" "" "" "" "" "R" "" "" "" "" "" "" "" "PG-13" "PG-13" "" "" "" "" ... ``` ```r ggplot(subset(movies, mpaa != ""), aes(x = mpaa)) + geom_bar() + theme_minimal() ``` <img src="Module02_files/figure-html/bar2-1.svg" width="40%" style="display: block; margin: auto;" /> The order of the bars is fortuitous in that it goes from the smallest frequency to the highest frequency. --- I said fortuitous because `ggplot2's default is to order the bars in an ascending alphabetic/alphanumeric ssequence` if the variable is a **character**. See below for an example. ```r df = tibble(x = c(rep("A", 2), rep("B", 4), rep("C", 1))) ggplot(data = df, aes(x = x)) + geom_bar() + theme_minimal() ``` <img src="Module02_files/figure-html/gg1-1.svg" width="40%" style="display: block; margin: auto;" /> Later on we'll learn how to order the bars with ascending/descending frequencies or by some other logic --- What about plotting the `relative frequencies` on the y-axis rather than the frequencies? .pull-left[ ```r ggplot(data = subset(movies, mpaa != ""), aes(x = mpaa, * y = (..count..)/sum(..count..))) + geom_bar() + * scale_y_continuous(labels = scales::percent) + labs(x = "MPAA Rating", y = "Relative Frequency (%)") + theme_minimal() ``` `Note`: I used - `y = (..count..)/sum(..count..)` - `scale_y_continuous(labels = scales::percent)` ] .pull-right[ <img src="Module02_files/figure-html/unnamed-chunk-6-1.png" width="100%" style="display: block; margin: auto;" /> ] --- We could also add a second or even a third/fourth categorical variable. Let us see this with our `hsb2` data-set ```r library(here) load("data/hsb2.RData") ggplot(data = hsb2, aes(x = ses, group = female)) + geom_bar(aes(fill = female)) + theme_minimal() ``` <img src="Module02_files/figure-html/Module02-11-1.svg" width="45%" style="display: block; margin: auto;" /> This is not very useful since the viewer has to estimate the relative sizes of the two colors within any given bar. --- ```r ggplot(data = hsb2, aes(x = ses, group = female)) + geom_bar(aes(fill = female), position = "dodge") + theme_minimal() ``` <img src="Module02_files/figure-html/Module02-12-1.svg" width="50%" style="display: block; margin: auto;" /> `position = "dodge"` will juxtapose the bars, and this is much better. --- This is fine if you want to know what percent of the 200 students are low SES males, low SES females, etc. What if you wanted to calculate `percentages within each sex?` ```r ggplot(data = hsb2, aes(x = ses, y = female)) + geom_bar(aes(group = female, fill = female, y = ..prop..), position = "dodge") + scale_y_continuous(labels = scales::percent) + labs(y = "Relative Frequency (%)", x = "Socioeconomic Status Groups") + theme_minimal() ``` <img src="Module02_files/figure-html/Module02-13-1.svg" width="50%" style="display: block; margin: auto;" /> --- What about `within each ses?` ```r ggplot(data = hsb2, aes(x = female, y = ses)) + geom_bar(aes(group = ses, fill = ses, y = ..prop..), position = "dodge") + scale_y_continuous(labels = scales::percent) + labs(y = "Relative Frequency (%)", x = "Socioeconomic Status Groups") + theme_minimal() ``` <img src="Module02_files/figure-html/Module02-14-1.svg" width="50%" style="display: block; margin: auto;" /> --- class: inverse, center, middle ## histograms ??? - They can [visit](http://tinlizzie.org/histograms/) to remind themselves of all the various pieces that build up a histogram and the meaning/value behind each component - They should look at [Yau's piece here](https://flowingdata.com/2014/02/27/how-to-read-histograms-and-use-them-in-r/) and [here](https://flowingdata.com/2017/06/07/how-histograms-work/) --- `geom_histogram()` does the trick but note that the default number of bins is not very useful and can be tweaked, along with other embellishments ```r ggplot(data = hsb2, aes(x = read)) + geom_histogram(fill = "cornflowerblue", color = "white") + ggtitle("Histogram of Reading Scores") + labs(x = "Reading Score", y = "Frequency") + theme_minimal() ``` <img src="Module02_files/figure-html/gg2a-1.svg" width="50%" style="display: block; margin: auto;" /> --- See `bins = 5` and also experiment with `binwidth = ` ```r ggplot(data = hsb2, aes(x = read)) + geom_histogram(fill = "cornflowerblue", color = "white", bins = 5) + ggtitle("Histogram of Reading Scores") + labs(x = "Reading Score", y = "Frequency") + theme_minimal() ``` <img src="Module02_files/figure-html/gg2b-1.svg" width="50%" style="display: block; margin: auto;" /> --- If we wanted to break out the histogram by one or more categorical variables, we could do so ```r ggplot(hsb2, aes(x = read)) + geom_histogram(fill = "cornflowerblue", bins = 5, color = "white") + ggtitle("Histogram of Reading Scores") + labs(x = "Reading Score", y = "Frequency") + facet_wrap(~ female) + theme_minimal() ``` <img src="Module02_files/figure-html/gg3-1.svg" width="50%" style="display: block; margin: auto;" /> --- Or better yet, .pull-left[ ```r ggplot(hsb2, aes(x = read)) + geom_histogram(fill="cornflowerblue", bins = 10, color = "white") + ggtitle("Histogram of Reading Scores") + labs(x = "Reading Score", y = "Frequency") + facet_wrap(~ female, ncol = 1) + theme_minimal() ``` Why? ... the distributions are stacked above each other, making for an easier comparison ] .pull-right[ <img src="Module02_files/figure-html/unnamed-chunk-7-1.png" width="100%" style="display: block; margin: auto;" /> ] --- One useful design element with breakouts is `placing in relief the consolidated data (i.e., the distribution without break-outs)` .pull-left[ ```r ggplot(data = hsb2, aes(x = read, fill = female)) + geom_histogram(bins = 10, color = "white") + ggtitle("Histogram of Reading Scores") + labs(x = "Reading Score", y = "Frequency") + facet_wrap(~ female, ncol = 1) + geom_histogram(data = hsb2[, -2], bins = 10, fill = "grey", alpha = .5) + theme_minimal() ``` ] .pull-right[ <img src="Module02_files/figure-html/unnamed-chunk-8-1.png" width="100%" style="display: block; margin: auto;" /> ] Here it is obvious that the distribution of readings scores of any one sex are similar to the overall distribution so perhaps the groups are not really that different in terms of reading scores --- For breakouts with two categorical variables we could do .pull-left[ ```r ggplot(data = hsb2, aes(x = read)) + geom_histogram(fill="cornflowerblue", bins = 10, color = "white") + ggtitle("Histogramx of Reading Scores") + labs(x = "Reading Score", y = "Frequency") + facet_wrap(~ female + schtyp, ncol = 2) + theme_minimal() ``` Note that `~ female + schtyp` renders the panels for the first category of female by all categories of schtyp and then repeats for the other category of female ] .pull-right[ <img src="Module02_files/figure-html/unnamed-chunk-9-1.png" width="110%" style="display: block; margin: auto;" /> ] --- .pull-left[ ```r ggplot(data = hsb2, aes(x = read)) + geom_histogram(fill="cornflowerblue", bins = 10, color = "white") + ggtitle("Histogramx of Reading Scores") + labs(x = "Reading Score", y = "Frequency") + * facet_wrap(schtyp ~ female, ncol = 2) + theme_minimal() ``` Note that `schtyp ~ female` renders the panels for the first category of schtype for all categories of female and then repeats for the other category of schtyp ] .pull-right[ <img src="Module02_files/figure-html/unnamed-chunk-10-1.png" width="110%" style="display: block; margin: auto;" /> ] --- ... which is the same as ... ```r ggplot(data = hsb2, aes(x = read)) + geom_histogram(fill = "cornflowerblue", bins = 10, color = "white") + ggtitle("Histogramx of Reading Scores") + labs(x = "Reading Score", y = "Frequency") + * facet_wrap(~ schtyp + female, ncol = 2) + theme_minimal() ``` <img src="Module02_files/figure-html/gg5c-1.svg" width="50%" style="display: block; margin: auto;" /> --- class: inverse, middle, center ## Ridge-plots --- These were all the rage in the summer of 2017, and initially named `joy plots` but the unfortunate connection with the source of the plots led the name to be revised to `ridge-plots`. If you are curious, see [why not joy?](http://serialmentor.com/blog/2017/9/15/goodbye-joyplots). You need to have installed the `ggridges` package but other than that, they are easy to craft. .pull-left[ ```r library(viridis) library(ggridges) library(ggthemes) ggplot(lincoln_weather, aes(x = `Mean Temperature [F]`, y = `Month`)) + geom_density_ridges(scale = 3, alpha = 0.3, aes(fill = Month)) + labs(x = "Mean Temperature", y = "Month", title = 'Temperatures in Lincoln NE', subtitle = 'Mean temperatures (Fahrenheit) by month for 2016', caption = "Data: The Weather Underground") + theme_ridges() + theme(axis.title.y = element_blank(), legend.position = "none") ``` ] .pull-right[ <img src="Module02_files/figure-html/unnamed-chunk-11-1.png" width="100%" style="display: block; margin: auto;" /> ] --- Here is another one, mapping the distribution of hemoglobin in four populations (the US being the reference group) as part of a study looking at the impacy of altitude on hemoglobin concentration (courtesy Whitlock and Schluter). .pull-left[ ```r hemo <- read.csv(url("http://whitlockschluter.zoology.ubc.ca/wp-content/data/chapter02/chap02e3cHumanHemoglobinElevation.csv")) ggplot(hemo, aes(x = hemoglobin, y = population)) + geom_density_ridges(scale = 3, alpha = 0.3, aes(fill = population)) + labs(title = 'Hemoglobin Concentration Levels', subtitle = 'in Four populations') + theme_ridges() + theme(axis.title.y = element_blank(), legend.position = "none") ``` ] .pull-right[ <img src="Module02_files/figure-html/unnamed-chunk-12-1.png" width="100%" style="display: block; margin: auto;" /> ] As should be evident, they are visually appealing when comparing a large number of groups on a single continuous variable and using simple `facet-wrap` or other options would be suboptimal --- class: inverse, middle, center ## Box-plots --- These can be useful to look at the distribution of a continuous variable .pull-left[ ```r ggplot(hemo, aes(y = hemoglobin, x = "")) + geom_boxplot(fill = "cornflowerblue") + coord_flip() + labs(x = "", y = "Hemoglobin Concentration") + theme_minimal() ``` Note - the `x = ""` in `aes()` because otherwise with a single group the box-plot will not build up - `coord_flip()` is flipping the x-axis and y-axis ] .pull-right[ <img src="Module02_files/figure-html/unnamed-chunk-13-1.png" width="100%" style="display: block; margin: auto;" /> ] --- .pull-left[ ```r ggplot(hemo, aes(y = hemoglobin, x = population, fill = population)) + geom_boxplot() + coord_flip() + labs(x = "", y = "Hemoglobin Concentration") + theme(axis.title.y = element_blank(), legend.position = "none") + theme_minimal() ``` Notice the need for no legend with `fill = population` ] .pull-right[ <img src="Module02_files/figure-html/unnamed-chunk-14-1.png" width="110%" style="display: block; margin: auto;" /> ] --- class: middle, inverse, center ## line charts --- `Ideal use:` time series data .pull-left[ ```r library(plotly) data(economics) # names(economics) ggplot(economics, aes(x = date, y = uempmed)) + geom_line() + labs(x = "Date", y = "Unemployment Rate") + theme_minimal() ``` ] .pull-right[ <img src="Module02_files/figure-html/unnamed-chunk-15-1.png" width="110%" style="display: block; margin: auto;" /> ] --- .pull-left[ ```r load("data/gap.df.RData") ggplot(gap.df, aes(x = year, y = LifeExp, group = continent, color = continent)) + geom_line() + geom_point() + labs(x = "Year", y = "Median Life Expectancy (in years)", color = "") + theme_minimal() + theme(legend.position = "bottom") ``` `Note:` I switched off the legend label via `color = ""` ] .pull-right[ <img src="Module02_files/figure-html/unnamed-chunk-16-1.png" width="100%" style="display: block; margin: auto;" /> ] Pretty ugly!! --- but we can dress these up by using `plotly` ```r library(plotly) myplot <- plot_ly(economics, x = ~date) %>% add_trace(y = ~uempmed, line = list(color = 'black'), mode = "lines", name = "Unemployment Rate") %>% add_trace(y = ~psavert, line = list(color = 'red'), mode = "lines", name = "Personal Savings Rate") %>% layout(autosize = F, width = 600, height = 300) library(shiny) div(myplot, align = "center") ```
--- class: center, inverse, middle ## Two quantitative variables --- scatter-plots ... ```r load("data/hsb2.RData") ggplot(hsb2, aes(x = write, y = science)) + geom_point() + labs(x = "Writing Scores", y = "Science Scores") + theme_minimal() ``` <img src="Module02_files/figure-html/sc1-1.svg" width="50%" style="display: block; margin: auto;" /> --- color by `ses` ```r ggplot(hsb2, aes(x = write, y = science)) + * geom_point(aes(color = ses)) + labs(x = "Writing Scores", y = "Science Scores") + theme_minimal() + theme(legend.position = "bottom") ``` <img src="Module02_files/figure-html/sc2-1.svg" width="50%" style="display: block; margin: auto;" /> --- breakout ses for ease of interpretation ```r ggplot(hsb2, aes(x = write, y = science)) + geom_point() + labs(x = "Writing Scores", y = "Science Scores") + * facet_wrap(~ ses) + theme_minimal() ``` <img src="Module02_files/figure-html/sc3-1.svg" width="50%" style="display: block; margin: auto;" /> --- make it interactive with `plotly` ```r p = plot_ly(hsb2, x = ~write, y = ~science, color = ~ses) %>% layout(autosize = FALSE, width = 600, height = 300) div(p, align = "center") ```
--- `count plots`: see the frequency of given pairs of values by varying sizes of the points. The more the frequency of a pair the greater the size of these points. ```r data(mpg, package = "ggplot2") ggplot(mpg, aes(x = cty, y = hwy)) + * geom_count(col = "firebrick", show.legend = FALSE) + labs(subtitle = "City vs Highway mileage", y = "Highway mileage", x = "City mileage") + theme_minimal() ``` <img src="Module02_files/figure-html/count1-1.svg" width="45%" style="display: block; margin: auto;" /> --- `Boston Marathon` data ```r boston = read.csv("./data/BostonMarathon.csv") boston2 = boston[sample(nrow(boston), 200), ] ggplot(boston2, aes(x = Age, y = finishtime, group = M.F)) + geom_count(aes(color = M.F), show.legend = FALSE) + labs(subtitle = "", y = "Finishing Times (in seconds)", x = "Age (in years)") + facet_wrap(~ M.F, ncol = 1) + theme_minimal() ``` <img src="Module02_files/figure-html/count2-1.svg" width="45%" style="display: block; margin: auto;" /> --- class: center, inverse, middle ## hexbins --- - scatter-plots not helpful when data points overlap - hex-bins carve up the plotting grid into hexagons of equal size, count how many `\(x,y\)` pairs fall in each hexagon, and use a color scheme (like a heatmap) to show where hexagons have more data versus less. ```r ggplot(data = diamonds, aes(y = price, x = carat)) + geom_hex() + labs(x = "Weight in Carats", y = "Price") + theme_minimal() ``` <img src="Module02_files/figure-html/hex1-1.svg" width="45%" style="display: block; margin: auto;" /> --- - adding a third variable ```r ggplot(data = diamonds, aes(y = price, x = carat)) + geom_hex() + labs(x = "Weight in Carats", y = "Price") + facet_wrap(~ color) + theme_minimal() ``` <img src="Module02_files/figure-html/hex2-1.svg" width="50%" style="display: block; margin: auto;" /> --- .left-column[ ## `ggplot2` ### rules & resources ] .right-column[ ## `ggplot(data, aes()) + geom_(aes()) + ...` `aes()` will take `x =`, `y =`, `fill = `, `color =`, `group =`, `size =`, `radius =`, `size =` and more each `geom` has its own components plenty of themes available; [see for e.g., `ggthemes` here](https://yutannihilation.github.io/allYourFigureAreBelongToUs/ggthemes/) don't forget to stay in touch with development of [ggplot2 extensions](http://www.ggplot2-exts.org/index.html) of course, [the plotly site](https://plot.ly/r/) and Carson Sievert's [plotly book](https://plotly-book.cpsievert.me) keep [ggplot2 cheatsheet handy](https://www.rstudio.com/wp-content/uploads/2015/03/ggplot2-cheatsheet.pdf) join [stackoverflow](https://stackoverflow.com) but if you ask a question, `post with a MWE` ] ??? MWE = minimum working example --- ## Some helpful resources [The Data Visualization Catalogue](https://datavizcatalogue.com) developed by Severino Ribecca to create a library of different information visualisation types [The R Graph Catalog](http://shinyapps.stat.ubc.ca/r-graph-catalog/) maintained by Joanna Zhao and Jenny Bryan is always useful to see code-and-resulting-figure [Ferdio's Data Visualization Project](http://datavizproject.com), a website trying to present all relevant data visualizations, so you can find the right visualization and get inspired how to make them [The Chartmaker Directory](http://chartmaker.visualisingdata.com) will offer an answer to one of the most common questions in data visualisation: `which tool do you need to make that chart?` [Emery's Essentials](http://annkemery.com/essentials/) focusing on the charts that give you the best bang for your buck [The Data Visualization Checklist](http://annkemery.com/wp-content/uploads/2016/10/DataVizChecklist_May2016.pdf) by Ann K. Emery & Stephanie Evergreen [Data Visualization: A Practical Introduction](https://kjhealy.github.io/socviz/) by Kieran Healy --- layout: false .left-column[ ## Remember ... ] .right-column[ 🔁 start with pencil and paper, sketch prototypes of desired visualization(s) 😄 graphics are relatively easy to generate with base R & with `ggplot2` 👏 common-sense: `number` & `type` of variable(s) guide plotting 🎇 stay `color conscious`: sensible colors & sensitive to color blindness 🔰 experiment, experiment, experiment until you are happy use the 🆓 learning resources available online 📒 if you learn something new in R, write it down ] --- ## Some relevant RStudio webinars - [The grammar of graphics](https://vimeo.com/223812632) - [The boxplot](https://vimeo.com/222358034) - [The histogram](https://vimeo.com/221607341) Check out `Data Camp` and the blogosphere for plenty of examples --- class: right, middle <img class="circle" src="https://github.com/aniruhil.png" width="175px"/> # Find me at... [<svg style="height:0.8em;top:.04em;position:relative;" viewBox="0 0 512 512"><path d="M459.37 151.716c.325 4.548.325 9.097.325 13.645 0 138.72-105.583 298.558-298.558 298.558-59.452 0-114.68-17.219-161.137-47.106 8.447.974 16.568 1.299 25.34 1.299 49.055 0 94.213-16.568 130.274-44.832-46.132-.975-84.792-31.188-98.112-72.772 6.498.974 12.995 1.624 19.818 1.624 9.421 0 18.843-1.3 27.614-3.573-48.081-9.747-84.143-51.98-84.143-102.985v-1.299c13.969 7.797 30.214 12.67 47.431 13.319-28.264-18.843-46.781-51.005-46.781-87.391 0-19.492 5.197-37.36 14.294-52.954 51.655 63.675 129.3 105.258 216.365 109.807-1.624-7.797-2.599-15.918-2.599-24.04 0-57.828 46.782-104.934 104.934-104.934 30.213 0 57.502 12.67 76.67 33.137 23.715-4.548 46.456-13.32 66.599-25.34-7.798 24.366-24.366 44.833-46.132 57.827 21.117-2.273 41.584-8.122 60.426-16.243-14.292 20.791-32.161 39.308-52.628 54.253z"/></svg> @aruhil](http://twitter.com/aruhil) [<svg style="height:0.8em;top:.04em;position:relative;" viewBox="0 0 512 512"><path d="M326.612 185.391c59.747 59.809 58.927 155.698.36 214.59-.11.12-.24.25-.36.37l-67.2 67.2c-59.27 59.27-155.699 59.262-214.96 0-59.27-59.26-59.27-155.7 0-214.96l37.106-37.106c9.84-9.84 26.786-3.3 27.294 10.606.648 17.722 3.826 35.527 9.69 52.721 1.986 5.822.567 12.262-3.783 16.612l-13.087 13.087c-28.026 28.026-28.905 73.66-1.155 101.96 28.024 28.579 74.086 28.749 102.325.51l67.2-67.19c28.191-28.191 28.073-73.757 0-101.83-3.701-3.694-7.429-6.564-10.341-8.569a16.037 16.037 0 0 1-6.947-12.606c-.396-10.567 3.348-21.456 11.698-29.806l21.054-21.055c5.521-5.521 14.182-6.199 20.584-1.731a152.482 152.482 0 0 1 20.522 17.197zM467.547 44.449c-59.261-59.262-155.69-59.27-214.96 0l-67.2 67.2c-.12.12-.25.25-.36.37-58.566 58.892-59.387 154.781.36 214.59a152.454 152.454 0 0 0 20.521 17.196c6.402 4.468 15.064 3.789 20.584-1.731l21.054-21.055c8.35-8.35 12.094-19.239 11.698-29.806a16.037 16.037 0 0 0-6.947-12.606c-2.912-2.005-6.64-4.875-10.341-8.569-28.073-28.073-28.191-73.639 0-101.83l67.2-67.19c28.239-28.239 74.3-28.069 102.325.51 27.75 28.3 26.872 73.934-1.155 101.96l-13.087 13.087c-4.35 4.35-5.769 10.79-3.783 16.612 5.864 17.194 9.042 34.999 9.69 52.721.509 13.906 17.454 20.446 27.294 10.606l37.106-37.106c59.271-59.259 59.271-155.699.001-214.959z"/></svg> aniruhil.org](https://aniruhil.org) [<svg style="height:0.8em;top:.04em;position:relative;" viewBox="0 0 512 512"><path d="M476 3.2L12.5 270.6c-18.1 10.4-15.8 35.6 2.2 43.2L121 358.4l287.3-253.2c5.5-4.9 13.3 2.6 8.6 8.3L176 407v80.5c0 23.6 28.5 32.9 42.5 15.8L282 426l124.6 52.2c14.2 6 30.4-2.9 33-18.2l72-432C515 7.8 493.3-6.8 476 3.2z"/></svg> ruhil@ohio.edu](mailto:ruhil@ohio.edu)