Our goal in this module is to understand some basic ways of visualizing data with graphics β bar-charts, histograms, box-plots, and so on. We will skip base R commands and instead just work with ggplot2, the most popular visualization package in the {tidyverse} universe as of now.
If you remember MPA 6010, recall the usual optionsβ¦
one qualitative/categorical variables: bar-chart
one quantitative/continuous variables: histogram/box-plot/area-chart
two quantitative/continuous variables: scatter-plot/hex-bin
We can then ratchet up as we need to.
I will use two data-sets to walk through the initial examples in this module, the first being this IMDB data-set
The internet movie database, http://imdb.com/, is a website devoted to collecting movie data supplied by studios and fans. It claims to be the biggest movie database on the web and is run by amazon. More about information imdb.com can be found online, http://imdb.com/help/show_ leaf?about, including information about the data collection process, http://imdb.com/help/show_leaf?infosource.
The {ggplot2} package has a special syntax and I will point out things you should note as we move through this module. First up, the library is called ggplot2 but the command starts with ggplot so donβt let that throw you off-track.
Second, you need to have a data-set to work with. In the code below I start by loading the library and then specifying the data-set to be used.
library(ggplot2)ggplot(data = starwars )
Nothing results from these commands because we have not yet specified anything about what should go on the x-axis, what should go on the y-axis. Well, let us do that then by asking for the column eye_color to be put on the x-axis.
This results in a gray canvas with the eye colors on the x-axis but nothing else has been drawn since we have not specified the geometry β¦ do you want a bar-chart? histogram? dot-plot? line-chart? This is a categorical variable and hence a bar-chart would be appropriate. We call for a bar-chart with the geom_bar() command.
The aes() refers to the aesthetics of the chart, and many other aesthetics can be added, such as group, color, fill, size, alpha, etc. We will see some of these in due course but for now I want to focus on two of these, both involving coloring of the geom_. Specifically, there are two commands for adding colors β (1) color or colour, and (2) fill β to a chart.
Note what the color = eye_color command did β¦ it drew a colored border for the bars, and an accompanying legend. What if we had used fill = eye_color instead?
Aha! Now the bars are filled with colors and an accompanying legend is drawn as well. So fill = and color = behave very differently, bear this in mind.
1.3 Adding labels with labs
One of the nice things about this software environment is that there are plenty of coloring schemes available to us and we will play with some of these shortly, but before we do that, let us look at one more improvement β adding titles, subtitles, captions, and axis labels to our chart. This is done with the labs = () command.
ggplot(data = starwars,mapping =aes(x = eye_color,fill = eye_color ) ) +geom_bar() +labs(x ="Eye Color",y ="Frequency (n)",title ="Bar-chart of Eye Colors",subtitle ="(of Star Wars characters)",caption ="My little work of art!!" )
Notice the text that now appears as a result of what has been specified in the labs() command.
1.4 Controlling the chart legend with theme()
In this bar-chart, do we really need the legend? No, because the colors and color names show up in the chart itself. How can we hide the legend? Turns out there is a neat command that will allow you to move the legend around and even to hide it.
ggplot(data = starwars,mapping =aes(x = eye_color,fill = eye_color ) ) +geom_bar() +labs(x ="Eye Color",y ="Frequency (n)",title ="Bar-chart of Eye Colors",subtitle ="(of Star Wars characters)",caption ="My little work of art!!" ) +theme(legend.position ="none")
Voila! The legend is gone. Instead of βnoneβ you could have specified βbottomβ, βleftβ, βtopβ, βrightβ to place the legend in a particular direction.
1.5 Customizing colors
Of course, it would be good to have the colors match the eye-color so let us do that next. The way we can do this is by calling specific colors by name. I have tried to order the lineup of the colors to match, as closely as I can, the eye colors.
c("black", "blue", "slategray", "brown", "gray34", "gold","greenyellow", "navajowhite1", "orange", "pink", "red","magenta", "thistle3", "white", "yellow" ) ->mycolorsggplot(data = starwars,mapping =aes(x = eye_color ) ) +geom_bar(fill = mycolors ) +labs(x ="Eye Color",y ="Frequency (n)",title ="Bar-chart of Eye Colors",subtitle ="(of Star Wars characters)",caption ="My little work of art!!" ) +theme(legend.position ="none")
These colors are from this source but see also this source. Colors can be customized by generating your own palettes via the Color Brewer here. But donβt get carried away: Remember to read the materials on choosing colors wisely, particularly the point about qualitative palettes, divergent palettes, and then palettes that work well even with colorblind audiences.
1.6 Selected color palettes
I had mentioned the existence of a number of color palettes so let us look at a few, but we will do this with a different variable. First up, the Pastel1 palette.
ggplot(data = starwars,mapping =aes(x = gender ) ) +geom_bar(aes(fill = gender) ) +labs(x ="Gender",y ="Frequency",title ="Bar-chart of Gender",subttitle ="(of Star Wars characters)",caption ="(Source: The dplyr package)") +scale_fill_brewer(palette ="Pastel1" )
Not bad but doesnβt work too well here. How about trying another palette, Set?
ggplot(data = starwars,mapping =aes(x = gender ) ) +geom_bar(aes(fill = gender) ) +labs(x ="Gender",y ="Frequency",title ="Bar-chart of Gender",subttitle ="(of Star Wars characters)",caption ="(Source: The dplyr package)") +scale_fill_brewer(palette ="Set1" )
Nice! But what is also noticeable here is that there are some characters in the data-set whose gender data is missing. These are the NA values. By default, you will see NA values showing up in some types of charts and so it is always good to exclude them from the chart. Here is one way of doing that.
ggplot(data =subset(starwars, !is.na(gender)),mapping =aes(x = gender ) ) +geom_bar(aes(fill = gender) ) +labs(x ="Gender",y ="Frequency",title ="Bar-chart of Gender",subttitle ="(of Star Wars characters)",caption ="(Source: The dplyr package)") +scale_fill_brewer(palette ="Set1" )
Notice what is different here: data = subset(starwars, !is.na(gender)) and that this command is effectively saying subset the starwars data to only include those cases where gender is not missing (this is the !is.na() portion of the command).
Another way to do the same thing would have been to use filter() and create a cleaned up copy of the data. If you take this route, be careful not to overwrite the original data-set; note how I am giving a new name (my.data) after filter() to save the results in. Then we lean on this data-set via data = my.data.
There is one color palette you should remember, and this is the {viridis} color scheme that works around varying types of color blindness in the population. Here come the palettes:
ggplot(data = my.data,mapping =aes(x = gender ) ) +geom_bar(aes(fill = gender) ) +labs(x ="Gender",y ="Frequency",title ="Bar-chart of Gender",subttitle ="(of Star Wars characters)",caption ="(Source: The dplyr package)") +scale_fill_viridis_d(option ="viridis" )
ggplot(data = my.data,mapping =aes(x = gender ) ) +geom_bar(aes(fill = gender) ) +labs(x ="Gender",y ="Frequency",title ="Bar-chart of Gender",subttitle ="(of Star Wars characters)",caption ="(Source: The dplyr package)") +scale_fill_viridis_d(option ="magma" )
ggplot(data = my.data,mapping =aes(x = gender ) ) +geom_bar(aes(fill = gender) ) +labs(x ="Gender",y ="Frequency",title ="Bar-chart of Gender",subttitle ="(of Star Wars characters)",caption ="(Source: The dplyr package)") +scale_fill_viridis_d(option ="plasma" )
ggplot(data = my.data,mapping =aes(x = gender ) ) +geom_bar(aes(fill = gender) ) +labs(x ="Gender",y ="Frequency",title ="Bar-chart of Gender",subttitle ="(of Star Wars characters)",caption ="(Source: The dplyr package)") +scale_fill_viridis_d(option ="cividis" )
1.7 Themes with {ggthemes}
One can also lean on various plotting themes as shown below. These themes mimic the style of graphics popularized by some data visualization experts (for e.g., Stephen Few, Edward Tufte), news-media houses (Fivethirtyeight, The Economist, The Wall Street Journal), some software packages (Excel, Stata, Google docs), and a few others. Below I show you just a handful.
Later on you will learn these & other ways to build advanced visualizations. For now we get to work more with ggplot2.
1.8 More with bar-charts
I want to show a few things with bar-charts now. First, we can specify things a bit differently without altering the result. For example, compare the following two pieces of code.
Notice that we switched the data = and the aes() pieces of the code but that made no difference; this is important to bear in mind because it will come in handy down the road when we need to build some advanced visualizations.
The plot is sub-optimal since MPAA ratings are missing for a lot of movies and should be eliminated from the plot via subset(mpa != "") or by running dplyrβs filter() to create another data-set. I will lean on filter().
The order of the bars here is fortuitous in that it goes from the smallest frequency to the highest frequency, drawing the readerβs eye. I said fortuitous because {ggplot2} defaults to drawing the bars in an ascending alphabetic/alphanumeric order if the variable is a character. See below for an example.
Notice the bars here do not follow in ascending/descending order of frequencies. Later on weβll learn how to order the bars with ascending/descending frequencies or by some other logic.
What about plotting relative frequencies on the y-axis rather than the frequencies?
y = (..count..)/sum(..count..) to change the y-axis to reflect the relative frequency as a proportion, and
scale_y_continuous(labels = percent) to then multiply these proportions by 100 to get percentages as the labels rather than 0.2, 0.4, 0.6, etc.
1.9 Disaggregating bar-charts for groups
Let us build a simple bar-chart with the hsb2 data we saw in Module 01. Here we first download it, label the values, save it, and then start charting it.
This is not very useful since the viewer has to estimate the relative sizes of the two colors within any given bar. That can be fixed with position = "dodge", juxtaposing the bars for the groups as a result, and the end product is much better. But note: position = "dodge" has to be put outside the aes() but still inside geom_bar() so be careful.
What if you wanted to calculate percentages within each sex? That is, what percent of male students fall within a particular ses category, and the same thing for female students?
ggplot() +geom_bar(data = hsb2, aes(x = ses, group = female,fill = female, y = ..prop.. ),position ="dodge") +scale_y_continuous(labels = scales::percent) +labs(x ="Socioeconomic Status",y ="Relative Frequency (%)" )
What about within each ses instead of within gender? That is, what if we wanted percent of Low ses that is Male versus Female, and so on?
ggplot() +geom_bar(data = hsb2, aes(x = female, group = ses,fill = ses, y = ..prop.. ),position ="dodge") +scale_y_continuous(labels = scales::percent) +labs(x ="Socioeconomic Status",y ="Relative Frequency (%)" )
ggplot() +geom_bar(data = hsb2, aes(x = female, group = ses,fill = ses, y = ..prop.. ),position ="dodge") +scale_y_continuous(labels = scales::percent) +labs(x ="Socioeconomic Status",y ="Relative Frequency (%)" )
There is some more we will do with bar-charts but for now let us set them aside and instead look at a few other charts β histograms, box-plots, and line-charts.
For histograms in ggplot2, geom_histogram() is the geometry needed but note that the default number of bins is not very useful and can be tweaked, along with other embellishments that are possible as well.
ggplot() +geom_histogram(data = hsb2,aes(x = read), fill ="cornflowerblue",color ="white" ) +labs(title ="Histogram of Reading Scores",x ="Reading Score",y ="Frequency" )
Note the warning stat_bin() using bins = 30. Pick better value with binwidth. This is because numerical variables need to be grouped in order to have meaningful histograms we can make sense of. How do you define the bins (aka the groups)? We could set bins = 5 and we could also experiment with binwidth =. Let us do bins = 5 which will say give us 5 groups, and go ahead and calculate them yourself.
If we wanted to disaggregate the histogram by one or more categorical variables, we could do so quite easily:
ggplot() +geom_histogram(data = hsb2,aes(x = read), fill ="cornflowerblue",color ="white",bins =5 ) +labs(title ="Histogram of Reading Scores",subtitle ="(broken out for Male vs. Female students)",x ="Reading Score",y ="Frequency" ) +facet_wrap(~female)
When we do this, it is often useful to organize them so that only one histogram shows up in a row. This is done with the ncol = 1 command.
ggplot() +geom_histogram(data = hsb2,aes(x = read), fill ="cornflowerblue",color ="white",bins =5 ) +labs(title ="Histogram of Reading Scores",subtitle ="(broken out for Male vs. Female students)",x ="Reading Score",y ="Frequency" ) +facet_wrap(~female, ncol =1)
ggplot() +geom_histogram(data = hsb2,aes(x = read), fill ="cornflowerblue",color ="white",bins =5 ) +labs(title ="Histogram of Reading Scores",subtitle ="(broken out by Socioeconomic Status)",x ="Reading Score",y ="Frequency" ) +facet_wrap(~ses, ncol =1)
Now the distributions are stacked above each, easing comparisons; do they have the same average? Do they vary the same? Are they similarly skewed/symmetric?.
For breakouts with two categorical variables we could do
ggplot() +geom_histogram(data = hsb2,aes(x = read), fill ="cornflowerblue",color ="white",bins =5 ) +labs(title ="Histogram of Reading Scores",subtitle ="(broken out by Socioeconomic Status and School Type)",x ="Reading Score",y ="Frequency" ) +facet_wrap(ses ~schtyp, ncol =2)
Note that ses ~ schtyp renders the panels for the first category of ses by all categories of schtyp and then repeats for the other categories in rows 2 and 3. If we did facet_wrap(schtype ~ ses, ncol = 3) we would have a different result:
ggplot() +geom_histogram(data = hsb2,aes(x = read), fill ="cornflowerblue",color ="white",bins =5 ) +labs(title ="Histogram of Reading Scores",subtitle ="(broken out by Socioeconomic Status and School Type)",x ="Reading Score",y ="Frequency" ) +facet_wrap(schtyp ~ses, ncol =3) +ylim(c(0, 23))
Notice that here I also add a ylim(c(...)) command to set the minimum and maximum values of the y-axis. This is useful, and I suggest you do not forget to set the y limit to start at 0 or then make a note in the plot for readers so they donβt assume it is at 0 when in fact it has been truncated for ease of data presentation. This misstates the pattern in the data, do not do it or then, again, annotate the plot to that effect so nobody is misled. Bar-charts and histograms will have 0 as the minimum y-limit but this is not true for some other plots.
1.11 Box-plots
Remember these, our friends from MPA 6010? These can be useful for studying the distribution of a continuous variable. See this video. Let us see these in action with the cmhflights data.
Notice the need for no legend with fill = population Notice also how fill = is inside aes(...) here because we are asking that each unique value seen in a variable called population be mapped to a unique color.
Could we use our facet_wrap(...) here too? Of course.
If we have data over time for one or more units, then line-charts work really well to exhibit trends. A classic, current example would be the number of confirmed COVID-19 cases per country per date. For example, say we have data on the unemployment rate for the country. These data are coming from the {plotly} library so we have to make sure it is installed and load it.
These work well if we have two or more continuous variables, and work well to highlight the nature and strength of a relationship between the two variables β¦. what happens to y as x increases? s
We could highlight the different ses groups, to see if there is any difference in the relationship between writing scores and science scores by the different ses levels.
And finally, a few suggestion about how to build up your visualizations:
π start with pencil and paper, sketch prototypes of desired visualization(s)
π graphics are relatively easy to generate with base R & with ggplot2
π common-sense: number & type of variable(s) guide plotting
π stay color conscious: sensible colors & sensitive to color blindness
π° experiment, experiment, experiment until you are happy
use the π learning resources available online
π if you learn something new in R, write it down
2 Practice Exercises
2.1 Nobel Prize Winners
Georgios Karamanis gathered and shared data on Nobel prize winners over the years, with a fair amount of detail, and used in the tidytuesday series a while back. These data are to be used for the questions that follow.
First create nobel.df that keeps only records starting in the year 1960, and only for the βPhysicsβ category. Now generate an appropriate chart that shows the distribution of winners by birth_country
Now break this distribution out by gender to see how winners by country differs across gender
Now go back to noble_winners, the full data-set, and create a simple plot that shows the distribution of prize winners by death_country, gender, and category
Construct appropriate plots that shows the relationship between the following pairs of variables
Adult obesity and High school graduation
Children in poverty and High school graduation
Preventable hospital stays and Unemployment rate
2.4 Unemployment Rates
Use the unemployment data given to you (unemprate.RData) and construct appropriate plots that show the distribution of unemployment rates across years for each of the four educational attainment groups.