1 Accessing Whitlock and Schluter’s Data

There are two ways to access the data used in the textbook.

  1. Using the R Code provided by the authors. This technique imports the data from the authors’ website.
  2. Downloading the data from the authors’ website to a local directory and opening it from there. Useful if you don’t have internet access 24/7, and if the data are to be used for an assignment.

Note: A word of caution. Some times the data will be in a raw format and hence easy to work with. At other times they may be given to us in some form of a frequency table or some other tabular format. When this happens and you need to use the data for some plot or calculation I will either provide you with the code needed to flip the data into a usable format.

2 Installing and Updating R Packages

You will need several packages at differing points throughout the semester. You can install these packages using the “Tools -> Install Packages …” option in the Menu bar. This is the easiest way to do it. The packages I’d like you to download now are listed below:

install.packages("binom", dependencies = TRUE, repos = "http://cran.rstudio.com/")
install.packages("epitools", dependencies = TRUE, repos = "http://cran.rstudio.com/")
install.packages("car", dependencies = TRUE, repos = "http://cran.rstudio.com/")
install.packages("multcomp", dependencies = TRUE, repos = "http://cran.rstudio.com/")
install.packages("ggplot2", dependencies = TRUE, repos = "http://cran.rstudio.com/")
install.packages("ggmap", dependencies = TRUE, repos = "http://cran.rstudio.com/")
install.packages("ggvis", dependencies = TRUE, repos = "http://cran.rstudio.com/")
install.packages("psych", dependencies = TRUE, repos = "http://cran.rstudio.com/")

Once installed, packages reside on your computer until they get corrupted or something else breaks. Whenever you want to use a package you have installed you can call it via library(ggplot2).

Packages do get updated quite frequently so maybe once every two/three weeks you should use the “Tools -> Check for Package Updates …” option in the Menu bar to check-and-update packages.

You can use the help function in the bottom-right window of RStudio to search for the details of a data-set, a package, a command, etc. Alternatively, you can just google the R package by name and you’ll see various examples, links to the package manual, etc.

3 Working with Categorical Data

Recall that categorical data merely capture a set of attributes or traits of the sampled units. Thus, for example, the various causes of death of teens are all categorical. So also are the activities listed in the tiger data-set. If I have a single categorical variable then the best ways to map its distribution is either via (a) a Frequency Table, or with (b) a Bar Chart.

3.1 The Tiger Data

The following data relate to the activities of 88 people at the time they were killed by tigers near Chitwan National Park, Nepal, from 1979 to 2006.

tigerData <- read.csv(url("http://whitlockschluter.zoology.ubc.ca/wp-content/data/chapter02/chap02e2aDeathsFromTigers.csv"))
head(tigerData)
##   person              activity
## 1      1 Disturbing tiger kill
## 2      2       Forest products
## 3      3          Grass/fodder
## 4      4       Fuelwood/timber
## 5      5          Grass/fodder
## 6      6       Forest products
table(tigerData$activity)
## 
## Disturbing tiger kill               Fishing       Forest products 
##                     5                     8                    11 
##       Fuelwood/timber          Grass/fodder               Herding 
##                     5                    44                     7 
##     Sleeping in house                Toilet               Walking 
##                     3                     2                     3
tab.T <- sort(table(tigerData$activity), decreasing = TRUE)
tab.T
## 
##          Grass/fodder       Forest products               Fishing 
##                    44                    11                     8 
##               Herding Disturbing tiger kill       Fuelwood/timber 
##                     7                     5                     5 
##     Sleeping in house               Walking                Toilet 
##                     3                     3                     2
data.frame(Frequency = tab.T)
##                       Frequency
## Grass/fodder                 44
## Forest products              11
## Fishing                       8
## Herding                       7
## Disturbing tiger kill         5
## Fuelwood/timber               5
## Sleeping in house             3
## Walking                       3
## Toilet                        2
data.frame(Frequency = addmargins(tab.T))
##                       Frequency
## Grass/fodder                 44
## Forest products              11
## Fishing                       8
## Herding                       7
## Disturbing tiger kill         5
## Fuelwood/timber               5
## Sleeping in house             3
## Walking                       3
## Toilet                        2
## Sum                          88

The table(tigerData$activity) command constructs a frequency table that is arranged in ascending order of the activity labels. This is rarely a useful ordering of the data. Instead we usually want the table sorted in descending or ascending order of the frequencies. This is achieved via the sort(table(tigerData$activity), decreasing = TRUE) and sort(table(tigerData$activity), decreasing = FALSE) commands. I have also saved the table with a name tab.T. I can improve how it is displayed in the console via the data.frame(Frequency = tab.T) command. Finally, addmargins calculates and appends the sum of the frequencies.

If I wanted to display Relative Frequencies I would have run:

tab.T2 <- prop.table(tab.T)
tab.T2
## 
##          Grass/fodder       Forest products               Fishing 
##            0.50000000            0.12500000            0.09090909 
##               Herding Disturbing tiger kill       Fuelwood/timber 
##            0.07954545            0.05681818            0.05681818 
##     Sleeping in house               Walking                Toilet 
##            0.03409091            0.03409091            0.02272727
data.frame(Relative.Frequency = tab.T2)
##                       Relative.Frequency
## Grass/fodder                  0.50000000
## Forest products               0.12500000
## Fishing                       0.09090909
## Herding                       0.07954545
## Disturbing tiger kill         0.05681818
## Fuelwood/timber               0.05681818
## Sleeping in house             0.03409091
## Walking                       0.03409091
## Toilet                        0.02272727
data.frame(Relative.Frequency = addmargins(tab.T2))
##                       Relative.Frequency
## Grass/fodder                  0.50000000
## Forest products               0.12500000
## Fishing                       0.09090909
## Herding                       0.07954545
## Disturbing tiger kill         0.05681818
## Fuelwood/timber               0.05681818
## Sleeping in house             0.03409091
## Walking                       0.03409091
## Toilet                        0.02272727
## Sum                           1.00000000

What about a bar chart (aka bar plot)? A simple one can be drawn as follows

barplot(tab.T, ylab = "Frequency", ylim=c(0,50), cex.names = 0.5, las = 2, main="Activities Linked to Fatal Tiger Attacks")

Note: cex.names is basically setting the font-size for the labels. Change the value from 0.5 to something else and see what it does. Change las to 1 and see what that does. In general, there are many more tweaks available for plots. To see these options, search for bar-plot in help.

There are better plots that can be created, and ggplot2 is by far the best option currently available. Make sure the ggplot2 package was installed on your machine and if it was installed, load the library via the library(ggplot2) command and then create a bar plot as follows:

library(ggplot2)
ggplot(tigerData, aes(x=activity)) + geom_bar() 

You can learn more about the various plot options and different graphics that can be constructed via ggplot2. To see what you can do with ggplot2 check out this cookbook

3.2 Education Spending

Below we read in the data on education spending per student in different years in British Columbia.

educationSpending = read.csv(url("http://whitlockschluter.zoology.ubc.ca/wp-content/data/chapter02/chap02f1_3EducationSpending.csv"))

barplot(educationSpending$spendingPerStudent, names.arg = educationSpending$year, ylab = "Education spending ($ per student)")
barplot(educationSpending$spendingPerStudent, names.arg = educationSpending$year, las = 1, col = "firebrick", ylim = c(0,8000),
    ylab = "Education spending ($ per student)")

Two bar-plots are shown, the first a plain vanilla plot and the second a fancier version of the same bar-plot.

3.3 - Manipulating Data Provided as a Frequency Table

See these data; they are given to us in a different format than we might usually encounter. Specifically, what we have here are the data provided to us as a frequency table.

library(abd)
data(TeenDeaths) 
names(TeenDeaths) 
## [1] "cause"  "deaths"
colnames <- c("Cause", "Freq") 
names(TeenDeaths) <- colnames 

What we can do, however, is to reconstruct the raw data. This is accomplished by the code below. Once this code has run you’ll see TeenDeaths in the upper-right window of RStudio. Open it by clicking on TeenDeaths; a familiar looking table will be visible. Notice that this is a frequency table. But I want to expand this into a proper data-set so that I can then work with the raw data as needed. The code below will do this for us.

teen = as.data.frame( rep(TeenDeaths$cause, TeenDeaths$deaths) )
head(teen)
## data frame with 0 columns and 0 rows
# The code below does the same thing and is shown purely for demonstration purposes
expand.dft <- function(x, na.strings = "NA", as.is = FALSE, dec = ".")
{
  # Take each row in the source data frame table and replicate it
  # using the Freq value
DF <- sapply(1:nrow(x), function(i) x[rep(i, each = x$Freq[i]), ], simplify = FALSE)
  # Take the above list and rbind it to create a single DF
  # Also subset the result to eliminate the Freq column
DF <- subset(do.call("rbind", DF), select = -Freq)
  # Now apply type.convert to the character coerced factor columns
  # to facilitate data type selection for each column
DF <- as.data.frame(lapply(DF, function(x) type.convert(as.character(x),  na.strings = na.strings, as.is = as.is, dec = dec)))
  # Return data frame
DF
}

teen <- expand.dft(TeenDeaths) # run the function on TeenDeaths and save the resulting file under a different name ("teen")

Now open the teen data frame and see how it looks similar to the tigerDeaths data. We can now calculate frequency tables, bar-plots, etc with ease.

3.4 Another Example – Endangered Species

library(abd)
data(EndangeredSpecies)
names(EndangeredSpecies)
## [1] "taxon"       "num.species"
species = as.data.frame( rep(EndangeredSpecies$taxon, EndangeredSpecies$num.species) )
colnames(species) = "taxon"

table.1 <- table(species)  
table.1
## species
##  Amphibians   Arachnids       Birds       Clams Crustaceans        Fish 
##          22          12          92          70          21         115 
##     Insects     Mammals      Plants    Reptiles      Snails 
##          44          74         745          36          32
table.2 = sort(table.1) 

addmargins(sort(table.1, decreasing=TRUE)) 
## species
##      Plants        Fish       Birds     Mammals       Clams     Insects 
##         745         115          92          74          70          44 
##    Reptiles      Snails  Amphibians Crustaceans   Arachnids         Sum 
##          36          32          22          21          12        1263
barplot(table.2, las=2, ylab="Frequency", xlab="Taxon")
ggplot(species, aes(x=taxon)) + geom_bar() + ylab("Frequency") + xlab("Taxon")

Note: The preceding bar-plot doesn’t need the labels to be abbreviated because they are small enough. Else we would have had to use the labels=abbreviate command.

4 Working with Numerical Data

Now lets switch to a numerical variable. Two renditions are valuable here … (1) a grouped frequency table and (2) a histogram. We’ll start with the grouped frequency table

Numerical variables will, by definition, assume a wide range of values. These values will often be too many to be able to construct a meaningful plot with the raw data. Let us see what this means with the following data:

data(iris) # See http://en.wikipedia.org/wiki/Iris_flower_data_set for details
names(iris)
## [1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width" 
## [5] "Species"
table(iris$Sepal.Length)
## 
## 4.3 4.4 4.5 4.6 4.7 4.8 4.9   5 5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8 5.9   6 
##   1   3   1   4   2   5   6  10   9   4   1   6   7   6   8   7   3   6 
## 6.1 6.2 6.3 6.4 6.5 6.6 6.7 6.8 6.9   7 7.1 7.2 7.3 7.4 7.6 7.7 7.9 
##   6   4   9   7   5   2   8   3   4   1   1   3   1   1   1   4   1

Notice how useless is the resulting table. This is so because each value of Sepal Length is being used to construct the Frequency Table. Nobody is really ever interested in this level of granularity. Instead, it would make sense to construct groups of Sepal Lengths and see how many data points fall within each group. How many groups we should create will be a matter of trial-and-error.

range(iris$Sepal.Length) # Find minimum and maximum values
## [1] 4.3 7.9
breaks <- seq(4.0, 8.0, by=0.5) # Create splitting points for the groups
breaks # See the breaks
## [1] 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0
SL.cut <- cut(iris$Sepal.Length, breaks, right=FALSE) # Create the groups
SL.cut # See what group each observation falls in
##   [1] [5,5.5) [4.5,5) [4.5,5) [4.5,5) [5,5.5) [5,5.5) [4.5,5) [5,5.5)
##   [9] [4,4.5) [4.5,5) [5,5.5) [4.5,5) [4.5,5) [4,4.5) [5.5,6) [5.5,6)
##  [17] [5,5.5) [5,5.5) [5.5,6) [5,5.5) [5,5.5) [5,5.5) [4.5,5) [5,5.5)
##  [25] [4.5,5) [5,5.5) [5,5.5) [5,5.5) [5,5.5) [4.5,5) [4.5,5) [5,5.5)
##  [33] [5,5.5) [5.5,6) [4.5,5) [5,5.5) [5.5,6) [4.5,5) [4,4.5) [5,5.5)
##  [41] [5,5.5) [4.5,5) [4,4.5) [5,5.5) [5,5.5) [4.5,5) [5,5.5) [4.5,5)
##  [49] [5,5.5) [5,5.5) [7,7.5) [6,6.5) [6.5,7) [5.5,6) [6.5,7) [5.5,6)
##  [57] [6,6.5) [4.5,5) [6.5,7) [5,5.5) [5,5.5) [5.5,6) [6,6.5) [6,6.5)
##  [65] [5.5,6) [6.5,7) [5.5,6) [5.5,6) [6,6.5) [5.5,6) [5.5,6) [6,6.5)
##  [73] [6,6.5) [6,6.5) [6,6.5) [6.5,7) [6.5,7) [6.5,7) [6,6.5) [5.5,6)
##  [81] [5.5,6) [5.5,6) [5.5,6) [6,6.5) [5,5.5) [6,6.5) [6.5,7) [6,6.5)
##  [89] [5.5,6) [5.5,6) [5.5,6) [6,6.5) [5.5,6) [5,5.5) [5.5,6) [5.5,6)
##  [97] [5.5,6) [6,6.5) [5,5.5) [5.5,6) [6,6.5) [5.5,6) [7,7.5) [6,6.5)
## [105] [6.5,7) [7.5,8) [4.5,5) [7,7.5) [6.5,7) [7,7.5) [6.5,7) [6,6.5)
## [113] [6.5,7) [5.5,6) [5.5,6) [6,6.5) [6.5,7) [7.5,8) [7.5,8) [6,6.5)
## [121] [6.5,7) [5.5,6) [7.5,8) [6,6.5) [6.5,7) [7,7.5) [6,6.5) [6,6.5)
## [129] [6,6.5) [7,7.5) [7,7.5) [7.5,8) [6,6.5) [6,6.5) [6,6.5) [7.5,8)
## [137] [6,6.5) [6,6.5) [6,6.5) [6.5,7) [6.5,7) [6.5,7) [5.5,6) [6.5,7)
## [145] [6.5,7) [6.5,7) [6,6.5) [6.5,7) [6,6.5) [5.5,6)
## 8 Levels: [4,4.5) [4.5,5) [5,5.5) [5.5,6) [6,6.5) [6.5,7) ... [7.5,8)
SL.freq <- table(SL.cut)
SL.freq # See the Grouped Frequency Table
## SL.cut
## [4,4.5) [4.5,5) [5,5.5) [5.5,6) [6,6.5) [6.5,7) [7,7.5) [7.5,8) 
##       4      18      30      31      32      22       7       6

Note: The option right=FALSE says construct the groups to be closed on the left and open on the right. Open intervals do not include the endpoints; only values between the endpoints will be included in the group. Closed intervals will include values that equal the endpoints and values that fall between the endpoints. Thus, in this example, a value of 4 will be bundled into the \([4,4.5)\) group but a value of 4.5 will be reserved for the \([4.5,5)\) group, and so on.

Now lets construct the histogram for these grouped Sepal Lengths.

hist(iris$Sepal.Length, ylab="Frequency", xlab="Sepal Length", ylim=c(0,35), breaks=breaks, right=FALSE, col="gray", main="Grouped Histogram of Sepal Length")

We’ve ignored the fact that there are three species here, and they may have different sepal lengths. We can fix that as follows:

histogram(~ Sepal.Length | Species, data=iris, ylab="Frequency", xlab="Sepal Length", ylim=c(0,60), freq=TRUE, breaks=breaks, right=FALSE,  main="", col="gray", layout = c(1,3))

This is essentially how we could construct histograms for some numerical variable that has been measured for different groups. The example below does this for the Hemoglobin data that document hemoglobin concentration in men living at high altitude in three different parts of the world and in a sea level population (USA).

hemoglobinData = read.csv(url("http://whitlockschluter.zoology.ubc.ca/wp-content/data/chapter02/chap02e3cHumanHemoglobinElevation.csv"))
histogram(~ hemoglobin | population, data = hemoglobinData,
  layout = c(1,4), col = "firebrick", breaks = seq(10, 26, by=1), xlab="Hemoglobin")

What follows is a demonstration of how differing intervals (how wide each group or bin is) changes the shape of the histogram. The data refer to the body mass of 228 female sockeye salmon sampled from Pick Creek in Alaska. The same data are plotted for three different interval widths: 0.1 kg, 0.3 kg, and 0.5 kg. We will first read the data into R, check the first 6 rows of the data-set, and then construct the three histograms, each with its own interval width.

salmonSizeData = read.csv(url("http://whitlockschluter.zoology.ubc.ca/wp-content/data/chapter02/chap02f2_5SalmonBodySize.csv"))
head(salmonSizeData)
##   year   sex oceanAgeYears lengthMm massKg
## 1 1996 FALSE             3      513  3.090
## 2 1996 FALSE             3      513  2.909
## 3 1996 FALSE             3      525  3.056
## 4 1996 FALSE             3      501  2.690
## 5 1996 FALSE             3      513  2.876
## 6 1996 FALSE             3      501  2.978
hist(salmonSizeData$massKg, right = FALSE, breaks = seq(1, 4, by=0.1), col = "gray", xlab="Mass (in kg)", main="Salmon Size")
hist(salmonSizeData$massKg, right = FALSE, breaks = seq(1, 4, by=0.3), col = "gray", xlab="Mass (in kg)", main="Salmon Size")
hist(salmonSizeData$massKg, right = FALSE, breaks = seq(1, 4, by=0.5), col = "gray", xlab="Mass (in kg)", main="Salmon Size")

4.1 The Bird Abundance Data

These data document the distribution of bird species abundance at Organ Pipe Cactus National Monument.

birdAbundanceData = read.csv(url("http://whitlockschluter.zoology.ubc.ca/wp-content/data/chapter02/chap02e2bDesertBirdAbundance.csv"))

birdAbundanceTable = table(cut(birdAbundanceData$abundance, breaks = seq(0, 650, by=50), right = FALSE))

birdAbundanceTable
## 
##    [0,50)  [50,100) [100,150) [150,200) [200,250) [250,300) [300,350) 
##        28         4         3         3         1         2         1 
## [350,400) [400,450) [450,500) [500,550) [550,600) [600,650) 
##         0         0         0         0         0         1
data.frame(Frequency = addmargins(birdAbundanceTable))
##    Frequency.Var1 Frequency.Freq
## 1          [0,50)             28
## 2        [50,100)              4
## 3       [100,150)              3
## 4       [150,200)              3
## 5       [200,250)              1
## 6       [250,300)              2
## 7       [300,350)              1
## 8       [350,400)              0
## 9       [400,450)              0
## 10      [450,500)              0
## 11      [500,550)              0
## 12      [550,600)              0
## 13      [600,650)              1
## 14            Sum             43
hist(birdAbundanceData$abundance, right = FALSE)
hist(birdAbundanceData$abundance, right = FALSE, breaks = seq(0, 650, by=50),   col = "firebrick", las = 1, xlab = "Abundance (No. individuals)",  ylab = "Frequency (No. species)", main = "")

Note: The option right = FALSE ensures that abundance value 300 (for example) is counted in the 300-400 bin rather than in the 200-300 bin.

5 What if I have two variables to work with?

  1. When both variables are categorical

If both variables are categorical then we can use contingency tables to tabulate the data and see what associations show up. We can also use bar-charts and mosaic plots to graph the data. Let us start with the the contingency table. We will use the avian malaria data-set, and start by reading in the data.

birdMalariaData = read.csv(url("http://whitlockschluter.zoology.ubc.ca/wp-content/data/chapter02/chap02e3aBirdMalaria.csv"))
head(birdMalariaData)
##   bird treatment response
## 1    1   Control  Malaria
## 2    2   Control  Malaria
## 3    3   Control  Malaria
## 4    4   Control  Malaria
## 5    5   Control  Malaria
## 6    6   Control  Malaria

You will see three variables but the bird number is basically an ID variable that would allow us to go back t our field notes and see what bird was numbered 1, and so on. As such the key variables are the treatment (i.e., whether the bird was in the Control group or in the Egg Removal group), and the response (i.e., whether the test was positive or negative for avian malaria).

The commands below will construct the needed contingency table:

tab.A = table(birdMalariaData$response, birdMalariaData$treatment)
tab.A
##             
##              Control Egg removal
##   Malaria          7          15
##   No Malaria      28          15
addmargins(tab.A)
##             
##              Control Egg removal Sum
##   Malaria          7          15  22
##   No Malaria      28          15  43
##   Sum             35          30  65
tab.B1 = prop.table(tab.A, 1)
tab.B1
##             
##                Control Egg removal
##   Malaria    0.3181818   0.6818182
##   No Malaria 0.6511628   0.3488372
tab.B2 = prop.table(tab.A, 2)
tab.B2
##             
##              Control Egg removal
##   Malaria        0.2         0.5
##   No Malaria     0.8         0.5

Note: prop.table(tab.A, 1) is basically calculating the relative frequencies for each row. Running prop.table(tab.A, 2) is basically calculating the relative frequencies for each column. Looking at tab.B1 shows us quite clearly that about 68% of the Egg removal group got malaria, versus only about 32% of the Control group; presumably reproduction does weaken the immune system.

Now let us draw a bar-plot for both variables.

barplot(as.matrix(tab.A), beside = TRUE, ylim=c(0,30))

The mosaic plot follows:

mosaicplot( t(tab.A), sub = "Treatment", ylab = "Relative Frequency", cex.axis = 0.8, main = "Mosaic Plot of Avian Malaria Data", col=c("black", "gray"))
  1. When both variables are numerical

There is no logical way to construct a frequency table for these data. Yes, we could group each of the numerical variables and then construct something like the contingency table but that would be risky; grouping both might misrepresent the data. Instead we will focus strictly on graphical renditions of any possible relationship between both variables. This is done via the scatter-plot.

guppyFatherSonData = read.csv(url("http://whitlockschluter.zoology.ubc.ca/wp-content/data/chapter02/chap02e3bGuppyFatherSonAttractiveness.csv"))
head(guppyFatherSonData)
##   fatherOrnamentation sonAttractiveness
## 1                0.35             -0.32
## 2                0.03             -0.03
## 3                0.14              0.11
## 4                0.10              0.28
## 5                0.22              0.31
## 6                0.23              0.18
plot(sonAttractiveness ~ fatherOrnamentation, data = guppyFatherSonData)

Now I’ll pretty it up a bit:

plot(sonAttractiveness ~ fatherOrnamentation, data = guppyFatherSonData, xlab="Father's Ornamentation", ylab="Son's Attractiveness", col="black", ylim=c(-1.5, 1.5), xlim=c(0, 1.3))
  1. What if I have a numerical variable measured for two or more groups?

One example might be the hemoglobin concentration data. The best way to show differences between the groups would be to stack the histograms one above the other (see below).

hemoglobinData = read.csv(url("http://whitlockschluter.zoology.ubc.ca/wp-content/data/chapter02/chap02e3cHumanHemoglobinElevation.csv"))
head(hemoglobinData)
##              id hemoglobin population
## 1 US.Sea.level1      10.40        USA
## 2 US.Sea.level2      11.20        USA
## 3 US.Sea.level3      11.70        USA
## 4 US.Sea.level4      11.80        USA
## 5 US.Sea.level5      11.90        USA
## 6 US.Sea.level6      12.05        USA
library(lattice)
histogram(~ hemoglobin | population, data = hemoglobinData,
  layout = c(1,4), col = "gray", breaks = seq(10, 26, by=1))

Note: The layout = c(1,4) option says stack them into1 column and 4 rows. Because this is a numerical variable we are constructing bins of width = 1. This is being done via the breaks = seq(10, 20, by=1) command. The | option says construct the histogram of hemoglobin for each group listed in the population variable. Be careful; if you get an error message that says the lattice library is not found be sure to install it first.

6 A Numerical Variable Measured Over Time (aka a Time Series)

The Measles data – showing confirmed cases of measles in England and Wales from 1995 to 2011, measured every quarter (i.e., at three month intervals of January-March, April-June, July-September, and October-December) – is an excellent example. So is the Lynx data-set.

measlesData = read.csv(url("http://whitlockschluter.zoology.ubc.ca/wp-content/data/chapter02/chap02e4aMeaslesOutbreaks.csv"))
head(measlesData)
##   year quarter confirmedCases yearByQuarter
## 1 2011     4th            136       2011.88
## 2 2011     3rd            154       2011.62
## 3 2011     2nd            346       2011.38
## 4 2011     1st            151       2011.12
## 5 2010     4th             31       2010.88
## 6 2010     3rd            134       2010.62
plot(confirmedCases ~ yearByQuarter, data = measlesData, type="b", xlab="Year", ylab="Confirmed Number of Cases", main="Measles Outbreak by Year (1995.Q1 - 2010.Q4)")

Note: Here the ~ symbol is being used to say plot confirmedCases against yearByQuarter. The type=“b” says construct a dot for each time point and then link successive dots with a line. If we switched type=“b” with type=“l” we would get just a straight line connecting successive points. Try it for yourself.

We could also create more interactive graphs of the measles data (see below).

library(plotly)
## 
## Attaching package: 'plotly'
## 
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## 
## The following object is masked from 'package:graphics':
## 
##     layout
p = plot_ly(data=measlesData, x=yearByQuarter, y=confirmedCases, showlegend = FALSE)
m = loess(confirmedCases ~ as.numeric(yearByQuarter), data = measlesData)
p = add_trace(p, y = fitted(m), name = "loess", showlegend = FALSE)
p 

7 Strip Charts

Strip Charts are useful at times, particularly when there is a need to show all individual values in the data-set. Below we construct a strip chart for data on serotonin levels in the central nervous system of desert locusts that were experimentally crowded for 0 (the control group), 1, and 2 hours. We’ll begin as usual; reading in the data.

locustData = read.csv(url("http://whitlockschluter.zoology.ubc.ca/wp-content/data/chapter02/chap02f1_2locustSerotonin.csv"))

head(locustData)
##   serotoninLevel treatmentTime
## 1            5.3             0
## 2            4.6             0
## 3            4.5             0
## 4            4.3             0
## 5            4.2             0
## 6            3.6             0
stripchart(serotoninLevel ~ treatmentTime, data = locustData, method = "jitter", vertical = TRUE, ylim=c(0,25), ylab="Seratonin Level", xlab="Crowded for 0, 1, 2 hrs")
par(bty = "l") # plot x and y axes only, not a complete box

stripchart(serotoninLevel ~ treatmentTime, data = locustData, vertical = TRUE, method = "jitter", pch = 16, col = "firebrick", cex = 1.5, las = 1, ylab = "Serotonin (pmoles)", xlab = "Treatment time (hours)", ylim = c(0, max(locustData$serotoninLevel)))
# The following command calculates the means in each treatment group (time)
meanSerotonin = tapply(locustData$serotoninLevel, locustData$treatmentTime, mean)
# "segments" draws draws lines to indicate the means
segments(x0 = c(1,2,3) - 0.1, y0 = meanSerotonin, x1 = c(1,2,3) + 0.1, 
    y1 = meanSerotonin, lwd = 2)

In my experience most folks are more familiar with histograms than they are with stripcharts. Plus the need to nudge the data points a little bit (called “jittering” in technical parlance) distorts the picture a bit and one should avoid distortions as much as is possible. However, if you see strip charts used a lot in your field or sub-field then I would learn how to construct them.

Note: I have not covered pie-charts; avoid them at all cost. Box-plots are very useful for displaying data but I will cover these when we get into Chapter 3 and talk about order statistics.

8 Some Examples from Practice Problems in Chapter 02

8.1 Problem # 6

We start by loading up the endangered species data.

EndangeredSpecies = read.csv(url("http://whitlockschluter.zoology.ubc.ca/wp-content/data/chapter02/chap02q06EndangeredSpecies.csv"))

Now I want to address sub-question (a):

colnames(EndangeredSpecies) = c("Taxon", "Freq")
species = expand.dft(EndangeredSpecies)
tab.A = table(species)
tab.A
## species
##  Amphibians   Arachnids       Birds       Clams Crustaceans        Fish 
##          22          12          92          70          21         115 
##     Insects     Mammals      Plants    Reptiles      Snails 
##          44          74         745          36          32
tab.B = sort(tab.A, decreasing=FALSE)
tab.B
## species
##   Arachnids Crustaceans  Amphibians      Snails    Reptiles     Insects 
##          12          21          22          32          36          44 
##       Clams     Mammals       Birds        Fish      Plants 
##          70          74          92         115         745
tab.C = data.frame(Frequency = tab.B)
tab.C
##             Frequency
## Arachnids          12
## Crustaceans        21
## Amphibians         22
## Snails             32
## Reptiles           36
## Insects            44
## Clams              70
## Mammals            74
## Birds              92
## Fish              115
## Plants            745

In brief, I have sorted the frequency table in increasing order of the Number of Species. That is, the more endangered species are listed first. This makes intuitive sense since most readers will, when primed by the word “endangered”, want to know which species are the most endangered so we might as well present these species first.

  1. I have constructed a frequency table

  2. The best graph would be a bar-plot:

barplot(tab.B, ylab="Frequency", xlab="Taxon", ylim=c(0, 800), cex.names=0.8)

Given that we are dealing with categorical data the bar-plot is the obvious graphic of choice.

  1. The baseline should be \(0\) since any other value would misrepresent the minimum value possible for any taxon, and also distort the plot.

  2. The relative frequency is shown below:

tab.D = prop.table(tab.B)
tab.E = data.frame(Rel.Frequency = tab.D)
tab.E
##             Rel.Frequency
## Arachnids     0.009501188
## Crustaceans   0.016627078
## Amphibians    0.017418844
## Snails        0.025336500
## Reptiles      0.028503563
## Insects       0.034837688
## Clams         0.055423595
## Mammals       0.058590657
## Birds         0.072842439
## Fish          0.091053048
## Plants        0.589865400

8.2 Problem #17

Read in the data.

anemone = read.csv(url("http://whitlockschluter.zoology.ubc.ca/wp-content/data/chapter02/chap02q17AnemonePersonality.csv"))
anemone
##    anemoneNumber startleResponse1 startleResponse2
## 1              1             1065              939
## 2              2              248              268
## 3              3              436              460
## 4              4              350              261
## 5              5              378              368
## 6              6              410              467
## 7              7              232              303
## 8              8              201              188
## 9              9              267              401
## 10            10              687              690
## 11            11              688              711
## 12            12              980              571
  1. The best graph would be a scatter-plot:
plot(startleResponse2 ~ startleResponse1, data=anemone, ylim=c(0,1000), xlab="Startle Response 1", ylab="Startle Response 2", pch=16)
  1. The scatter-plot shows a positive association between startle response times of sea anemones.

8.3 Problem #18

  1. The frequency distribution is drawn below:
response1 = anemone$startleResponse1
response1
##  [1] 1065  248  436  350  378  410  232  201  267  687  688  980
hist(response1, xlab="Response 1", col="gray", main="Histogram of Response 1")
  1. The histogram shows the distribution of Response 1 times to be positively skewed (i.e., skewed right).