There are two ways to access the data used in the textbook.
Note: A word of caution. Some times the data will be in a raw format and hence easy to work with. At other times they may be given to us in some form of a frequency table or some other tabular format. When this happens and you need to use the data for some plot or calculation I will either provide you with the code needed to flip the data into a usable format.
You will need several packages at differing points throughout the semester. You can install these packages using the “Tools -> Install Packages …” option in the Menu bar. This is the easiest way to do it. The packages I’d like you to download now are listed below:
install.packages("binom", dependencies = TRUE, repos = "http://cran.rstudio.com/")
install.packages("epitools", dependencies = TRUE, repos = "http://cran.rstudio.com/")
install.packages("car", dependencies = TRUE, repos = "http://cran.rstudio.com/")
install.packages("multcomp", dependencies = TRUE, repos = "http://cran.rstudio.com/")
install.packages("ggplot2", dependencies = TRUE, repos = "http://cran.rstudio.com/")
install.packages("ggmap", dependencies = TRUE, repos = "http://cran.rstudio.com/")
install.packages("ggvis", dependencies = TRUE, repos = "http://cran.rstudio.com/")
install.packages("psych", dependencies = TRUE, repos = "http://cran.rstudio.com/")
Once installed, packages reside on your computer until they get corrupted or something else breaks. Whenever you want to use a package you have installed you can call it via library(ggplot2)
.
Packages do get updated quite frequently so maybe once every two/three weeks you should use the “Tools -> Check for Package Updates …” option in the Menu bar to check-and-update packages.
You can use the help
function in the bottom-right window of RStudio to search for the details of a data-set, a package, a command, etc. Alternatively, you can just google the R package by name and you’ll see various examples, links to the package manual, etc.
Recall that categorical data merely capture a set of attributes or traits of the sampled units. Thus, for example, the various causes of death of teens are all categorical. So also are the activities listed in the tiger data-set. If I have a single categorical variable then the best ways to map its distribution is either via (a) a Frequency Table, or with (b) a Bar Chart.
The following data relate to the activities of 88 people at the time they were killed by tigers near Chitwan National Park, Nepal, from 1979 to 2006.
tigerData <- read.csv(url("http://whitlockschluter.zoology.ubc.ca/wp-content/data/chapter02/chap02e2aDeathsFromTigers.csv"))
head(tigerData)
## person activity
## 1 1 Disturbing tiger kill
## 2 2 Forest products
## 3 3 Grass/fodder
## 4 4 Fuelwood/timber
## 5 5 Grass/fodder
## 6 6 Forest products
table(tigerData$activity)
##
## Disturbing tiger kill Fishing Forest products
## 5 8 11
## Fuelwood/timber Grass/fodder Herding
## 5 44 7
## Sleeping in house Toilet Walking
## 3 2 3
tab.T <- sort(table(tigerData$activity), decreasing = TRUE)
tab.T
##
## Grass/fodder Forest products Fishing
## 44 11 8
## Herding Disturbing tiger kill Fuelwood/timber
## 7 5 5
## Sleeping in house Walking Toilet
## 3 3 2
data.frame(Frequency = tab.T)
## Frequency
## Grass/fodder 44
## Forest products 11
## Fishing 8
## Herding 7
## Disturbing tiger kill 5
## Fuelwood/timber 5
## Sleeping in house 3
## Walking 3
## Toilet 2
data.frame(Frequency = addmargins(tab.T))
## Frequency
## Grass/fodder 44
## Forest products 11
## Fishing 8
## Herding 7
## Disturbing tiger kill 5
## Fuelwood/timber 5
## Sleeping in house 3
## Walking 3
## Toilet 2
## Sum 88
The table(tigerData$activity)
command constructs a frequency table that is arranged in ascending order of the activity labels. This is rarely a useful ordering of the data. Instead we usually want the table sorted in descending or ascending order of the frequencies. This is achieved via the sort(table(tigerData$activity), decreasing = TRUE)
and sort(table(tigerData$activity), decreasing = FALSE)
commands. I have also saved the table with a name tab.T
. I can improve how it is displayed in the console via the data.frame(Frequency = tab.T)
command. Finally, addmargins
calculates and appends the sum of the frequencies.
If I wanted to display Relative Frequencies I would have run:
tab.T2 <- prop.table(tab.T)
tab.T2
##
## Grass/fodder Forest products Fishing
## 0.50000000 0.12500000 0.09090909
## Herding Disturbing tiger kill Fuelwood/timber
## 0.07954545 0.05681818 0.05681818
## Sleeping in house Walking Toilet
## 0.03409091 0.03409091 0.02272727
data.frame(Relative.Frequency = tab.T2)
## Relative.Frequency
## Grass/fodder 0.50000000
## Forest products 0.12500000
## Fishing 0.09090909
## Herding 0.07954545
## Disturbing tiger kill 0.05681818
## Fuelwood/timber 0.05681818
## Sleeping in house 0.03409091
## Walking 0.03409091
## Toilet 0.02272727
data.frame(Relative.Frequency = addmargins(tab.T2))
## Relative.Frequency
## Grass/fodder 0.50000000
## Forest products 0.12500000
## Fishing 0.09090909
## Herding 0.07954545
## Disturbing tiger kill 0.05681818
## Fuelwood/timber 0.05681818
## Sleeping in house 0.03409091
## Walking 0.03409091
## Toilet 0.02272727
## Sum 1.00000000
What about a bar chart (aka bar plot)? A simple one can be drawn as follows
barplot(tab.T, ylab = "Frequency", ylim=c(0,50), cex.names = 0.5, las = 2, main="Activities Linked to Fatal Tiger Attacks")
Note: cex.names
is basically setting the font-size for the labels. Change the value from 0.5 to something else and see what it does. Change las
to 1 and see what that does. In general, there are many more tweaks available for plots. To see these options, search for bar-plot
in help.
There are better plots that can be created, and ggplot2
is by far the best option currently available. Make sure the ggplot2 package was installed on your machine and if it was installed, load the library via the library(ggplot2)
command and then create a bar plot as follows:
library(ggplot2)
ggplot(tigerData, aes(x=activity)) + geom_bar()
You can learn more about the various plot options and different graphics that can be constructed via ggplot2. To see what you can do with ggplot2 check out this cookbook
Below we read in the data on education spending per student in different years in British Columbia.
educationSpending = read.csv(url("http://whitlockschluter.zoology.ubc.ca/wp-content/data/chapter02/chap02f1_3EducationSpending.csv"))
barplot(educationSpending$spendingPerStudent, names.arg = educationSpending$year, ylab = "Education spending ($ per student)")
barplot(educationSpending$spendingPerStudent, names.arg = educationSpending$year, las = 1, col = "firebrick", ylim = c(0,8000),
ylab = "Education spending ($ per student)")
Two bar-plots are shown, the first a plain vanilla plot and the second a fancier version of the same bar-plot.
See these data; they are given to us in a different format than we might usually encounter. Specifically, what we have here are the data provided to us as a frequency table.
library(abd)
data(TeenDeaths)
names(TeenDeaths)
## [1] "cause" "deaths"
colnames <- c("Cause", "Freq")
names(TeenDeaths) <- colnames
What we can do, however, is to reconstruct the raw data. This is accomplished by the code below. Once this code has run you’ll see TeenDeaths
in the upper-right window of RStudio. Open it by clicking on TeenDeaths; a familiar looking table will be visible. Notice that this is a frequency table. But I want to expand this into a proper data-set so that I can then work with the raw data as needed. The code below will do this for us.
teen = as.data.frame( rep(TeenDeaths$cause, TeenDeaths$deaths) )
head(teen)
## data frame with 0 columns and 0 rows
# The code below does the same thing and is shown purely for demonstration purposes
expand.dft <- function(x, na.strings = "NA", as.is = FALSE, dec = ".")
{
# Take each row in the source data frame table and replicate it
# using the Freq value
DF <- sapply(1:nrow(x), function(i) x[rep(i, each = x$Freq[i]), ], simplify = FALSE)
# Take the above list and rbind it to create a single DF
# Also subset the result to eliminate the Freq column
DF <- subset(do.call("rbind", DF), select = -Freq)
# Now apply type.convert to the character coerced factor columns
# to facilitate data type selection for each column
DF <- as.data.frame(lapply(DF, function(x) type.convert(as.character(x), na.strings = na.strings, as.is = as.is, dec = dec)))
# Return data frame
DF
}
teen <- expand.dft(TeenDeaths) # run the function on TeenDeaths and save the resulting file under a different name ("teen")
Now open the teen
data frame and see how it looks similar to the tigerDeaths data. We can now calculate frequency tables, bar-plots, etc with ease.
library(abd)
data(EndangeredSpecies)
names(EndangeredSpecies)
## [1] "taxon" "num.species"
species = as.data.frame( rep(EndangeredSpecies$taxon, EndangeredSpecies$num.species) )
colnames(species) = "taxon"
table.1 <- table(species)
table.1
## species
## Amphibians Arachnids Birds Clams Crustaceans Fish
## 22 12 92 70 21 115
## Insects Mammals Plants Reptiles Snails
## 44 74 745 36 32
table.2 = sort(table.1)
addmargins(sort(table.1, decreasing=TRUE))
## species
## Plants Fish Birds Mammals Clams Insects
## 745 115 92 74 70 44
## Reptiles Snails Amphibians Crustaceans Arachnids Sum
## 36 32 22 21 12 1263
barplot(table.2, las=2, ylab="Frequency", xlab="Taxon")
ggplot(species, aes(x=taxon)) + geom_bar() + ylab("Frequency") + xlab("Taxon")
Note: The preceding bar-plot doesn’t need the labels to be abbreviated because they are small enough. Else we would have had to use the labels=abbreviate
command.
Now lets switch to a numerical variable. Two renditions are valuable here … (1) a grouped frequency table and (2) a histogram. We’ll start with the grouped frequency table
Numerical variables will, by definition, assume a wide range of values. These values will often be too many to be able to construct a meaningful plot with the raw data. Let us see what this means with the following data:
data(iris) # See http://en.wikipedia.org/wiki/Iris_flower_data_set for details
names(iris)
## [1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width"
## [5] "Species"
table(iris$Sepal.Length)
##
## 4.3 4.4 4.5 4.6 4.7 4.8 4.9 5 5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8 5.9 6
## 1 3 1 4 2 5 6 10 9 4 1 6 7 6 8 7 3 6
## 6.1 6.2 6.3 6.4 6.5 6.6 6.7 6.8 6.9 7 7.1 7.2 7.3 7.4 7.6 7.7 7.9
## 6 4 9 7 5 2 8 3 4 1 1 3 1 1 1 4 1
Notice how useless is the resulting table. This is so because each value of Sepal Length is being used to construct the Frequency Table. Nobody is really ever interested in this level of granularity. Instead, it would make sense to construct groups of Sepal Lengths and see how many data points fall within each group. How many groups we should create will be a matter of trial-and-error.
range(iris$Sepal.Length) # Find minimum and maximum values
## [1] 4.3 7.9
breaks <- seq(4.0, 8.0, by=0.5) # Create splitting points for the groups
breaks # See the breaks
## [1] 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0
SL.cut <- cut(iris$Sepal.Length, breaks, right=FALSE) # Create the groups
SL.cut # See what group each observation falls in
## [1] [5,5.5) [4.5,5) [4.5,5) [4.5,5) [5,5.5) [5,5.5) [4.5,5) [5,5.5)
## [9] [4,4.5) [4.5,5) [5,5.5) [4.5,5) [4.5,5) [4,4.5) [5.5,6) [5.5,6)
## [17] [5,5.5) [5,5.5) [5.5,6) [5,5.5) [5,5.5) [5,5.5) [4.5,5) [5,5.5)
## [25] [4.5,5) [5,5.5) [5,5.5) [5,5.5) [5,5.5) [4.5,5) [4.5,5) [5,5.5)
## [33] [5,5.5) [5.5,6) [4.5,5) [5,5.5) [5.5,6) [4.5,5) [4,4.5) [5,5.5)
## [41] [5,5.5) [4.5,5) [4,4.5) [5,5.5) [5,5.5) [4.5,5) [5,5.5) [4.5,5)
## [49] [5,5.5) [5,5.5) [7,7.5) [6,6.5) [6.5,7) [5.5,6) [6.5,7) [5.5,6)
## [57] [6,6.5) [4.5,5) [6.5,7) [5,5.5) [5,5.5) [5.5,6) [6,6.5) [6,6.5)
## [65] [5.5,6) [6.5,7) [5.5,6) [5.5,6) [6,6.5) [5.5,6) [5.5,6) [6,6.5)
## [73] [6,6.5) [6,6.5) [6,6.5) [6.5,7) [6.5,7) [6.5,7) [6,6.5) [5.5,6)
## [81] [5.5,6) [5.5,6) [5.5,6) [6,6.5) [5,5.5) [6,6.5) [6.5,7) [6,6.5)
## [89] [5.5,6) [5.5,6) [5.5,6) [6,6.5) [5.5,6) [5,5.5) [5.5,6) [5.5,6)
## [97] [5.5,6) [6,6.5) [5,5.5) [5.5,6) [6,6.5) [5.5,6) [7,7.5) [6,6.5)
## [105] [6.5,7) [7.5,8) [4.5,5) [7,7.5) [6.5,7) [7,7.5) [6.5,7) [6,6.5)
## [113] [6.5,7) [5.5,6) [5.5,6) [6,6.5) [6.5,7) [7.5,8) [7.5,8) [6,6.5)
## [121] [6.5,7) [5.5,6) [7.5,8) [6,6.5) [6.5,7) [7,7.5) [6,6.5) [6,6.5)
## [129] [6,6.5) [7,7.5) [7,7.5) [7.5,8) [6,6.5) [6,6.5) [6,6.5) [7.5,8)
## [137] [6,6.5) [6,6.5) [6,6.5) [6.5,7) [6.5,7) [6.5,7) [5.5,6) [6.5,7)
## [145] [6.5,7) [6.5,7) [6,6.5) [6.5,7) [6,6.5) [5.5,6)
## 8 Levels: [4,4.5) [4.5,5) [5,5.5) [5.5,6) [6,6.5) [6.5,7) ... [7.5,8)
SL.freq <- table(SL.cut)
SL.freq # See the Grouped Frequency Table
## SL.cut
## [4,4.5) [4.5,5) [5,5.5) [5.5,6) [6,6.5) [6.5,7) [7,7.5) [7.5,8)
## 4 18 30 31 32 22 7 6
Note: The option right=FALSE
says construct the groups to be closed on the left and open on the right. Open intervals do not include the endpoints; only values between the endpoints will be included in the group. Closed intervals will include values that equal the endpoints and values that fall between the endpoints. Thus, in this example, a value of 4 will be bundled into the \([4,4.5)\) group but a value of 4.5 will be reserved for the \([4.5,5)\) group, and so on.
Now lets construct the histogram for these grouped Sepal Lengths.
hist(iris$Sepal.Length, ylab="Frequency", xlab="Sepal Length", ylim=c(0,35), breaks=breaks, right=FALSE, col="gray", main="Grouped Histogram of Sepal Length")
We’ve ignored the fact that there are three species here, and they may have different sepal lengths. We can fix that as follows:
histogram(~ Sepal.Length | Species, data=iris, ylab="Frequency", xlab="Sepal Length", ylim=c(0,60), freq=TRUE, breaks=breaks, right=FALSE, main="", col="gray", layout = c(1,3))
This is essentially how we could construct histograms for some numerical variable that has been measured for different groups. The example below does this for the Hemoglobin data that document hemoglobin concentration in men living at high altitude in three different parts of the world and in a sea level population (USA).
hemoglobinData = read.csv(url("http://whitlockschluter.zoology.ubc.ca/wp-content/data/chapter02/chap02e3cHumanHemoglobinElevation.csv"))
histogram(~ hemoglobin | population, data = hemoglobinData,
layout = c(1,4), col = "firebrick", breaks = seq(10, 26, by=1), xlab="Hemoglobin")
What follows is a demonstration of how differing intervals (how wide each group or bin is) changes the shape of the histogram. The data refer to the body mass of 228 female sockeye salmon sampled from Pick Creek in Alaska. The same data are plotted for three different interval widths: 0.1 kg, 0.3 kg, and 0.5 kg. We will first read the data into R, check the first 6 rows of the data-set, and then construct the three histograms, each with its own interval width.
salmonSizeData = read.csv(url("http://whitlockschluter.zoology.ubc.ca/wp-content/data/chapter02/chap02f2_5SalmonBodySize.csv"))
head(salmonSizeData)
## year sex oceanAgeYears lengthMm massKg
## 1 1996 FALSE 3 513 3.090
## 2 1996 FALSE 3 513 2.909
## 3 1996 FALSE 3 525 3.056
## 4 1996 FALSE 3 501 2.690
## 5 1996 FALSE 3 513 2.876
## 6 1996 FALSE 3 501 2.978
hist(salmonSizeData$massKg, right = FALSE, breaks = seq(1, 4, by=0.1), col = "gray", xlab="Mass (in kg)", main="Salmon Size")
hist(salmonSizeData$massKg, right = FALSE, breaks = seq(1, 4, by=0.3), col = "gray", xlab="Mass (in kg)", main="Salmon Size")
hist(salmonSizeData$massKg, right = FALSE, breaks = seq(1, 4, by=0.5), col = "gray", xlab="Mass (in kg)", main="Salmon Size")
These data document the distribution of bird species abundance at Organ Pipe Cactus National Monument.
birdAbundanceData = read.csv(url("http://whitlockschluter.zoology.ubc.ca/wp-content/data/chapter02/chap02e2bDesertBirdAbundance.csv"))
birdAbundanceTable = table(cut(birdAbundanceData$abundance, breaks = seq(0, 650, by=50), right = FALSE))
birdAbundanceTable
##
## [0,50) [50,100) [100,150) [150,200) [200,250) [250,300) [300,350)
## 28 4 3 3 1 2 1
## [350,400) [400,450) [450,500) [500,550) [550,600) [600,650)
## 0 0 0 0 0 1
data.frame(Frequency = addmargins(birdAbundanceTable))
## Frequency.Var1 Frequency.Freq
## 1 [0,50) 28
## 2 [50,100) 4
## 3 [100,150) 3
## 4 [150,200) 3
## 5 [200,250) 1
## 6 [250,300) 2
## 7 [300,350) 1
## 8 [350,400) 0
## 9 [400,450) 0
## 10 [450,500) 0
## 11 [500,550) 0
## 12 [550,600) 0
## 13 [600,650) 1
## 14 Sum 43
hist(birdAbundanceData$abundance, right = FALSE)
hist(birdAbundanceData$abundance, right = FALSE, breaks = seq(0, 650, by=50), col = "firebrick", las = 1, xlab = "Abundance (No. individuals)", ylab = "Frequency (No. species)", main = "")
Note: The option right = FALSE
ensures that abundance value 300 (for example) is counted in the 300-400 bin rather than in the 200-300 bin.
If both variables are categorical then we can use contingency tables to tabulate the data and see what associations show up. We can also use bar-charts and mosaic plots to graph the data. Let us start with the the contingency table. We will use the avian malaria data-set, and start by reading in the data.
birdMalariaData = read.csv(url("http://whitlockschluter.zoology.ubc.ca/wp-content/data/chapter02/chap02e3aBirdMalaria.csv"))
head(birdMalariaData)
## bird treatment response
## 1 1 Control Malaria
## 2 2 Control Malaria
## 3 3 Control Malaria
## 4 4 Control Malaria
## 5 5 Control Malaria
## 6 6 Control Malaria
You will see three variables but the bird number is basically an ID variable that would allow us to go back t our field notes and see what bird was numbered 1, and so on. As such the key variables are the treatment (i.e., whether the bird was in the Control group or in the Egg Removal group), and the response (i.e., whether the test was positive or negative for avian malaria).
The commands below will construct the needed contingency table:
tab.A = table(birdMalariaData$response, birdMalariaData$treatment)
tab.A
##
## Control Egg removal
## Malaria 7 15
## No Malaria 28 15
addmargins(tab.A)
##
## Control Egg removal Sum
## Malaria 7 15 22
## No Malaria 28 15 43
## Sum 35 30 65
tab.B1 = prop.table(tab.A, 1)
tab.B1
##
## Control Egg removal
## Malaria 0.3181818 0.6818182
## No Malaria 0.6511628 0.3488372
tab.B2 = prop.table(tab.A, 2)
tab.B2
##
## Control Egg removal
## Malaria 0.2 0.5
## No Malaria 0.8 0.5
Note: prop.table(tab.A, 1)
is basically calculating the relative frequencies for each row. Running prop.table(tab.A, 2)
is basically calculating the relative frequencies for each column. Looking at tab.B1
shows us quite clearly that about 68% of the Egg removal group got malaria, versus only about 32% of the Control group; presumably reproduction does weaken the immune system.
Now let us draw a bar-plot for both variables.
barplot(as.matrix(tab.A), beside = TRUE, ylim=c(0,30))
The mosaic plot follows:
mosaicplot( t(tab.A), sub = "Treatment", ylab = "Relative Frequency", cex.axis = 0.8, main = "Mosaic Plot of Avian Malaria Data", col=c("black", "gray"))
There is no logical way to construct a frequency table for these data. Yes, we could group each of the numerical variables and then construct something like the contingency table but that would be risky; grouping both might misrepresent the data. Instead we will focus strictly on graphical renditions of any possible relationship between both variables. This is done via the scatter-plot.
guppyFatherSonData = read.csv(url("http://whitlockschluter.zoology.ubc.ca/wp-content/data/chapter02/chap02e3bGuppyFatherSonAttractiveness.csv"))
head(guppyFatherSonData)
## fatherOrnamentation sonAttractiveness
## 1 0.35 -0.32
## 2 0.03 -0.03
## 3 0.14 0.11
## 4 0.10 0.28
## 5 0.22 0.31
## 6 0.23 0.18
plot(sonAttractiveness ~ fatherOrnamentation, data = guppyFatherSonData)
Now I’ll pretty it up a bit:
plot(sonAttractiveness ~ fatherOrnamentation, data = guppyFatherSonData, xlab="Father's Ornamentation", ylab="Son's Attractiveness", col="black", ylim=c(-1.5, 1.5), xlim=c(0, 1.3))
One example might be the hemoglobin concentration data. The best way to show differences between the groups would be to stack the histograms one above the other (see below).
hemoglobinData = read.csv(url("http://whitlockschluter.zoology.ubc.ca/wp-content/data/chapter02/chap02e3cHumanHemoglobinElevation.csv"))
head(hemoglobinData)
## id hemoglobin population
## 1 US.Sea.level1 10.40 USA
## 2 US.Sea.level2 11.20 USA
## 3 US.Sea.level3 11.70 USA
## 4 US.Sea.level4 11.80 USA
## 5 US.Sea.level5 11.90 USA
## 6 US.Sea.level6 12.05 USA
library(lattice)
histogram(~ hemoglobin | population, data = hemoglobinData,
layout = c(1,4), col = "gray", breaks = seq(10, 26, by=1))
Note: The layout = c(1,4)
option says stack them into1 column and 4 rows. Because this is a numerical variable we are constructing bins of width = 1. This is being done via the breaks = seq(10, 20, by=1)
command. The |
option says construct the histogram of hemoglobin for each group listed in the population variable. Be careful; if you get an error message that says the lattice library is not found be sure to install it first.
The Measles data – showing confirmed cases of measles in England and Wales from 1995 to 2011, measured every quarter (i.e., at three month intervals of January-March, April-June, July-September, and October-December) – is an excellent example. So is the Lynx data-set.
measlesData = read.csv(url("http://whitlockschluter.zoology.ubc.ca/wp-content/data/chapter02/chap02e4aMeaslesOutbreaks.csv"))
head(measlesData)
## year quarter confirmedCases yearByQuarter
## 1 2011 4th 136 2011.88
## 2 2011 3rd 154 2011.62
## 3 2011 2nd 346 2011.38
## 4 2011 1st 151 2011.12
## 5 2010 4th 31 2010.88
## 6 2010 3rd 134 2010.62
plot(confirmedCases ~ yearByQuarter, data = measlesData, type="b", xlab="Year", ylab="Confirmed Number of Cases", main="Measles Outbreak by Year (1995.Q1 - 2010.Q4)")
Note: Here the ~
symbol is being used to say plot confirmedCases against yearByQuarter. The type=“b”
says construct a dot for each time point and then link successive dots with a line. If we switched type=“b”
with type=“l”
we would get just a straight line connecting successive points. Try it for yourself.
We could also create more interactive graphs of the measles data (see below).
library(plotly)
##
## Attaching package: 'plotly'
##
## The following object is masked from 'package:ggplot2':
##
## last_plot
##
## The following object is masked from 'package:graphics':
##
## layout
p = plot_ly(data=measlesData, x=yearByQuarter, y=confirmedCases, showlegend = FALSE)
m = loess(confirmedCases ~ as.numeric(yearByQuarter), data = measlesData)
p = add_trace(p, y = fitted(m), name = "loess", showlegend = FALSE)
p
Strip Charts are useful at times, particularly when there is a need to show all individual values in the data-set. Below we construct a strip chart for data on serotonin levels in the central nervous system of desert locusts that were experimentally crowded for 0 (the control group), 1, and 2 hours. We’ll begin as usual; reading in the data.
locustData = read.csv(url("http://whitlockschluter.zoology.ubc.ca/wp-content/data/chapter02/chap02f1_2locustSerotonin.csv"))
head(locustData)
## serotoninLevel treatmentTime
## 1 5.3 0
## 2 4.6 0
## 3 4.5 0
## 4 4.3 0
## 5 4.2 0
## 6 3.6 0
stripchart(serotoninLevel ~ treatmentTime, data = locustData, method = "jitter", vertical = TRUE, ylim=c(0,25), ylab="Seratonin Level", xlab="Crowded for 0, 1, 2 hrs")
par(bty = "l") # plot x and y axes only, not a complete box
stripchart(serotoninLevel ~ treatmentTime, data = locustData, vertical = TRUE, method = "jitter", pch = 16, col = "firebrick", cex = 1.5, las = 1, ylab = "Serotonin (pmoles)", xlab = "Treatment time (hours)", ylim = c(0, max(locustData$serotoninLevel)))
# The following command calculates the means in each treatment group (time)
meanSerotonin = tapply(locustData$serotoninLevel, locustData$treatmentTime, mean)
# "segments" draws draws lines to indicate the means
segments(x0 = c(1,2,3) - 0.1, y0 = meanSerotonin, x1 = c(1,2,3) + 0.1,
y1 = meanSerotonin, lwd = 2)
In my experience most folks are more familiar with histograms than they are with stripcharts. Plus the need to nudge the data points a little bit (called “jittering” in technical parlance) distorts the picture a bit and one should avoid distortions as much as is possible. However, if you see strip charts used a lot in your field or sub-field then I would learn how to construct them.
Note: I have not covered pie-charts; avoid them at all cost. Box-plots are very useful for displaying data but I will cover these when we get into Chapter 3 and talk about order statistics.
We start by loading up the endangered species data.
EndangeredSpecies = read.csv(url("http://whitlockschluter.zoology.ubc.ca/wp-content/data/chapter02/chap02q06EndangeredSpecies.csv"))
Now I want to address sub-question (a):
colnames(EndangeredSpecies) = c("Taxon", "Freq")
species = expand.dft(EndangeredSpecies)
tab.A = table(species)
tab.A
## species
## Amphibians Arachnids Birds Clams Crustaceans Fish
## 22 12 92 70 21 115
## Insects Mammals Plants Reptiles Snails
## 44 74 745 36 32
tab.B = sort(tab.A, decreasing=FALSE)
tab.B
## species
## Arachnids Crustaceans Amphibians Snails Reptiles Insects
## 12 21 22 32 36 44
## Clams Mammals Birds Fish Plants
## 70 74 92 115 745
tab.C = data.frame(Frequency = tab.B)
tab.C
## Frequency
## Arachnids 12
## Crustaceans 21
## Amphibians 22
## Snails 32
## Reptiles 36
## Insects 44
## Clams 70
## Mammals 74
## Birds 92
## Fish 115
## Plants 745
In brief, I have sorted the frequency table in increasing order of the Number of Species. That is, the more endangered species are listed first. This makes intuitive sense since most readers will, when primed by the word “endangered”, want to know which species are the most endangered so we might as well present these species first.
I have constructed a frequency table
The best graph would be a bar-plot:
barplot(tab.B, ylab="Frequency", xlab="Taxon", ylim=c(0, 800), cex.names=0.8)
Given that we are dealing with categorical data the bar-plot is the obvious graphic of choice.
The baseline should be \(0\) since any other value would misrepresent the minimum value possible for any taxon, and also distort the plot.
The relative frequency is shown below:
tab.D = prop.table(tab.B)
tab.E = data.frame(Rel.Frequency = tab.D)
tab.E
## Rel.Frequency
## Arachnids 0.009501188
## Crustaceans 0.016627078
## Amphibians 0.017418844
## Snails 0.025336500
## Reptiles 0.028503563
## Insects 0.034837688
## Clams 0.055423595
## Mammals 0.058590657
## Birds 0.072842439
## Fish 0.091053048
## Plants 0.589865400
Read in the data.
anemone = read.csv(url("http://whitlockschluter.zoology.ubc.ca/wp-content/data/chapter02/chap02q17AnemonePersonality.csv"))
anemone
## anemoneNumber startleResponse1 startleResponse2
## 1 1 1065 939
## 2 2 248 268
## 3 3 436 460
## 4 4 350 261
## 5 5 378 368
## 6 6 410 467
## 7 7 232 303
## 8 8 201 188
## 9 9 267 401
## 10 10 687 690
## 11 11 688 711
## 12 12 980 571
plot(startleResponse2 ~ startleResponse1, data=anemone, ylim=c(0,1000), xlab="Startle Response 1", ylab="Startle Response 2", pch=16)
response1 = anemone$startleResponse1
response1
## [1] 1065 248 436 350 378 410 232 201 267 687 688 980
hist(response1, xlab="Response 1", col="gray", main="Histogram of Response 1")