# 1 Describing Data: Central Tendency and Dispersion

Working with the data on gliding snakes’ undulation rates we will calculate the mean, median, range, interquartile range, variance, and standard deviation. We will also construct some plots in order to see a few things in action.

snakeData = read.csv(url("http://whitlockschluter.zoology.ubc.ca/wp-content/data/chapter03/chap03e1GlidingSnakes.csv"))
head(snakeData)
##   snake undulationRateHz
## 1     1              0.9
## 2     2              1.2
## 3     3              1.2
## 4     4              1.3
## 5     5              1.4
## 6     6              1.4

Drawing a histogram of these data will allow us to see how undulation rates are distributed.

hist(snakeData$undulationRateHz, right = FALSE, las = 1, col = "gray", breaks = seq(0.8,2.2,by=0.2), xlab = "Undulation rate (Hz)", ylab = "Frequency", main = "Distribution of Undulation Rates (in Hz)", ylim=c(0,3)) The code below constructs a similar plot albeit with the ggplot2 package library(ggplot2) ggplot(snakeData, aes(undulationRateHz)) + geom_histogram(binwidth=0.2) + xlab("Undulation rate (Hz)") ## Calculating Central Tendency & Variability Calculating the mean, median, variance, and standard deviation is pretty straightforward: mean(snakeData$undulationRate, na.rm=TRUE)
## [1] 1.375
median(snakeData$undulationRate, na.rm=TRUE) ## [1] 1.35 var(snakeData$undulationRate, na.rm=TRUE)
## [1] 0.105
sd(snakeData$undulationRate, na.rm=TRUE) ## [1] 0.324037 Note: If you have missing values in your data-set then you should run mean(snakeData$undulationRateHz, na.rm=TRUE) etc. for all commands. As a matter of fact R has several commands that will not run unless you specify na.rm=TRUE. Further, if we wanted to we could have named each estimate, for example, mean.ur = mean(snakeData$undulationRate) and so on. We have to be a bit careful when calculating the quartiles ($$Q_1$$ and $$Q_3$$) and the interquartile range. This is because there are nine ways of calculating these in R. We will use type=5 in this class since that is what is used by Whitlock & Schluter. The relevant commands are shown below: quantile(snakeData$undulationRateHz, probs=c(0.25, 0.75), type=5, na.rm=TRUE)
## 25% 75%
## 1.2 1.5
IQR(snakeData$undulationRateHz, type=5, na.rm=TRUE) ## [1] 0.3 Note: $$IQR = Q_3 - Q_1 = 1.5 - 1.2 = 0.3$$ . Specifying probs=c(0.25, 0.75) specifies that we want $$Q_1$$ and $$Q_3$$. If you wanted the median then you’d have to specify a probs value of 0.50. ## 1.1 The Coefficient of Variation The Coefficient of Variation tells us something about how a variable is distributed about its mean. Given that it is a ratio, the higher is the coefficient of variation the more dispersed are the values of the variable. Why is it a useful measure if we have others to choose from? Because it is not influenced by the units of measurement. For example, if you are measuring heights in inches and calculate the variance and standard deviation, then convert height into centimeters and recompute the variance and standard deviation, the two sets of values will differ a good bit (see below): height.in = c(65, 72, 77, 60, 58) mean(height.in) # No missing values so na.rm=TRUE not specified ## [1] 66.4 sd(height.in) # No missing values so na.rm=TRUE not specified ## [1] 8.018728 height.cm = height.in * 2.54 mean(height.cm) # No missing values so na.rm=TRUE not specified ## [1] 168.656 sd(height.cm) # No missing values so na.rm=TRUE not specified ## [1] 20.36757 Now suddenly it seems as if there is more variability when height is measured in centimeters, but this is because of the units of measurement being used! If we now calculate the coefficient of variation for height in inches and then for height in centimeters we get identical estimates of how much variation there is in the heights of these individuals: cov.in = (sd(height.in) / mean(height.in)) * 100 cov.in ## [1] 12.0764 cov.cm = (sd(height.cm) / mean(height.cm)) * 100 cov.cm ## [1] 12.0764 Now calculating the coefficient of variation for the gliding snakes data-set: 100 * sd(snakeData$undulationRate)/mean(snakeData$undulationRate) ## [1] 23.56633 # 2 Working with Grouped Data For this exercise we’ll replicate the in-text example that uses the number of convictions data-set (see below): convictionsFreq = read.csv(url("http://whitlockschluter.zoology.ubc.ca/wp-content/data/chapter03/chap03t1_2ConvictionsFreq.csv")) head(convictionsFreq) ## convictions frequency ## 1 0 265 ## 2 1 49 ## 3 2 21 ## 4 3 19 ## 5 4 10 ## 6 5 10 Notice the format of the data … it is given to us as a frequency table. Not a problem so long as we understand what we are looking at. In brief, 265 boys had no convictions, 49 had 1 conviction, and so on. We can expand the data into its long (or raw) form as shown below: convictions = rep(convictionsFreq$convictions, convictionsFreq$frequency) head(convictions) ## [1] 0 0 0 0 0 0 Now the calculations can commence: mean(convictions, na.rm=TRUE) ## [1] 1.126582 median(convictions, na.rm=TRUE) ## [1] 0 sd(convictions, na.rm=TRUE) ## [1] 2.456562 var(convictions, na.rm=TRUE) ## [1] 6.034698 quantile(convictions, probs=c(0.25, 0.75), type=5, na.rm=TRUE) ## 25% 75% ## 0 1 IQR(convictions, type=5, na.rm=TRUE) ## [1] 1 ( sd(convictions) / mean(convictions) ) * 100 ## [1] 218.0544 tab.c = table(convictions); tab.c ## convictions ## 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 ## 265 49 21 19 10 10 2 2 4 2 1 4 3 1 2 barplot(tab.c, ylim=c(0,300), xlab="Number of Convictions", ylab="Frequency") ## The Speeds of Male Tidarren spiders (pre- and post-amputation) spiderData = read.csv(url("http://whitlockschluter.zoology.ubc.ca/wp-content/data/chapter03/chap03e2SpiderAmputation.csv")) head(spiderData) ## spider speed treatment ## 1 1 1.25 before ## 2 2 2.94 before ## 3 3 2.38 before ## 4 4 3.09 before ## 5 5 3.41 before ## 6 6 3.00 before Note: The two variables of interest are speed and treatment. We can start by constructing histograms of speeds before and after amputation. We will then look at a different graphic … the boxplot, an extremely useful plot because it conveys a lot of information very simply. First the histograms library(lattice) histogram(~ speed | treatment, data=spiderData, layout=c(1,2), type=c("count")) Now the box-plots: boxplot(speed ~ treatment, data=spiderData, horizontal=TRUE, ylab="Amputation Treatment", xlab="Speed (cm/s)", col="gray") For each group the boxplot maps a few statistics. First, the dark line inside each box is the Median for the group. Clearly the median speed is higher for the post-amputation group than for the pre-amputation group. The edges of the gray boxes reflect the first and third quartiles for each group; $$Q_1$$ is to the left of the Median and $$Q_3$$ is to the right of the Median. The “whiskers” go out to the minimum and maximum values. If a distribution is symmetric then the box should be equally wide either side of the Median and the whiskers should be of similar length on either side of the box. If you see the plot looking stretched out on one side then that tells us it is skewed towards this side. For instance, see the stretch to the right for the “after”" group; these data are positively skewed (aka skewed right). For the “before” group we see the stretch to the left; this group’s speeds are negatively skewed (i.e. skewed left). We also see a lone dot at the lower extreme of the plot for the “before” group. This is an outlier … any data point that is smaller than $$Q_1 - (1.5 \times IQR)$$ or larger than $$Q_3 + (1.5 \times IQR)$$. An outlier is essentially an extremely unusual value. We will now extract and save the before-amputation data using the subset switch. We will then repeat this exercise for the after-amputation data. speedBefore = subset(spiderData, treatment == "before") speedBefore ## spider speed treatment ## 1 1 1.25 before ## 2 2 2.94 before ## 3 3 2.38 before ## 4 4 3.09 before ## 5 5 3.41 before ## 6 6 3.00 before ## 7 7 2.31 before ## 8 8 2.93 before ## 9 9 2.98 before ## 10 10 3.55 before ## 11 11 2.84 before ## 12 12 1.64 before ## 13 13 3.22 before ## 14 14 2.87 before ## 15 15 2.37 before ## 16 16 1.91 before speedAfter = subset(spiderData, treatment == "after") speedAfter ## spider speed treatment ## 17 1 2.40 after ## 18 2 3.50 after ## 19 3 4.49 after ## 20 4 3.17 after ## 21 5 5.26 after ## 22 6 3.22 after ## 23 7 2.32 after ## 24 8 3.31 after ## 25 9 3.70 after ## 26 10 4.70 after ## 27 11 4.94 after ## 28 12 5.06 after ## 29 13 3.22 after ## 30 14 3.52 after ## 31 15 5.45 after ## 32 16 3.40 after Now we can calculate the mean, median, variance, standard deviation, quartiles, and the interquartile range for each group. mean(speedBefore$speed, na.rm=TRUE); mean(speedAfter$speed, na.rm=TRUE) ## [1] 2.668125 ## [1] 3.85375 median(speedBefore$speed, na.rm=TRUE); median(speedAfter$speed, na.rm=TRUE) ## [1] 2.9 ## [1] 3.51 var(speedBefore$speed, na.rm=TRUE); var(speedAfter$speed, na.rm=TRUE) ## [1] 0.4115896 ## [1] 0.9853183 sd(speedBefore$speed, na.rm=TRUE); sd(speedAfter$speed, na.rm=TRUE) ## [1] 0.6415525 ## [1] 0.992632 quantile(speedBefore$speed, probs=c(0.25, 0.75), type=5, na.rm=TRUE)
##   25%   75%
## 2.340 3.045
quantile(speedAfter$speed, probs=c(0.25, 0.75), type=5, na.rm=TRUE) ## 25% 75% ## 3.22 4.82 IQR(speedBefore$speed, type=5, na.rm=TRUE)
## [1] 0.705
IQR(speedAfter$speed, type=5, na.rm=TRUE) ## [1] 1.6 Just for completeness’ sake, multiply the IQR of each group by 1.5 1.5 * IQR(speedBefore$speed, type=5, na.rm=TRUE)
## [1] 1.0575
1.5 * IQR(speedAfter$speed, type=5, na.rm=TRUE) ## [1] 2.4 Now subtract 1.0575 from $$Q_1$$ of the “before” group 2.340 - 1.0575 ## [1] 1.2825 This yields a value of $$1.2825$$ … Spider 1 in the “before” group has a speed of $$1.25$$ and hence is flagged as an outlier in the boxplot. # 3 Cumulative Frequency Distributions par(mfrow=c(1,2)) plot( ecdf(speedBefore$speed), verticals = TRUE,
las = 1, main = "", do.points = FALSE,
ylab = "Cumulative relative frequency",
xlab = "Running speed before amputation (cm/s)")
plot( ecdf(speedAfter$speed), verticals = TRUE, las = 1, main = "", do.points = FALSE, ylab = "Cumulative relative frequency", xlab = "Running speed after amputation (cm/s)" ) dev.off() ## null device ## 1 par(mfrow=c(1,1)) You can pick any point on the vertical axis, connect a straight line to the plotted line, then drop a perpendicular to the x-axis … and you’ll be able to say “so-and-so proportion of the spider had speeds below the x-value”. Out of sheer curiosity, pick $$0.5$$ on the vertical axis of both groups and draw the straight line to the plot. Now read the approximate running speed that would result if you dropped a perpendicular to the x-axis. What this says is that 50% of spiders in the before group had speeds of at most 2.9 while 50% of spiders in the after group had speeds of at most 3.51. par(mfrow=c(1,2)) plot( ecdf(speedBefore$speed), verticals = TRUE,
las = 1, main = "", do.points = FALSE,
ylab = "Cumulative relative frequency",
xlab = "Running speed before amputation (cm/s)")
abline(h=0.5)
plot( ecdf(speedAfter$speed), verticals = TRUE, las = 1, main = "", do.points = FALSE, ylab = "Cumulative relative frequency", xlab = "Running speed after amputation (cm/s)" ) abline(h=0.5) dev.off() ## null device ## 1 par(mfrow=c(1,1)) You don’t have to pick the $$0.5$$ point; you could have picked $$0.9$$ and said “90% of spiders in group so-and-so had speeds of so much or less”, so on and so forth. # 4 What if I have Three or More Groups? The psych package (install it before you use it) is very handy here to calculate a number of statistics. We will see it in use with the data-set on sticklebacks: sticklebackData = read.csv(url("http://whitlockschluter.zoology.ubc.ca/wp-content/data/chapter03/chap03e3SticklebackPlates.csv")) head(sticklebackData) ## id plates genotype ## 1 4-1 11 mm ## 2 4-2 63 Mm ## 3 4-4 22 Mm ## 4 4-5 10 Mm ## 5 4-10 14 mm ## 6 4-12 11 mm library(psych) describeBy(sticklebackData$plates, group=sticklebackData$genotype, na.rm=TRUE) ## group: mm ## vars n mean sd median trimmed mad min max range skew kurtosis se ## 1 1 88 11.67 3.57 11 11.42 2.97 6 37 31 4.2 26.92 0.38 ## -------------------------------------------------------- ## group: Mm ## vars n mean sd median trimmed mad min max range skew kurtosis ## 1 1 174 50.38 15.15 59 52.52 7.41 10 68 58 -1.06 -0.18 ## se ## 1 1.15 ## -------------------------------------------------------- ## group: MM ## vars n mean sd median trimmed mad min max range skew kurtosis se ## 1 1 82 62.78 3.41 63 63.21 1.48 42 69 27 -3.55 17.62 0.38 Histograms are easily drawn as well: histogram(~ plates | genotype, data=sticklebackData, layout=c(1,3), type="count", xlab="Number of Plates", ylab="Frequency") Again, in a very marked fashion the $$MM$$ group has the most number of plates on average. Constructing a table of frequencies and proportions of the stickleback genotypes is easy as well. tab.s1 = table(sticklebackData$genotype, dnn = "genotype")
tab.s1
## genotype
##  mm  Mm  MM
##  88 174  82
tab.s2 = addmargins(tab.s1)
tab.s2
## genotype
##  mm  Mm  MM Sum
##  88 174  82 344
sticklebackFreq = data.frame(tab.s2)
sticklebackFreq
##   genotype Freq
## 1       mm   88
## 2       Mm  174
## 3       MM   82
## 4      Sum  344
tab.sprop1 = prop.table(tab.s1)
tab.sprop1
## genotype
##        mm        Mm        MM
## 0.2558140 0.5058140 0.2383721
tab.sprop2 = addmargins(tab.sprop1)
sticklebackRelFreq = data.frame(tab.sprop2)
sticklebackRelFreq
##   genotype      Freq
## 1       mm 0.2558140
## 2       Mm 0.5058140
## 3       MM 0.2383721
## 4      Sum 1.0000000