1 Describing Data: Central Tendency and Dispersion
- 1.1 The Coefficient of Variation
2 Working with Grouped Data
3 Cumulative Frequency Distributions
4 What if I have Three or More Groups?

1 Describing Data: Central Tendency and Dispersion

Working with the data on gliding snakes’ undulation rates we will calculate the mean, median, range, interquartile range, variance, and standard deviation. We will also construct some plots in order to see a few things in action.

snakeData = read.csv(url("http://whitlockschluter.zoology.ubc.ca/wp-content/data/chapter03/chap03e1GlidingSnakes.csv"))
head(snakeData)

##   snake undulationRateHz
## 1     1              0.9
## 2     2              1.2
## 3     3              1.2
## 4     4              1.3
## 5     5              1.4
## 6     6              1.4

Drawing a histogram of these data will allow us to see how undulation rates are distributed.

hist(snakeData$undulationRateHz, right = FALSE, las = 1, col = "gray", breaks = seq(0.8,2.2,by=0.2), xlab = "Undulation rate (Hz)", ylab = "Frequency", main = "Distribution of Undulation Rates (in Hz)", ylim=c(0,3))

The code below constructs a similar plot albeit with the ggplot2 package

library(ggplot2)
ggplot(snakeData, aes(undulationRateHz)) + geom_histogram(binwidth=0.2) + xlab("Undulation rate (Hz)")

## Calculating Central Tendency & Variability Calculating the mean, median, variance, and standard deviation is pretty straightforward:

mean(snakeData$undulationRate, na.rm=TRUE)

## [1] 1.375

median(snakeData$undulationRate, na.rm=TRUE)

## [1] 1.35

var(snakeData$undulationRate, na.rm=TRUE)

## [1] 0.105

sd(snakeData$undulationRate, na.rm=TRUE)

## [1] 0.324037

Note: If you have missing values in your data-set then you should run mean(snakeData$undulationRateHz, na.rm=TRUE) etc. for all commands. As a matter of fact R has several commands that will not run unless you specify na.rm=TRUE. Further, if we wanted to we could have named each estimate, for example, mean.ur = mean(snakeData$undulationRate) and so on.

We have to be a bit careful when calculating the quartiles ($Q_1$ and $Q_3$) and the interquartile range. This is because there are nine ways of calculating these in R. We will use type=5 in this class since that is what is used by Whitlock & Schluter. The relevant commands are shown below:

quantile(snakeData$undulationRateHz, probs=c(0.25, 0.75), type=5, na.rm=TRUE)

## 25% 75% 
## 1.2 1.5

IQR(snakeData$undulationRateHz, type=5, na.rm=TRUE)

## [1] 0.3

Note: $IQR = Q_3 - Q_1 = 1.5 - 1.2 = 0.3$ . Specifying probs=c(0.25, 0.75) specifies that we want $Q_1$ and $Q_3$. If you wanted the median then you’d have to specify a probs value of 0.50.

1.1 The Coefficient of Variation

The Coefficient of Variation tells us something about how a variable is distributed about its mean. Given that it is a ratio, the higher is the coefficient of variation the more dispersed are the values of the variable. Why is it a useful measure if we have others to choose from? Because it is not influenced by the units of measurement. For example, if you are measuring heights in inches and calculate the variance and standard deviation, then convert height into centimeters and recompute the variance and standard deviation, the two sets of values will differ a good bit (see below):

height.in = c(65, 72, 77, 60, 58)
mean(height.in) # No missing values so na.rm=TRUE not specified

## [1] 66.4

sd(height.in) # No missing values so na.rm=TRUE not specified

## [1] 8.018728

height.cm = height.in * 2.54
mean(height.cm) # No missing values so na.rm=TRUE not specified

## [1] 168.656

sd(height.cm) # No missing values so na.rm=TRUE not specified

## [1] 20.36757

Now suddenly it seems as if there is more variability when height is measured in centimeters, but this is because of the units of measurement being used! If we now calculate the coefficient of variation for height in inches and then for height in centimeters we get identical estimates of how much variation there is in the heights of these individuals:

cov.in = (sd(height.in) / mean(height.in)) * 100
cov.in

## [1] 12.0764

cov.cm = (sd(height.cm) / mean(height.cm)) * 100
cov.cm

## [1] 12.0764

Now calculating the coefficient of variation for the gliding snakes data-set:

100 * sd(snakeData$undulationRate)/mean(snakeData$undulationRate)

## [1] 23.56633

2 Working with Grouped Data

For this exercise we’ll replicate the in-text example that uses the number of convictions data-set (see below):

convictionsFreq = read.csv(url("http://whitlockschluter.zoology.ubc.ca/wp-content/data/chapter03/chap03t1_2ConvictionsFreq.csv"))
head(convictionsFreq)

##   convictions frequency
## 1           0       265
## 2           1        49
## 3           2        21
## 4           3        19
## 5           4        10
## 6           5        10

Notice the format of the data … it is given to us as a frequency table. Not a problem so long as we understand what we are looking at. In brief, 265 boys had no convictions, 49 had 1 conviction, and so on. We can expand the data into its long (or raw) form as shown below:

convictions = rep(convictionsFreq$convictions, convictionsFreq$frequency)
head(convictions)

## [1] 0 0 0 0 0 0

Now the calculations can commence:

mean(convictions, na.rm=TRUE)

## [1] 1.126582

median(convictions, na.rm=TRUE)

## [1] 0

sd(convictions, na.rm=TRUE)

## [1] 2.456562

var(convictions, na.rm=TRUE)

## [1] 6.034698

quantile(convictions, probs=c(0.25, 0.75), type=5, na.rm=TRUE)

## 25% 75% 
##   0   1

IQR(convictions, type=5, na.rm=TRUE)

## [1] 1

( sd(convictions) / mean(convictions) ) * 100

## [1] 218.0544

tab.c = table(convictions); tab.c

## convictions
##   0   1   2   3   4   5   6   7   8   9  10  11  12  13  14 
## 265  49  21  19  10  10   2   2   4   2   1   4   3   1   2

barplot(tab.c, ylim=c(0,300), xlab="Number of Convictions", ylab="Frequency")

## The Speeds of Male Tidarren spiders (pre- and post-amputation)

spiderData = read.csv(url("http://whitlockschluter.zoology.ubc.ca/wp-content/data/chapter03/chap03e2SpiderAmputation.csv"))
head(spiderData)

##   spider speed treatment
## 1      1  1.25    before
## 2      2  2.94    before
## 3      3  2.38    before
## 4      4  3.09    before
## 5      5  3.41    before
## 6      6  3.00    before

Note: The two variables of interest are speed and treatment. We can start by constructing histograms of speeds before and after amputation. We will then look at a different graphic … the boxplot, an extremely useful plot because it conveys a lot of information very simply. First the histograms

library(lattice)
histogram(~ speed | treatment, data=spiderData, layout=c(1,2), type=c("count"))

Now the box-plots:

boxplot(speed ~ treatment, data=spiderData, horizontal=TRUE, ylab="Amputation Treatment", xlab="Speed (cm/s)", col="gray")

For each group the boxplot maps a few statistics. First, the dark line inside each box is the Median for the group. Clearly the median speed is higher for the post-amputation group than for the pre-amputation group. The edges of the gray boxes reflect the first and third quartiles for each group; $Q_1$ is to the left of the Median and $Q_3$ is to the right of the Median. The “whiskers” go out to the minimum and maximum values.

If a distribution is symmetric then the box should be equally wide either side of the Median and the whiskers should be of similar length on either side of the box. If you see the plot looking stretched out on one side then that tells us it is skewed towards this side. For instance, see the stretch to the right for the “after”" group; these data are positively skewed (aka skewed right). For the “before” group we see the stretch to the left; this group’s speeds are negatively skewed (i.e. skewed left).

We also see a lone dot at the lower extreme of the plot for the “before” group. This is an outlier … any data point that is smaller than $Q_1 - (1.5 \times IQR)$ or larger than $Q_3 + (1.5 \times IQR)$. An outlier is essentially an extremely unusual value.

We will now extract and save the before-amputation data using the subset switch. We will then repeat this exercise for the after-amputation data.

speedBefore = subset(spiderData, treatment == "before") 
speedBefore

##    spider speed treatment
## 1       1  1.25    before
## 2       2  2.94    before
## 3       3  2.38    before
## 4       4  3.09    before
## 5       5  3.41    before
## 6       6  3.00    before
## 7       7  2.31    before
## 8       8  2.93    before
## 9       9  2.98    before
## 10     10  3.55    before
## 11     11  2.84    before
## 12     12  1.64    before
## 13     13  3.22    before
## 14     14  2.87    before
## 15     15  2.37    before
## 16     16  1.91    before

speedAfter = subset(spiderData, treatment == "after") 
speedAfter

##    spider speed treatment
## 17      1  2.40     after
## 18      2  3.50     after
## 19      3  4.49     after
## 20      4  3.17     after
## 21      5  5.26     after
## 22      6  3.22     after
## 23      7  2.32     after
## 24      8  3.31     after
## 25      9  3.70     after
## 26     10  4.70     after
## 27     11  4.94     after
## 28     12  5.06     after
## 29     13  3.22     after
## 30     14  3.52     after
## 31     15  5.45     after
## 32     16  3.40     after

Now we can calculate the mean, median, variance, standard deviation, quartiles, and the interquartile range for each group.

mean(speedBefore$speed, na.rm=TRUE); mean(speedAfter$speed, na.rm=TRUE)

## [1] 2.668125

## [1] 3.85375

median(speedBefore$speed, na.rm=TRUE); median(speedAfter$speed, na.rm=TRUE)

## [1] 2.9

## [1] 3.51

var(speedBefore$speed, na.rm=TRUE); var(speedAfter$speed, na.rm=TRUE)

## [1] 0.4115896

## [1] 0.9853183

sd(speedBefore$speed, na.rm=TRUE); sd(speedAfter$speed, na.rm=TRUE)

## [1] 0.6415525

## [1] 0.992632

quantile(speedBefore$speed, probs=c(0.25, 0.75), type=5, na.rm=TRUE)

##   25%   75% 
## 2.340 3.045

quantile(speedAfter$speed, probs=c(0.25, 0.75), type=5, na.rm=TRUE)

##  25%  75% 
## 3.22 4.82

IQR(speedBefore$speed, type=5, na.rm=TRUE)

## [1] 0.705

IQR(speedAfter$speed, type=5, na.rm=TRUE)

## [1] 1.6

Just for completeness’ sake, multiply the IQR of each group by 1.5

1.5 * IQR(speedBefore$speed, type=5, na.rm=TRUE)

## [1] 1.0575

1.5 * IQR(speedAfter$speed, type=5, na.rm=TRUE)

## [1] 2.4

Now subtract 1.0575 from $Q_1$ of the “before” group

2.340 - 1.0575

## [1] 1.2825

This yields a value of $1.2825$ … Spider 1 in the “before” group has a speed of $1.25$ and hence is flagged as an outlier in the boxplot.

3 Cumulative Frequency Distributions

par(mfrow=c(1,2))
plot( ecdf(speedBefore$speed), verticals = TRUE,  
  las = 1, main = "", do.points = FALSE,
    ylab = "Cumulative relative frequency", 
    xlab = "Running speed before amputation (cm/s)")
plot( ecdf(speedAfter$speed), verticals = TRUE,  
  las = 1, main = "", do.points = FALSE,
    ylab = "Cumulative relative frequency", 
    xlab = "Running speed after amputation (cm/s)" )

dev.off()

## null device 
##           1

par(mfrow=c(1,1))

You can pick any point on the vertical axis, connect a straight line to the plotted line, then drop a perpendicular to the x-axis … and you’ll be able to say “so-and-so proportion of the spider had speeds below the x-value”. Out of sheer curiosity, pick $0.5$ on the vertical axis of both groups and draw the straight line to the plot. Now read the approximate running speed that would result if you dropped a perpendicular to the x-axis. What this says is that 50% of spiders in the before group had speeds of at most 2.9 while 50% of spiders in the after group had speeds of at most 3.51.

par(mfrow=c(1,2))
plot( ecdf(speedBefore$speed), verticals = TRUE,  
  las = 1, main = "", do.points = FALSE,
  ylab = "Cumulative relative frequency", 
    xlab = "Running speed before amputation (cm/s)")
abline(h=0.5)
plot( ecdf(speedAfter$speed), verticals = TRUE,  
  las = 1, main = "", do.points = FALSE,
    ylab = "Cumulative relative frequency", 
    xlab = "Running speed after amputation (cm/s)" )
abline(h=0.5)

dev.off()

## null device 
##           1

par(mfrow=c(1,1))

You don’t have to pick the $0.5$ point; you could have picked $0.9$ and said “90% of spiders in group so-and-so had speeds of so much or less”, so on and so forth.

4 What if I have Three or More Groups?

The psych package (install it before you use it) is very handy here to calculate a number of statistics. We will see it in use with the data-set on sticklebacks:

sticklebackData = read.csv(url("http://whitlockschluter.zoology.ubc.ca/wp-content/data/chapter03/chap03e3SticklebackPlates.csv"))
head(sticklebackData)

##     id plates genotype
## 1  4-1     11       mm
## 2  4-2     63       Mm
## 3  4-4     22       Mm
## 4  4-5     10       Mm
## 5 4-10     14       mm
## 6 4-12     11       mm

library(psych)
describeBy(sticklebackData$plates, group=sticklebackData$genotype, na.rm=TRUE)

## group: mm
##   vars  n  mean   sd median trimmed  mad min max range skew kurtosis   se
## 1    1 88 11.67 3.57     11   11.42 2.97   6  37    31  4.2    26.92 0.38
## -------------------------------------------------------- 
## group: Mm
##   vars   n  mean    sd median trimmed  mad min max range  skew kurtosis
## 1    1 174 50.38 15.15     59   52.52 7.41  10  68    58 -1.06    -0.18
##     se
## 1 1.15
## -------------------------------------------------------- 
## group: MM
##   vars  n  mean   sd median trimmed  mad min max range  skew kurtosis   se
## 1    1 82 62.78 3.41     63   63.21 1.48  42  69    27 -3.55    17.62 0.38

Histograms are easily drawn as well:

histogram(~ plates | genotype, data=sticklebackData, layout=c(1,3), type="count", xlab="Number of Plates", ylab="Frequency")

Again, in a very marked fashion the $MM$ group has the most number of plates on average.

Constructing a table of frequencies and proportions of the stickleback genotypes is easy as well.

tab.s1 = table(sticklebackData$genotype, dnn = "genotype")
tab.s1

## genotype
##  mm  Mm  MM 
##  88 174  82

tab.s2 = addmargins(tab.s1)
tab.s2

## genotype
##  mm  Mm  MM Sum 
##  88 174  82 344

sticklebackFreq = data.frame(tab.s2)
sticklebackFreq

##   genotype Freq
## 1       mm   88
## 2       Mm  174
## 3       MM   82
## 4      Sum  344

tab.sprop1 = prop.table(tab.s1)
tab.sprop1

## genotype
##        mm        Mm        MM 
## 0.2558140 0.5058140 0.2383721

tab.sprop2 = addmargins(tab.sprop1)
sticklebackRelFreq = data.frame(tab.sprop2)
sticklebackRelFreq

##   genotype      Freq
## 1       mm 0.2558140
## 2       Mm 0.5058140
## 3       MM 0.2383721
## 4      Sum 1.0000000

R Script for Chapter 3

ani

December 15, 2015

1 Describing Data: Central Tendency and Dispersion

1.1 The Coefficient of Variation

2 Working with Grouped Data

3 Cumulative Frequency Distributions

4 What if I have Three or More Groups?