1 - Estimating with Uncertainty

We know that pretty much everything we do with data involves flying blind in the sense that all we have is a sample to go by but our goal is to generalize from the sample to the unseen population. Before we can do so however we need to understand not only the pitfalls of working with samples but also how we may leverage statistical theory to help us build a bridge from our sample to the population it purportedly represents. We’ll start by working with what is the closest thing to a population we may have in our hands … the population of human gene lengths (see the Human Genome Project for details if you are unfamiliar with this monumental effort).

We begin by reading in the human gene length data, treating it as a known population of human gene lengths.

humanGeneLengths = read.csv(url("http://whitlockschluter.zoology.ubc.ca/wp-content/data/chapter04/chap04e1HumanGeneLengths.csv"))
head(humanGeneLengths)
##   geneLength
## 1       2069
## 2       1303
## 3       1067
## 4       2849
## 5       3378
## 6       2391
summary(humanGeneLengths)
##    geneLength   
##  Min.   :   60  
##  1st Qu.: 1312  
##  Median : 2226  
##  Mean   : 2622  
##  3rd Qu.: 3444  
##  Max.   :99631
HGL.N = nrow(humanGeneLengths)
HGL.mean = mean(humanGeneLengths$geneLength)
HGL.mean
## [1] 2622.027
HGL.var = sum((humanGeneLengths$geneLength - HGL.mean)^2) / HGL.N
HGL.var
## [1] 4149031
HGL.sd = sqrt(HGL.var)
HGL.sd
## [1] 2036.917
data.frame(Parameter = c("Mean", "Standard deviation"), 
  Value = c(HGL.mean, HGL.sd))
##            Parameter    Value
## 1               Mean 2622.027
## 2 Standard deviation 2036.917

Pay close attention to the results of the summary() command. Note, in particular, the mean length of the human gene \((=2,622.0274)\). Note also the rest of the code you see. Since we are treating these data as a population this code manually calculates the mean (HGL.Mean), the population size (HGL.N), variance (HGL.var), and standard deviation (HGL.sd).

Let us map the distribution via a histogram and a box-plot.

hist(humanGeneLengths$geneLength, breaks=seq(0, 100000, by = 500), ylim=c(0,3000), xlab="Gene Length", ylab="Frequency", col="gray", main="Histogram of Human Gene Length")

boxplot(humanGeneLengths$geneLength, horizontal=TRUE, col="gray", xlab="Human Gene Length")

Both plots show a severely right-skewed distribution; the bulk of the lengths are clustering at the lower ends of the distribution. Below you see the authors’ code for subsetting the data by selecting only those observations that have gene lengths of 15,000 nucleotides or less.

geneLengthsUpTo15K = subset(humanGeneLengths, geneLength <= 15000)
hist(geneLengthsUpTo15K$geneLength, breaks = seq(0, 15000, 500), xlab = "Gene length (Number of nucleotides)", ylab = "Frequency",  col = "firebrick", las = 1, main = "", right = FALSE, ylim=c(0,3000))

boxplot(geneLengthsUpTo15K$geneLength, horizontal=TRUE, xlab="Gene length (Number of nucleotides)", col="gray")

2 - Drawing Random Samples from the Population

We will begin with ONE random sample of size 100 (i.e., \(n=100\)). The code below accomplishes this for us. You will also see the code for constructing a histogram and box-plot for the sample, and the mean and standard deviation of the sample.

gene.sample = sample(humanGeneLengths$geneLength, size = 100, replace = TRUE)
#gene.sample = as.data.frame(gene.sample)
summary(gene.sample)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     204    1761    2340    2768    3412   10748
gs.mean = mean(gene.sample)
gs.mean
## [1] 2767.91
gs.sd = sd(gene.sample)
gs.sd
## [1] 1780.529
hist(gene.sample, col="gray", ylim=c(0,35),
     xlab="Gene Length (number of nucleotides)",
     main="Random Sample of Gene Length (n=100)")

boxplot(gene.sample, col="gray", horizontal=TRUE, xlab="Gene Length (number of nucleotides)")

Calculating the standard error of this sample is straight forward (see below). The length() command is calculating the total number of observations in the sample (just for the sake of demonstration, since we knowingly set it to 100 so there is no need to calculate it!).

gs.n = length(gene.sample)
gs.n
## [1] 100
gs.se = gs.sd / sqrt(gs.n)
gs.se
## [1] 178.0529

Now, before we go any farther note that your sample mean 2767.91 is quite some distance from the population mean 2622.0274027

3 - The Sampling Distribution

We will now run some nifty code to draw an increasing number of random samples, each with exactly 100 observations (i.e., \(n=100\) for each sample). For each sample we’ll calculate the sample mean, save this estimate for subsequent use, and then we will plot a histogram and a box-plot of the distribution of all sample means so estimated.

## Drawing 100 random samples of n=100
population = humanGeneLengths
means = rep(NA, 100)
for(i in 1:100) {
  temp.sample = sample(population$geneLength, 1000, replace=TRUE)
  means[i] = mean(temp.sample)
}
means = as.data.frame(means)
means.dat = unlist(means)
mean(means.dat)
## [1] 2606.231
hist(means.dat, main="100 Random Samples of n = 100", xlab = "Sample Mean", col="gray")
abline(v=2622.0274, col="red")

boxplot(means.dat, horizontal=TRUE, main="100 Random Samples of n = 100", xlab = "Sample Mean", col="gray")
abline(v=2622.0274, col="red")

Note: Be careful!! The histogram and box-plot are for 100 sample means and not for 100 gene lengths. The perpendicular red line flags the Population mean (\(= 2622.02\))

I will now expand the size of each random sample drawn from 100 to 10,000. I will then repeat the preceding sampling and plotting sequence.

## Drawing 100 random samples of n=10,000
population = humanGeneLengths
means = rep(NA, 100)
for(i in 1:100) {
  temp.sample = sample(population$geneLength, 10000, replace=TRUE)
  means[i] = mean(temp.sample)
}
means = as.data.frame(means)
means.dat = unlist(means)
mean(means.dat)
## [1] 2620.13
hist(means.dat, main="100 Random Samples of n = 10,000", xlab = "Sample Mean", col="gray")
abline(v=2622.0274, col="red")

boxplot(means.dat, horizontal=TRUE, main="100 Random Samples of n = 100", xlab = "Sample Mean", col="gray")
abline(v=2622.0274, col="red")

Now I’ll draw many random samples of size = 100 from the population.

# Drawing a large number of random samples of n=100
population = humanGeneLengths
means= rep(NA, 20290)

for(i in 1:20290) {
  temp.sample=sample(population$geneLength, size=100, replace=TRUE)
  means[i]=mean(temp.sample)
}

means = as.data.frame(means)
means.dat = unlist(means)
mean(means.dat)
## [1] 2621.432
hist(means.dat, main="Many Random Samples of n = 100", xlab = "Sample Mean", col="red", ylim=c(0,5000), xlim=c(2000,4500))
abline(v=2622.0274, col="white")

And now a large number of possible samples of size = 10,000

# Drawing a large number of random samples of n=10,000
population = humanGeneLengths
means= rep(NA, 20290)

for(i in 1:20290) {
  temp.sample=sample(population$geneLength, size=10000, replace=TRUE)
  means[i]=mean(temp.sample)
}

means = as.data.frame(means)
means.dat = unlist(means)
mean(means.dat)
## [1] 2622.168
hist(means.dat, main="Many Random Samples of n = 10000", xlab = "Sample Mean", col="red", ylim=c(0,4000), xlim=c(2525,2725))
abline(v=2622.0274, col="white")

The commands below will allow us to visually compare the sampling distribution of the mean for different sample sizes … \(n=20\); \(n=100\), and; \(n=500\)

results20 = vector()
for(i in 1:10000){
    tmpSample = sample(humanGeneLengths$geneLength, size = 20, replace = FALSE)
    results20[i] = mean(tmpSample)
    }
results100 = vector()
for(i in 1:10000){
    tmpSample = sample(humanGeneLengths$geneLength, size = 100, replace = FALSE)
    results100[i] = mean(tmpSample)
    }
results500 = vector()
for(i in 1:10000){
    tmpSample = sample(humanGeneLengths$geneLength, size = 500, replace = FALSE)
    results500[i] = mean(tmpSample)
    }
par(mfrow = c(3,1))  # put 3 plots on the same page in 3 rows, 1 column
hist(results20, breaks = 50, right = FALSE, xlim = range(results20),
    col = "firebrick", main = "n = 20", xlab = "")
hist(results100, breaks = 50, right = FALSE, xlim = range(results20),
    col = "firebrick", main = "n = 100", xlab = "")
hist(results500, breaks = 50, right = FALSE, xlim = range(results20),
    col = "firebrick", main = "n = 500", xlab = "")

par(mfrow = c(1,1))  # change layout back to 1 plot per page

4 - The Confidence Interval (A Demonstration)

\[\text{Approximate 95% CI: } \bar{Y} \pm (2)\left(\dfrac{s}{\sqrt{n}}\right) = \bar{Y} \pm (2)\left(SE_{\bar{Y}}\right)\] Recall the notion of the confidence interval. In brief, if we were to take all possible random samples of size \(=n\), calculate the 95% confidence interval for the sample mean in each of these samples, then 95% of the resulting confidence intervals would trap the population mean.

We could try and do this for samples of size \(=n\) drawn from the humanGeneLengths data but that would overwhelm the computer. So instead we’ll look at two simple examples that hopefully get two key points across. The first one below will plot 20 confidence intervals calculated with 20 random samples, each of size \(=100\).

results100 = data.frame(mean = rep(0, 20), lower = rep(0, 20), upper = rep(0, 20))
index = 1:20
for(i in index){
  tmpSample = sample(humanGeneLengths$geneLength, size = 100, replace = TRUE)
t = t.test(tmpSample)
results100$mean[i] = t$estimate
results100$lower[i] = t$conf.int[1]
results100$upper[i] = t$conf.int[2]
}
plot(index ~ mean, data = results100, pch = 16, col = "red", yaxt = "n", xlim = c(min(results100$lower), max(results100$upper)), bty = "l", xlab = "Gene length (number of nucleotides)", ylab = "Sample Index")
lines(c(2622.0274, 2622.0274), c(0,20), lty = 2)
segments(results100$lower, index, results100$upper, index)
segments(results100$lower, index - 0.25, results100$lower, index + 0.25)
segments(results100$upper, index - 0.25, results100$upper, index + 0.25)

What is obvious is how the sample means fluctuate. The example below extends the point a bit by emphasizing the fact that some confidence intervals will miss the population mean altogether. Here the number of randomly drawn samples has been pushed up to 100.

n.draw = 100
mu = mean(humanGeneLengths$geneLength)
n = 100
SD = sd(humanGeneLengths$geneLength)
draws = matrix(rnorm(n.draw * n, mu, SD), n)
get.conf.int = function(x) t.test(x)$conf.int
conf.int = apply(draws, 2, get.conf.int)
trap.n = sum(conf.int[1, ] <= mu & conf.int[2, ] >= mu)
trap.n
## [1] 97
par(bg = "ghostwhite")
plot(range(conf.int), c(0, 1 + n.draw), type = "n", 
     xlab = "Gene Length (in mm)", 
     ylab = "Sample Run",
     main = "95% Confidence Intervals (100 Sample Runs)",
     cex.main =  1, cex.lab = 1)
for (i in 1:n.draw) 
  lines(conf.int[, i], 
  rep(i, 2), lwd = 1, col = "black")
abline(v = 2622.0274, lwd = 1, lty = 2, col="firebrick2")

Now you see 97 confidence intervals trapping the population mean (flagged by the red dashed line at \(2622.0274\)). Of course, in practice you would typically work with a single sample and try to do the best you can. However, henceforth you should recognize the fact that your answers could be way off-base, and you may not even know it! Hence a golden rule-of-thumb is to pay attention to … * Sampling – don’t use the wrong sampling design or sample incorrectly * Sample Size – get as large a sample as you can without violating the appropriate sampling protocols * Standard Error – pay attention to the standard errors of your estimates; large standard errors should lead you to be more cautious with your generalizations

5 - Error Bars

We’ll first draw a strip chart and then the standard error bars for the locust serotonin data used in Chapter 02.

locustData = read.csv(url("http://whitlockschluter.zoology.ubc.ca/wp-content/data/chapter02/chap02f1_2locustSerotonin.csv"))
head(locustData)
##   serotoninLevel treatmentTime
## 1            5.3             0
## 2            4.6             0
## 3            4.5             0
## 4            4.3             0
## 5            4.2             0
## 6            3.6             0
stripchart(serotoninLevel ~ treatmentTime, data = locustData, 
method = "jitter", vertical = TRUE)

We need the mean and the standard to superimpose on the strip charts. We used the dplyr and the psych packages in Chapter 03 to get mean, median, etc. for each group. Will these work for also calcualting the standard error?

locustData$groups = factor(locustData$treatmentTime, levels=c(0, 1, 2), labels=c("0 Hours", "1 Hour", "2 Hours"))

library(dplyr)

grouped = group_by(locustData, groups)

tab.A = summarise(grouped, 
  Mean = mean(serotoninLevel, na.rm=TRUE),
  Std.Dev = sd(serotoninLevel, na.rm=TRUE),
  n = n(),
  Std.Error = (Std.Dev / sqrt(n))
                    )

tab.A
## # A tibble: 3 x 5
##   groups   Mean Std.Dev     n Std.Error
##   <fct>   <dbl>   <dbl> <int>     <dbl>
## 1 0 Hours  6.36    4.82    10      1.52
## 2 1 Hour   8.04    4.96    10      1.57
## 3 2 Hours 10.8     5.33    10      1.68

Now the code for the finished plot. This code looks complicated but we’ve done it before so the only new element is the addition of the standard error.

offsetAmount = 0.2
par(bty = "l") 

stripchart(serotoninLevel ~ groups, data = locustData, vertical = TRUE, method = "jitter", jitter = 0.1, pch = 16, xlab = "Treatment time (hours)", ylab = "Serotonin (pmoles)", col = "firebrick", cex = 1, las = 1, ylim = c(0, max(locustData$serotoninLevel)))

points(tab.A$Mean ~ c(c(1,2,3) + offsetAmount), pch = 16, cex = 1.2)

arrows( c(c(1,2,3) + offsetAmount), tab.A$Mean - tab.A$Std.Error, c(c(1,2,3) + offsetAmount), tab.A$Mean + tab.A$Std.Error,  angle = 90, code = 3, length = 0.1)

Note: You will often see bar-charts drawn with error bars. This is a terrible idea as tersely summarized here “Beware of Dynamite”

6 - Some Worked Practice Problems

6.1 - Problem #1

Let us enter the data:

Y = c(112, 128, 108, 129, 125, 153, 155, 132, 137)

Now we can calculate everything needed to respond to the questions.

mean.Y = mean(Y)
mean.Y
## [1] 131
sd.Y = sd(Y)
sd.Y
## [1] 15.95306
n.Y = length(Y)
n.Y
## [1] 9
se.Y = sd.Y / sqrt(n.Y)
se.Y 
## [1] 5.317685
approx.CI = c(mean.Y - 2*se.Y, mean.Y + 2*se.Y)
approx.CI
## [1] 120.3646 141.6354

So, the standard deviation is 15.9530561, the sample size (\(n\)) is 9, the standard error of the mean is 5.3176854, and the approximate 95% confidence interval for the mean is 120.3646292, 141.6353708

6.2 - Problem #7

We could enter the raw data but why do that when the data-set is available.

fireflies = read.csv(url("http://whitlockschluter.zoology.ubc.ca/wp-content/data/chapter04/chap04q07FireflyFlash.csv"))
head(fireflies)
##   flash
## 1    79
## 2    80
## 3    82
## 4    83
## 5    86
## 6    85
summary(fireflies)
##      flash       
##  Min.   : 79.00  
##  1st Qu.: 87.50  
##  Median : 95.00  
##  Mean   : 95.94  
##  3rd Qu.:102.00  
##  Max.   :119.00

and now the calculations:

mean.flash = mean(fireflies$flash)
mean.flash
## [1] 95.94286
sd.flash = sd(fireflies$flash)
sd.flash
## [1] 10.9945
n.flash = length(fireflies$flash)
n.flash
## [1] 35
se.flash = sd.flash / sqrt(n.flash)
se.flash
## [1] 1.858409
approx.CI = c(mean.flash - 2*se.flash, mean.flash + 2*se.flash)
approx.CI
## [1] 92.22604 99.65968
  1. The estimated sample mean is 95.9428571. This quantity estimates the average flash duration in the population of male fireflies (i.e., it estimates the population mean)
  2. It will rarely equal the population mean because of sampling variability.
  3. The standard error is 1.8584094
  4. The standard error tells us how dispersed the sampling distribution of sample means is likely to be
  5. The approximate 95% confidence interval is 92.2260384, 99.6596759.
  6. It tells us the plausible range of values of the population mean. Technically, it tells us that if we drew all possible random samples of this size \((i.e., n=35)\) and calculated 95% confidence intervals for each sample, 95% of the resulting 95% confidence intervals will include the population mean.