Now we shift gears and expand our comparisons to so-called two “groups” – males versus females, syrup versus water, alive and dead horned lizards, and so on. We’ll start with the simpler case of what is called a paired t-test.

Paired \(t-tests\)

A classic paired design is a “before” and “after”" study such as, for example, the speed in syrup versus water study. What you have are two conditions, one the control and the other the treatment, and every unit is measured under both conditions. You then take the difference in the outcome of each unit between the control and treatment conditions. If the treatment has no effect then, on average, there should be no difference in outcomes. Of course some units may see a positive or a negative difference in scores but when averaged for the sample as a whole, so far as the \(H_0\) of no difference, no impact, is true, the sample average should be statistically indistinguishable from zero.

\[\begin{eqnarray*} d_{i} &=& Y_{1} - Y_{2} \\ \bar{d} &=& \dfrac{\sum{d_i}}{n} \\ s^{2}_{d} &=& \dfrac{\sum(d_i - \bar{d})^2}{n-1} \\ s_d &=& \sqrt{\dfrac{\sum(d_i - \bar{d})^2}{n-1}} \end{eqnarray*}\]

Test Statistic: \(t = \dfrac{\bar{d} - \mu_d}{s_d/\sqrt{n}}; df=n-1\)

Interval Estimate: \(\bar{d} \pm t_{\alpha/2}\left(\dfrac{s_d}{\sqrt{n}}\right)\)

The Hypotheses:

  • \(H_0: \mu_{d}=0; H_A: \mu_{d} \neq 0\)
  • \(H_0: \mu_{d} \leq 0; H_A: \mu_{d} > 0\)
  • \(H_0: \mu_{d} \geq 0; H_A: \mu_{d} < 0\)

Assumptions:

  1. Random samples
  2. The differences are approximately Normally distributed (… but \(Y_1\) and \(Y_2\) can follow any distribution)

As usual, we’ll assume random samples but we should test for Assumption 2 (normality). For now we won’t worry about transforming the data, should normality not hold for the sample.

Practice Problems

Testosterone and Blackbirds

In many species, males are more likely to attract females if the males have high testosterone levels. Are males with high testosterone paying a cost for this extra mating success in other ways? One hypothesis is that males with high testosterone might be less able to fight off disease – that is, their high levels of testosterone might reduce their immunocompetence. To test this hypothesis researchers artificially increased the testosterone levels of 13 male red-winged blackbirds by implanting a small permeable tube filled with testosterone. They measured immunocompetence as the rate of antibody production in response to a nonpathogenic antigen in each bird’s blood serum both before and after the implant. The antibody production rates were measured optically, in units of log \(10^{-3}\) optical density per minute \(\left( ln[mOD/min]\right)\). Did mean antibody production change after implants were inserted?

birds = read.csv(url("http://whitlockschluter.zoology.ubc.ca/wp-content/data/chapter12/chap12e2BlackbirdTestosterone.csv"))

boxplot(birds$logBeforeImplant, horizontal = TRUE, xlab = "Pre-implant antibody production rate (in log units)", col = "navajowhite")

boxplot(birds$logAfterImplant, horizontal = TRUE, xlab = "Post-implant antibody production rate (in log units)", col = "navajowhite")

qqnorm(birds$logBeforeImplant);  qqline(birds$logBeforeImplant, col = "darkred")

qqnorm(birds$logAfterImplant);  qqline(birds$logAfterImplant, col = "darkred")

shapiro.test(birds$logBeforeImplant)
## 
##  Shapiro-Wilk normality test
## 
## data:  birds$logBeforeImplant
## W = 0.73117, p-value = 0.001159
shapiro.test(birds$logAfterImplant)
## 
##  Shapiro-Wilk normality test
## 
## data:  birds$logAfterImplant
## W = 0.8182, p-value = 0.01131

Uh oh! Again, for now we’ll ignore the fact that the Shapiro-Wilk results are yielding such low p-values. Let us then proceed to the t-test.

\(H_0\): Mean change in antibody production pre- and post-implant is zero \((\mu_{d} = 0)\)
\(H_A\): Mean change in antibody production pre- and post-implant is NOT zero \((\mu_{d} \neq 0)\)

t.test(birds$logBeforeImplant, birds$logAfterImplant, paired = TRUE, conf.level = 0.95, alternative = "two.sided")
## 
##  Paired t-test
## 
## data:  birds$logBeforeImplant and birds$logAfterImplant
## t = -1.2714, df = 12, p-value = 0.2277
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.15238464  0.04007695
## sample estimates:
## mean of the differences 
##             -0.05615385

Note the structure of the command here. You specify both variables, and that paired = TRUE. The remaining commands are the usual ones tied to the \(\alpha\) for the test and to the alternative hypothesis (\(H_A\)).

The test results indicate a high p-value that then does not allow us to reject the null hypothesis. These data do not allow us to reject the notion that mean change in antibody production pre- and post-implant is zero.

Problem 1

Can the death rate be influenced by tax incentives? A pair of researchers investigated this possibility using data on deaths in the United States in years in which the government announced it was changing (usually raising) the tax rate on inheritance (the estate tax). The authors calculated the death rate for the 14 days before and 14 days after the changes in the estate tax rates took effect.

\(H_0:\) Estate tax rates have no effect on the death rate \((\mu_d = 0)\)

\(H_A:\) Estate tax rates have an effect on the death rate \((\mu_d \neq 0)\)

\(\alpha=0.05\)

taxes = read.csv(url("http://whitlockschluter.zoology.ubc.ca/wp-content/data/chapter12/chap12q01DeathAndTaxes.csv"))

boxplot(taxes$lowerTaxDeaths, horizontal = TRUE, xlab = "Death rates under lower tax rates", col = "navajowhite")

boxplot(taxes$HigherTaxDeaths, horizontal = TRUE, xlab = "Death rates under higher tax rates", col = "navajowhite")

qqnorm(taxes$lowerTaxDeaths);  qqline(taxes$lowerTaxDeaths, col = "darkred")

qqnorm(taxes$HigherTaxDeaths);  qqline(taxes$HigherTaxDeaths, col = "darkred")

shapiro.test(taxes$lowerTaxDeaths)
## 
##  Shapiro-Wilk normality test
## 
## data:  taxes$lowerTaxDeaths
## W = 0.95127, p-value = 0.6603
shapiro.test(taxes$HigherTaxDeaths)
## 
##  Shapiro-Wilk normality test
## 
## data:  taxes$HigherTaxDeaths
## W = 0.98489, p-value = 0.9873
t.test(taxes$lowerTaxDeaths, taxes$HigherTaxDeaths, paired = TRUE, conf.level = 0.95, alternative = "two.sided")
## 
##  Paired t-test
## 
## data:  taxes$lowerTaxDeaths and taxes$HigherTaxDeaths
## t = 1.9121, df = 10, p-value = 0.08491
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.2244865  2.9408501
## sample estimates:
## mean of the differences 
##                1.358182

Given that the \(p-value > 0.05\) we cannot reject \(H_0\). The data suggest that estate tax rates have no impact on death rates.

Problem 12

Ostriches live in hot environments, and they are normally exposed to the sun for long periods. Mammals in similar environments have special mechanisms for reducing the temperature of their brain relative to their body temperature. Fuller et al. (2003) tested whether ostriches could do the same. The mean body and brain temperature of six ostriches was recorded at typical hot conditions. Test for a mean difference in temperature between the body and brain for these ostriches.

\(H_0:\) There is no difference in ostriches’ brain versus body temperature \((\mu_d=0)\)

\(H_A:\) There is no difference in ostriches’ brain versus body temperature \((\mu_d \neq 0)\)

ostrich = read.csv(url("http://whitlockschluter.zoology.ubc.ca/wp-content/data/chapter12/chap12q12OstrichTemp.csv"))

boxplot(ostrich$bodyTemp, horizontal = TRUE, xlab = "Body Temperature", col = "navajowhite")

boxplot(ostrich$brainTemp, horizontal = TRUE, xlab = "Brain Temperature", col = "navajowhite")

qqnorm(ostrich$bodyTemp);  qqline(ostrich$bodyTemp, col = "darkred")

qqnorm(ostrich$brainTemp);  qqline(ostrich$brainTemp, col = "darkred")

shapiro.test(ostrich$bodyTemp)
## 
##  Shapiro-Wilk normality test
## 
## data:  ostrich$bodyTemp
## W = 0.92434, p-value = 0.5371
shapiro.test(ostrich$brainTemp)
## 
##  Shapiro-Wilk normality test
## 
## data:  ostrich$brainTemp
## W = 0.92054, p-value = 0.5092
t.test(ostrich$bodyTemp, ostrich$brainTemp, paired = TRUE, conf.level = 0.95, alternative = "two.sided")
## 
##  Paired t-test
## 
## data:  ostrich$bodyTemp and ostrich$brainTemp
## t = -5.6099, df = 5, p-value = 0.002489
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.9454120 -0.3512547
## sample estimates:
## mean of the differences 
##              -0.6483333

Given that the \(p-value = 0.002489\) we can easily reject \(H_0\); the data suggest that there is a difference in brain versus body temperature of ostriches. The 95% confidence interval for the difference between body and brain temperatures is \([-0.9115, -0.3851]\)

Two-Sample Comparison of Means

Here we have two groups, often referred to as two independent samples, one exposed to the treatment and the other exposed to the placebo (this is the control group). Now we have to do some extra legwork to figure out whether the groups come from populations with equal or unequal variances. Which assumption we go with will also determine the degrees of freedom.

Assumptions:

  1. Random samples
  2. Variables are drawn from normally distributed Populations
  3. The standard deviation (and hence the variance) is the same in both populations

Now we’ll have to two assumptions to test – normality, and equal variability. By now we know how to test for normality so let us set that aside for now. What about testing for equal variances? There are several tests but we’ll focus on only two – (a) the F-test, and (b) Levene’s test.

Testing for Equality of Variances

The \(F-test\)
  • Assumes normally distributed populations (hence sensitive to departures)
  • If \(X_1\) and \(X_2\) are two independent random variables distributed as \(\chi^{2}\) with \(df_1\) and \(df_2\), respectively,then the ratio \(\dfrac{\frac{X_1}{df_1}}{\frac{X_2}{df_2}}\) follows the \(F\) distribution with \(df_1\) in the numerator and \(df_2\) in the denominator
  • Hypotheses: \(H_0: \sigma^{2}_{1} = \sigma^{2}_{2}; H_A: \sigma^{2}_{1} \neq \sigma^{2}_{2}\)
  • Test Statistic: \(F=\dfrac{s^{2}_{1}}{s^{2}_{2}}\); \(F \sim F_{\alpha/2, df_1, df_2}\)

Note: \(s^{2}_{1}\) is the larger sample variance

Levene’s Test for Homogeneity of Variances
  • Assumes roughly symmetric frequency distributions within all groups
  • Robust to violations of assumption
  • Can be used with 2 or more groups
  • Hypotheses: \(H_0: \sigma^{2}_{1} = \sigma^{2}_{2} = \sigma^{2}_{3} = \cdots \sigma^{2}_{k}\); \(H_A:\) For at least one pair of \((i,j)\) we have \(\sigma^{2}_{i} \neq \sigma^{2}_{j}\)
  • Test Statistic: \(W = \dfrac{ (N-k)\displaystyle\sum^{k}_{i=1}n_{i}\left( \bar{Z}_{i} - \bar{Z} \right)^{2} }{(k-1)\displaystyle\sum^{k}_{i=1}\sum^{n_i}_{j=1}\left( Z_{ij} - \bar{Z}_{i}\right)^{2}}\)
  • \(Z_{ij} = |{Y_{ij} - \bar{Y}_i}|\); \(\bar{Z}_i\) is the mean for all \(Y\) in the \(i^{th}\) group; \(\bar{Z}\) is the mean for all \(Y\) in the study; \(k\) is the number of groups in the study; and \(n_i\) is the sample size for group \(i\)
  • If you opt for the more robust version that uses the Median, then, \(Z_{ij} = |Y_{ij} - \tilde{Y}_{i}|\) where \(\tilde{Y}_{i}\) is the median of the \(i^{th}\) group
  • \(W \sim F_{\alpha, k-1, n-k}\)

Some Rules-of-thumb:

  • Draw larger samples if you suspect the Population(s) may be skewed
  • Go with assumption of equal variances if both the following are met:
    1. Assumption theoretically justified, standard deviations fairly close
    2. \(n_1 \geq 30\) and \(n_2 \geq 30\)
  • Go with assumption of unequal variances if both the following are met:
    1. One standard deviation is at least twice the other standard deviation
    2. \(n_1 < 30\) or \(n_2 < 30\)

Two-Tailed Hypothesis:

\(H_0\): \(\mu_{1} = \mu_{2}\); \(H_A\): \(\mu_{1} \neq \mu_{2}\)

These can be rewritten as \(H_0\): \(\mu_{1} - \mu_{2} = 0\); \(H_A\): \(\mu_{1} - \mu_{2} \neq 0\)

One-Tailed Hypotheses:

  • \(H_0\): \(\mu_{1} \leq \mu_{2}\); \(H_A\): \(\mu_{1} > \mu_{2}\)

These can be rewritten as \(H_0\): \(\mu_{1} - \mu_{2} \leq 0\); \(H_A\): \(\mu_{1} - \mu_{2} > 0\)

  • \(H_0\): \(\mu_{1} \geq \mu_{2}\); \(H_A\): \(\mu_{1} < \mu_{2}\)

These can be rewritten as \(H_0\): \(\mu_{1} - \mu_{2} \geq 0\); \(H_A\): \(\mu_{1} - \mu_{2} < 0\)

The Test Statistic:

\[t = \dfrac{\left(\bar{Y}_{1} - \bar{Y}_{2} \right) - \left(\mu_{1} - \mu_{2} \right)}{SE_{\bar{Y}_{1} - \bar{Y}_{2}}} \]

The pooled sample variance: \[s^{2}_{p} = \dfrac{df_{1}s^{2}_{1} + df_{2}s^{2}_{2}}{df_{1} + df_{2}}\] \[df_{1}=n_{1}-1; df_{2}=n_{2}-1\]

The standard error: \(SE_{\bar{Y}_{1} - \bar{Y}_{2}}\) and \(df\) are calculated in one of two ways:

  1. Assuming equal population variances

\[SE_{\bar{Y}_{1} - \bar{Y}_{2}} = \sqrt{ s^{2}_{p} \left(\dfrac{1}{n_{1}} + \dfrac{1}{n_{2}}\right) } \] \[df = n_{1} + n_{2} - 2\]

  1. Assuming unequal population variances \[SE_{\bar{Y}_{1} - \bar{Y}_{2}} = \sqrt{\dfrac{s^{2}_{1}}{n_{1}} + \dfrac{s^{2}_{2}}{n_{2}} } \]

\[\text{approximate } df = \dfrac{\left( \dfrac{s^{2}_{1}}{n_1} + \dfrac{s^{2}_{2}}{n_2} \right)^{2}}{\left[\dfrac{\left(\frac{s^{2}_{1}}{n_{1}} \right)^{2}}{df_{1}} + \dfrac{\left( \frac{s^{2}_{2}}{n_2} \right)^{2}}{df_{2}} \right]}\]

Note: The approximate df should be rounded down to the nearest integer

Practice Problems

Problem 6

Do males in polyandrous populations evolve larger testes than do males in monogamous populations because larger testes produce more sperm? To test this hypothesis researchers carried out an experiment on eight separate lines of yellow dung flies. In four of these lines each female was mated with three males before laying eggs. In the other four lines the females mated only once. After 10 generations the testes size (in cross-sectional area) was measured in each of the eight lines. Test whether testes size differs between the two groups.

\(H_0:\) Males of polyandrous populations have equally sized testes as do males of monogamous populations \((\mu_M - \mu_P = 0)\)

\(H_A:\) Males of polyandrous populations do not have equally sized testes as do males of monogamous populations \((\mu_M - \mu_P \neq 0)\)

testes = read.csv(url("http://whitlockschluter.zoology.ubc.ca/wp-content/data/chapter12/chap12q6TestesSize.csv"))

boxplot(testesArea ~ matingSystem, data = testes, horizontal = TRUE, xlab = "Testes Area", col = "navajowhite")

library(lattice)

qqmath(~ testesArea | matingSystem, data = testes, 
       distribution = qnorm, prepanel = prepanel.qqmathline,
       panel = function(x, ...) {
         panel.qqmathline(x, ...)
         panel.qqmath(x, ...)
       })

tapply(testes$testesArea, testes$matingSystem, FUN = shapiro.test)
## $Monogamous
## 
##  Shapiro-Wilk normality test
## 
## data:  X[[i]]
## W = 0.9202, p-value = 0.5381
## 
## 
## $Polyandrous
## 
##  Shapiro-Wilk normality test
## 
## data:  X[[i]]
## W = 0.99994, p-value = 0.9999

Note a couple of things here. First, your data are presented differently – you have one variable that measures the outcome for both groups and another variable that indicates what group each measurement belongs to. Second, we are now using library(lattice) to generate the Q-Q plots for each group, side by side. Third, the shapiro.test command is also being run differently by nesting it within the tapply() command since the native shapiro.test() command does not allow you to include a group indicator.

The substantive upshot of all of this is that the distributions seem to be similar in both groups, and the Shapiro-Wilk p-values are quite high. We can proceed with the t-test, but before we do so we’ll have to figure out whether to run the test with equal or unequal variances assumed. Let us run the F-test and Levene’s test to figure out which assumption to go with here.

var.test(testesArea ~ matingSystem, data = testes, conf.level = 0.95, alternative = "two.sided", ratio = 1)
## 
##  F test to compare two variances
## 
## data:  testesArea by matingSystem
## F = 0.84559, num df = 3, denom df = 3, p-value = 0.8936
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##   0.05476898 13.05519098
## sample estimates:
## ratio of variances 
##          0.8455882

The p-value of the F-test is quite high so we are unable to reject the NULL of equal variances. What about if we had used Levene’s test instead?

library(car)
leveneTest(testesArea ~ matingSystem, data = testes, center = mean)
## Levene's Test for Homogeneity of Variance (center = mean)
##       Df F value Pr(>F)
## group  1  0.0429 0.8428
##        6
leveneTest(testesArea ~ matingSystem, data = testes, center = median)
## Levene's Test for Homogeneity of Variance (center = median)
##       Df F value Pr(>F)
## group  1   0.038 0.8519
##        6

Same conclusion would result, regardless of whether we also opted for the center = mean or the center = median options. Now we can run the \(t-test\) assuming equal variances.

t.test(testesArea ~ matingSystem, data = testes, conf.level = 0.95, alternative = "two.sided", var.equal = TRUE)
## 
##  Two Sample t-test
## 
## data:  testesArea by matingSystem
## t = -4.4824, df = 6, p-value = 0.004182
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.15845444 -0.04654556
## sample estimates:
##  mean in group Monogamous mean in group Polyandrous 
##                    0.8475                    0.9500

The \(p-value < 0.05\) so we can easily reject \(H_0\); the data suggest that average testes size differs between males of polyandrous and monogamous populations.

What if you had to test for \(H_A\) that testes size is larger in males of polyandrous populations? What would your conclusion be? What about the confidence interval?

t.test(testesArea ~ matingSystem, data = testes, conf.level = 0.95, alternative = "less", var.equal = TRUE)
## 
##  Two Sample t-test
## 
## data:  testesArea by matingSystem
## t = -4.4824, df = 6, p-value = 0.002091
## alternative hypothesis: true difference in means is less than 0
## 95 percent confidence interval:
##         -Inf -0.05806458
## sample estimates:
##  mean in group Monogamous mean in group Polyandrous 
##                    0.8475                    0.9500

Problem 15

Red-winged blackbird males defend territories and attract females to mate and raise young there. A male protects nests and females on his territory both from other males and from predators. Males also frequently mate with females on adjacent territories. When males are successful in mating with females outside their territory, do they also attempt to protect these females and their nests from predators? An experiment measured the aggressiveness of males toward stuffed magpies placed on the territories adjacent to the males. The measurement scale was such that larger scores indicated more aggressiveness. These scores were normally distributed. Later researchers used DNA techniques on chicks in nests to identify whether the male had mated with the female on the adjacent territory. They then compared the aggressiveness scores of males who had mated with females on adjacent territories with those of males who had not done so. Test whether there are differences in the mean aggressiveness scores between the two groups of males.

\(H_0\): There is no difference in mean aggressiveness scores of the two groups \((\mu_1 - \mu_2 = 0)\)

\(H_A\): There is a difference in mean aggressiveness scores of the two groups \((\mu_1 - \mu_2 \neq 0)\)

Since we lack the raw data we’ll have to go with the summary statistics they have provided to us:

Group Mean \((\bar{Y})\) Std. Dev. \((s)\) Sample Size \((n)\)
Mated with neighbor 0.806 1.135 10
Did not mate with neighbor -0.168 0.543 36

Now on to the manual calculations:

d = (0.806 - (-0.168)); d
## [1] 0.974
n1 = 10; n2 = 36
var1 = (1.135)^2; var2 = (0.543)^2; df1 = (n1 -1); df2 = (n2 - 1)
pooled.variance = ((df1 * var1) + (df2 * var2))/(df1 + df2); pooled.variance
## [1] 0.4980395
std.error = sqrt( (var1/n1) + (var2/n2) ); std.error
## [1] 0.3701523
t.value = d/std.error; t.value
## [1] 2.631349
approx.df = ((var1/n1 + var2/n2)^2 )/(((var1/n1)^2/df1 + (var2/n2)^2/df2)); approx.df
## [1] 10.17021
p.value = 2 * (1 - pt(t.value, df=10))
p.value
## [1] 0.02510396
calculated.F = var1/var2; calculated.F
## [1] 4.369101
p.value = 1 - pf(calculated.F, df1, df2)
p.value
## [1] 0.0007097568

The \(p-value < 0.05\) allows us to reject \(H_0\); the data suggest a difference in average aggressiveness scores of the two groups.

If we had worked with the assumption of equal variances we would have run this instead:

std.error = sqrt( pooled.variance * (1/n1 + 1/n2) ); std.error
## [1] 0.2522665
t.value = d/std.error; t.value
## [1] 3.860996
df = (n1 + n2) - 2; df
## [1] 44
p.value = 2 * (1 - pt(t.value, df=44))
p.value
## [1] 0.0003663202

Same conclusion as under the assumption of unequal population variances.