Deviations from Normality

Several tests assume the data are drawn from normally distributed populations. As such if this assumption is violated we are faced with a choice – do we abandon hypothesis testing as planned or do we take an alternative path? Well, step one is to figure out when to abandon the assumption of normality.

Graphical Method

This is an easy diagnostic tool because we can plot our histogram or box-plot and then superimpose the normal distribution to see how far off we are. Another useful visual device is the quantile-quantile plot. The idea behind these latter plots is quite simple. Basically you split your sample into portions and then see what percent fall above or below a certain percentile. We know what the percentiles should look like if we are dealing with a normal distribution. If we then juxtapose the theoretical quantiles with the sample-based quantile, the more these two overlap the more likely it is that the sample are coming from a normally distributed population. Obviously if there is a \(1:1\) fit we would have a straight line; this is what you see in the plot. Any deviations from this line are evident from the dot-plot. The more the dots deviate from the line the less likely it is that the sample data are coming from a normally distributed population.

Marine Reserves

Marine organisms do not enjoy anywhere near the same protection from human influence as do terrestrial species. However, marine reserves are becoming increasingly popular for biological conservation and the protection of fisheries. But are reserves effective in preserving marine wildlife? Researchers matched each of 32 marine reserves to a control location, which was either the site of the reserve before it became protected or a similar unprotected site nearby. One index of protection evaluated by the study was the “biomass ratio”, which is the total mass of all marine plants and animals per unit area of reserve divided by the same quantity in the unprotected control. This biomass ratio would equal one if protection had no effect. The biomass ratio would exceed one if protection were beneficial and it would be less than one if protection reduced biomass. Does the mean biomass ratio differ from one (i.e., are marine reserves effective)?

\(H_0:\) Mean biomass ratio is unaffected by reserve protection \((\mu=1)\)
\(H_A:\) Mean biomass ratio is affected by reserve protection \((\mu \neq 1)\)

First things first, are these data coming from a normally distributed population?

marine = read.csv(url("http://whitlockschluter.zoology.ubc.ca/wp-content/data/chapter13/chap13e1MarineReserve.csv"))
head(marine); names(marine)
##   biomassRatio
## 1         1.34
## 2         1.96
## 3         2.49
## 4         1.27
## 5         1.19
## 6         1.15
## [1] "biomassRatio"
boxplot(marine$biomassRatio, horizontal = TRUE, xlab="Biomass Ratio", main="Boxplot of Marine Biomass Ratio", col = "firebrick")

h = hist(marine$biomassRatio, xlab="Biomass Ratio", main="Histogram of Marine Biomass Ratio", col = "firebrick")
x = marine$biomassRatio
xfit = seq(min(x), max(x), length=32) 
yfit = dnorm(xfit, mean=mean(x), sd=sd(x)) 
yfit = yfit*diff(h$mids[1:2])*length(x) 
lines(xfit, yfit, col="blue", lwd=2)

qqnorm(marine$biomassRatio, col = "cornflowerblue");
qqline(marine$biomassRatio, col = "firebrick")

All three plots show skew, suggesting anything but a normal distribution underlying these data.

Formal Tests (beware!!!!)

Two tests are very popular, one is the Shapiro-Wilk Test, and the other is the Anderson-Darling Test. Shapiro-Wilk is designed for sample sizes under 5,000 while Anderson-Darling is designed for large samples.

\(H_0:\) The sample is drawn from a normal population
\(H_A:\) The sample is not drawn from a normal population

\[W =\dfrac{\left(\displaystyle\sum^{n}a_iY_{(i)}\right)^2}{\displaystyle\sum^{n}\left(Y_i - \bar{Y} \right)^2}\] \[0 \leq W \leq 1\]

\(W \to 1\): Observed distribution is as expected if population were Normal

Skewness: \(\displaystyle\sum^{n}_{i=1}\dfrac{\left(Y_i - \bar{Y}\right)^3}{\left(n-1\right)s^{3}}\)

\(s = 0\): Normal distribution; \(s < 0\): skewed left; \(s > 0\): skewed right

Kurtosis: \(\displaystyle\sum^{n}_{i=1}\dfrac{\left(Y_i - \bar{Y}\right)^4}{\left(n-1\right)s^{4}} - 3\)

\(k = 0\): Standard Normal distribution; \(K > 0\): Peaked''; $k < 0$:Flat’’

shapiro.test(marine$biomassRatio)
## 
##  Shapiro-Wilk normality test
## 
## data:  marine$biomassRatio
## W = 0.81751, p-value = 8.851e-05
library(nortest)
ad.test(marine$biomassRatio)
## 
##  Anderson-Darling normality test
## 
## data:  marine$biomassRatio
## A = 2.0194, p-value = 2.892e-05
library(moments)
skewness(marine$biomassRatio)
## [1] 1.688122
kurtosis(marine$biomassRatio)
## [1] 5.641057

Both the Shapiro-Wilk and the Anderson-Darling test suggest that the null of normality can be easily rejected. Note: AD is used here purely for illustration purposes and is not needed in this case given the small sample size.

Violations of Normality

What if we ignore violations of the assumption of normality? Perhaps nothing will go wrong in terms of the test results. Most tests are robust to violations of normality, so long as you can rely on the Central Limit Theorem and the test you are dealing with involves the mean. In brief, large samples will allow you to get away with samples exhibiting some non-normal distribution so long as you are not testing variances. How large is large? There is no clear-cut answer. Depending upon what you are looking to test, how many groups are involved, and the research design even 500 units in each group may be insufficient.

Unequal Standard Deviations

The rule of thumb used for sample sizes and relative standard deviations of the groups applies here as well. If you have small samples or the standard deviation of one group is quite a bit larger than that of the other group(s) (i.e., one standard deviation is at least twice the other standard deviation), then you should not assume equal variances. Of course you would be betetr off using formal tests for equality of variance and for the homogeneity of variances, respectively (i.e., the F-test and Levene’s test).

If the F-test and Levene’s test give you conflicting results I woud recommend going with Levene’s test if the data are non-normally distributed

What do I do if my data are non-normally distributed?

  • Have faith in the Central Limit Theorem. So long as you have a sample size of 30 or more you can assume that the sampling distribution of the sample mean follows the normal distribution (even if the original measurement does not come from a normal distribution)
  • Test statistics for means & confidence intervals of means will be unaffected so long as you have a large enough sample to work with
    • Most tests are robust to some violations of the normality assumption so long as the sample size is not too small and the skew isn’t very extreme
    • How small? If there is similar skewness, even when comparing two groups, having 30 in each group will work
    • If I have one group with severe skewness (one left-skewed the other right-skewed) then I need samples of a few hundred units each before the test can be trusted
  • If all else fails (or you are a conservative analyst), try to transform the data and rerun all tests … We will try this tactic next
  • If this doesn’t solve the problem (or you are not a fan of massaging your data) then switch to non-parametric tests that do not require the Normality assumption … We will also see these tests in action.

Transforming Data

If we don’t want to ignore violations of the normality assumption we have no choice but to transform the data. Several transformations are available for use, each with the goal of reducing the skew in the sample (i.e., rendering it more “normally” distributed).

  1. Log Transformation \[Y^{'} = ln\left( Y \right)\] This is very commonly used in biological research to reduce skew. It usually works so long as the measure does not include \(0\). If it does, then standard practice entails adding \(1\) to each measurement and then taking the natural logarithm.

The log transformation is most useful when:

  1. the measures represent ratios or products
  2. The data are positively skewed (i.e., skewed right)
  3. The group with the larger mean also has the larger standard deviation
  4. The data span several orders of magnitude
  • If substantially positively skewed and \(Y > 0\) then use \(Y^{'} = ln(Y)\)
  • If substantially positively skewed and \(Y \geq 0\) then use \(ln(Y + c)\); where \(c\) is a constant chosen such that the smallest score \(= 1\)
  • If substantially negatively skewed then use \(ln(K - Y)\); where \(K = Y_{max} + 1\) so that the smallest score \(=1\)
  1. Square-root Transformation \[Y^{'} = \left( \sqrt{Y} \right)\] Used often when there is moderate skew, and with count data
  • If moderately positively skewed and \(Y > 0\) then use \(\sqrt{Y}\)
  • If moderately positively skewed and \(Y \geq 0\) then use \(\sqrt{Y + 0.5}\)
  • If moderately negatively skewed use \(\sqrt{K - Y}\); where \(K = Y_{max} + 1\) so that the smallest score \(=1\)
  1. Arcsine Transformation If data are proportions the arcsine: \(arcsin\left( \sqrt{p}\right)\) is suggested but I would not do this readily … Proportions are essentially coming from your Binomial or Poisson distributions and in these cases we would rarely run a \(t-test\) or fit a linear regression.

  2. Other Transformations The square \((Y^{'} = Y^{2})\) is used f the data are skewed left, and if this fails, then the anti-log might work \((Y^{'} = e^{Y})\).

If data are skewed right, the reciprocal may also work \((Y^{'} = \frac{1}{Y})\).

For the square and the reciprocal you need to have all measurements be positive. If they are negative then you need to multiple each by \(-1\) and then carry out the transformation.

If we use a transformation, the estimated statistics have to be back-transformed to the original metric.

  • The Natural Logarithm: \(Y^{'} = ln[Y]\) if \(Y > 0\)
  • The Natural Logarithm: \(Y^{'} = ln[Y + 1]\) if \(Y \geq 0\)
  • The Square Root: \(Y^{'} = \sqrt{Y + \frac{1}{2}}\)
  • The Arcsine: \(p^{'} =arcsin[\sqrt{p}]\)

The Corresponding Back-transformations

  • The Natural Logarithm: \(Y = e^{Y^{'}}\)
  • The Square Root: \(Y = Y^{'2} - \frac{1}{2}\)
  • The Arcsine: \(p = \left(sin[p^{'}]\right)^{2}\)

Nonparametric Alternatives to the \(t-tests\)

Non-parametric tests are used (a) when transformations do not work, or (b) the data represent ordinal categories (or are ranked data)

  • Called non-parametric because unlike, say, the \(t-test\) which requires some distributional assumption to be true (i.e., Normality) and involves parameters (i.e., the mean and the variance), these alternatives make no such assumptions or need no such parameters
  • They are more likely to lead you to commit a Type II error so if the assumptions of parametric tests are met use parametric tests
  • Here are a few non-parametric analogues to the \(t-tests\):
    • Sign Test: Alternative to the One-Sample \(t-test\) or to the Paired \(t-test\)
    • Wilcoxon Signed-Rank Test: Alternative to the Paired \(t-test\)
    • Mann-Whitney \(U-test\): Alternative to the Two-Sample \(t-test\) with equal variances
    • Kolmogorov-Smirnov test: Alternative to the Two-Sample \(t-test\) with unequal variances

The Sign Test

Assumption: Samples are drawn independently from a continuous distribution

  • Tests whether the Median equals a hypothesized value (\(H_0\) value)
  • Scores above the \(H_0\) value are marked \(+\); scores below are marked \(-\)
  • Scores \(=\) to the Median are dropped
  • If \(H_0\) is correct then 50% of the scores should be \(+\) and 50% should be \(-\)
  • Essentially a Binomial test where \(p_0=0.50\)
  • One-Sample Hypotheses:
    *\(H_0\): The distribution is symmetric around \(p=0.50\)
    • \(H_A\): The distribution is not symmetric around \(p \neq 0.50\)
  • Paired Design Hypotheses:
    • \(H_0\): The distribution of the two measurements is the same
    • \(H_A\): The distribution of the two measurements is not the same
  • Has little power because it is simplistic and works best with large samples

Let us run the test with the Sexual Conflict data.

The process by which a single species splits into two species is still not well understood. One proposal involves ``sexual conflict’’ – a genetic arms race between males and females that arises from their different reproductive roles. For example, in a number of insect species, male seminal fluid contains chemicals that reduce the tendency of the female to mate again with other males. However, these chemicals also reduce female survival, so females have developed mechanisms to counter these chemicals. Sexual conflict can cause rapid genetic divergence between isolated populations of the same species, leading to the formation of new species. Sexual conflict is more pronounced in species in which females mate more than once, leading to the prediction that they should form new species at a more rapid rate.
To investigate this, Arnqvist et al. (2000) identified 25 insect taxa (groups) in which females mate multiple times, and they paired each of these groups to a closely related insect group in which females only mate once. Which type of insects tend to have more species?
Note: These are treated as paired data because the two sets of groups are closely related in the sense of being almost “identical”

taxa = read.csv(url("http://whitlockschluter.zoology.ubc.ca/wp-content/data/chapter13/chap13e4SexualConflict.csv"))
head(taxa); names(taxa); summary(taxa)
##   taxonPair nSpeciesMultipleMating nSpeciesSingleMating difference
## 1         A                     53                   10         43
## 2         B                     73                  120        -47
## 3         C                    228                   74        154
## 4         D                    353                  289         64
## 5         E                    157                   30        127
## 6         F                    300                    4        296
## [1] "taxonPair"              "nSpeciesMultipleMating"
## [3] "nSpeciesSingleMating"   "difference"
##    taxonPair  nSpeciesMultipleMating nSpeciesSingleMating
##  A      : 1   Min.   :    7          Min.   :   4.0      
##  B      : 1   1st Qu.:   34          1st Qu.:  10.0      
##  C      : 1   Median :   77          Median :  30.0      
##  D      : 1   Mean   : 1134          Mean   : 263.5      
##  E      : 1   3rd Qu.:  228          3rd Qu.: 115.0      
##  F      : 1   Max.   :21000          Max.   :3500.0      
##  (Other):19                                              
##    difference     
##  Min.   : -980.0  
##  1st Qu.:   -3.0  
##  Median :   16.0  
##  Mean   :  870.5  
##  3rd Qu.:   79.0  
##  Max.   :20940.0  
## 
par(mfrow=c(1,2))
boxplot(taxa$difference, horizontal = TRUE, col = "firebrick", main="Boxplot of Difference in Species Number", xlab="", cex.main=0.9)
qqnorm(taxa$difference, col="blue");
qqline(taxa$difference, col="red")

shapiro.test(taxa$difference)
## 
##  Shapiro-Wilk normality test
## 
## data:  taxa$difference
## W = 0.25473, p-value = 2.911e-10

The difference assumes negative values so what do we do? We can try the following:

log.diff = log(taxa$difference + (2 - min(taxa$difference)))
summary(log.diff)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.6931  6.8865  6.9058  6.8185  6.9670  9.9952
boxplot(log.diff, horizontal=TRUE)

shapiro.test(log.diff)
## 
##  Shapiro-Wilk normality test
## 
## data:  log.diff
## W = 0.45146, p-value = 1.284e-08

That didn’t work. We could try some R packages designed to help transform the data but that would be futile here. Consequently, we can opt for the Sign Test.

\(H_0\): Median difference in number of species is \(= 0\)
\(H_A\): Median difference in number of species is \(\neq 0\)

First thing we will do is count how many cases are \(+\), how many \(-\), and how many are equal to the hypothesized median of \(0\)

library(BSDA)
SIGN.test(taxa$nSpeciesMultipleMating, taxa$nSpeciesSingleMating, md = 0, alternative = "two.sided", conf.level = 0.95)
## 
##  Dependent-samples Sign-Test
## 
## data:  taxa$nSpeciesMultipleMating and taxa$nSpeciesSingleMating
## S = 18, p-value = 0.04329
## alternative hypothesis: true median difference is not equal to 0
## 95 percent confidence interval:
##   1.00000 77.16674
## sample estimates:
## median of x-y 
##            16 
## 
## Achieved and Interpolated Confidence Intervals: 
## 
##                   Conf.Level L.E.pt  U.E.pt
## Lower Achieved CI     0.8922      1 70.0000
## Interpolated CI       0.9500      1 77.1667
## Upper Achieved CI     0.9567      1 78.0000

Note the p-value of 0.04329; we can reject the NULL hypothesis. The data suggest that groups of insects whose females mate multiple times have more species than groups whose females mate singly, consistent with the sexual-conflict hypothesis.

You could also do this test manually:

taxa$sign = "equal to median"
taxa$sign[taxa$difference > 0] = "plus"
taxa$sign[taxa$difference < 0] = "minus"
taxa$sign = factor(taxa$sign, levels = c("minus", "equal to median", "plus"))
table(taxa$sign)
## 
##           minus equal to median            plus 
##               7               0              18

So we have 7 minuses and 18 pluses. If the distribution were evenly spread we should have seen 50% of the cases below and 50% above the median. This is not what we see. Lets use the binomial test:

binom.test(7, 25, p=0.5)
## 
##  Exact binomial test
## 
## data:  7 and 25
## number of successes = 7, number of trials = 25, p-value = 0.04329
## alternative hypothesis: true probability of success is not equal to 0.5
## 95 percent confidence interval:
##  0.1207167 0.4938768
## sample estimates:
## probability of success 
##                   0.28

Given the p-value we can reject \(H_0\); the Median difference is not \(=0\). Note the p-value. Surprised?

The Wilcoxon Signed-Rank Test

The Sign Test is weak in that it does not take magnitudes of how much above/below the median an observation falls. The Wilcoxon Signed-Rank Test is a slight improvement because it ranks observations on the basis of how far below/above they fall from the hypothesized median.

  • Assumes the distribution is symmetric around the Median. This is equal to assuming there is no skew so it is a very restrictive assumption
  • Steps:
    • Calculate \(Y_i - \mu_{0}\) for all \(i=1,2,\ldots,n\)
    • Rank in ascending order the absolute differences \(\left|Y_i - \mu_{0} \right|\) for all \(i=1,2,\ldots,n\)
    • Assign + or - to each rank
    • Let the sum of + and - ranks be \(W^{+}\) and \(W^{-}\), respectively
    • Let \(W = min\left(W^{+},W^{-}\right)\) and \(W^{*}_{\alpha}\) be critical W
    • Reject \(H_0\): The median difference between the two samples is \(\mu_0 = 0\) if …
      • \(H_A\): is \(\mu \neq \mu_0\) and \(W \leq W^{*}_{\alpha}\)
      • \(H_A\): is \(\mu > \mu_0\) and \(W^{-} \leq W^{*}_{\alpha}\)
      • \(H_A\): is \(\mu < \mu_0\) and \(W^{+} \leq W^{*}_{\alpha}\)
  • Normal approximation: \(z_0 = \dfrac{W^{+} - n(n+1)/4}{\sqrt{n(n+1)(2n+1)/24}}\) … used when \(n \geq 50\) and there are no ties

Let us run the test on the Sexual Conflict data:

wilcox.test(taxa$nSpeciesMultipleMating,  taxa$nSpeciesSingleMating, paired = TRUE)
## Warning in wilcox.test.default(taxa$nSpeciesMultipleMating, taxa
## $nSpeciesSingleMating, : cannot compute exact p-value with ties
## 
##  Wilcoxon signed rank test with continuity correction
## 
## data:  taxa$nSpeciesMultipleMating and taxa$nSpeciesSingleMating
## V = 230, p-value = 0.07139
## alternative hypothesis: true location shift is not equal to 0

Do not reject \(H_0\); the distributions appear to be the same (i.e., there is no difference in the number of species between the polyandrous and the monandrous groups).

The Mann-Whitney \(U-test\)

If we have a two-sample (i.e., unpaired) design, then the Mann-Whitney U-Test can be used assuming we couldn’t transform the data to run a \(t-test\).

  • The assumptions of the Mann-Whitney U test are:
  • The variable of interest is continuous (not discrete). The measurement scale is at least ordinal
  • The probability distributions of the two populations are identical, except for location (i.e., the ``center’’)
  • The two samples are independent
  • Both are simple random samples from their respective populations
  • \(H_0\): The samples come from populations with similar probability distributions
  • Test Process and Statistic …
    • Combine both samples and rank, in ascending order, all values
    • If there are ties, rank accordingly
    • Sum the ranks of the smaller group \((R_1)\)
    • \(U_1\) = \(n_1 n_2 + \dfrac{n_1(n_1 + 1)}{2} - R_{1}\); \(U_2 = n_1 n_2 - U_{1}\)
    • Choose the larger of \(U_1\) or \(U_2\) as the test statistic
    • Reject \(H_0\) if \(P-value \leq \alpha\)

Sexual Cannibalism in Sagebrush Crickets

The sage cricket, Cyphoderris strepitans, has an unusual form of mating. During mating, the male offers his fleshy hind wings to the female to eat. The wounds are not fatal but a male with already nibbled wings is less likely to be chosen by females he meets later. Females get some nutrition from feeding on the wings, which raises the question, “Are females more likely to mate if they are hungry?”

Johnson et al. (1999) answered this question by randomly dividing 24 females into two groups: one group of 11 females were starved for at least two days and another group of 13 females was fed during the same period. Finally, each female was put separately into a cage with a single (new) male, and the waiting time to mating was recorded.

\(H_0\): Time to mating is the same for female crickets that were starved as for those who were fed
\(H_A\): Time to mating is not the same for female crickets that were starved as for those who were fed

cric = read.csv(url("http://whitlockschluter.zoology.ubc.ca/wp-content/data/chapter13/chap13e5SagebrushCrickets.csv"))
head(cric); names(cric)
##   feedingStatus timeToMating
## 1       starved          1.9
## 2       starved          2.1
## 3       starved          3.8
## 4       starved          9.0
## 5       starved          9.6
## 6       starved         13.0
## [1] "feedingStatus" "timeToMating"
wilcox.test(cric$timeToMating ~ cric$feedingStatus, paired=FALSE)
## 
##  Wilcoxon rank sum test
## 
## data:  cric$timeToMating by cric$feedingStatus
## W = 88, p-value = 0.3607
## alternative hypothesis: true location shift is not equal to 0

Do not reject \(H_0\); time to mating does not appear to be different for starved versus fed female crickets

Note the warning about ties; If we have any ties we need to switch to a version of the Wilcoxon test designed to handle ties. I’ll use the data on gestures and sightedness.

Gesturing is common during human speech. Is this behavior learned via exposure? A measure was made of the number of gestures produced by each of 12 pairs of sighted individuals while talking to sighted individuals. This result was compared with the number of gestures produced while talking by each of 12 pairs of people who had been blind since birth and were therefore presumably unexposed to the gestures of others. These data are given below.

gestures = read.csv(url("http://whitlockschluter.zoology.ubc.ca/wp-content/data/review2/rev2q13BlindGestures.csv"))
head(gestures); names(gestures)
##   sightedness numberOfGestures
## 1       Blind                0
## 2       Blind                1
## 3       Blind                1
## 4       Blind                2
## 5       Blind                1
## 6       Blind                1
## [1] "sightedness"      "numberOfGestures"
wilcox.test(gestures$numberOfGestures ~ gestures$sightedness, paired = FALSE)
## Warning in wilcox.test.default(x = c(0L, 1L, 1L, 2L, 1L, 1L, 1L, 3L, 1L, :
## cannot compute exact p-value with ties
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  gestures$numberOfGestures by gestures$sightedness
## W = 62.5, p-value = 0.5755
## alternative hypothesis: true location shift is not equal to 0

Note the warning message that you can’t recover an exact p-value with this test because of ties in the data. Mercifully we have another test we could use in the presence of ties.

library(exactRankTests)
wilcox.exact(gestures$numberOfGestures ~ gestures$sightedness, paired=FALSE)
## 
##  Exact Wilcoxon rank sum test
## 
## data:  gestures$numberOfGestures by gestures$sightedness
## W = 62.5, p-value = 0.6361
## alternative hypothesis: true mu is not equal to 0

Kolmogorov-Smirnov (K-S) Test

This (very weak) test is used to compare the distributions of two groups by comparing the empirical cumulative distribution functions (ecdfs) of the two groups and finding the greatest absolute distance between the two

  • The \(ecdf\) is \(\hat{F}(Y) =\) fraction of sample with values \(\leq Y_i\) , where \(i=1,2,3, \cdots, n\)
  • The \(K-S\) statistic is \(D_{max} = | \hat{F}_1(Y) - \hat{F}_2(Y) |\)
  • Assumptions:
    • The measurement scale is at least ordinal.
    • The probability distributions are continuous
    • The two samples are mutually independent
    • Both samples are simple random samples from their respective populations
  • \(H_0:\) \(F_1(Y) = F_2(Y)\) for all \(Y_i\); \(H_A:\) \(F_1(Y) \neq F_2(Y)\) for at least one \(Y_i\)
  • Reject \(H_0\) if \(P-value\) of calculated \(D_{max} \leq \alpha\)

Let us see when and how we might use this test > The stalk-eyed fly, Crytodiopsis dalmanni, is a bizarre-looking insect from the jungles of Malaysia. Its eyes are at the ends of long stalks that emerge from its head, making the fly look like something from the cantina scene in Star Wars. These eye stalks are present in both sexes, but they are particularly impressive in males. The span of the eye stalk in males enhances their attractiveness to females as well as their success in battles against other males. The span, in millimeters, from one eye to the other, was measured in a random sample of 45 male stalk-eyed flies.

Let us plot the distribution and run the formal tests to see if we can stick with the usual \(t-test\)

cric = read.csv(url("http://whitlockschluter.zoology.ubc.ca/wp-content/data/chapter13/chap13e5SagebrushCrickets.csv"))
head(cric); names(cric)
##   feedingStatus timeToMating
## 1       starved          1.9
## 2       starved          2.1
## 3       starved          3.8
## 4       starved          9.0
## 5       starved          9.6
## 6       starved         13.0
## [1] "feedingStatus" "timeToMating"
boxplot(cric$timeToMating ~ cric$feedingStatus, horizontal=TRUE)

var.test(cric$timeToMating ~ cric$feedingStatus)
## 
##  F test to compare two variances
## 
## data:  cric$timeToMating by cric$feedingStatus
## F = 2.8393, num df = 12, denom df = 10, p-value = 0.108
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  0.7841424 9.5786535
## sample estimates:
## ratio of variances 
##           2.839337
library(car)
## Loading required package: carData
## 
## Attaching package: 'carData'
## The following objects are masked from 'package:BSDA':
## 
##     Vocab, Wool
leveneTest(cric$timeToMating ~ cric$feedingStatus, center = mean)
## Levene's Test for Homogeneity of Variance (center = mean)
##       Df F value   Pr(>F)   
## group  1  8.2093 0.008995 **
##       22                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
leveneTest(cric$timeToMating ~ cric$feedingStatus, center = median)
## Levene's Test for Homogeneity of Variance (center = median)
##       Df F value  Pr(>F)  
## group  1  4.5709 0.04388 *
##       22                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The box-plots and the formal tests (at least Levene’s test for homogeneity of variances) suggest we should reject the null of equal variances. We could try and normalize the data via the natural logarithm.

cric$log.time = log(cric$timeToMating)
boxplot(cric$log.time ~ cric$feedingStatus, horizontal=TRUE)

var.test(cric$log.time ~ cric$feedingStatus)
## 
##  F test to compare two variances
## 
## data:  cric$log.time by cric$feedingStatus
## F = 2.0434, num df = 12, denom df = 10, p-value = 0.2663
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  0.5643217 6.8934446
## sample estimates:
## ratio of variances 
##           2.043378
library(car)
leveneTest(cric$log.time ~ cric$feedingStatus, center = mean)
## Levene's Test for Homogeneity of Variance (center = mean)
##       Df F value  Pr(>F)  
## group  1  4.3155 0.04965 *
##       22                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
leveneTest(cric$log.time ~ cric$feedingStatus, center = median)
## Levene's Test for Homogeneity of Variance (center = median)
##       Df F value Pr(>F)
## group  1    2.21 0.1513
##       22

Now we have a quandary, using the median tells us we cannot reject \(H_0\) but both the test via the mean and the box-plots suggest otherwise. A safe approach would be to assume we cannot normalize the data so the Kolmogorov-Smirnov test may be called for.

ks.test(cric$feedingStatus, cric$timeToMating, alternative = "two.sided")
## Warning in ks.test(cric$feedingStatus, cric$timeToMating, alternative =
## "two.sided"): cannot compute exact p-value with ties
## 
##  Two-sample Kolmogorov-Smirnov test
## 
## data:  cric$feedingStatus and cric$timeToMating
## D = 0.875, p-value = 2.093e-08
## alternative hypothesis: two-sided

Uh oh! We have ties so this won’t work. Technically, a permutation test would be useful here.

library(exactRankTests)          # for perm.test()
perm.test(cric$timeToMating ~ cric$feedingStatus, paired=FALSE, alternative="two.sided", exact=TRUE) 
## 
##  2-sample Permutation Test (scores mapped into 1:(m+n) using
##  rounded scores)
## 
## data:  cric$timeToMating by cric$feedingStatus
## T = 123, p-value = 0.1376
## alternative hypothesis: true mu is not equal to 0

The test tells us we cannot reject \(H_0\); diet has no effect on eye spans.

The Testing Sequence

  • One-Sample or Paired Design
    • Test for Normality
    • If \(N()\) holds then use the One-Sample \(t-test\)
    • If \(N()\) does not hold then try a transformation, test for \(N()\) again
    • If \(N()\) still does not hold then use the Sign test (one sample) or the Wilcoxon Signed-Rank test (paired design)
    • Permutation test for paired designs
  • Two Sample design (equal variances)
    • Test for Normality and equal variances
    • If \(N()\) holds and variances are equal then use the Two-Sample \(t-test\) with \(var.equal=TRUE\)
    • If \(N()\) does not hold then try a transformation, test for \(N()\) again
    • If \(N()\) still does not hold then use the Mann-Whitney \(U-test\)
    • Permutation test
  • Two Sample design (unequal variances)
    • Test for Normality and equal variances
    • If \(N()\) holds but variances are unequal then use the Two-Sample \(t-test\) with \(var.equal=FALSE\)
    • If \(N()\) does not hold then try a transformation, test for \(N()\) again
    • If \(N()\) still does not hold then use the Kolmogorov-Smirnov test
    • Permutation test

Some Worked Examples

Chapter 13 Problem #3

Recycling paper has some obvious benefits but it may have unintended consequences. For example, perhaps people are less careful about how much paper they use if they know that their waste will be recycled. Researchers tested this idea by measuring paper use in two groups of experimental participants. Each person was placed in a room alone with scissors, paper, and a trash can, and was told that he or she was testing the scissors. In the “recycling” group only there was a recycling bin in the room. The amount of paper used by each participant was measured in grams.

recycle = read.csv(url("http://whitlockschluter.zoology.ubc.ca/wp-content/data/chapter13/chap13q03Recycling.csv"))
head(recycle); names(recycle)
##   recyclingBinPresent paperUsage
## 1              absent          4
## 2              absent          4
## 3              absent          4
## 4              absent          4
## 5              absent          4
## 6              absent          4
## [1] "recyclingBinPresent" "paperUsage"

We will start by testing whether the normality assumption is met.

boxplot(recycle$paperUsage ~ recycle$recyclingBinPresent, horizontal=TRUE)

tapply(recycle$paperUsage, recycle$recyclingBinPresent, FUN = shapiro.test)
## $absent
## 
##  Shapiro-Wilk normality test
## 
## data:  X[[i]]
## W = 0.8734, p-value = 0.009063
## 
## 
## $present
## 
##  Shapiro-Wilk normality test
## 
## data:  X[[i]]
## W = 0.81731, p-value = 0.002062
var.test(recycle$paperUsage, recycle$recyclingBinPresent)
## Warning in var(y): Calling var(x) on a factor x is deprecated and will become an error.
##   Use something like 'all(duplicated(x)[-1L])' to test for a constant vector.
## 
##  F test to compare two variances
## 
## data:  recycle$paperUsage and recycle$recyclingBinPresent
## F = 392.73, num df = 40, denom df = 40, p-value < 2.2e-16
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  209.4325 736.4412
## sample estimates:
## ratio of variances 
##           392.7273
library(car)
leveneTest(recycle$paperUsage ~ recycle$recyclingBinPresent, center = mean)
## Levene's Test for Homogeneity of Variance (center = mean)
##       Df F value   Pr(>F)   
## group  1  10.374 0.002578 **
##       39                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
leveneTest(recycle$paperUsage ~ recycle$recyclingBinPresent, center = median)
## Levene's Test for Homogeneity of Variance (center = median)
##       Df F value  Pr(>F)  
## group  1  5.3358 0.02627 *
##       39                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Neither the box-plots nor the Shapiro-Wilk Test allow us to assume normality holds, and the F-test rejects the null hypothesis of equal variance as well.

Since we have two groups here we are in the domain of the two-sample t-test but since the normality assumption is violated we have no choice but to try and normalize the data.

recycle$log.paper = log(recycle$paperUsage)
boxplot(recycle$log.paper ~ recycle$recyclingBinPresent, horizontal=TRUE)

with(recycle, tapply(log.paper, recyclingBinPresent, shapiro.test))
## $absent
## 
##  Shapiro-Wilk normality test
## 
## data:  X[[i]]
## W = 0.85961, p-value = 0.005045
## 
## 
## $present
## 
##  Shapiro-Wilk normality test
## 
## data:  X[[i]]
## W = 0.95117, p-value = 0.4136
var.test(recycle$log.paper, recycle$recyclingBinPresent)
## Warning in var(y): Calling var(x) on a factor x is deprecated and will become an error.
##   Use something like 'all(duplicated(x)[-1L])' to test for a constant vector.
## 
##  F test to compare two variances
## 
## data:  recycle$log.paper and recycle$recyclingBinPresent
## F = 1.7989, num df = 40, denom df = 40, p-value = 0.06689
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  0.9593389 3.3733860
## sample estimates:
## ratio of variances 
##            1.79895
leveneTest(recycle$log.paper ~ recycle$recyclingBinPresent, center = mean)
## Levene's Test for Homogeneity of Variance (center = mean)
##       Df F value Pr(>F)
## group  1  0.2523 0.6183
##       39
leveneTest(recycle$log.paper ~ recycle$recyclingBinPresent, center = median)
## Levene's Test for Homogeneity of Variance (center = median)
##       Df F value Pr(>F)
## group  1   0.385 0.5385
##       39

So the equal variances assumption is not violated but not so the normality assumption for the “absent” group. What about the square root transformation?

recycle$sr.paper = sqrt(recycle$paperUsage)
boxplot(recycle$sr.paper ~ recycle$recyclingBinPresent, horizontal=TRUE)

with(recycle, tapply(sr.paper, recyclingBinPresent, shapiro.test))
## $absent
## 
##  Shapiro-Wilk normality test
## 
## data:  X[[i]]
## W = 0.88006, p-value = 0.01211
## 
## 
## $present
## 
##  Shapiro-Wilk normality test
## 
## data:  X[[i]]
## W = 0.89752, p-value = 0.04388
var.test(recycle$sr.paper, recycle$recyclingBinPresent)
## Warning in var(y): Calling var(x) on a factor x is deprecated and will become an error.
##   Use something like 'all(duplicated(x)[-1L])' to test for a constant vector.
## 
##  F test to compare two variances
## 
## data:  recycle$sr.paper and recycle$recyclingBinPresent
## F = 5.8129, num df = 40, denom df = 40, p-value = 1.718e-07
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##   3.099865 10.900257
## sample estimates:
## ratio of variances 
##           5.812858
leveneTest(recycle$sr.paper ~ recycle$recyclingBinPresent, center = mean)
## Levene's Test for Homogeneity of Variance (center = mean)
##       Df F value  Pr(>F)  
## group  1  4.0485 0.05115 .
##       39                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
leveneTest(recycle$sr.paper ~ recycle$recyclingBinPresent, center = median)
## Levene's Test for Homogeneity of Variance (center = median)
##       Df F value  Pr(>F)  
## group  1  3.1393 0.08424 .
##       39                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Neither transformation worked and so we will have to use a non-parametric test. Since we have two groups and the variances are unequal (of the original measurement) we should run the Kolmogorov-Smirnov test.

\(H_0:\) Paper usage in the two groups comes from the same continuous distribution
\(H_A:\) Paper usage in the two groups does not come from the same continuous distribution

ks.test(recycle$recyclingBinPresent, recycle$paperUsage, alternative="two.sided", conf.level = 0.95)
## Warning in ks.test(recycle$recyclingBinPresent, recycle$paperUsage,
## alternative = "two.sided", : cannot compute exact p-value with ties
## 
##  Two-sample Kolmogorov-Smirnov test
## 
## data:  recycle$recyclingBinPresent and recycle$paperUsage
## D = 1, p-value < 2.2e-16
## alternative hypothesis: two-sided

We have ties but we will take the test results at face-value. Given the low p-value we reject \(H_0\); paper usage is not the same in both groups.

Chapter 12, Problem 16

Mosquitoes find their victims in part by odor, so it makes sense to wonder whether what we eat and drink influences our attractiveness to mosquitoes. A study in West Africa, working with the mosquito species that carry malaria, wondered whether drinking the local beer influenced attractiveness to mosquitoes.
They opened a container holding 50 mosquitoes next to each of 25 alcohol-free participants and measured the proportion of mosquitoes that left the container and flew toward the participants (they called this proportion “activation”). They repeated this procedure 15 minutes after each participant had consumed a liter of beer and measured the “change in activation” (after beer - before beer). This procedure was also carried out on another 18 participants who were given water instead of beer.
Without checking any of the underlying assumptions, use a t-test for differences between the mean changes in mosquito activation between the beer-drinking and water-drinking groups.

mosq = read.csv(url("http://whitlockschluter.zoology.ubc.ca/wp-content/data/chapter12/chap12q16BeerAndMosquitoes.csv"))
head(mosq); names(mosq)
##   drink beforeDrink afterDrink change
## 1  beer        0.13       0.49   0.36
## 2  beer        0.13       0.59   0.46
## 3  beer        0.21       0.27   0.06
## 4  beer        0.25       0.43   0.18
## 5  beer        0.25       0.50   0.25
## 6  beer        0.32       0.50   0.18
## [1] "drink"       "beforeDrink" "afterDrink"  "change"

We know this is a two-sample t-test.

\(H_0:\) Mean change in activation is the same in both groups
\(H_A:\) Mean change in activation is not the same in both groups

t.test(mosq$change ~ mosq$drink, alternative="two.sided", var.equal=TRUE)
## 
##  Two Sample t-test
## 
## data:  mosq$change by mosq$drink
## t = 3.1913, df = 41, p-value = 0.002717
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.05383517 0.23940928
## sample estimates:
##  mean in group beer mean in group water 
##         0.154400000         0.007777778

Given the p-value we can reject \(H_0\); the data suggest that mean change in activation is not the same in both groups.

Test for normality and equality of variances. Transform the data (if possible with the natural log or the square root). If neither transformation works, use the appropriate non-parametric test.

summary(mosq)
##    drink     beforeDrink       afterDrink         change        
##  beer :25   Min.   :0.1200   Min.   :0.0600   Min.   :-0.28000  
##  water:18   1st Qu.:0.3100   1st Qu.:0.4150   1st Qu.:-0.03000  
##             Median :0.4400   Median :0.4900   Median : 0.10000  
##             Mean   :0.4447   Mean   :0.5377   Mean   : 0.09302  
##             3rd Qu.:0.5850   3rd Qu.:0.7000   3rd Qu.: 0.20500  
##             Max.   :0.7900   Max.   :1.0000   Max.   : 0.46000
boxplot(mosq$change ~ mosq$drink, horizontal=TRUE)

tapply(mosq$change, mosq$drink, FUN = shapiro.test)
## $beer
## 
##  Shapiro-Wilk normality test
## 
## data:  X[[i]]
## W = 0.96785, p-value = 0.5912
## 
## 
## $water
## 
##  Shapiro-Wilk normality test
## 
## data:  X[[i]]
## W = 0.98162, p-value = 0.9652
var.test(mosq$change, mosq$drink)
## Warning in var(y): Calling var(x) on a factor x is deprecated and will become an error.
##   Use something like 'all(duplicated(x)[-1L])' to test for a constant vector.
## 
##  F test to compare two variances
## 
## data:  mosq$change and mosq$drink
## F = 0.10805, num df = 42, denom df = 42, p-value = 4.511e-11
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  0.05852226 0.19947615
## sample estimates:
## ratio of variances 
##          0.1080453
library(car)
leveneTest(mosq$change ~ mosq$drink, center = mean)
## Levene's Test for Homogeneity of Variance (center = mean)
##       Df F value Pr(>F)
## group  1  1.3689 0.2488
##       41
leveneTest(mosq$change ~ mosq$drink, center = median)
## Levene's Test for Homogeneity of Variance (center = median)
##       Df F value Pr(>F)
## group  1  1.0519 0.3111
##       41

The Shapiro-Wilk test says normality may not be violated but the variance tests give us conflicting results. The F-test says the distributions are not coming from populations with equal variances but Levene’s test disagrees with this conclusion. What shall we do?

Well, since at least one of the variance tests is signalling a problem with equal variances AND both sub-samples are small \((n < 30)\). We might as well run the t-test with unequal variances. We do so below.

t.test(mosq$change ~ mosq$drink, alternative="two.sided", var.equal=FALSE)
## 
##  Welch Two Sample t-test
## 
## data:  mosq$change by mosq$drink
## t = 3.3219, df = 40.663, p-value = 0.001897
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.05746134 0.23578311
## sample estimates:
##  mean in group beer mean in group water 
##         0.154400000         0.007777778

Same conclusion as before; mean change in activation differs across the two groups.

Assume normality is rejected, the variances were indeed equal (i.e., that Levene’s test was spot-on with its conclusion), and no transformation will work. Now run the appropriate non-parametric test.

The test will be the Mann-Whitney U test

\(H_0:\) Change in activation is drawn from the same population for both groups
\(H_A:\) Change in activation is not drawn from the same population for both groups

wilcox.test(mosq$change ~ mosq$drink, paired=FALSE)
## Warning in wilcox.test.default(x = c(0.36, 0.46, 0.06, 0.18, 0.25, 0.18, :
## cannot compute exact p-value with ties
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  mosq$change by mosq$drink
## W = 343.5, p-value = 0.003661
## alternative hypothesis: true location shift is not equal to 0
library(exactRankTests)
wilcox.exact(mosq$change ~ mosq$drink, paired=FALSE)
## 
##  Exact Wilcoxon rank sum test
## 
## data:  mosq$change by mosq$drink
## W = 343.5, p-value = 0.002899
## alternative hypothesis: true mu is not equal to 0

Again, we can reject \(H_0\). Different populations characterize change in activation of each group.

Note that the presence of ties forced us to use the Exact Wilcoxon Rank Sum test.

Assume normality is rejected, variances are unequal, and no transformation will work. Use an appropriate non-parametric test.

This will have to be the Kolmogorov-Smirnov test.

\(H_0:\) Change in activation is drawn from the same population for both groups
\(H_A:\) Change in activation is not drawn from the same population for both groups

ks.test(mosq$drink, mosq$change)
## Warning in ks.test(mosq$drink, mosq$change): cannot compute exact p-value
## with ties
## 
##  Two-sample Kolmogorov-Smirnov test
## 
## data:  mosq$drink and mosq$change
## D = 1, p-value < 2.2e-16
## alternative hypothesis: two-sided

The p-value is practically 0 so we can safely reject \(H_0\); change in activation is drawn from different populations for each group