1 - The Normal Distribution

The Normal distribution is expressed as: \[\sim N(): f(Y)=\dfrac{1}{\sigma{\sqrt{2\pi}}}e^{-(Y-\mu)^{2}/2\sigma^{2}}\] where \(\mu=\) mean; \(\sigma=\) s.d., \(\pi=3.14159\), and; \(e=2.71828\)

Two parameters \((\mu; \sigma)\) describe a Normal distribution. The Normal distribution is symmetric, with the tails extending from \(-\infty\) to \(+\infty\). Note that \(\mu\) can be positive, negative, or zero while \(\sigma\) dictates how wide or flat the curve is. The total area under the Normal curve is \(1\), and the area to the left of \(\mu\) is 0.5; the area to the right of \(\mu\) is 0.5

x = seq(0,100,length=200)
y = dnorm(x,mean=50,sd=10)
plot(x, y, type="l", lwd=2, col="red", 
     bty="n", yaxt="no", ylab="")
polygon(c(0,x,100),c(0,y,0),col="tomato1")
abline(v=c(40,60), col="white")
abline(v=c(30,70), col="white")
abline(v=c(20,80), col="white")
abline(v=c(50), col="blue")

Now we can sketch some Normal distributions with the same mean but with different variability.

# Same Mean, different Variance #
x=seq(-8,8,length=500)
y3=dnorm(x,mean=0,sd=1/2)
plot(x,y3,type="l",lwd=2,col="green", bty="n", xaxt="no", yaxt="no", ylab="")
y2=dnorm(x,mean=0,sd=2)
lines(x,y2,type="l",lwd=2,col="blue")
y1=dnorm(x,mean=0,sd=1)
lines(x,y1,type="l",lwd=2,col="red")

And now with different means but with the same variability.

# Different Mean, different Variance #
x=seq(-8,8,length=500)
y3=dnorm(x,mean=0,sd=1/2)
plot(x,y3,type="l",lwd=2,col="green", bty="n", xaxt="no", yaxt="no", ylab="")
y2=dnorm(x,mean=1,sd=1)
lines(x,y2,type="l",lwd=2,col="blue")
y1=dnorm(x,mean=2,sd=1.5)
lines(x,y1,type="l",lwd=2,col="red")

2 - The Standard Normal Distribution

z = seq(-3,3,length=200)
y = dnorm(z,mean=0,sd=1)
plot(z, y, type="l", lwd=2, col="red", 
     bty="n", yaxt="no", ylab="")
polygon(c(-5,z,5),c(0,y,0),col="lightblue")
abline(v=c(-1,1), col="white")
abline(v=c(-2,2), col="white")
abline(v=c(-3,3), col="white")
abline(v=c(0,0), col="firebrick")

3 - Linking Confidence Intervals to Significance Levels

It is a given that for every \(z-score\) distribution a specific proportion of the distributions falls within 1 standard deviation. Let us see what this is. First we will calculate the area above \(z=+1\) and given the symmetry, we know this is also the area below \(z=-1\):

pnorm(1, mean=0, sd=1, lower.tail=FALSE)
## [1] 0.1586553
pnorm(-1, mean=0, sd=1, lower.tail=TRUE)
## [1] 0.1586553

Note the command’s structure: pnorm(zscore, mean=0, sd=1, lower.tail=) where lower.tail=FALSE indicates we want the area above the z-score and lower.tail=TRUE indicates that we want the area below the z-score.

Therefore, the total area that fails to the right of \(z=+1\) plus to the left of \(z=-1\) is just 0.1586553 + 0.1586553 = 0.3173106. This, in turn implies that the total area between \(z-scores = \pm 1\) has to be 0.6826895. What about the area between \(z-scores = \pm 2\)?

pnorm(2, mean=0, sd=1, lower.tail=FALSE)
## [1] 0.02275013
pnorm(-2, mean=0, sd=1, lower.tail=TRUE)
## [1] 0.02275013

The area between \(z-scores = \pm 2\) has to be 0.9544997.

Okay, so thus far the Empirical Rule seems to be true. But what if I wanted to know what \(z-scores\) trap 95% of the distribution? Here we are being asked for the z-scores that trap a certain proportion (0.95) of the distribution. Given the symmetry of the distribution we know this means a total of 5% of the distribution is then split amongst the two tails, 2.5% in the left-tail and 2.5% in the right-tail. The corresponding z-scores can be estimated as follows:

qnorm(0.025, lower.tail=TRUE)
## [1] -1.959964
qnorm(0.025, lower.tail=FALSE)
## [1] 1.959964

Note the command structure: qnorm(proportion of the curve, lower.tail=).

So, in essence, \(z-scores=\pm1.959964 \approx \pm 1.96\) trap 95% of the distribution. What about 99% of the distribution?

qnorm(0.005, lower.tail=TRUE)
## [1] -2.575829
qnorm(0.005, lower.tail=FALSE)
## [1] 2.575829

Well that turns out to be \(\approx \pm 2.58\)

We can store these two values – \(z-scores = \pm 1.96\) in our memory because these flag the middle 95% of the distribution, and similarly \(z-scores = \pm 2.58\) flag the middle 99% of the distribution.

In the problems that follow you will see the 95% confidence interval being calculated with \(z=\pm 1.96\); this is where this value of \(1.96\) is coming from. Note also that this means we no longer use the \(2 \times se\) rule used to calculate the approximate 95% confidence interval.

4 - The Binomial Distribution

Given a certain number of independent trials (\(n\)) with an identical probability (\(p\)) of success (\(X\)) in each trial we can easily calculate the probability of seeing a specific number of successes. For example, if I flip a coin 10 times, where \(X=Head\) with \(p=0.5\) then what is probability of seeing exactly 2 heads, exactly 4 heads, 7 heads, etc.? The answer is easily calculated as: \[ P\left[X \text{ successes}\right] = \binom{n}{X}p^{X}\left(1 - p\right)^{n-X} \\ \text{where } \binom{n}{x} = \dfrac{n!}{X!(n-X)!} \text{ and } \\ n! = n \times (n-1) \times (n-2) \times \cdots \times 2 \times 1\]

R has built-in functions to perform the calculations for us. For example, if I toss a coin 2 times, what is the probability of getting exactly 1 head? Let \(X=1\). We know for unbiased coins \(p(Heads)=0.50\). We are also conducting \(n=2\) independent trials.

x = 0:2
p.x = dbinom(x, 2, p=0.5)
cbind(x, p.x)
##      x  p.x
## [1,] 0 0.25
## [2,] 1 0.50
## [3,] 2 0.25
plot(x, p.x, type = "h", ylim=c(0,1))

5 - Some Worked Problems

5.1 - Problem 3

  1. The observed proportion of wallets returned was \(\frac{101}{240}=\) 0.4208333

  2. We know that \(p^{'}=\dfrac{X+2}{n+4}\) and so we can calculate this as follows:

n = 240; X = 101
p.prime = (X + 2)/(n + 4)
p.prime
## [1] 0.4221311

So \(p^{'}=\) 0.4221311

  1. The 95% lower-bound is given by \[p^{'}-z{\sqrt{\dfrac{p^{'}\left(1-p^{'}\right)}{n+4}}}\]

We have \(z \approx 1.96\) as the \(z-score\) to be used for the 95% confidence interval and hence the lower-bound is:

z = 1.96
z
## [1] 1.96
se.p.prime = sqrt((p.prime * (1 - p.prime))/(n+4))
p.95CI.lower = p.prime - z*(se.p.prime)
p.95CI.upper = p.prime + z*(se.p.prime)
p.95CI.lower; p.95CI.upper
## [1] 0.3601586
## [1] 0.4841037

So the lower-bound and upper-bound are 0.3601586 and 0.4841037, respectively.

  1. The two values that lie within the most plausible range would be \(0.37, 0.47\), respectively, and two values that would fall outside this range would be \(0.35, 0.49\), respectively.

  2. Because the null proportion of 0.5 does not fall within the interval we would be able to reject \(H_0\) if we were using a significance level of 0.05 (i.e, \(\alpha=0.05\)).

For completeness’ sake, note that R could do these calculations very swiftly:

library(binom)
binom.test(X, n, p=0.5, alternative="two.sided", conf.level=0.95)
## 
##  Exact binomial test
## 
## data:  X and n
## number of successes = 101, number of trials = 240, p-value =
## 0.01674
## alternative hypothesis: true probability of success is not equal to 0.5
## 95 percent confidence interval:
##  0.3576151 0.4860318
## sample estimates:
## probability of success 
##              0.4208333
binom.confint(X, n, p=0.5, alternative="two.sided", conf.level=0.95, methods=c("ac"))
##          method   x   n      mean     lower     upper
## 1 agresti-coull 101 240 0.4208333 0.3600899 0.4840711

Note a couple of things. First, binom.test() is calculating the default Wald confidence interval whereas binom.confint allows us to choose what method we want and in this example you see the Agresti-Coull method being used via the methods=“ac” switch.

The Agresti-Coull method used by R is slightly different from that listed in the text. In particular, “for a 95% confidence interval, this method does not use the concept of”adding 2 successes and 2 failures," but rather uses the formulas explicitly described here: http://en.wikipedia.org/wiki/Binomial_proportion_confidence_interval#Agresti-Coull_Interval." As a result our manual calculations will not synchronize 1:1 with the confidence interval generated by binom.confint. However, this won’t be an issue since R will automatically calculate the correct confidence interval limits so long as we specify method=“ac”.

Remember: If \(n\) is small and/or \(p\) is close to \(0\) or \(1\) then the Agresti-Coull method is preferred. I would just default to this in any event.

5.2 - Radiologists and Missing Sons

Assume that they are just as likely to have boys as girls. This then generates the following hypotheses:

\[H_0: \text{ Radiologsts are just as likely to have sons as daughters } (p=0.5)\]
\[H_A: \text{ Radiologsts are not as likely to have sons as daughters } (p \neq 0.5)\]

Let \(\alpha=0.05\)

binom.test(30, 87, p=0.5, alternative="two.sided", conf.level=0.95)
## 
##  Exact binomial test
## 
## data:  30 and 87
## number of successes = 30, number of trials = 87, p-value =
## 0.005014
## alternative hypothesis: true probability of success is not equal to 0.5
## 95 percent confidence interval:
##  0.2461396 0.4544136
## sample estimates:
## probability of success 
##              0.3448276

The \(p-value = 0.005014\) so we can reject \(H_0\); the data provide sufficient evidence to conclude that radiologists are not as likely to have sons as daughters.

What if we suspected, a priori that radiologists are less likely to have sons? In that case we would have done the following:

\[H_0: \text{ Radiologsts are at least as likely to have sons as daughters } (p \geq 0.5)\]

\[H_A: \text{ Radiologsts are less likely to have sons than daughters } (p < 0.5)\] Let \(\alpha=0.05\)

binom.test(30, 87, p=0.5, alternative="less", conf.level=0.95)
## 
##  Exact binomial test
## 
## data:  30 and 87
## number of successes = 30, number of trials = 87, p-value =
## 0.002507
## alternative hypothesis: true probability of success is less than 0.5
## 95 percent confidence interval:
##  0.0000000 0.4374992
## sample estimates:
## probability of success 
##              0.3448276

Again, the \(P-value=0.002507\) and we can easily reject \(H_0\); the data provide sufficient evidence to conclude that radiologists are not at least as likely to have sons as daughters.

5.3 - Problem 5

  1. Best estimate of the proportion of $1 bills with measurable cocaine content would be \(\frac{46}{50}=\) 0.92

  2. The 95% confidence interval would be

binom.confint(46, 50, p=0.5, conf.level=0.95, methods=c("ac"), alternative="two.sided")
##          method  x  n mean     lower     upper
## 1 agresti-coull 46 50 0.92 0.8064694 0.9735986
  1. We can be 95% confident that the true proportion of $1 bills with measurable cocaine content falls within the interval given by 0.8064 and 0.9735

5.4 - Problem 14

  1. This is pretty straightforward:
n = 12028
x = 6052
p.hat = (x/n); p.hat
## [1] 0.5031593

The best estimate of the proportion of deaths out of this time interval that occurred in the week before Christmas is 0.5031

  1. To calculate the confidence interval we will need to use the Agresti-Coull method. This can be done as follows:
p.prime = (x + 2)/(n + 4)
se.p.prime = sqrt((p.prime * (1 - p.prime))/(n+4))
p.95CI.lower = p.prime - 2*(se.p.prime)
p.95CI.upper = p.prime + 2*(se.p.prime)
p.95CI.lower; p.95CI.upper
## [1] 0.4940419
## [1] 0.5122746

So the 95% CI is 0.4940 and 0.5122

  1. The true value of 50%, i.e., the true proportion of 0.50, falls within the 95% confidence interval and as such the data do not support the common perception that people are able to delay their deaths from chronic illnesses until after a special event, like Christmas.

5.5 - Problem 15

Given that in 13 of the 16 pairs the clade with latex/resin was found to be more diverse. Does having the clade defended by latex/resin lead to more diversity?

\[H_0: \text{ The clade with latex/resin is likely to be just as diverse as its sister clade without latex/resin} (p=0.5)\]

\[H_A: \text{ The clade with latex/resin is not likely to be just as diverse as its sister clade without latex/resin} (p \neq 0.5)\] Let \(\alpha=0.05\)

binom.test(13, 16, p=0.5, alternative="two.sided", conf.level=0.95)
## 
##  Exact binomial test
## 
## data:  13 and 16
## number of successes = 13, number of trials = 16, p-value = 0.02127
## alternative hypothesis: true probability of success is not equal to 0.5
## 95 percent confidence interval:
##  0.5435435 0.9595263
## sample estimates:
## probability of success 
##                 0.8125

The \(P-value = 0.02127\) so we reject \(H_0\). However, note that with \(\alpha=0.01\) we would be unable to Reject \(H_0\).

binom.test(13, 16, p=0.5, alternative="two.sided", conf.level=0.99)
## 
##  Exact binomial test
## 
## data:  13 and 16
## number of successes = 13, number of trials = 16, p-value = 0.02127
## alternative hypothesis: true probability of success is not equal to 0.5
## 99 percent confidence interval:
##  0.4656383 0.9776921
## sample estimates:
## probability of success 
##                 0.8125

Note that the 95% confidence interval traps the Null value of \(0.5\) AND the \(P-value > \alpha\)

  1. In order for it to be a controlled experiment the authors would need to have randomly assigned the clades to the latex/resin and non-latex/resin groups. They don’t tell us this random assignment took place, and as such it seems to be an observational study.

5.6 - Problem 19

  1. Below we calculate the estimated proportion, p-prime, and the 95% confidence intervals.
n = 200
x = 109
z = qnorm(0.025, lower.tail=FALSE)
p.hat = (x/n)
p.prime = (x + 2)/(n + 4)
se.p.prime = sqrt((p.prime * (1 - p.prime))/(n+4))
p.95CI.lower = p.prime - z*(se.p.prime)
p.95CI.upper = p.prime + z*(se.p.prime)
p.95CI.lower;p.95CI.upper
## [1] 0.4757728
## [1] 0.6124625

The best estimate of the proportion of shoppers that have injured themselves with food or drink packaging is 0.545. The 95% confidence intervals are 0.4757 and 0.6124, respectively.

  1. We are not told if this was a random sample. One factor rendering it a non-random sample might be that the grocery stores all belonged to a particular chain that caters to a particular segment of society, or were all located in an area frequented by members of a particular socioeconomic/demographic group. Non-response – who chose to participate in the survey may be an issue as well; those more likely to have been injured may have been more or less likely to participate in the survey.