Module 03: Sampling Distributions and more

# .salt[.fancy[Module 03: Sampling Distributions and more]]

# .salt[.fancy[MPA 6010]]

# .salt[.fancy[Ani Ruhil]]

---

# .fat[.fancy[Agenda]]

1. Sampling distributions

2. Interval estimation

3. Student's `t` distribution

---

# .fat[.fancy[Sampling]]

---

Many ways to do it but at its most basic, we could say it is about upholding two principles:

(1) Each and every observation in the population has the same chance of being drawn into the sample  
(2) Each observation is drawn into the sample independently of all other observations

Samples that meet criteria (1) and (2) are .heatinline[.fancy[simple random samples]]

Draw a distinction between .heatinline[.fancy[finite populations]] (where we know the population size) and .heatinline[.fancy[infinite populations]] (where we do not know the population size)

* Finite Population: Draw a sample of size `$n$` from the population such that each possible sample of size `$n$` has an identical probability of selection 
    * We may sample with replacement (recommended), or 
    * sample without replacement (the more common -- but not necessarily better -- approach) 
    
* Infinite Population: We sample such that 
    * Each element is selected independently of all other elements 
    * Each element has the same probability of being selected

---

All possible unique samples with `$n = 2$`?

What sample means shows up most often?

So the next time I draw a sample of `$n = 2$` from this population, what sample mean should I .heatinline[.fancy[expect]] to see?

Formally, the Sampling Distribution of `$\bar{x}$` is the probability distribution of all possible values of the sample mean `$\bar{x}$`

Standard Deviation of `$\bar{x}$` is the .heatinline[.fancy[Standard Error]] of `$\bar{x}$`, calculated as `$\sigma_{\bar{x}}=\dfrac{\sigma}{\sqrt{n}}$` 
]

.pull-right[
<table class="table table-striped" style="font-size: 14px; width: auto !important; margin-left: auto; margin-right: auto;">
<caption style="font-size: initial !important;">An Example of Sampling Distributions</caption>
 <thead>
  <tr>
   <th style="text-align:right;"> Sample No. </th>
   <th style="text-align:right;"> x1 </th>
   <th style="text-align:right;"> x2 </th>
   <th style="text-align:right;"> Mean </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> 2 </td>
   <td style="text-align:right;"> 2 </td>
   <td style="text-align:right;"> 2 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 2 </td>
   <td style="text-align:right;"> 2 </td>
   <td style="text-align:right;"> 4 </td>
   <td style="text-align:right;"> 3 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 3 </td>
   <td style="text-align:right;"> 2 </td>
   <td style="text-align:right;"> 6 </td>
   <td style="text-align:right;"> 4 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 4 </td>
   <td style="text-align:right;"> 2 </td>
   <td style="text-align:right;"> 8 </td>
   <td style="text-align:right;"> 5 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 5 </td>
   <td style="text-align:right;"> 4 </td>
   <td style="text-align:right;"> 2 </td>
   <td style="text-align:right;"> 3 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 6 </td>
   <td style="text-align:right;"> 4 </td>
   <td style="text-align:right;"> 4 </td>
   <td style="text-align:right;"> 4 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 7 </td>
   <td style="text-align:right;"> 4 </td>
   <td style="text-align:right;"> 6 </td>
   <td style="text-align:right;"> 5 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 8 </td>
   <td style="text-align:right;"> 4 </td>
   <td style="text-align:right;"> 8 </td>
   <td style="text-align:right;"> 6 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 9 </td>
   <td style="text-align:right;"> 6 </td>
   <td style="text-align:right;"> 2 </td>
   <td style="text-align:right;"> 4 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 10 </td>
   <td style="text-align:right;"> 6 </td>
   <td style="text-align:right;"> 4 </td>
   <td style="text-align:right;"> 5 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 11 </td>
   <td style="text-align:right;"> 6 </td>
   <td style="text-align:right;"> 6 </td>
   <td style="text-align:right;"> 6 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 12 </td>
   <td style="text-align:right;"> 6 </td>
   <td style="text-align:right;"> 8 </td>
   <td style="text-align:right;"> 7 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 13 </td>
   <td style="text-align:right;"> 8 </td>
   <td style="text-align:right;"> 2 </td>
   <td style="text-align:right;"> 5 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 14 </td>
   <td style="text-align:right;"> 8 </td>
   <td style="text-align:right;"> 4 </td>
   <td style="text-align:right;"> 6 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 15 </td>
   <td style="text-align:right;"> 8 </td>
   <td style="text-align:right;"> 6 </td>
   <td style="text-align:right;"> 7 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 16 </td>
   <td style="text-align:right;"> 8 </td>
   <td style="text-align:right;"> 8 </td>
   <td style="text-align:right;"> 8 </td>
  </tr>
</tbody>
</table>
]

---

## The Human Genome

Let us see the length of the [human genome](http://phylo.bio.ku.edu/biostats/geneLenDemo.html) with the population-level data

---

### .heat[.fancy[The Standard Error]]

A gauge of how much error should we expect to make because we are using a sample instead of the population

`$$\sigma_{\bar{x}} = \dfrac{\sigma}{\sqrt{n}}$$`

Meaning? ... the average distance of a sample mean from the population mean for all samples of a given sample size drawn from a common population with standard deviation of `$\sigma$`

Two things drive how large or small the standard error will be ...  
    (1) how much variability there is in the population, i.e., the population standard deviation `$(\sigma)$`, and  
    (2) how large a sample we draw, i.e., `$(n)$`

Cannot influence `$\sigma$` but `$n$` is under our control. All else being equal, 
+ the larger the `$n$` the smaller will `$\sigma_{\bar{x}}$` be, and  
+ the larger the `$\sigma$` the larger will `$\sigma_{\bar{x}}$` be

---

## .heatinline[.fancy[The Central Limit Theorem]]

> For any population with mean `$\mu$` and standard deviation `$\sigma$`, the distribution of sample means for samples of size `$n$` will approach the normal distribution with a mean of `$\mu$` and standard error of `$\sigma_{\bar{x}}=\dfrac{\sigma}{\sqrt{n}}$` as `$n \rightarrow \infty$`

Since you never see the population, it turns out that if we have `$n \geq 30$` we can rely on this theorem to carry out some basic statistical tests with the .heatinline[.fancy[z-score (aka the standard normal) distribution]]

---

### Example 1

Tuition costs at state universities is `$\sim \mu=4260; \sigma=900$`. Sample of `$n=50$` is drawn at random.

(1) What is sampling distribution of mean tuition costs?  `$\mu = 4260; \sigma_{\bar{x}}=\dfrac{\sigma}{\sqrt{n}}=\dfrac{900}{\sqrt{50}}= 127.28$`

.pull-left[
(2) What is `$P(\bar{x})$` within 250 of `$\mu$`? Sampling distribution is `$\mu = 4260; \sigma_{\bar{x}} = 127.28$`

Within `$250$` implies `$P(4010 \leq \mu \leq 4510)$`

`$$z_{4510}=\dfrac{4510-4260}{127.28}=1.96 \\
z_{4010} = \dfrac{4010-4260}{127.28} = - 1.96 \\
P(4010 \leq \mu \leq 4510) = 0.95$$`
]

`$$P(4160 \leq \mu \leq 4360) = 0.5704$$`
]

---

### Example 2

Average annual cost of automobile insurance is `$\sim \mu=687; \sigma=230$` and `$n= 45$`

`$\sigma_{\bar{x}}=\frac{\sigma}{\sqrt{n}}=\frac{230}{\sqrt{45}}= 34.29$` Hence the sampling distribution is `$\mu = 687; \sigma_{\bar{x}} = 34.29$`

`$$z_{787}=\frac{787-687}{34.29}=2.92 \\
z_{587} = \frac{587-687}{34.29} = -2.92 \\
\therefore P(587 \leq \mu \leq 787) = 0.9964$$`
]

`$$z_{712}=\frac{712-687}{34.29}=0.73 Z \\
z_{662} = \frac{662-687}{34.29} = -0.73 \\
\therefore P(662 \leq \mu \leq 712) = 0.5346$$`

]

If insurance agency wants to be within 25, would you recommend a larger sample?

Yes, since `$n = 45$` is unlikely to yield the desired result

---

## .heatinline[.fancy[What about proportions?]]

The theory of sampling distributions applies to proportions as well

Sample proportion is `$\bar{p}=\dfrac{x}{n}$` and on average `$E(\bar{p}) = p$`.

The standard error (aka the standard deviation of `$\bar{p}$`) is  `$\sigma_{\bar{p}}={\sqrt{\dfrac{p(1-p)}{n}}}$`

We can assume sampling distribution of `$\bar{p}$` is approximately normally distributed if `$np \geq 5$` AND `$n(1-p) \geq 5$` (both conditions must be met)<sup>1</sup>

**Note:** When population proportion `$p$` is unknown, standard error is calculated as `$s_{\bar{p}}={\sqrt{\dfrac{\bar{p}(1-\bar{p})}{n}}}$`

.footnote[[1] Technically, for a finite population, `$\sigma_{\bar{p}}={\sqrt{\dfrac{N-n}{N-1}}}{\sqrt{\dfrac{p(1-p)}{n}}}$` and for an infinite population `$\sigma_{\bar{p}}={\sqrt{\dfrac{p(1-p)}{n}}}$`
]

---

## Example 3 
The Governor's office reports 56% of US households have internet access. If `$p=0.56$` and `$n=300$`, what is:

(1) The sampling distribution of `$\bar{p}$`?

Given `$n=300; p=0.56$`, `$\sigma_{\bar{p}}={\sqrt{\dfrac{0.56(1-0.56)}{300}}}=0.0287$`

(2) What is `$Probability (\bar{p})$` within `$\pm 0.03$`?

We are looking for the interval `$(0.53,0.59)$` and so we calculate:

`$$z_{0.59}=\dfrac{0.59-0.56}{0.0287}=1.035$$` 
`$$z_{0.53}=\dfrac{0.53-0.56}{0.0287}=-1.035$$`

and hence `$P(0.53 \leq \bar{p} \leq 0.59)=0.7062$`

(3) Calculate (1) and (2) for `$n=600$` and `$n=1000$`, respectively

---

# .heat[.fancy[Point Estimates]]

---

Statistics calculated for a population are .heatinline[.fancy[population parameters]] (for e.g., `$\mu$`; `$\sigma^{2}$`; `$\sigma$`)

Statistics calculated for a sample are .heatinline[.fancy[sample statistics]] (for e.g., `${\bar{x}}$`; `$s^{2}$`; `$s$`)

Desirable point estimators have the following properties:

(1) The sampling distribution of the point estimator is .heatinline[.fancy[unbiased]] -- it is centered around the population parameter  
(2) The point estimator is .heatinline[.fancy[efficient]] -- it has the smallest possible variance  
(3) The point estimator is .heatinline[.fancy[consistent]] -- it tends towards the population parameter as the sample size increases

The easiest way to understand bias and efficiency is via the following graphic [authored by Sebastian Raschka](https://sebastianraschka.com):

---

##  Interval Estimates

Although point estimates (for e.g., `$\bar{x}$`) tell us the expected value of a parameter based on a sample statistic, we know samples are not all the same even though they were drawn randomly, and have the same sample size. So we try to ask: How much confidence can we place in the point estimate? The .heatinline[.fancy[margin of error]] helps us answer this question

* Interval Estimate of `$\mu =  \bar{x} \pm \text{ Margin of error}$` 
* Interval Estimate of `$p = \bar{p} \pm \text{ Margin of error}$` 
* Margin of Error `$= \left( z \times \sigma \right)$`

In any normal distribution of sample means with mean `$\mu$` and standard deviation `$\sigma$`, the following statement is true:

* Over all samples of size `$n$`, the probability is `$0.95$` for the event in question 
* That is, `$-1.96\sigma \leq \bar{x}-\mu \leq +1.96\sigma$`

Rearranging the former identity yields: `$\bar{x}-1.96\sigma \leq \mu \leq \bar{x} +1.96\sigma$`

This says that of all samples of identical size `$n$`, there is a 95% probability that `$\mu$` falls within `$\bar{x} \pm 1.96\sigma$`

The range of values within `$\bar{x} \pm 1.96\sigma$` yield the 95% .heatinline[.fancy[confidence interval (CI)]] of `$\mu$`, and the two boundaries are the 95% .heatinline[.fancy[confidence interval limits]]

---

Be careful! We aren't saying `$\mu$` falls within a known interval. Rather, that in repeated random sampling 95% of Confidence Intervals will include `$\mu$`

---

---

### Interval Estimate with Known `$\sigma$`

`$\bar{X} \pm z_{\alpha/2}\left(\dfrac{\sigma}{\sqrt{n}}\right)$`

where `$z_{\frac{\alpha}{2}}$` is the `$z$` value yielding an area of `$\dfrac{\alpha}{2}$` in the upper tail of the standard normal distribution

| Confidence Level    |   `$\alpha$`    |   `$\dfrac{\alpha}{2}$`  |   `$z_{\frac{\alpha}{2}}$` | 
| :--                 | :--           | :--           | :--              | 
| 90%                 |   0.10        |    0.050    |   1.645 | 
| 95%                 |   0.05        |    0.025    |   1.960 | 
| 99%                 |   0.01        |    0.005    |   2.576 |

---

## Confidence Intervals

`$$\mu =  \bar{x} \pm \text{ Margin of error}$$`

When you calculate the interval estimate for a specific `$z$`, you get two values ...

`$$\bar{x} - (z \times \sigma)$$`

`$$\bar{x} + (z \times \sigma)$$`

These are the lower and upper confidence interval limits, respectively

If `$z = 1.645$` you get the 90% confidence intervals

If `$z = 1.960$` you get the 95% confidence intervals

If `$z = 2.576$` you get the 99% confidence intervals

---

Let `$\bar{x}=32; \sigma=6$`. Let also `$n=50$`.

(1) What is the standard error?

`$$\sigma_{\bar{x}}=\dfrac{\sigma}{\sqrt{n}}=\dfrac{6}{\sqrt{50}}=0.8485$$`

(2) Find `$90\%$`CI

`$$= 32 \pm 1.645(0.8485)=32 \pm 1.3957=(30.6043; 33.3957)$$`

(3) Find `$95\%$`CI

`$$= 32 \pm 1.960(0.8485)=32 \pm 1.6630=(30.3370; 33.6630)$$`

(4) Find `$99\%$`CI

`$$= 32 \pm 2.576(0.8485)=32 \pm 2.1857=(29.8143; 34.1857)$$`

The more confident you want to be, the wider the interval will be

---

### Example 4

Let household mean television viewing time be `$\sim \bar{x}=8.5; \sigma=3.5$` and `$n=300$`

Therefore  `$\sigma_{\bar{x}}=\dfrac{\sigma}{\sqrt{n}}=\dfrac{3.5}{\sqrt{300}}=0.2020$`

`$$= 8.5 \pm 1.960(0.2020) \\
= 8.5 \pm 0.3959 \\
= (8.1041; 8.8959)$$`
]

`$$= 8.5 \pm 2.576(0.2020) \\
= 8.5 \pm 0.520352 \\
= (7.979648; 9.020352)$$`
]

---

### .heat[.fancy[Confidence Intervals for Proportions]]

Since we don't know the population proportion `$p$` we calculate the standard error  `$s_{\bar{p}}$` as `$s_{\bar{p}} = {\sqrt{\dfrac{\bar{p}(1-\bar{p})}{n}}}$`

Several formulas but most common are

(1) Agresti-Coull confidence interval -- first calculate `$p^{'} = \dfrac{x + 2}{n + 4}$` and then calculate

`$$p^{'} - z_{\alpha/2}\sqrt{ \dfrac{p^{'} \left(1-p^{'} \right) } {n+4} } \leq p \leq p^{'} + z _{\alpha/2}\sqrt{ \dfrac{p^{'} \left(1-p^{'} \right) } {n+4} }$$`

(2) Wald confidence intervals are calculated with the usual standard error and as:

`$$\bar{p} - z_{\alpha/2}\sqrt{ \dfrac{\bar{p} \left(1-\bar{p} \right) } {n} } \leq p \leq \bar{p} + z_{\alpha/2}\sqrt{ \dfrac{\bar{p} \left(1-\bar{p} \right) } {n} }$$`

---

Agresti-Coull and Wald are too conservative (i.e., too wide) when

(i) `$n$` is small or  
(ii) `$p$` is close to `$0$` or `$1$`

In these situations Wilson's Interval is preferred

`$$\dfrac{n}{n + z^2_{\alpha/2}}\left[ \left(\bar{p} + \dfrac{z^2_{\alpha/2}}{2n} \right) \pm z_{\alpha/2} \sqrt{\dfrac{\bar{p}(1-\bar{p})}{n} + \dfrac{z^2_{\alpha/2}}{4n^2}} \right ]$$`

which converts to

`$$\dfrac{n}{n + z^2_{\alpha/2}}\left[ \left(\bar{p} + \dfrac{z^2_{\alpha/2}}{2n} \right) - z_{\alpha/2} \sqrt{\dfrac{\bar{p}(1-\bar{p})}{n} + \dfrac{z^2_{\alpha/2}}{4n^2}} \right ] \leq p \leq \dfrac{n}{n + z^2_{\alpha/2}}\left[ \left(\bar{p} + \dfrac{z^2_{\alpha/2}}{2n} \right) + z_{\alpha/2} \sqrt{\dfrac{\bar{p}(1-\bar{p})}{n} + \dfrac{z^2_{\alpha/2}}{4n^2}} \right ]$$`

---

### .heat[.fancy[An interesting issue]]

.pull-left[
<table class="table table-striped" style="font-size: 14px; width: auto !important; margin-left: auto; margin-right: auto;">
<caption style="font-size: initial !important;">The Binomial Proportion Revisited</caption>
 <thead>
  <tr>
   <th style="text-align:right;"> Proportion </th>
   <th style="text-align:right;"> 1 - Proportion </th>
   <th style="text-align:right;"> (Proportion) * (1 - Proportion) </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:right;"> 0.0 </td>
   <td style="text-align:right;"> 1.0 </td>
   <td style="text-align:right;"> 0.00 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 0.1 </td>
   <td style="text-align:right;"> 0.9 </td>
   <td style="text-align:right;"> 0.09 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 0.2 </td>
   <td style="text-align:right;"> 0.8 </td>
   <td style="text-align:right;"> 0.16 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 0.3 </td>
   <td style="text-align:right;"> 0.7 </td>
   <td style="text-align:right;"> 0.21 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 0.4 </td>
   <td style="text-align:right;"> 0.6 </td>
   <td style="text-align:right;"> 0.24 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 0.5 </td>
   <td style="text-align:right;"> 0.5 </td>
   <td style="text-align:right;"> 0.25 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 0.6 </td>
   <td style="text-align:right;"> 0.4 </td>
   <td style="text-align:right;"> 0.24 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 0.7 </td>
   <td style="text-align:right;"> 0.3 </td>
   <td style="text-align:right;"> 0.21 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 0.8 </td>
   <td style="text-align:right;"> 0.2 </td>
   <td style="text-align:right;"> 0.16 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 0.9 </td>
   <td style="text-align:right;"> 0.1 </td>
   <td style="text-align:right;"> 0.09 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 1.0 </td>
   <td style="text-align:right;"> 0.0 </td>
   <td style="text-align:right;"> 0.00 </td>
  </tr>
</tbody>
</table>
]
.pull-right[
Note what happens when `$\bar{p}=0$` and when `$\bar{p}=1$`: Your numerator in the formula `$s_{\bar{p}} = \sqrt{\dfrac{\bar{p}(1-\bar{p})}{n}}$` is driven to `$0$` and hence so is the standard error

Where the numerator `$\bar{p}(1 - \bar{p})$` is the widest is when `$\bar{p}=0.50$`; as `$\bar{p}$` moves away towards `$0$` or `$1$`, `$\bar{p}(1 - \bar{p})$` steadily decreases

At the extremes, the Wilson will do well but not so the Wald or Agresti-Coull. So,

+ Use Wilson when `$p$` is close to `$0$` or `$1$` or if the sample size is `small` 
+ Use Agresti-Coull when the sample size is `large` and `$p$` is not close to `$0$` or `$1$`
]

---

In a survey of a small town in Vermont, voters were asked if they would support closing the local schools and sending the local kids to schools in the neighboring town to more efficiently utilize local tax dollars. A random sample of 153 voters yields 43.1% favoring school closures. What is the 95% confidence interval for the population proportion? Use the Wald approach.

We have `$\bar{p}=0.431$` and `$n = 153$`, yielding a standard error of

`$$s_{\bar{p}} = \sqrt{\dfrac{0.431(1-0.431)}{153}} = 0.04003585$$`

Given `$z = \pm 1.96$` the confidence interval is:

`$$0.431 \pm 1.96(0.04003585) = 0.3525297 \text{ and } 0.5094703$$`

Loosely interpreted as: We are about 95% confident that the population proportion of the town's voters that support school closures lies in the interval given by 0.3525 and 0.5094

If  you need to, you can [use this online calculator for exact confidence intervals](http://statpages.info/confint.html)

---

# .fat[.fancy[Student's t-distribution]]

---

Extension of the large sample theory of sampling distributions to small samples

Assume `$X \sim N(13,2)$`. If `$x = 12, n=16$`, what is `$z_{x=12}$`?

It turns out that `$z_{x=12}=2$`.

Now, if we don't know `$\mu$` and `$\sigma$` and draw three samples of `$n=16$` each but `$s$` differs in each sample? What is `$z$` in each?

<table class="table table-striped" style="font-size: 16px; width: auto !important; margin-left: auto; margin-right: auto;">
<caption style="font-size: initial !important;">The t distribution: An Example</caption>
 <thead>
  <tr>
   <th style="text-align:right;"> Sample No. </th>
   <th style="text-align:right;"> Std. Dev. </th>
   <th style="text-align:right;"> Std. Error </th>
   <th style="text-align:right;"> z-score </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> 4 </td>
   <td style="text-align:right;"> 1.00 </td>
   <td style="text-align:right;"> -1 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 2 </td>
   <td style="text-align:right;"> 2 </td>
   <td style="text-align:right;"> 0.50 </td>
   <td style="text-align:right;"> -2 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 3 </td>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> 0.25 </td>
   <td style="text-align:right;"> -4 </td>
  </tr>
</tbody>
</table>

Notice what happens here: Each sample yields a different value of `$z$` for the same `$x=12$` simply because each sample is generating differing `$\bar{x}$` and `$s$`. If you use `$t$` instead of `$z$` you gain some precision

`$$t = \dfrac{x - \bar{x}}{s_{\bar{x}}}$$` where `$s_{\bar{x}} = \dfrac{s}{\sqrt{n}}$` and `$t$` is distributed with `$n-1$` degrees of freedom

---

### Various `t` distributions

<div class="figure" style="text-align: center">
<img src="module03_files/figure-html/unnamed-chunk-7-1.svg" alt="Student's t versus the Standard Normal z Distribution" width="55%" />
<p class="caption">Student's t versus the Standard Normal z Distribution</p>
</div>

---

[See this interactive applet](http://rpsychologist.com/d3/tdist/)

* .heatinline[.fancy[regardless of the sample size,]] use `$t$` distribution whenever `$\sigma$` is unknown and hence `$s$` must be used

* use the `$t$` distribution whenever `$n < 30$` even if `$\sigma$` is known

---

Find the following `$t$` score(s):

(1) `$t$` leaves `$0.025$` in the Upper Tail with `$df=12$`

Answer: `$t=2.179$`

(2) `$t$` leaves `$0.05$` in the Lower Tail with `$df=50$`

Answer: `$t= -1.676$`

(3) `$t$` leaves `$0.01$` in the Upper Tail with `$df=30$`

Answer: `$t=2.457$`

(4) `$90\%$` of the area falls between these `$t$` values with `$df = 25$`

Answer: `$t= \pm 1.708$`

(5) `$95\%$` of the area falls between these `$t$` values with `$df = 45$`

Answer: `$t= \pm 2.104$`

---

Simple random sample with `$n=54$` yielded `$\bar{x} = 22.5$` and `$s=4.4$`.

(1) Calculate the standard error.

`$$s_{\bar{x}}=\dfrac{s}{\sqrt{n}}=\dfrac{4.4}{\sqrt{54}}=\dfrac{4.4}{7.34}=0.59$$`

(2) What is the `$90\%$` confidence interval?

`$$\bar{x}\pm t(s_{\bar{x}})=22.5 \pm 1.674(0.59)=22.5 \pm 0.98=21.52; 23.48$$`

(3) What is the `$95\%$` confidence interval?

`$$\bar{x}\pm t(s_{\bar{x}})=22.5 \pm 2.006(0.59)=22.5 \pm 1.18=21.32; 23.68$$`

(4) What is the `$99\%$` confidence interval?

`$$\bar{x}\pm t(s_{\bar{x}})=22.5 \pm 2.672(0.59)=22.5 \pm 1.57=20.93; 24.07$$`

(5) What happens to the margin of error and the width of the interval as we increase how "confident" we want to be?

Answer: The margin of error increases and the confidence interval widens

---

Continental Airlines' pilots fly on average 49 hours per month. This is based on `$n=100$` with `$s=8.5$`

`$\therefore s_{\bar{x}}=\dfrac{s}{\sqrt{n}}=\dfrac{8.5}{\sqrt{100}}=\dfrac{8.5}{10}=0.85$`

What is the margin of error at `$95\%$` CI?

`$t_{\alpha/2}(s_{\bar{x}})=1.984(0.85)=1.68$`

What is the `$95\%$`CI?

`$\bar{x}\pm t_{\alpha/2}(s_{\bar{x}})=49 \pm 1.984(0.85)=49 \pm 1.68=47.32; 50.58$`

---

# .fat[.fancy[How large a sample do I need?]]

---

### Determining Needed Sample Size

Margin of Error `$=z_{\alpha/2}\left(\dfrac{\sigma}{\sqrt{n}}\right)$`

(1) Select the desired confidence level `$(1-\alpha)$`  
(2) Select our best guess of `$\sigma$`; If you don't know `$\sigma$`, use `$\sigma \approx \frac{Range}{4}$`  
(3) Solve for `$n$` in the formula for the Margin of Error

`\begin{eqnarray*}
E=z_{\alpha/2}\left(\dfrac{\sigma}{\sqrt{n}}\right) \\
\therefore \sqrt{n}(E)= (z_{\alpha/2})(\sigma) \\
\therefore \sqrt{n}=\dfrac{(z_{\alpha/2})(\sigma)}{E} \\
\therefore n = \dfrac{(z_{\alpha/2})^{2}(\sigma)^{2}}{E^{2}}
\end{eqnarray*}`

---

## Example 5 
Range for a sample is `$36$`

We don't have `$\sigma$` so we calculate `$\sigma \approx \dfrac{Range}{4}=\dfrac{36}{4}=9$`

+ At `$95\%$`CI, how large an `$n$` would give give margin of error of 3?

Given `$(1-\alpha)=0.95; \sigma^{2}=81; E=3; \therefore z=1.96$`, `$\therefore n=\dfrac{(z_{\alpha/2})^{2}(\sigma)^{2}}{E^{2}}$` `$=\dfrac{(1.96)^{2}(81)}{9}$` `$=(1.96)^{2}(9)=34.5744 \approx 35$` ... `(Note: we are rounding up)`

+ At `$95\%$`CI, how large an `$n$` would give give margin of error of 2?

Given `$(1-\alpha)=0.95; \sigma^{2}=81; E=2; \therefore z=1.96$`, `$\therefore n=\dfrac{(z_{\alpha/2})^{2}(\sigma)^{2}}{E^{2}}$` `$=\dfrac{(1.96)^{2}(81)}{4}$`

`$=\dfrac{311.1696}{4} = 77.7924 \approx 78$` ... `(Note: we are rounding up)`

---

## Example 6

The New York Times reports mean Bar Mitzvah costs in New York City to be `$19,000$` dollars. Let `$\sigma = 9400$`. They also want you to use a `$95\%$` CI.

(1) Recommend `$n$` if desired margin of error is `$1000$`

`$n=\dfrac{(z_{\alpha/2})^{2}(\sigma)^{2}}{E^{2}}$` `$=\dfrac{(1.96)^{2}(9400)^{2}}{(1000)^{2}}  =(1.96)^{2}(88.36)= 339.4437 \approx 340$`

(2) What if the margin of error desired is `$500$`?

`$n=\dfrac{(z_{\alpha/2})^{2}(\sigma)^{2}}{E^{2}}$` `$=\dfrac{(1.96)^{2}(9400)^{2}}{(500)^{2}} = (1.96)^{2}(353.44)= 1357.7751 \approx 1358$`

(3) What if it is `$200$`?

`$n=\dfrac{(z_{\alpha/2})^{2}(\sigma)^{2}}{E^{2}}$` `$=\dfrac{(1.96)^{2}(9400)^{2}}{(200)^{2}} = (1.96)^{2}(2209)= 8486.0944 \approx 8487$`

... `The closer we want to be to the true value, the larger the sample size needed`

---

### Population Proportion & Sample Size Determination

Interval estimate of population proportion is `$\bar{p}=\pm$` Margin of Error

If `$np \geq 5$` and `$n(1-p) \geq 5$`, the sampling distribution of `$\bar{p}$` is approximately Normal

`$\sigma_{\bar{p}}=\sqrt{\dfrac{{\bar{p}}(1-{\bar{p}})}{n}}$`

Margin of error is `$E = t_{\alpha/2}\sqrt{\dfrac{\bar{p}(1-\bar{p})}{n}}$`

So the interval estimate for `$p$` `$=\bar{p} \pm t_{\alpha/2}\sqrt{\dfrac{\bar{p}(1-\bar{p})}{n}}$`

`$n=\dfrac{(z_{\alpha/2})^{2}(p^{*})(1-p^{*})}{E^{2}}; p^{*}=$` best guess or `$0.50$`

---

### Example 7

A simple random sample of `$n=800$` generates `$\bar{p} = 0.70$`

Given `$n=800; \bar{p}=0.70$`, `$\sigma_{\bar{p}}=\sqrt{\dfrac{p(1-p)}{n}} =\sqrt{\dfrac{0.7(0.3)}{800}}=0.0162$`

What is `$90\%$`CI for `$p$`?

`$t_{\alpha/2}=1.645$`, therefore 90%CI `$=0.7 \pm 1.645(0.0162) = 0.7 \pm 0.0266 = 0.6734; 0.7266$`

What is `$95\%$`CI for `$p$`?

`$t_{\alpha/2}=1.96$`, therefore 95%CI `$=0.7 \pm 1.96(0.0162) = 0.7 \pm 0.0317 = 0.6683; 0.7317$`

---

### Example 8

Audience profile data for the ESPN SportsZone website show `$26\%$` of users to be women in a sample of `$n=400$`  users.

What is the margin of error at `$95\%$` CI?

`$E = t_{\alpha/2}\sqrt{\dfrac{\bar{p}(1-\bar{p})}{n}} = 1.96\sqrt{\dfrac{0.26(74)}{400}} = 0.0429$`

What is the `$95\%$` CI?

95%CI `$=0.26 \pm 0.0429 = 0.2171; 0.3029$`

How large a sample do we need if desired margin of error is `$0.03$`?

Given `$E=0.03$`, `$n=\dfrac{(1.96)^{2}(0.26)(0.74)}{0.0009} = 821.2487 \approx 822$`