Review of hypothesis
testing
Overview of some key statistical tests
t-tests
χ2 (chi-square)
Overview of some basic regression models
Hypothesis testing
is an inferential procedure that uses sample data to evaluate the credibility of a hypothesis about a population parameter. The process involves ...
Stating a hypothesis:
an assumption that can neither be fully proven nor fully disproven. For example,
Not more than 5% of GM trucks breakdown in under 10,000 miles
Heights of North American adult males is distributed with μ=72 inches
Mean year-round temperature in Athens (OH) is >62
10% of Ohio teachers are Accomplished
Mean county unemployment rate is 12.1%
Drawing a sample to test the hypothesis
Conducting the test itself to see if the hypothesis should be rejected
Null
and the Alternative
HypothesesNull Hypothesis: (H0) is the assumption believed to be true
Alternative Hypothesis: (Ha) is the statement believed to be true if (H0) is rejected
H0: μ>72 inches; H1: μ≤72
H0: μ<72 inches; H1: μ≥72
H0: μ≤72 inches; H1: μ>72
H0: μ≥72 inches; H1: μ<72
H0: μ=72 inches; H1: μ≠72
H0 and H1 are mutually exclusive
and mutually exhaustive
Mutually Exclusive: Either H0 or H1 is True ⋯ both cannot be true at the same time
Mutually Exhaustive: H0 and H1 exhaust the Sample Space ⋯; there are no other possibilities unknown to us
Type I and Type II Errors
Decision based on Sample |
Null is true |
Null is false |
---|---|---|
Reject the Null | Type I error | No error |
Do not reject the Null | No error | Type II error |
Type I Error:
Rejecting the Null hypothesis H0 when H0 is true
i.e., we should not have rejected the Null
Type II Error:
Failing to reject the Null hypothesis H0 when H0 is false
i.e., we should have rejected the Null
The probability of committing a Type I error = Level of Significance =α
The probability of committing a Type II error = Level of Significance =β
The power of the test is (1−β)
We have to decide how often we want to make a Type I error (i.e., falsely Reject H0). Conventionally we set this rate to one of the following α values:
α=0.05 or α=0.01
Note the very cautious language ... Reject
H0 versus Do Not Reject
H0
Assume we want to know whether the roundabout on SR682 has had an impact on traffic accidents in Athens. We have historical data on the number of accidents in years past. Say the average per day used to be 6 (i.e., μ0=6). To see if the roundabout has had an impact we could gather accident data for a random sample of 100 days (n=100) from the period after the roundabout was built.
Before we do that though, we will need to specify our hypotheses. What do we think might be the impact? Let us say the City Engineer argues that the roundabout should have decreased accidents.
If he is correct then the sample mean ¯x should be less than the population mean μ0 i.e., ¯x<μ0
If he is wrong then the sample mean ¯x should be at least as much as the population mean μ0 i.e., ¯x≥μ0
We know from the theory of sampling distributions that the distribution of sample means, for all samples of size n, will be normally distributed (as shown below)
Most samples would be in the middle of the distribution but by sheer chance
we could end up with a sample mean in the tails. This will happen with a very small probability but it could happen!!
If we believe the City Engineer, we would setup the hypotheses as follows:
very small
? By setting α either to 0.05 or to 0.01 We Reject H0 if P(tcalculated)≤α; the data provide sufficient evidence to conclude that the roundabout has reduced accidents
If P(tcalculated)>α then we Fail to reject H0; the data provide insufficient evidence to conclude that the roundabout has reduced accidents
Reject H0 if calculated t falls in the green region (i.e., calculated t≤−1.6603)
Do Not Reject H0 if calculated t falls in the grey region (i.e., tcalculated>−1.6603)
If we believe the City Engineer, we would setup the hypotheses as follows:
If this area is very small then we can conclude that the roundabout must have worked to reduce accidents
How should we define very small
? By setting α either to 0.05 or to 0.01
We can then Reject H0 if P(±tcalculated)≤α ; the data provide sufficient evidence to conclude that the roundabout has reduced accidents
If P(±tcalculated)>α then we will Fail to reject H0; the data provide insufficient evidence to conclude that the roundabout has reduced accidents
Reject H0 if calculated |t| falls in the green region (i.e., calculated t≤−1.98 or calculated t≥1.98)
Do Not Reject H0 if calculated |t| falls in the grey region (i.e., −1.98<calculated t<1.98)
State the hypotheses
has changed
, or is different
, or had an impact
, etc. then H0 must specify that nothing has changed …H0:μ=μ0;H1:μ≠μ0… two-tailed has increased
, or has risen
, or is more
then H0 must specify that it has not increased …H0:μ≤μ0;H1:μ>μ0… one-tailed has decreased
, or has reduced
, or is less
then H0 must specify that it has not decreased …H0:μ≥μ0;H1:μ<μ0… one-tailed Collect the sample and set α=0.05 or α=0.01
Calculate the t
Reject H0 if calculated t falls in the critical region; do not reject H0 otherwise
Type I Error:
You rejected H0 but it should not have been rejected (level of significance=α)Type II Error:
You failed to reject H0 but it should have been rejected [From Harrell & Slaughter] We want to test if the mean tumor volume is 190 mm3 in a population with melanoma
H0:μ0=190 versus H1:μ0≠190
¯x=181.52,s=40,n=100,μ0=190
s¯x=s√n=40√100=4
t=¯x−μ0s¯x=181.52−1904=−2.12
p−value=0.037, leading us to reject H0 if α=0.05
The data do not conform to the pattern predicted by the null hypothesis.
[From Harrell & Slaughter] To investigate the relationship between smoking and bone mineral density, Rosner presented a paired analysis in which each person had a nearly perfect control which was his or her twin. Data were normalized by dividing differences by the mean density in the twin pair. Computed density in heavier smoking twin minus density in lighter smoking one.
Mean difference was -5% and standard error was 2%, with n=41
H0:mean difference is μ0=0 versus H1:mean difference is μ0≠0
t=−5−02=−2.5
p−value=0.0166
The data do not conform to the pattern predicted by the null hypothesis.
Separate Parent Populations
We often need to compare sample means across two groups. For example, are average earnings the same for men and women in a specific occupation? Perhaps we suspect (a) women are underpaid or (generally) that (b) their salaries differ from those of men.
Let the population and sample means be μm,μw and ¯xm,¯xw, respectively
(a) H0:μm≤μw and H1:μm>μw, ∴H0:μm−μw≤0 and H1:μm−μw>0
(b) H0:μm=μw and H1:μm≠μw, ∴H0:μm−μw=0 and H1:μm−μw≠0
Standard Error of the difference in means: s¯xm−¯xw=√s2mnm+s2wnw
Confidence Interval estimate: (¯xm−¯xw)±tα/2(s¯xm−¯xw)
The Test Statistic: t=(¯xm−¯xw)−(μm−μw)√s2mnm+s2wnw=(¯xm−¯xw)−D0√s2mnm+s2wnw
The degrees of freedom for this test: df=(s2mnm+s2wnw)21(nm−1)(s2mnm)2+1(nw−1)(s2wnw)2
Note: We usually round down
the df to the nearest integer
We have two ways of calculating the estimated standard error (s¯x1−¯x2) and the degrees of freedom df
(1) When the population variances are assumed unequal
(2) When the population variances are assumed equal
Standard Error will be:
(s¯x1−¯x2)=√σ21n1+σ22n2
Degrees of Freedom will be:
df=(s2mnm+s2wnw)21(nm−1)(s2mnm)2+1(nw−1)(s2wnw)2
Use this when n1 or n2 are <30 and
Either sample has a standard deviation at least twice that of the other sample
Standard Error will be:
(s¯x1−¯x2)=√n1+n2n1×n2√(n1−1)s2x1+(n2−1)s2x2(n1+n2)−2
Degrees of Freedom will be:
df=(n1+n2)−2
Use when the standard deviations are roughly equal, and
n1 and n2 ≥30
[From Harrell & Slaughter] Two soporific drugs to be tested, Durg 1 versus Drug 2. Which of these is more effective?
H0:μ1=μ2 versus H1:μ1≠μ2
Assuming unequal variances: t=−1.8608,df=17.776,p−value=0.07939 so fail to reject H0
Assuming equal variances: t=−1.8608,df=18,p−value=0.07919 so fail to reject H0
Given a certain number of independent trials (n) with an identical probability (p) of success (X) in each trial we can easily calculate the probability of seeing a specific number of successes.
For example, if I flip a coin 10 times, where X=Head with p=0.5 then what is probability of seeing exactly 2 heads, exactly 4 heads, 7 heads, etc.?
The answer is easily calculated as:
P[X successes]=(nX)pX(1−p)n−Xwhere (nx)=n!X!(n−X)! and n!=n×(n−1)×(n−2)×⋯×2×1
Assume that they are just as likely to have boys as girls. This then generates the following hypotheses:
H0: Radiologsts are just as likely to have sons as daughters (p=0.5) H1: Radiologsts are not as likely to have sons as daughters (p≠0.5)
Let α=0.05
The p−value=0.005014 so we can reject H0; the data provide sufficient evidence to conclude that radiologists are not as likely to have sons as daughters.
What if we suspected, a priori that radiologists are less likely to have sons? In that case we would have done the following:
H0: Radiologsts are at least as likely to have sons as daughters (p≥0.5) HA: Radiologsts are less likely to have sons than daughters (p<0.5)
Again, the p−value=0.002507 and we can easily reject H0; the data provide sufficient evidence to conclude that radiologists are not at least as likely to have sons as daughters.
The χ2 distribution is used with multinomial data (i.e., when the categorical variable has more than two categories) to test whether observed frequency counts differ from expected frequency counts.
H0: Proportions are all the same HA: Proportions are \textit{not} all the same
χ2=∑i(Observedi−Expectedi)2Expectedi
χ2 distributed with (no. of categories−1) degrees of freedom (df)
Reject H0 if p−value≤α; Do not reject H0 otherwise
As df→∞ you need a larger χ2 to Reject H0 at the same α
The plot below shows how the theoretical χ2 distribution varies with the degrees of freedom. As the distribution shifts right the degrees of freedom are getting smaller. The first is for 3 degrees of freedom, which means we have a total of 4 categories, then we have 4 degrees of freedom (so 5 categories), then 5 degrees of freedom (so 6 categories), and finally 6 degrees of freedom (i.e., 7 categories). Note what happens; The more the degrees of freedom, the larger the χ2 value needed to reject H0 with α=0.05 or α=0.01.
The test is built on two assumptions:
Assume that there are four gourmet meats placed before 100 subjects. In a blind taste test each subject is asked to pick the item they liked the most. Do subjects exhibit indifference between the four items? If they do, then we would expect about 25% to pick item A, 25% to pick item B, 25% to pick item C, and 25% to pick item D. If H0 were true then expected frequencies would be as follows for A, B, C, and D:
## [1] 25 25 25 25
Now assume that the 100 subjects actually indicated the following preferences for A, B, C, and D:
## [1] 30 10 40 20
Some of the difference between the observed frequencies
and expected frequencies if H0 were true (i.e., there were no clear preferences) could be by chance. Therefore we test whether the overall difference is enough to suggest that this could not happen by chance very often or if it could happen very often. In other words, do the data suggest that subjects do prefer some items over the others. We will set α=0.05 and then conduct the test.
observed | expected |
---|---|
30 | 25 |
10 | 25 |
40 | 25 |
20 | 25 |
## ## Chi-squared test for given probabilities## ## data: observed## X-squared = 20, df = 3, p-value = 0.0001697
Given that the p−value is less than α=0.05 we can reject the H0 that the items are equally preferred. That is, the data suggest that some items are preferable to others.
Often you are interested in testing for a relationship between two categorical variables
Calculate, for each cell in the contingency table, (fij−eij)2eij
Add the resulting value over all cells. This yields χ2=∑i∑j(fij−eij)2eij
χ2∼df=(r−1)(c−1) where ... r= number of rows
, and c= number of columns
If you have a 2×2 table or small samples then Fisher's Exact test may be preferable
It has been hypothesized that the white rump of pigeons serves to distract predators like the peregrine falcons, and therefore it may be an adaptation to reduce predation. To test this, researchers followed the fate of 203 pigeons, 101 with white rumps and 102 with blue rumps. Nine of the white-rumped birds and 92 of the blue-rumped birds were killed by falcons.
## ## blue white Sum## killed 92 9 101## survived 10 92 102## Sum 102 101 203
## ## blue white## killed 0.91089109 0.08910891## survived 0.09803922 0.90196078
The table suggests that 91% of the blue-rumped pigeons were killed versus only 9% of the white-rumped pigeons.
Do the two kinds of pigeons differ in their rate of capture by falcons? Carry out an appropriate test.
H0: Predation by falcons is independent of rump-color
HA: Predation by falcons is not independent of rump-color
α=0.05
## ## Pearson's Chi-squared test## ## data: tab.p## X-squared = 134.13, df = 1, p-value < 2.2e-16
## ## Fisher's Exact Test for Count Data## ## data: tab.p## p-value < 2.2e-16## alternative hypothesis: true odds ratio is not equal to 1## 95 percent confidence interval:## 33.70502 270.81972## sample estimates:## odds ratio ## 89.50586
Regardless of the test used, we can safely Reject H0 given that the p−value≈0. The data suggest that predation by falcons is not independent of rump-color
Sex at Birth | Light | Regular | Dark | Total |
---|---|---|---|---|
Male | 20 | 40 | 20 | 80 |
Female | 30 | 30 | 10 | 70 |
Total | 50 | 70 | 30 | 150 |
Research Question:
Are coffee preferences independent of sex at birth (i.e., is there any association between coffee preferences and gender)?
H0: Coffee preference is independent of sex at birth H1: Coffee preference is not independent of sex at birth
For each cell
in the contingency table, calculate
eij=Row i Total×Column j TotalSample Size
e11=(80)(50)150=4000150=26.67
e12=(80)(70)150=5600150=37.33
e13=(80)(30)150=2400150=16.00
e21=(70)(50)150=3500150=23.33
e22=(70)(70)150=4900150=32.67
e23=(70)(30)150=2100150=14.00
Sex at birth | Coffee | fi | ei | (fi−ei) | (fi−ei)2 | (fi−ei)2/ei |
---|---|---|---|---|---|---|
Male | Light | 20 | 26.67 | -6.67 | 44.49 | 1.67 |
Male | Medium | 40 | 37.33 | 2.67 | 7.13 | 0.19 |
Male | Dark | 20 | 16.00 | 4.00 | 16.00 | 1.00 |
Female | Light | 30 | 23.33 | 6.67 | 44.49 | 1.91 |
Female | Medium | 30 | 32.67 | -2.67 | 7.13 | 0.22 |
Female | Dark | 10 | 14.00 | -4.00 | 16.00 | 1.14 |
χ2 | 6.13 |
df=(r−1)(c−1)=(2−1)(3−1)=(1)(2)=2
p−value<0.05; Reject H0
Coffee preferences and gender are not independent
y=α+β(x)
Infant Mortality=α+β(Income)
Infant Mortality=53.23−0.0016(Income)
As Income increases by 1 US Dollar, Infant Mortality drops by 0.0016
If Income rises by 1000 US Dollars, Infant Mortality drops by 0.0016(1000)=1.6
What if a country has Income of 20,000? What would be the predicted Infant Mortality?
Infant Mortality=53.23−0.0016(Income)=53.23−0.0016(20000)=53.23−32=21.23
How good is this linear regression?
(1) Income is a significant predictor ... p−value=0.000000107
(2) Adjusted R2=0.1884 ... This linear regression model explains
about 18.84% of the variaton in infant mortality
(3) Root Mean Squared Error =35.09 ... If we use this regression model to predict infant mortality, average prediction error
would be ±35.09 infant deaths
High Adjusted R2, small Root Mean Squared Error, and statistically significant predictor variable is desirable
y=α+β1(x1)+β2(x2)+…+βk(xk)
y=α+β1(Income)+β2(Female Youth Literacy Rate)+β3(% in Urban Areas)
## ## Call:## lm(formula = U5MR ~ Income + FemaleYouthLR + Urban, data = sowc)## ## Residuals:## Min 1Q Median 3Q Max ## -41.570 -13.036 -4.449 4.948 94.270 ## ## Coefficients:## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 1.810e+02 1.011e+01 17.901 <2e-16 ***## Income -3.374e-04 2.411e-04 -1.399 0.1642 ## FemaleYouthLR -1.418e+00 1.236e-01 -11.467 <2e-16 ***## Urban -2.686e-01 1.213e-01 -2.214 0.0286 * ## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## Residual standard error: 23.22 on 129 degrees of freedom## Multiple R-squared: 0.6525, Adjusted R-squared: 0.6444 ## F-statistic: 80.74 on 3 and 129 DF, p-value: < 2.2e-16
Holding all else constant, as Female Youth Literacy Rate rises by a unit amount infant mortality drops by 1.418
Holding all else constant, as the % of the population living in Urban areas rises by a unit amount infant mortality drops by 0.02686
Income does not appear to have a statistically significant impact on infant mortality
The model "explains" 64.44% of the variation in infant mortality
Average prediction error would be 23.22
Biased sample
(analysis will be worthless)
Incorrect functional form
(y=α+β(1x3) but you fit y=α+β(x))
Influential outliers
(extremely unusual data point may influence the linear regression line)
Heteroscedastic errors
(discretionary spending of poor families will have little variance while that of wealthy families will have more variance)
Correlated errors
(people in the same poor neighborhood are more likely to share adverse health outcomes)
Measurement error
(measurement error in the independent variables)
High Multicollinearity
(income and discretionary spending tend to be highly correlated so you cannot control for one while increasing the other by a unit amount)
Non-normal distribution of errors
(you forgot to include some independent variables or have the wrong functional form or both)
Goal is to model the probability
of survival, drug impact, birth of a girl child, tumor shrinking, and so on
Yi∈{0,1},i=1,…,N
P(Yi=1|Xi)=Φ(∑bkXik) ... Probit
P(Yi=1|Xi)=exp(bkXik)1+exp(bkXik) ... Logit
Y1,Y2,…,YN are statistically independent
Xiks are not exactly or nearly linearly dependent
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked |
---|
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22 | 1 | 0 | A/5 21171 | 7.25 | S | |
2 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Thayer) | female | 38 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
3 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26 | 0 | 0 | STON/O2. 3101282 | 7.925 | S |
## ## Call:## glm(formula = Survived ~ Fare, family = binomial(link = "logit"), ## data = mydf2)## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -2.4558 -0.8985 -0.8625 1.3461 1.5344 ## ## Coefficients:## Estimate Std. Error z value Pr(>|z|) ## (Intercept) -0.911664 0.095817 -9.515 < 2e-16 ***## Fare 0.014741 0.002219 6.644 3.06e-11 ***## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## (Dispersion parameter for binomial family taken to be 1)## ## Null deviance: 1171.1 on 875 degrees of freedom## Residual deviance: 1105.7 on 874 degrees of freedom## AIC: 1109.7## ## Number of Fisher Scoring iterations: 4
## ## Call:## glm(formula = Survived ~ Fare + Female, family = binomial(link = "logit"), ## data = mydf2)## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -2.1991 -0.6277 -0.5872 0.8123 1.9237 ## ## Coefficients:## Estimate Std. Error z value Pr(>|z|) ## (Intercept) 0.652604 0.148535 4.394 1.11e-05 ***## Fare 0.011033 0.002289 4.820 1.44e-06 ***## FemaleMale -2.408824 0.170896 -14.095 < 2e-16 ***## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## (Dispersion parameter for binomial family taken to be 1)## ## Null deviance: 1171.07 on 875 degrees of freedom## Residual deviance: 876.04 on 873 degrees of freedom## AIC: 882.04## ## Number of Fisher Scoring iterations: 4
Survived=f(Fare)
## Bootstrapped (25 reps) Confusion Matrix ## ## (entries are percentual average cell counts across resamples)## ## Reference## Prediction 0 1## 0 56.5 29.9## 1 4.1 9.5## ## Accuracy (average) : 0.6599
Survived=f(Fare,Female)
## Bootstrapped (25 reps) Confusion Matrix ## ## (entries are percentual average cell counts across resamples)## ## Reference## Prediction 0 1## 0 51.4 11.7## 1 10.2 26.8## ## Accuracy (average) : 0.7814
Male's had a lower probability of survival than females, on average, and holding all else constant.
Wealthier passengers had a higher probability of survival than other passengers, on average, and holding all else constant
Reference individual is a woman who paid the Median Fare. Compared to this individual,
(a) Probability that the average Male who paid the same Median Fare of 14.4542 survived = 0.1660
(b) Probability that the average Female who paid the same Median Fare of 14.4542 survived = 0.6919
(c) Probability of survival was 0.6760 for the average Female who paid 7.91 (Q1) and 0.7300 if she paid 31.00 (Q3)
(d) Probability of survival was 0.1561 for the average Male who paid Q1 Fare and 0.1934 if he paid Q3 Fare
Fare | Male Probability | Female Probability |
---|---|---|
7.91 | 0.1561 | 0.6760 |
14.45 | 0.1660 | 0.6910 |
31.00 | 0.1934 | 0.7300 |
Review of hypothesis
testing
Overview of some key statistical tests
t-tests
χ2 (chi-square)
Overview of some basic regression models
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |