library(HistData)
data(Galton); names(Galton); summary(Galton)
## [1] "parent" "child"
## parent child
## Min. :64.00 Min. :61.70
## 1st Qu.:67.50 1st Qu.:66.20
## Median :68.50 Median :68.20
## Mean :68.31 Mean :68.09
## 3rd Qu.:69.50 3rd Qu.:70.20
## Max. :73.00 Max. :73.70
Galton (1886) presented these data in a table, showing a cross-tabulation of 928 adult children born to 205 fathers and mothers, by their height and their mid-parent’s height. He visually smoothed the bivariate frequency distribution and showed that the contours formed concentric and similar ellipses, thus setting the stage for correlation, regression and the bivariate normal distribution.
library(car)
scatterplot(child ~ parent, data=Galton, pch=16, ellipse=FALSE, levels=c(0.5, 0.95), reg.line=lm, smoother=FALSE, xlim=c(63,75), ylim=c(63,75))
Draw four random samples of parents’ height for each value of child’s height. The sample size is set to 40 so you can tweak that number in the command below to get larger or smaller samples.
randomSample = function(df,n) { return (df[sample(nrow(df), n),]) }
sample1 <- randomSample(Galton, 40)
sample2 <- randomSample(Galton, 40)
sample3 <- randomSample(Galton, 40)
sample4 <- randomSample(Galton, 40)
Graphing the 4 scatterplots to see how they look:
Now adding the regression line estimated in each sample to the scatterplot for each of the preceding scatterplots:
Estimating and seeing the results of the Population Regression Function for Galton’s data:
##
## Call:
## lm(formula = Galton$child ~ Galton$parent)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.8050 -1.3661 0.0487 1.6339 5.9264
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 23.94153 2.81088 8.517 <2e-16 ***
## Galton$parent 0.64629 0.04114 15.711 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.239 on 926 degrees of freedom
## Multiple R-squared: 0.2105, Adjusted R-squared: 0.2096
## F-statistic: 246.8 on 1 and 926 DF, p-value: < 2.2e-16
Estimating and seeing the results of the Sample Regression Function:
##
## Call:
## lm(formula = sample1$child ~ sample1$parent)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.006 -1.541 -0.006 1.110 4.852
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 36.9150 11.5669 3.191 0.00284 **
## sample1$parent 0.4646 0.1681 2.764 0.00875 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.044 on 38 degrees of freedom
## Multiple R-squared: 0.1674, Adjusted R-squared: 0.1455
## F-statistic: 7.64 on 1 and 38 DF, p-value: 0.008754
##
## Call:
## lm(formula = sample2$child ~ sample2$parent)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.7063 -1.1948 0.2822 0.9388 4.8281
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 33.8696 13.3761 2.532 0.0156 *
## sample2$parent 0.5115 0.1968 2.599 0.0132 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.173 on 38 degrees of freedom
## Multiple R-squared: 0.1509, Adjusted R-squared: 0.1286
## F-statistic: 6.756 on 1 and 38 DF, p-value: 0.01323
##
## Call:
## lm(formula = sample3$child ~ sample3$parent)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.0114 -0.8110 -0.0114 1.1556 3.9886
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 29.8192 12.1260 2.459 0.01859 *
## sample3$parent 0.5668 0.1780 3.185 0.00289 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.992 on 38 degrees of freedom
## Multiple R-squared: 0.2107, Adjusted R-squared: 0.1899
## F-statistic: 10.14 on 1 and 38 DF, p-value: 0.002891
##
## Call:
## lm(formula = sample4$child ~ sample4$parent)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.1612 -1.3282 0.5878 1.3803 5.1161
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 41.4591 14.9982 2.764 0.00875 **
## sample4$parent 0.3887 0.2213 1.756 0.08715 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.353 on 38 degrees of freedom
## Multiple R-squared: 0.07505, Adjusted R-squared: 0.05071
## F-statistic: 3.083 on 1 and 38 DF, p-value: 0.08715
library(stargazer)
stargazer(model4, model4.1, model4.2, model4.3, model4.4, type="html", column.labels = c("Population", "Sample 1", "Sample 2", "Sample 3", "Sample 4"), column.separate = c(1, 1, 1, 1, 1), digits=2)
Dependent variable: | |||||
child | child | child | child | child | |
Population | Sample 1 | Sample 2 | Sample 3 | Sample 4 | |
(1) | (2) | (3) | (4) | (5) | |
parent | 0.65*** | ||||
(0.04) | |||||
parent | 0.46*** | ||||
(0.17) | |||||
parent | 0.51** | ||||
(0.20) | |||||
parent | 0.57*** | ||||
(0.18) | |||||
parent | 0.39* | ||||
(0.22) | |||||
Constant | 23.94*** | 36.92*** | 33.87** | 29.82** | 41.46*** |
(2.81) | (11.57) | (13.38) | (12.13) | (15.00) | |
Observations | 928 | 40 | 40 | 40 | 40 |
R2 | 0.21 | 0.17 | 0.15 | 0.21 | 0.08 |
Adjusted R2 | 0.21 | 0.15 | 0.13 | 0.19 | 0.05 |
Residual Std. Error | 2.24 (df = 926) | 2.04 (df = 38) | 2.17 (df = 38) | 1.99 (df = 38) | 2.35 (df = 38) |
F Statistic | 246.84*** (df = 1; 926) | 7.64*** (df = 1; 38) | 6.76** (df = 1; 38) | 10.14*** (df = 1; 38) | 3.08* (df = 1; 38) |
Note: | p<0.1; p<0.05; p<0.01 |
Note how the samples vary in terms of the estimates that result. In essence, you could end up with any one of these random samples and be some ways off from the TRUE population parameters. Never forget that.
Mole rats are the only known mammals with distinct social castes. A single queen and a small number of males are the only reproducing individuals in a colony. Remaining individuals, called workers, gather food, defend the colony, care for the young, and maintain burrows. Recently, it was discovered that there might be two worker castes in the Damaraland mole rat. “Frequent workers” do almost all of the work in the colony, whereas “infrequent workers” do little work except on rare occasions after rains, when they extend the burrow system. To assess the physiological differences between the two types of workers, researchers compared daily energy expenditures of wild mole rats during a dry season. Energy expenditure appeared to vary with body mass in both groups but infrequent workers were heavier than frequent workers. How different is mean daily energy expenditure between the two groups when adjusted for differences in body mass?
rats = read.csv(url("http://whitlockschluter.zoology.ubc.ca/wp-content/data/chapter18/chap18e4MoleRatLayabouts.csv"))
summary(rats)
## caste lnMass lnEnergy
## lazy :14 Min. :3.850 Min. :3.555
## worker:21 1st Qu.:4.248 1st Qu.:3.902
## Median :4.511 Median :4.190
## Mean :4.540 Mean :4.193
## 3rd Qu.:4.844 3rd Qu.:4.489
## Max. :5.263 Max. :5.043
Fit the following model:
\[log(energy) = a + b_1(log(mass)) + b_2(caste) + e\]
lm.r = lm(lnEnergy ~ lnMass + caste, data = rats)
summary(lm.r)
##
## Call:
## lm(formula = lnEnergy ~ lnMass + caste, data = rats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.73388 -0.19371 0.01317 0.17578 0.47673
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.09687 0.94230 -0.103 0.9188
## lnMass 0.89282 0.19303 4.625 5.89e-05 ***
## casteworker 0.39334 0.14611 2.692 0.0112 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2966 on 32 degrees of freedom
## Multiple R-squared: 0.409, Adjusted R-squared: 0.3721
## F-statistic: 11.07 on 2 and 32 DF, p-value: 0.0002213
Predicting lnEnergy for worker with lnMass = Q1 …
summary(rats)
## caste lnMass lnEnergy
## lazy :14 Min. :3.850 Min. :3.555
## worker:21 1st Qu.:4.248 1st Qu.:3.902
## Median :4.511 Median :4.190
## Mean :4.540 Mean :4.193
## 3rd Qu.:4.844 3rd Qu.:4.489
## Max. :5.263 Max. :5.043
newdata1 = data.frame(lnMass = 4.248, caste = "worker")
predict(lm.r, newdata = newdata1)
## 1
## 4.089153
newdata1 = data.frame(lnMass = 4.54, caste = "worker")
predict(lm.r, newdata = newdata1)
## 1
## 4.349855
newdata1 = data.frame(lnMass = 4.54, caste = "lazy")
predict(lm.r, newdata = newdata1)
## 1
## 3.956512
Now predict lnEnergy for lazy with lnMass = Q1 …
newdata1 = data.frame(lnMass = c(4.248, 4.511, 4.844), caste = "lazy")
predict(lm.r, newdata = newdata1)
## 1 2 3
## 3.695810 3.930621 4.227928
So what is the predicted difference in lnEnergy for differences in caste, holding lnMass constant?
What if you held caste constant and increased lnMass from Q1 by 1?
With more than one independent variable the visreg package really comes in handy to see the predicted values of the outcome.
library(visreg)
visreg(lm.r, "lnMass", by = "caste", overlay = TRUE, band = FALSE)
lm.r2 = lm(lnEnergy ~ lnMass + caste + lnMass:caste, data = rats)
summary(lm.r2)
##
## Call:
## lm(formula = lnEnergy ~ lnMass + caste + lnMass:caste, data = rats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.72004 -0.17990 0.05631 0.19551 0.43128
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.2939 1.6691 0.775 0.4441
## lnMass 0.6069 0.3428 1.771 0.0865 .
## casteworker -1.5713 1.9518 -0.805 0.4269
## lnMass:casteworker 0.4186 0.4147 1.009 0.3206
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2965 on 31 degrees of freedom
## Multiple R-squared: 0.4278, Adjusted R-squared: 0.3725
## F-statistic: 7.727 on 3 and 31 DF, p-value: 0.0005391
If we wanted to use the five-number summary for lnMass and generate predictions for both castes with these values, we could do the following:
new.data.a = data.frame(lnMass = c(3.850, 4.248, 4.511, 4.844, 5.263 ), caste = "worker")
predicted.lnEnergy.w = predict(lm.r, newdata = new.data.a)
predicted.lnEnergy.w
## 1 2 3 4 5
## 3.733812 4.089153 4.323963 4.621270 4.995360
new.data.b = data.frame(lnMass = c(3.850, 4.248, 4.511, 4.844, 5.263 ), caste = "lazy")
predicted.lnEnergy.l = predict(lm.r, newdata = new.data.b)
predicted.lnEnergy.l
## 1 2 3 4 5
## 3.340470 3.695810 3.930621 4.227928 4.602018
Now let us plot these side-by-side
plot(new.data.a$lnMass, predicted.lnEnergy.w, type = "b", col = "firebrick", ylim = c(0, 6), ylab = "Predicted lnEnergy", xlab = "lnMass")
par(new = TRUE)
lines(new.data.b$lnMass, predicted.lnEnergy.l, type = "b", col = "cornflowerblue", ylim = c(0, 6))
text(4, 4.2, "Workers")
text(5, 4, "Lazy")
The gap between the two lines is the boost in lnEnergy for workers
visreg(lm.r, "lnMass", by = "caste")
visreg(lm.r, "lnMass", by = "caste", overlay = TRUE)
These are categorical variables with more than two values. For example, in some of the ANOVA tests we ran we had three groups. How would we fit and interpret models such as these? Let us see with reference to a specific example – the Zooplankton depredation case. Recall that here we had a three-category treatment indicator: “control”, “low”, and “high. The outcome was diversity and we had a blocking indicator as well – the location of each of five sites.
zoop = read.csv(url("http://whitlockschluter.zoology.ubc.ca/wp-content/data/chapter18/chap18e2ZooplanktonDepredation.csv"))
summary(zoop)
## treatment diversity block
## control:5 Min. :1.000 Min. :1
## high :5 1st Qu.:1.400 1st Qu.:2
## low :5 Median :2.200 Median :3
## Mean :2.133 Mean :3
## 3rd Qu.:2.550 3rd Qu.:4
## Max. :4.100 Max. :5
Assume we want to fit this model: \(diversity = a + b_1(treatment)\)
lm.z = lm(diversity ~ treatment, data = zoop)
summary(lm.z)
##
## Call:
## lm(formula = diversity ~ treatment, data = zoop)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.72 -0.44 -0.02 0.31 1.08
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.0200 0.2587 11.673 6.57e-08 ***
## treatmenthigh -1.6400 0.3659 -4.482 0.000749 ***
## treatmentlow -1.0200 0.3659 -2.788 0.016411 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5785 on 12 degrees of freedom
## Multiple R-squared: 0.6307, Adjusted R-squared: 0.5691
## F-statistic: 10.25 on 2 and 12 DF, p-value: 0.002539
Note how the model is estimated: \[= (Intercept) + b_1(treatment=high) + b_2(treatment=low)\]
newdata1 = data.frame(treatment = "control")
predict(lm.z, newdata = newdata1)
## 1
## 3.02
newdata2 = data.frame(treatment = "low")
predict(lm.z, newdata = newdata2)
## 1
## 2
newdata3 = data.frame(treatment = "high")
predict(lm.z, newdata = newdata3)
## 1
## 1.38
What we have here are the means of diversity for each group:
with(zoop, tapply(diversity, treatment, FUN = mean))
## control high low
## 3.02 1.38 2.00
Changing the reference group to be absorbed into the Intercept…
zoop$treatment = relevel(zoop$treatment, "high")
lm.z2 = lm(diversity ~ treatment, data = zoop)
summary(lm.z2)
##
## Call:
## lm(formula = diversity ~ treatment, data = zoop)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.72 -0.44 -0.02 0.31 1.08
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.3800 0.2587 5.334 0.000178 ***
## treatmentcontrol 1.6400 0.3659 4.482 0.000749 ***
## treatmentlow 0.6200 0.3659 1.695 0.115931
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5785 on 12 degrees of freedom
## Multiple R-squared: 0.6307, Adjusted R-squared: 0.5691
## F-statistic: 10.25 on 2 and 12 DF, p-value: 0.002539
What if we had two categorical variables? Let us see the model specification and interpretation with the lifespans of flies example.
flies = read.csv(url("http://whitlockschluter.zoology.ubc.ca/wp-content/data/chapter18/chap18q07flyLifeSpan.csv"))
summary(flies)
## lifespanDays treatment fertility
## Min. : 2.00 high-cost:420 fertile:420
## 1st Qu.:22.00 low-cost :422 sterile:422
## Median :27.00
## Mean :27.74
## 3rd Qu.:33.00
## Max. :56.00
Notice that low-cost
treatment and sterile
fertility are the respective modal categories for the respective factors. Let us create easier to understand factors. In particular, let us create
flies$low_cost[flies$treatment == "low-cost"] = 1
flies$low_cost[flies$treatment == "high-cost"] = 0
flies$low_cost = factor(flies$low_cost, levels = c(0, 1), labels = c("= 0", "= 1"))
flies$sterile[flies$fertility == "sterile"] = 1
flies$sterile[flies$fertility == "fertile"] = 0
flies$sterile = factor(flies$sterile, levels = c(0, 1), labels = c("= 0", "= 1"))
Now the model will be estimated as follows:
lm.f = lm(lifespanDays ~ low_cost + sterile, data = flies)
summary(lm.f)
##
## Call:
## lm(formula = lifespanDays ~ low_cost + sterile, data = flies)
##
## Residuals:
## Min 1Q Median 3Q Max
## -30.3061 -4.3290 0.8528 4.8528 25.8757
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 23.1472 0.4926 46.988 < 2e-16 ***
## low_cost= 1 6.9771 0.5684 12.276 < 2e-16 ***
## sterile= 1 2.1819 0.5684 3.839 0.000133 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.246 on 839 degrees of freedom
## Multiple R-squared: 0.1649, Adjusted R-squared: 0.1629
## F-statistic: 82.83 on 2 and 839 DF, p-value: < 2.2e-16
anova(lm.f)
## Analysis of Variance Table
##
## Response: lifespanDays
## Df Sum Sq Mean Sq F value Pr(>F)
## low_cost 1 10262 10262.3 150.915 < 2.2e-16 ***
## sterile 1 1002 1002.1 14.736 0.0001329 ***
## Residuals 839 57053 68.0
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
What group profile is the Intercept absorbing?
These “effects” of group membership are easy to see with visreg.
library(visreg)
visreg(lm.f, "low_cost", by = "sterile")
We can also see that the regression lines are really the mean lifespans (see below):
with(flies, aggregate(lifespanDays ~ low_cost:sterile, FUN = mean))
## low_cost sterile lifespanDays
## 1 = 0 = 0 23.31905
## 2 = 1 = 0 29.95238
## 3 = 0 = 1 25.15714
## 4 = 1 = 1 32.47642
In linear modeling we refer to the NULL model as one that has no independent variables whatsoever. As we add independent variables \((x_1, x_2, \ldots)\) to the model, we need to test whether they are improving the fit of the model to the data. We can see this with respect to a specific data-set in the epicalc package. These data reflect the number of deaths in London from 1st-15th Dec 1952 due to air pollution. Two independent variables are usable – atmospheric smoke (in mg/cu.m), and SO2 (atmospheric sulphur dioxide in parts/million). The dependent variable will be the number of deaths.
library(epicalc)
data(SO2)
Let us fit two models, the first with smoke as the only independent variable, and the second with SO2 as the second independent variable:
lm.01 = lm(deaths ~ smoke, data = SO2)
summary(lm.01)
##
## Call:
## lm(formula = deaths ~ smoke, data = SO2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -144.15 -73.33 24.39 54.55 180.39
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 171.82 31.43 5.466 0.000108 ***
## smoke 63.76 15.31 4.164 0.001112 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 88.71 on 13 degrees of freedom
## Multiple R-squared: 0.5715, Adjusted R-squared: 0.5386
## F-statistic: 17.34 on 1 and 13 DF, p-value: 0.001112
lm.02 = lm(deaths ~ smoke + SO2, data = SO2)
summary(lm.02)
##
## Call:
## lm(formula = deaths ~ smoke + SO2, data = SO2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -100.717 -20.689 -3.298 15.148 114.931
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 89.51 25.08 3.569 0.003858 **
## smoke -220.32 58.14 -3.789 0.002579 **
## SO2 1051.82 212.60 4.947 0.000338 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 52.96 on 12 degrees of freedom
## Multiple R-squared: 0.859, Adjusted R-squared: 0.8355
## F-statistic: 36.57 on 2 and 12 DF, p-value: 7.844e-06
We can now test whether lm.02 is an improvement over lm.01 via the following test:
anova(lm.01)
## Analysis of Variance Table
##
## Response: deaths
## Df Sum Sq Mean Sq F value Pr(>F)
## smoke 1 136450 136450 17.339 0.001112 **
## Residuals 13 102302 7869
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(lm.02)
## Analysis of Variance Table
##
## Response: deaths
## Df Sum Sq Mean Sq F value Pr(>F)
## smoke 1 136450 136450 48.654 1.485e-05 ***
## SO2 1 68648 68648 24.478 0.0003378 ***
## Residuals 12 33654 2805
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
numerator = ((136450 + 68648) - 136450)/1
denominator = (33654 / (15 - 2 - 1))
F = numerator/denominator; F
## [1] 24.4778
R does this very test if you simply run the following:
anova(lm.02, lm.01)
## Analysis of Variance Table
##
## Model 1: deaths ~ smoke + SO2
## Model 2: deaths ~ smoke
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 12 33654
## 2 13 102302 -1 -68648 24.478 0.0003378 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
What about using visreg()
here, with two numeric variables?
library(visreg)
visreg(lm.02, "smoke", by = c("SO2"))
The plot below swaps out smoke for SO2
library(visreg)
visreg(lm.02, "SO2", by = c("smoke"))
What if we wanted to hold smoke to specific values, maybe just the Minimum and the Maximum?
library(visreg)
visreg(lm.02, "SO2", by = c("smoke"), breaks = c(0.320, 1.930))
What about an interaction between smoke and SO2?
lm.m = lm(deaths ~ smoke + SO2 + smoke:SO2, data = SO2)
summary(lm.m)
##
## Call:
## lm(formula = deaths ~ smoke + SO2 + smoke:SO2, data = SO2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -99.037 -20.958 -4.436 14.812 116.428
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 91.529 40.198 2.277 0.04377 *
## smoke -221.057 61.719 -3.582 0.00431 **
## SO2 1043.471 255.325 4.087 0.00180 **
## smoke:SO2 2.271 34.318 0.066 0.94843
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 55.3 on 11 degrees of freedom
## Multiple R-squared: 0.8591, Adjusted R-squared: 0.8207
## F-statistic: 22.36 on 3 and 11 DF, p-value: 5.521e-05
The interaction isn’t significant, that much is obvious. But what would the regression lines look like with visreg()
?
library(visreg)
visreg(lm.m, "smoke", by = "SO2", alpha = 0.05)
What about generating some predicted values? What would be the predicted number of deaths when smoke is at its minimum (0.290), at its mean (1.406), and then at its maximum (4.460)? To do this calculation we would need to fix SO2 at some value. Say we set it to its mean (0.458). Now the predictions.
newdata1 = data.frame("SO2" = 0.458, "smoke" = c(0.290, 1.406, 4.460))
predict(lm.m, newdata = newdata1)
## 1 2 3
## 505.6336 260.0942 -411.8388
What about if smoke were at its mean and we allowed SO2 to be at its minimum, its mean, and then at its maximum?
newdata2 = data.frame("SO2" = c(0.090, 0.458, 1.348), "smoke" = 1.406)
predict(lm.m, newdata = newdata2)
## 1 2 3
## -125.0781 260.0942 1191.6251
Is the model with an interaction an improvement upon the model without an interaction?
anova(lm.m, lm.02)
## Analysis of Variance Table
##
## Model 1: deaths ~ smoke + SO2 + smoke:SO2
## Model 2: deaths ~ smoke + SO2
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 11 33641
## 2 12 33654 -1 -13.393 0.0044 0.9484
We’ll use data on systolic blood pressure (in mmHg) of 69 individuals, along with information on their age and gender.
library(multcomp)
data(sbp); summary(sbp)
## gender sbp age
## male :40 Min. :110.0 Min. :17.00
## female:29 1st Qu.:135.0 1st Qu.:36.00
## Median :149.0 Median :47.00
## Mean :148.7 Mean :46.14
## 3rd Qu.:162.0 3rd Qu.:59.00
## Max. :185.0 Max. :70.00
Is systolic blood pressure influenced by (a) gender, and/or (b) age? If age does impact systolic blood pressure, does the impact of age vary across men and women?
lm.Z = lm(sbp ~ age + gender + age:gender, data = sbp)
summary(lm.Z)
##
## Call:
## lm(formula = sbp ~ age + gender + age:gender, data = sbp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -20.647 -3.410 1.254 4.314 21.153
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 110.03853 4.73610 23.234 < 2e-16 ***
## age 0.96135 0.09632 9.980 9.63e-15 ***
## genderfemale -12.96144 7.01172 -1.849 0.0691 .
## age:genderfemale -0.01203 0.14519 -0.083 0.9342
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.946 on 65 degrees of freedom
## Multiple R-squared: 0.7759, Adjusted R-squared: 0.7656
## F-statistic: 75.02 on 3 and 65 DF, p-value: < 2.2e-16
So clearly, age influences systolic blood pressure but gender has no impact, neither directly nor by interacting with age.
library(visreg)
visreg(lm.Z, "age", by = c("gender"))
visreg(lm.Z, "age", by = c("gender"), overlay = TRUE)
What would happen if we dropped the interaction term?
lm.Y = lm(sbp ~ age + gender, data = sbp)
summary(lm.Y)
##
## Call:
## lm(formula = sbp ~ age + gender, data = sbp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -20.705 -3.299 1.248 4.325 21.160
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 110.28698 3.63824 30.313 < 2e-16 ***
## age 0.95606 0.07153 13.366 < 2e-16 ***
## genderfemale -13.51345 2.16932 -6.229 3.7e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.878 on 66 degrees of freedom
## Multiple R-squared: 0.7759, Adjusted R-squared: 0.7691
## F-statistic: 114.2 on 2 and 66 DF, p-value: < 2.2e-16
Now we see both age and gender having an impact on systolic blood pressure. Further, both the average prediction error has dropped and the adjusted R-squared has increased.
library(visreg)
visreg(lm.Y, "age", by = c("gender"))
visreg(lm.Y, "age", by = c("gender"), overlay = TRUE)
Could we use the ANOVA F test to see if the model with an interaction is better than the model without an interaction?
anova(lm.Z, lm.Y)
## Analysis of Variance Table
##
## Model 1: sbp ~ age + gender + age:gender
## Model 2: sbp ~ age + gender
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 65 5201.4
## 2 66 5202.0 -1 -0.54936 0.0069 0.9342
Yes we could, and the high p-value of 0.9342 indicates that these two models are essentially identical, so we might as well settle for the simpler model (i.e., the one without an interaction).