The Population Regression Function

library(HistData)
data(Galton); names(Galton); summary(Galton)
## [1] "parent" "child"
##      parent          child      
##  Min.   :64.00   Min.   :61.70  
##  1st Qu.:67.50   1st Qu.:66.20  
##  Median :68.50   Median :68.20  
##  Mean   :68.31   Mean   :68.09  
##  3rd Qu.:69.50   3rd Qu.:70.20  
##  Max.   :73.00   Max.   :73.70

Galton (1886) presented these data in a table, showing a cross-tabulation of 928 adult children born to 205 fathers and mothers, by their height and their mid-parent’s height. He visually smoothed the bivariate frequency distribution and showed that the contours formed concentric and similar ellipses, thus setting the stage for correlation, regression and the bivariate normal distribution.

library(car)
scatterplot(child ~ parent, data=Galton, pch=16,  ellipse=FALSE, levels=c(0.5, 0.95), reg.line=lm, smoother=FALSE, xlim=c(63,75), ylim=c(63,75))

Draw four random samples of parents’ height for each value of child’s height. The sample size is set to 40 so you can tweak that number in the command below to get larger or smaller samples.

randomSample = function(df,n) {     return (df[sample(nrow(df), n),]) }
sample1 <- randomSample(Galton, 40)
sample2 <- randomSample(Galton, 40)
sample3 <- randomSample(Galton, 40)
sample4 <- randomSample(Galton, 40)

Graphing the 4 scatterplots to see how they look:

Now adding the regression line estimated in each sample to the scatterplot for each of the preceding scatterplots:

Estimating and seeing the results of the Population Regression Function for Galton’s data:

## 
## Call:
## lm(formula = Galton$child ~ Galton$parent)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.8050 -1.3661  0.0487  1.6339  5.9264 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   23.94153    2.81088   8.517   <2e-16 ***
## Galton$parent  0.64629    0.04114  15.711   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.239 on 926 degrees of freedom
## Multiple R-squared:  0.2105, Adjusted R-squared:  0.2096 
## F-statistic: 246.8 on 1 and 926 DF,  p-value: < 2.2e-16

Estimating and seeing the results of the Sample Regression Function:

## 
## Call:
## lm(formula = sample1$child ~ sample1$parent)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -5.006 -1.541 -0.006  1.110  4.852 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)   
## (Intercept)     36.9150    11.5669   3.191  0.00284 **
## sample1$parent   0.4646     0.1681   2.764  0.00875 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.044 on 38 degrees of freedom
## Multiple R-squared:  0.1674, Adjusted R-squared:  0.1455 
## F-statistic:  7.64 on 1 and 38 DF,  p-value: 0.008754
## 
## Call:
## lm(formula = sample2$child ~ sample2$parent)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.7063 -1.1948  0.2822  0.9388  4.8281 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)  
## (Intercept)     33.8696    13.3761   2.532   0.0156 *
## sample2$parent   0.5115     0.1968   2.599   0.0132 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.173 on 38 degrees of freedom
## Multiple R-squared:  0.1509, Adjusted R-squared:  0.1286 
## F-statistic: 6.756 on 1 and 38 DF,  p-value: 0.01323
## 
## Call:
## lm(formula = sample3$child ~ sample3$parent)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.0114 -0.8110 -0.0114  1.1556  3.9886 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)   
## (Intercept)     29.8192    12.1260   2.459  0.01859 * 
## sample3$parent   0.5668     0.1780   3.185  0.00289 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.992 on 38 degrees of freedom
## Multiple R-squared:  0.2107, Adjusted R-squared:  0.1899 
## F-statistic: 10.14 on 1 and 38 DF,  p-value: 0.002891
## 
## Call:
## lm(formula = sample4$child ~ sample4$parent)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.1612 -1.3282  0.5878  1.3803  5.1161 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)   
## (Intercept)     41.4591    14.9982   2.764  0.00875 **
## sample4$parent   0.3887     0.2213   1.756  0.08715 . 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.353 on 38 degrees of freedom
## Multiple R-squared:  0.07505,    Adjusted R-squared:  0.05071 
## F-statistic: 3.083 on 1 and 38 DF,  p-value: 0.08715
library(stargazer)
stargazer(model4, model4.1, model4.2, model4.3, model4.4, type="html", column.labels = c("Population", "Sample 1", "Sample 2", "Sample 3", "Sample 4"), column.separate = c(1, 1, 1, 1, 1), digits=2)
Dependent variable:
child child child child child
Population Sample 1 Sample 2 Sample 3 Sample 4
(1) (2) (3) (4) (5)
parent 0.65***
(0.04)
parent 0.46***
(0.17)
parent 0.51**
(0.20)
parent 0.57***
(0.18)
parent 0.39*
(0.22)
Constant 23.94*** 36.92*** 33.87** 29.82** 41.46***
(2.81) (11.57) (13.38) (12.13) (15.00)
Observations 928 40 40 40 40
R2 0.21 0.17 0.15 0.21 0.08
Adjusted R2 0.21 0.15 0.13 0.19 0.05
Residual Std. Error 2.24 (df = 926) 2.04 (df = 38) 2.17 (df = 38) 1.99 (df = 38) 2.35 (df = 38)
F Statistic 246.84*** (df = 1; 926) 7.64*** (df = 1; 38) 6.76** (df = 1; 38) 10.14*** (df = 1; 38) 3.08* (df = 1; 38)
Note: p<0.1; p<0.05; p<0.01

Note how the samples vary in terms of the estimates that result. In essence, you could end up with any one of these random samples and be some ways off from the TRUE population parameters. Never forget that.

Fitting a Linear Regression Model with Multiple Independent Variables

Mole Rat Layabouts

Mole rats are the only known mammals with distinct social castes. A single queen and a small number of males are the only reproducing individuals in a colony. Remaining individuals, called workers, gather food, defend the colony, care for the young, and maintain burrows. Recently, it was discovered that there might be two worker castes in the Damaraland mole rat. “Frequent workers” do almost all of the work in the colony, whereas “infrequent workers” do little work except on rare occasions after rains, when they extend the burrow system. To assess the physiological differences between the two types of workers, researchers compared daily energy expenditures of wild mole rats during a dry season. Energy expenditure appeared to vary with body mass in both groups but infrequent workers were heavier than frequent workers. How different is mean daily energy expenditure between the two groups when adjusted for differences in body mass?

rats = read.csv(url("http://whitlockschluter.zoology.ubc.ca/wp-content/data/chapter18/chap18e4MoleRatLayabouts.csv"))
summary(rats)
##     caste        lnMass         lnEnergy    
##  lazy  :14   Min.   :3.850   Min.   :3.555  
##  worker:21   1st Qu.:4.248   1st Qu.:3.902  
##              Median :4.511   Median :4.190  
##              Mean   :4.540   Mean   :4.193  
##              3rd Qu.:4.844   3rd Qu.:4.489  
##              Max.   :5.263   Max.   :5.043

Fit the following model:

\[log(energy) = a + b_1(log(mass)) + b_2(caste) + e\]

lm.r = lm(lnEnergy ~ lnMass + caste, data = rats)
summary(lm.r)
## 
## Call:
## lm(formula = lnEnergy ~ lnMass + caste, data = rats)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.73388 -0.19371  0.01317  0.17578  0.47673 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.09687    0.94230  -0.103   0.9188    
## lnMass       0.89282    0.19303   4.625 5.89e-05 ***
## casteworker  0.39334    0.14611   2.692   0.0112 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2966 on 32 degrees of freedom
## Multiple R-squared:  0.409,  Adjusted R-squared:  0.3721 
## F-statistic: 11.07 on 2 and 32 DF,  p-value: 0.0002213

Predicting lnEnergy for worker with lnMass = Q1 …

summary(rats)
##     caste        lnMass         lnEnergy    
##  lazy  :14   Min.   :3.850   Min.   :3.555  
##  worker:21   1st Qu.:4.248   1st Qu.:3.902  
##              Median :4.511   Median :4.190  
##              Mean   :4.540   Mean   :4.193  
##              3rd Qu.:4.844   3rd Qu.:4.489  
##              Max.   :5.263   Max.   :5.043
newdata1 = data.frame(lnMass = 4.248, caste = "worker")
predict(lm.r, newdata = newdata1)
##        1 
## 4.089153
newdata1 = data.frame(lnMass = 4.54, caste = "worker")
predict(lm.r, newdata = newdata1)
##        1 
## 4.349855
newdata1 = data.frame(lnMass = 4.54, caste = "lazy")
predict(lm.r, newdata = newdata1)
##        1 
## 3.956512

Now predict lnEnergy for lazy with lnMass = Q1 …

newdata1 = data.frame(lnMass = c(4.248, 4.511, 4.844), caste = "lazy")
predict(lm.r, newdata = newdata1)
##        1        2        3 
## 3.695810 3.930621 4.227928

So what is the predicted difference in lnEnergy for differences in caste, holding lnMass constant?

What if you held caste constant and increased lnMass from Q1 by 1?

With more than one independent variable the visreg package really comes in handy to see the predicted values of the outcome.

library(visreg)
visreg(lm.r, "lnMass", by = "caste", overlay = TRUE, band = FALSE)

lm.r2 = lm(lnEnergy ~ lnMass + caste + lnMass:caste, data = rats)
summary(lm.r2)
## 
## Call:
## lm(formula = lnEnergy ~ lnMass + caste + lnMass:caste, data = rats)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.72004 -0.17990  0.05631  0.19551  0.43128 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)  
## (Intercept)          1.2939     1.6691   0.775   0.4441  
## lnMass               0.6069     0.3428   1.771   0.0865 .
## casteworker         -1.5713     1.9518  -0.805   0.4269  
## lnMass:casteworker   0.4186     0.4147   1.009   0.3206  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2965 on 31 degrees of freedom
## Multiple R-squared:  0.4278, Adjusted R-squared:  0.3725 
## F-statistic: 7.727 on 3 and 31 DF,  p-value: 0.0005391

If we wanted to use the five-number summary for lnMass and generate predictions for both castes with these values, we could do the following:

new.data.a = data.frame(lnMass = c(3.850, 4.248, 4.511, 4.844, 5.263 ), caste = "worker")
predicted.lnEnergy.w = predict(lm.r, newdata = new.data.a)
predicted.lnEnergy.w
##        1        2        3        4        5 
## 3.733812 4.089153 4.323963 4.621270 4.995360
new.data.b = data.frame(lnMass = c(3.850, 4.248, 4.511, 4.844, 5.263 ), caste = "lazy")
predicted.lnEnergy.l = predict(lm.r, newdata = new.data.b)
predicted.lnEnergy.l
##        1        2        3        4        5 
## 3.340470 3.695810 3.930621 4.227928 4.602018

Now let us plot these side-by-side

plot(new.data.a$lnMass, predicted.lnEnergy.w, type = "b", col = "firebrick", ylim = c(0, 6), ylab = "Predicted lnEnergy", xlab = "lnMass")
par(new = TRUE)
lines(new.data.b$lnMass, predicted.lnEnergy.l, type = "b", col = "cornflowerblue", ylim = c(0, 6))
text(4, 4.2, "Workers")
text(5, 4, "Lazy")

The gap between the two lines is the boost in lnEnergy for workers

visreg(lm.r, "lnMass", by = "caste")

visreg(lm.r, "lnMass", by = "caste", overlay = TRUE)

Multinomial Categorical Independent Variables

These are categorical variables with more than two values. For example, in some of the ANOVA tests we ran we had three groups. How would we fit and interpret models such as these? Let us see with reference to a specific example – the Zooplankton depredation case. Recall that here we had a three-category treatment indicator: “control”, “low”, and “high. The outcome was diversity and we had a blocking indicator as well – the location of each of five sites.

zoop = read.csv(url("http://whitlockschluter.zoology.ubc.ca/wp-content/data/chapter18/chap18e2ZooplanktonDepredation.csv"))
summary(zoop)
##    treatment   diversity         block  
##  control:5   Min.   :1.000   Min.   :1  
##  high   :5   1st Qu.:1.400   1st Qu.:2  
##  low    :5   Median :2.200   Median :3  
##              Mean   :2.133   Mean   :3  
##              3rd Qu.:2.550   3rd Qu.:4  
##              Max.   :4.100   Max.   :5

Assume we want to fit this model: \(diversity = a + b_1(treatment)\)

lm.z = lm(diversity ~ treatment, data = zoop)
summary(lm.z)
## 
## Call:
## lm(formula = diversity ~ treatment, data = zoop)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -0.72  -0.44  -0.02   0.31   1.08 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     3.0200     0.2587  11.673 6.57e-08 ***
## treatmenthigh  -1.6400     0.3659  -4.482 0.000749 ***
## treatmentlow   -1.0200     0.3659  -2.788 0.016411 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5785 on 12 degrees of freedom
## Multiple R-squared:  0.6307, Adjusted R-squared:  0.5691 
## F-statistic: 10.25 on 2 and 12 DF,  p-value: 0.002539

Note how the model is estimated: \[= (Intercept) + b_1(treatment=high) + b_2(treatment=low)\]

newdata1 = data.frame(treatment = "control")
predict(lm.z, newdata = newdata1)
##    1 
## 3.02
newdata2 = data.frame(treatment = "low")
predict(lm.z, newdata = newdata2)
## 1 
## 2
newdata3 = data.frame(treatment = "high")
predict(lm.z, newdata = newdata3)
##    1 
## 1.38

What we have here are the means of diversity for each group:

with(zoop, tapply(diversity, treatment, FUN = mean))
## control    high     low 
##    3.02    1.38    2.00

Changing the reference group to be absorbed into the Intercept…

zoop$treatment = relevel(zoop$treatment, "high")

lm.z2 = lm(diversity ~ treatment, data = zoop)
summary(lm.z2)
## 
## Call:
## lm(formula = diversity ~ treatment, data = zoop)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -0.72  -0.44  -0.02   0.31   1.08 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        1.3800     0.2587   5.334 0.000178 ***
## treatmentcontrol   1.6400     0.3659   4.482 0.000749 ***
## treatmentlow       0.6200     0.3659   1.695 0.115931    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5785 on 12 degrees of freedom
## Multiple R-squared:  0.6307, Adjusted R-squared:  0.5691 
## F-statistic: 10.25 on 2 and 12 DF,  p-value: 0.002539

Two Categorical Independent Variables

What if we had two categorical variables? Let us see the model specification and interpretation with the lifespans of flies example.

flies = read.csv(url("http://whitlockschluter.zoology.ubc.ca/wp-content/data/chapter18/chap18q07flyLifeSpan.csv"))
summary(flies)
##   lifespanDays       treatment     fertility  
##  Min.   : 2.00   high-cost:420   fertile:420  
##  1st Qu.:22.00   low-cost :422   sterile:422  
##  Median :27.00                                
##  Mean   :27.74                                
##  3rd Qu.:33.00                                
##  Max.   :56.00

Notice that low-cost treatment and sterile fertility are the respective modal categories for the respective factors. Let us create easier to understand factors. In particular, let us create

  • a factor that is coded as low_cost = 1 if treatment = low-cost; and low_cost = 0 if treatment = high-cost
  • a factor that is coded as sterile = 1 if fertility = sterile; and sterile = 0 if fertility = fertile
flies$low_cost[flies$treatment == "low-cost"] = 1
flies$low_cost[flies$treatment == "high-cost"] = 0
flies$low_cost = factor(flies$low_cost, levels = c(0, 1), labels = c("= 0", "= 1"))

flies$sterile[flies$fertility == "sterile"] = 1
flies$sterile[flies$fertility == "fertile"] = 0
flies$sterile = factor(flies$sterile, levels = c(0, 1), labels = c("= 0", "= 1"))

Now the model will be estimated as follows:

lm.f = lm(lifespanDays ~ low_cost + sterile, data = flies)
summary(lm.f)
## 
## Call:
## lm(formula = lifespanDays ~ low_cost + sterile, data = flies)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -30.3061  -4.3290   0.8528   4.8528  25.8757 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  23.1472     0.4926  46.988  < 2e-16 ***
## low_cost= 1   6.9771     0.5684  12.276  < 2e-16 ***
## sterile= 1    2.1819     0.5684   3.839 0.000133 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.246 on 839 degrees of freedom
## Multiple R-squared:  0.1649, Adjusted R-squared:  0.1629 
## F-statistic: 82.83 on 2 and 839 DF,  p-value: < 2.2e-16
anova(lm.f)
## Analysis of Variance Table
## 
## Response: lifespanDays
##            Df Sum Sq Mean Sq F value    Pr(>F)    
## low_cost    1  10262 10262.3 150.915 < 2.2e-16 ***
## sterile     1   1002  1002.1  14.736 0.0001329 ***
## Residuals 839  57053    68.0                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

What group profile is the Intercept absorbing?

These “effects” of group membership are easy to see with visreg.

library(visreg)
visreg(lm.f, "low_cost", by = "sterile")

We can also see that the regression lines are really the mean lifespans (see below):

with(flies, aggregate(lifespanDays ~ low_cost:sterile, FUN = mean))
##   low_cost sterile lifespanDays
## 1      = 0     = 0     23.31905
## 2      = 1     = 0     29.95238
## 3      = 0     = 1     25.15714
## 4      = 1     = 1     32.47642

Model Fitting

In linear modeling we refer to the NULL model as one that has no independent variables whatsoever. As we add independent variables \((x_1, x_2, \ldots)\) to the model, we need to test whether they are improving the fit of the model to the data. We can see this with respect to a specific data-set in the epicalc package. These data reflect the number of deaths in London from 1st-15th Dec 1952 due to air pollution. Two independent variables are usable – atmospheric smoke (in mg/cu.m), and SO2 (atmospheric sulphur dioxide in parts/million). The dependent variable will be the number of deaths.

library(epicalc)
data(SO2)

Let us fit two models, the first with smoke as the only independent variable, and the second with SO2 as the second independent variable:

lm.01 = lm(deaths ~ smoke, data = SO2)
summary(lm.01)
## 
## Call:
## lm(formula = deaths ~ smoke, data = SO2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -144.15  -73.33   24.39   54.55  180.39 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   171.82      31.43   5.466 0.000108 ***
## smoke          63.76      15.31   4.164 0.001112 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 88.71 on 13 degrees of freedom
## Multiple R-squared:  0.5715, Adjusted R-squared:  0.5386 
## F-statistic: 17.34 on 1 and 13 DF,  p-value: 0.001112
lm.02 = lm(deaths ~ smoke + SO2, data = SO2)
summary(lm.02)
## 
## Call:
## lm(formula = deaths ~ smoke + SO2, data = SO2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -100.717  -20.689   -3.298   15.148  114.931 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    89.51      25.08   3.569 0.003858 ** 
## smoke        -220.32      58.14  -3.789 0.002579 ** 
## SO2          1051.82     212.60   4.947 0.000338 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 52.96 on 12 degrees of freedom
## Multiple R-squared:  0.859,  Adjusted R-squared:  0.8355 
## F-statistic: 36.57 on 2 and 12 DF,  p-value: 7.844e-06

We can now test whether lm.02 is an improvement over lm.01 via the following test:

anova(lm.01)
## Analysis of Variance Table
## 
## Response: deaths
##           Df Sum Sq Mean Sq F value   Pr(>F)   
## smoke      1 136450  136450  17.339 0.001112 **
## Residuals 13 102302    7869                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(lm.02)
## Analysis of Variance Table
## 
## Response: deaths
##           Df Sum Sq Mean Sq F value    Pr(>F)    
## smoke      1 136450  136450  48.654 1.485e-05 ***
## SO2        1  68648   68648  24.478 0.0003378 ***
## Residuals 12  33654    2805                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
numerator = ((136450 + 68648) - 136450)/1
denominator = (33654 / (15 - 2 - 1))
F = numerator/denominator; F
## [1] 24.4778

R does this very test if you simply run the following:

anova(lm.02, lm.01)
## Analysis of Variance Table
## 
## Model 1: deaths ~ smoke + SO2
## Model 2: deaths ~ smoke
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1     12  33654                                  
## 2     13 102302 -1    -68648 24.478 0.0003378 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

What about using visreg() here, with two numeric variables?

library(visreg)
visreg(lm.02, "smoke", by = c("SO2"))

  • smoke is on the x-axis
  • SO2 is being held at specific values
    • in the first panel, SO2 is held at its \(10^{th}\) percentile
    • in the second panel, SO2 is held at the Median
    • in the third panel, SO2 is held at its \(90^{th}\) percentile

The plot below swaps out smoke for SO2

library(visreg)
visreg(lm.02, "SO2", by = c("smoke"))

What if we wanted to hold smoke to specific values, maybe just the Minimum and the Maximum?

library(visreg)
visreg(lm.02, "SO2", by = c("smoke"), breaks = c(0.320, 1.930))

What about an interaction between smoke and SO2?

lm.m = lm(deaths ~ smoke + SO2 + smoke:SO2, data = SO2) 
summary(lm.m)
## 
## Call:
## lm(formula = deaths ~ smoke + SO2 + smoke:SO2, data = SO2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -99.037 -20.958  -4.436  14.812 116.428 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)   91.529     40.198   2.277  0.04377 * 
## smoke       -221.057     61.719  -3.582  0.00431 **
## SO2         1043.471    255.325   4.087  0.00180 **
## smoke:SO2      2.271     34.318   0.066  0.94843   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 55.3 on 11 degrees of freedom
## Multiple R-squared:  0.8591, Adjusted R-squared:  0.8207 
## F-statistic: 22.36 on 3 and 11 DF,  p-value: 5.521e-05

The interaction isn’t significant, that much is obvious. But what would the regression lines look like with visreg()?

library(visreg)
visreg(lm.m, "smoke", by = "SO2", alpha = 0.05)

What about generating some predicted values? What would be the predicted number of deaths when smoke is at its minimum (0.290), at its mean (1.406), and then at its maximum (4.460)? To do this calculation we would need to fix SO2 at some value. Say we set it to its mean (0.458). Now the predictions.

newdata1 = data.frame("SO2" = 0.458, "smoke" = c(0.290, 1.406, 4.460))
predict(lm.m, newdata = newdata1)
##         1         2         3 
##  505.6336  260.0942 -411.8388

What about if smoke were at its mean and we allowed SO2 to be at its minimum, its mean, and then at its maximum?

newdata2 = data.frame("SO2" = c(0.090, 0.458, 1.348), "smoke" = 1.406)
predict(lm.m, newdata = newdata2)
##         1         2         3 
## -125.0781  260.0942 1191.6251

Is the model with an interaction an improvement upon the model without an interaction?

anova(lm.m, lm.02)
## Analysis of Variance Table
## 
## Model 1: deaths ~ smoke + SO2 + smoke:SO2
## Model 2: deaths ~ smoke + SO2
##   Res.Df   RSS Df Sum of Sq      F Pr(>F)
## 1     11 33641                           
## 2     12 33654 -1   -13.393 0.0044 0.9484

Another Example: Systolic Blood Pressure

We’ll use data on systolic blood pressure (in mmHg) of 69 individuals, along with information on their age and gender.

library(multcomp)
data(sbp); summary(sbp)
##     gender        sbp             age       
##  male  :40   Min.   :110.0   Min.   :17.00  
##  female:29   1st Qu.:135.0   1st Qu.:36.00  
##              Median :149.0   Median :47.00  
##              Mean   :148.7   Mean   :46.14  
##              3rd Qu.:162.0   3rd Qu.:59.00  
##              Max.   :185.0   Max.   :70.00

Is systolic blood pressure influenced by (a) gender, and/or (b) age? If age does impact systolic blood pressure, does the impact of age vary across men and women?

lm.Z = lm(sbp ~ age + gender + age:gender, data = sbp)
summary(lm.Z)
## 
## Call:
## lm(formula = sbp ~ age + gender + age:gender, data = sbp)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -20.647  -3.410   1.254   4.314  21.153 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      110.03853    4.73610  23.234  < 2e-16 ***
## age                0.96135    0.09632   9.980 9.63e-15 ***
## genderfemale     -12.96144    7.01172  -1.849   0.0691 .  
## age:genderfemale  -0.01203    0.14519  -0.083   0.9342    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.946 on 65 degrees of freedom
## Multiple R-squared:  0.7759, Adjusted R-squared:  0.7656 
## F-statistic: 75.02 on 3 and 65 DF,  p-value: < 2.2e-16

So clearly, age influences systolic blood pressure but gender has no impact, neither directly nor by interacting with age.

  • As age increases by 1 year, systolic blood pressure increases by 0.96 mmHg
  • On average, sbp predicted from this regression model could over-predict or under-predict the true sbp of the individual by 8.946 mmHg
  • About 76.56% of the variation in sbp can be explained by looking at age and gender
library(visreg)
visreg(lm.Z, "age", by = c("gender"))

visreg(lm.Z, "age", by = c("gender"), overlay = TRUE)

What would happen if we dropped the interaction term?

lm.Y = lm(sbp ~ age + gender, data = sbp)
summary(lm.Y)
## 
## Call:
## lm(formula = sbp ~ age + gender, data = sbp)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -20.705  -3.299   1.248   4.325  21.160 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  110.28698    3.63824  30.313  < 2e-16 ***
## age            0.95606    0.07153  13.366  < 2e-16 ***
## genderfemale -13.51345    2.16932  -6.229  3.7e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.878 on 66 degrees of freedom
## Multiple R-squared:  0.7759, Adjusted R-squared:  0.7691 
## F-statistic: 114.2 on 2 and 66 DF,  p-value: < 2.2e-16

Now we see both age and gender having an impact on systolic blood pressure. Further, both the average prediction error has dropped and the adjusted R-squared has increased.

library(visreg)
visreg(lm.Y, "age", by = c("gender"))

visreg(lm.Y, "age", by = c("gender"), overlay = TRUE)

Could we use the ANOVA F test to see if the model with an interaction is better than the model without an interaction?

anova(lm.Z, lm.Y)
## Analysis of Variance Table
## 
## Model 1: sbp ~ age + gender + age:gender
## Model 2: sbp ~ age + gender
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1     65 5201.4                           
## 2     66 5202.0 -1  -0.54936 0.0069 0.9342

Yes we could, and the high p-value of 0.9342 indicates that these two models are essentially identical, so we might as well settle for the simpler model (i.e., the one without an interaction).