Lions’ Noses

The simple linear regression model is fit as lm(y ~ x, data = dataname). As usual, we’ll give the model a name, for example, lm.1 = lm(y ~ x, data = dataname) and then use the summary(lm.1) command to see the results.

In the case of the Lions’ noses data, the model would be fit as shown below. Of course we start by fitting a scatter-plot to the data first.

lions = read.csv(url("http://whitlockschluter.zoology.ubc.ca/wp-content/data/chapter17/chap17e1LionNoses.csv"))
summary(lions)
##  proportionBlack    ageInYears    
##  Min.   :0.1000   Min.   : 1.100  
##  1st Qu.:0.1650   1st Qu.: 2.175  
##  Median :0.2650   Median : 3.500  
##  Mean   :0.3222   Mean   : 4.309  
##  3rd Qu.:0.4325   3rd Qu.: 5.850  
##  Max.   :0.7900   Max.   :13.100
library(car)
## Loading required package: carData
scatterplot(ageInYears ~ proportionBlack, data = lions, ellipse = TRUE, levels = c(0.05, 0.95), reg.line = FALSE, smoother = FALSE, xlab = "Proportion of the Nose Black", ylab = "Age (in Years)", xlim = c(-0.5, 1), ylim = c(-2, 13.5))
## Warning in plot.window(...): "levels" is not a graphical parameter
## Warning in plot.window(...): "reg.line" is not a graphical parameter
## Warning in plot.window(...): "smoother" is not a graphical parameter
## Warning in plot.xy(xy, type, ...): "levels" is not a graphical parameter
## Warning in plot.xy(xy, type, ...): "reg.line" is not a graphical parameter
## Warning in plot.xy(xy, type, ...): "smoother" is not a graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "levels" is
## not a graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "reg.line" is
## not a graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "smoother" is
## not a graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "levels" is
## not a graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "reg.line" is
## not a graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "smoother" is
## not a graphical parameter
## Warning in box(...): "levels" is not a graphical parameter
## Warning in box(...): "reg.line" is not a graphical parameter
## Warning in box(...): "smoother" is not a graphical parameter
## Warning in title(...): "levels" is not a graphical parameter
## Warning in title(...): "reg.line" is not a graphical parameter
## Warning in title(...): "smoother" is not a graphical parameter

Note the positive correlation between proportionBlack and age.

Now we fit the regression model:

lm.1 = lm(ageInYears ~ proportionBlack, data = lions)
summary(lm.1)
## 
## Call:
## lm(formula = ageInYears ~ proportionBlack, data = lions)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.5449 -1.1117 -0.5285  0.9635  4.3421 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       0.8790     0.5688   1.545    0.133    
## proportionBlack  10.6471     1.5095   7.053 7.68e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.669 on 30 degrees of freedom
## Multiple R-squared:  0.6238, Adjusted R-squared:  0.6113 
## F-statistic: 49.75 on 1 and 30 DF,  p-value: 7.677e-08

Interpreting the Regression Estimates

  • The intercept is 0.8790, indicating that when proportionBlack = 0 the predicted age is 0.8790, i.e., slightly under 1 year old
  • As proportionBlack increases, so does the lion’s predicted age.
    • When proportionBlack = 0.1, predicted age is 0.8790 + 10.6471 * (0.1) = 1.94371
    • When proportionBlack = 0.2, predicted age is 0.8790 + 10.6471 * (0.2) = 3.00842
    • How much did age increase by? 3.00842 - 1.94371 = 1.06471
  • The Adjusted R-squared is 0.6113, indicating that about 61.13% of the variation in lions’ ages can be explained by the proportion of their nose that is black
  • The residual standard error is 1.669, indicating that if we used this model to predict a lion’s age, our predictions would be off, on average, by about 1.669 years … we could overpredict or underpredict by 1.669 years, on average

Generating Predicted Values

We may want to generate predicted values of the dependent variable for specific values of the independent variable. This is easily done as follows. We’ll do this in two ways, first we will do some in-sample predictions – using the actual value of proportionBlack in our sample to see what predicted age would be.

lions$predicted.age = predict(lm.1, newdata = lions)

Notice that the predicted age isn’t the same because of prediction error. In a nutshell, this means proportionBlack isn’t a perfect predictor of a lion’s age, and even if it were, the randomness of nature might lead to some prediction error in any way. How much error is there?

lions$errors = resid(lm.1)

summary(lions)
##  proportionBlack    ageInYears     predicted.age       errors       
##  Min.   :0.1000   Min.   : 1.100   Min.   :1.944   Min.   :-2.5449  
##  1st Qu.:0.1650   1st Qu.: 2.175   1st Qu.:2.636   1st Qu.:-1.1117  
##  Median :0.2650   Median : 3.500   Median :3.700   Median :-0.5285  
##  Mean   :0.3222   Mean   : 4.309   Mean   :4.309   Mean   : 0.0000  
##  3rd Qu.:0.4325   3rd Qu.: 5.850   3rd Qu.:5.484   3rd Qu.: 0.9635  
##  Max.   :0.7900   Max.   :13.100   Max.   :9.290   Max.   : 4.3421

Now open up the data-set and look at the actual age, the predicted age, and the errors which are equal to ageInYears - predicted.age

What if we wanted to predict age for two lions, one with proportionBlack = 0.40 and then for proportionBlack = 0.80

p.age1 = predict(lm.1, newdata = (proportionBlack = 0.40))
p.age1
##        1 
## 5.137854
p.age2 = predict(lm.1, newdata = (proportionBlack = 0.80))
p.age2
##        1 
## 9.396702

What is the difference in the two predicted ages? 4.258848

How much did proportionBlack increase by from 0.40 to 0.80? By 0.40

What was the estimated slope on proportionBlack? 10.6471

What if we multiply the change in proportionBlack by 10.6471? We obtain 4.25884

Confidence Intervals and Prediction Intervals

These intervals can be calculated as follows, though note that the prediction interval will be wider than the confidence interval since the prediction interval is predicting a single value of age while the confidence interval is predicting a mean value of age.

conf = predict(lm.1, newdata = lions, interval = "confidence")
int = predict(lm.1, newdata = lions, interval = "prediction")

my.data.c = cbind(lions, conf)
my.data.p = cbind(lions, int)

Open up my.data.c and eyeball the actual age, the predicted age (called fit) and the 95% confidence intervals listed as lwr and upr. Now repeat this with my.data.p and it should be obvious that the prediction intervals are wider.

We can also generate plots of the estimated regression line with confidence intervals via the visreg() library.

library(visreg)
visreg(lm.1, "proportionBlack", alpha = 0.05) 

visreg(lm.1, "proportionBlack", alpha = 0.01)

Note that the first plot is for the 95% confidence interval while the second plot is for the 99% confidence interval.

Simple Regression with a Categorical Variable

Let us use the opsin data and focus on one factor – population – as the independent variable.

opsin = read.csv(url("http://whitlockschluter.zoology.ubc.ca/wp-content/data/chapter18/chap18q18OpsinExpression.csv"))
summary(opsin)
##   population waterClarity relativeExpressionOfSWS1
##  Spring:18   Clear:16     Min.   :0.03000         
##  Swamp :15   Tea  :17     1st Qu.:0.07000         
##                           Median :0.09000         
##                           Mean   :0.09364         
##                           3rd Qu.:0.12000         
##                           Max.   :0.16000

Let us fit a regression model:

lm.2 = lm(relativeExpressionOfSWS1 ~ population, data = opsin)
summary(lm.2)
## 
## Call:
## lm(formula = relativeExpressionOfSWS1 ~ population, data = opsin)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.049333 -0.019333  0.000667  0.020667  0.054444 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      0.105556   0.007101   14.87 1.19e-15 ***
## populationSwamp -0.026222   0.010533   -2.49   0.0184 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.03013 on 31 degrees of freedom
## Multiple R-squared:  0.1666, Adjusted R-squared:  0.1397 
## F-statistic: 6.198 on 1 and 31 DF,  p-value: 0.01836
lm.3 = lm(relativeExpressionOfSWS1 ~ waterClarity, data = opsin)
summary(lm.3)
## 
## Call:
## lm(formula = relativeExpressionOfSWS1 ~ waterClarity, data = opsin)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.056875 -0.021176  0.003125  0.018824  0.053125 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      0.106875   0.007556  14.144 4.58e-15 ***
## waterClarityTea -0.025699   0.010528  -2.441   0.0206 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.03023 on 31 degrees of freedom
## Multiple R-squared:  0.1612, Adjusted R-squared:  0.1342 
## F-statistic: 5.958 on 1 and 31 DF,  p-value: 0.02055

Notice that the slope is calculated for the Swamp group. So if we are interested in understanding the predicted value of relative expression of SWS1 for the Spring group we would have to set the slope to 0. This would mean calculating \(0.105556\) -0.026222(Swamp) = 0.105556. This would be the predicted relative expression of SWS1.

What about for the Swamp group? This would be \(0.105556 -0.026222 = 0.079334\)

What exactly are these values? Let us see:

with(opsin, tapply(relativeExpressionOfSWS1, population, FUN = mean))
##     Spring      Swamp 
## 0.10555556 0.07933333
visreg(lm.2, "population", alpha = 0.05)

visreg(lm.2, "population", alpha = 0.01)

Practice Problems

Problem 1

Can the son’s attractiveness be predicted from the father’s ornamentation? Use the guppies data given below to fit a regression model to these data.

guppies = read.csv("http://whitlockschluter.zoology.ubc.ca/wp-content/data/chapter02/chap02e3bGuppyFatherSonAttractiveness.csv")
summary(guppies)
##  fatherOrnamentation sonAttractiveness
##  Min.   :0.0300      Min.   :-0.3200  
##  1st Qu.:0.3275      1st Qu.: 0.1800  
##  Median :0.4550      Median : 0.4500  
##  Mean   :0.4908      Mean   : 0.4872  
##  3rd Qu.:0.6100      3rd Qu.: 0.8025  
##  Max.   :1.1200      Max.   : 1.2300
  1. Report and interpret the (i) estimated slope, (ii) the adjusted R-squared, and (iii) the residual standard error.
lm.4 = lm(sonAttractiveness ~ fatherOrnamentation, data = guppies)
summary(lm.4)
## 
## Call:
## lm(formula = sonAttractiveness ~ fatherOrnamentation, data = guppies)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.66888 -0.14647 -0.02119  0.27727  0.51324 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         0.005084   0.118988   0.043    0.966    
## fatherOrnamentation 0.982285   0.216499   4.537 6.78e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3212 on 34 degrees of freedom
## Multiple R-squared:  0.3771, Adjusted R-squared:  0.3588 
## F-statistic: 20.59 on 1 and 34 DF,  p-value: 6.784e-05

The estimated slope is 0.9822; as the father’s ornamentation increases by 1 the son’s attractiveness increases by 0.9822

The adjusted R-squared is 0.3588; we can explain about 35.88% of the variation in sons’ attractiveness with fathers’ ornamentation

  1. Calculate predicted attractiveness when father’s ornamentation is 0.25, and then when father’s ornamentation is 0.75.
predict(lm.4, newdata = (fatherOrnamentation = 0.25))
##         1 
## 0.2506552
predict(lm.4, newdata = (fatherOrnamentation = 0.75))
##         1 
## 0.7417978
  1. Use visreg() to plot the estimated regression line and the 95% confidence bands.
visreg(lm.4, "fatherOrnamentation")

Problem 2

Using the flies data given below, fit a regression model with lifespan as the dependent variable and fertility as the independent variable.

flies = read.csv(url("http://whitlockschluter.zoology.ubc.ca/wp-content/data/chapter18/chap18q07flyLifeSpan.csv"))
summary(flies)
##   lifespanDays       treatment     fertility  
##  Min.   : 2.00   high-cost:420   fertile:420  
##  1st Qu.:22.00   low-cost :422   sterile:422  
##  Median :27.00                                
##  Mean   :27.74                                
##  3rd Qu.:33.00                                
##  Max.   :56.00
  1. Report and interpret (i) the residual standard error, and (ii) the adjusted R-squared.
lm.5 = lm(lifespanDays ~ fertility, data = flies)
summary(lm.5)
## 
## Call:
## lm(formula = lifespanDays ~ fertility, data = flies)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -26.8341  -5.6357  -0.8341   5.9655  29.3643 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       26.6357     0.4368  60.985  < 2e-16 ***
## fertilitysterile   2.1984     0.6169   3.563 0.000387 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.951 on 840 degrees of freedom
## Multiple R-squared:  0.01489,    Adjusted R-squared:  0.01372 
## F-statistic:  12.7 on 1 and 840 DF,  p-value: 0.0003867

The residual standard error is 8.951; on average, predicted lifespan could be greater or smaller than the true lifespan by about 8.951 days

The adjusted R-squared is 0.0137; we can explain about 1.37% of the variation in lifespan with fertility

  1. Calculate predicted lifespan for the fertile group, and then for the sterile group.
  • Predicted lifespan is 26.6357 for the fertile group
  • Predicted lifespan is 28.8341 for the sterile group
  1. Use visreg() to plot the estimated regression line and the 95% confidence bands.
visreg(lm.5, "fertility")