The simple linear regression model is fit as lm(y ~ x, data = dataname)
. As usual, we’ll give the model a name, for example, lm.1 = lm(y ~ x, data = dataname)
and then use the summary(lm.1)
command to see the results.
In the case of the Lions’ noses data, the model would be fit as shown below. Of course we start by fitting a scatter-plot to the data first.
lions = read.csv(url("http://whitlockschluter.zoology.ubc.ca/wp-content/data/chapter17/chap17e1LionNoses.csv"))
summary(lions)
## proportionBlack ageInYears
## Min. :0.1000 Min. : 1.100
## 1st Qu.:0.1650 1st Qu.: 2.175
## Median :0.2650 Median : 3.500
## Mean :0.3222 Mean : 4.309
## 3rd Qu.:0.4325 3rd Qu.: 5.850
## Max. :0.7900 Max. :13.100
library(car)
## Loading required package: carData
scatterplot(ageInYears ~ proportionBlack, data = lions, ellipse = TRUE, levels = c(0.05, 0.95), reg.line = FALSE, smoother = FALSE, xlab = "Proportion of the Nose Black", ylab = "Age (in Years)", xlim = c(-0.5, 1), ylim = c(-2, 13.5))
## Warning in plot.window(...): "levels" is not a graphical parameter
## Warning in plot.window(...): "reg.line" is not a graphical parameter
## Warning in plot.window(...): "smoother" is not a graphical parameter
## Warning in plot.xy(xy, type, ...): "levels" is not a graphical parameter
## Warning in plot.xy(xy, type, ...): "reg.line" is not a graphical parameter
## Warning in plot.xy(xy, type, ...): "smoother" is not a graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "levels" is
## not a graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "reg.line" is
## not a graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "smoother" is
## not a graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "levels" is
## not a graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "reg.line" is
## not a graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "smoother" is
## not a graphical parameter
## Warning in box(...): "levels" is not a graphical parameter
## Warning in box(...): "reg.line" is not a graphical parameter
## Warning in box(...): "smoother" is not a graphical parameter
## Warning in title(...): "levels" is not a graphical parameter
## Warning in title(...): "reg.line" is not a graphical parameter
## Warning in title(...): "smoother" is not a graphical parameter
Note the positive correlation between proportionBlack and age.
Now we fit the regression model:
lm.1 = lm(ageInYears ~ proportionBlack, data = lions)
summary(lm.1)
##
## Call:
## lm(formula = ageInYears ~ proportionBlack, data = lions)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.5449 -1.1117 -0.5285 0.9635 4.3421
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.8790 0.5688 1.545 0.133
## proportionBlack 10.6471 1.5095 7.053 7.68e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.669 on 30 degrees of freedom
## Multiple R-squared: 0.6238, Adjusted R-squared: 0.6113
## F-statistic: 49.75 on 1 and 30 DF, p-value: 7.677e-08
0.8790 + 10.6471 * (0.1) = 1.94371
0.8790 + 10.6471 * (0.2) = 3.00842
3.00842 - 1.94371
= 1.06471
We may want to generate predicted values of the dependent variable for specific values of the independent variable. This is easily done as follows. We’ll do this in two ways, first we will do some in-sample predictions – using the actual value of proportionBlack in our sample to see what predicted age would be.
lions$predicted.age = predict(lm.1, newdata = lions)
Notice that the predicted age isn’t the same because of prediction error. In a nutshell, this means proportionBlack isn’t a perfect predictor of a lion’s age, and even if it were, the randomness of nature might lead to some prediction error in any way. How much error is there?
lions$errors = resid(lm.1)
summary(lions)
## proportionBlack ageInYears predicted.age errors
## Min. :0.1000 Min. : 1.100 Min. :1.944 Min. :-2.5449
## 1st Qu.:0.1650 1st Qu.: 2.175 1st Qu.:2.636 1st Qu.:-1.1117
## Median :0.2650 Median : 3.500 Median :3.700 Median :-0.5285
## Mean :0.3222 Mean : 4.309 Mean :4.309 Mean : 0.0000
## 3rd Qu.:0.4325 3rd Qu.: 5.850 3rd Qu.:5.484 3rd Qu.: 0.9635
## Max. :0.7900 Max. :13.100 Max. :9.290 Max. : 4.3421
Now open up the data-set and look at the actual age, the predicted age, and the errors which are equal to ageInYears - predicted.age
What if we wanted to predict age for two lions, one with proportionBlack = 0.40 and then for proportionBlack = 0.80
p.age1 = predict(lm.1, newdata = (proportionBlack = 0.40))
p.age1
## 1
## 5.137854
p.age2 = predict(lm.1, newdata = (proportionBlack = 0.80))
p.age2
## 1
## 9.396702
What is the difference in the two predicted ages? 4.258848
How much did proportionBlack increase by from 0.40 to 0.80? By 0.40
What was the estimated slope on proportionBlack? 10.6471
What if we multiply the change in proportionBlack by 10.6471? We obtain 4.25884
These intervals can be calculated as follows, though note that the prediction interval will be wider than the confidence interval since the prediction interval is predicting a single value of age while the confidence interval is predicting a mean value of age.
conf = predict(lm.1, newdata = lions, interval = "confidence")
int = predict(lm.1, newdata = lions, interval = "prediction")
my.data.c = cbind(lions, conf)
my.data.p = cbind(lions, int)
Open up my.data.c
and eyeball the actual age, the predicted age (called fit
) and the 95% confidence intervals listed as lwr
and upr
. Now repeat this with my.data.p
and it should be obvious that the prediction intervals are wider.
We can also generate plots of the estimated regression line with confidence intervals via the visreg()
library.
library(visreg)
visreg(lm.1, "proportionBlack", alpha = 0.05)
visreg(lm.1, "proportionBlack", alpha = 0.01)
Note that the first plot is for the 95% confidence interval while the second plot is for the 99% confidence interval.
Let us use the opsin
data and focus on one factor – population
– as the independent variable.
opsin = read.csv(url("http://whitlockschluter.zoology.ubc.ca/wp-content/data/chapter18/chap18q18OpsinExpression.csv"))
summary(opsin)
## population waterClarity relativeExpressionOfSWS1
## Spring:18 Clear:16 Min. :0.03000
## Swamp :15 Tea :17 1st Qu.:0.07000
## Median :0.09000
## Mean :0.09364
## 3rd Qu.:0.12000
## Max. :0.16000
Let us fit a regression model:
lm.2 = lm(relativeExpressionOfSWS1 ~ population, data = opsin)
summary(lm.2)
##
## Call:
## lm(formula = relativeExpressionOfSWS1 ~ population, data = opsin)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.049333 -0.019333 0.000667 0.020667 0.054444
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.105556 0.007101 14.87 1.19e-15 ***
## populationSwamp -0.026222 0.010533 -2.49 0.0184 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.03013 on 31 degrees of freedom
## Multiple R-squared: 0.1666, Adjusted R-squared: 0.1397
## F-statistic: 6.198 on 1 and 31 DF, p-value: 0.01836
lm.3 = lm(relativeExpressionOfSWS1 ~ waterClarity, data = opsin)
summary(lm.3)
##
## Call:
## lm(formula = relativeExpressionOfSWS1 ~ waterClarity, data = opsin)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.056875 -0.021176 0.003125 0.018824 0.053125
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.106875 0.007556 14.144 4.58e-15 ***
## waterClarityTea -0.025699 0.010528 -2.441 0.0206 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.03023 on 31 degrees of freedom
## Multiple R-squared: 0.1612, Adjusted R-squared: 0.1342
## F-statistic: 5.958 on 1 and 31 DF, p-value: 0.02055
Notice that the slope is calculated for the Swamp group. So if we are interested in understanding the predicted value of relative expression of SWS1 for the Spring
group we would have to set the slope to 0. This would mean calculating 0.105556 -0.026222(Swamp) = 0.105556. This would be the predicted relative expression of SWS1.
What about for the Swamp group? This would be 0.105556−0.026222=0.079334
What exactly are these values? Let us see:
with(opsin, tapply(relativeExpressionOfSWS1, population, FUN = mean))
## Spring Swamp
## 0.10555556 0.07933333
visreg(lm.2, "population", alpha = 0.05)
visreg(lm.2, "population", alpha = 0.01)
Can the son’s attractiveness be predicted from the father’s ornamentation? Use the guppies data given below to fit a regression model to these data.
guppies = read.csv("http://whitlockschluter.zoology.ubc.ca/wp-content/data/chapter02/chap02e3bGuppyFatherSonAttractiveness.csv")
summary(guppies)
## fatherOrnamentation sonAttractiveness
## Min. :0.0300 Min. :-0.3200
## 1st Qu.:0.3275 1st Qu.: 0.1800
## Median :0.4550 Median : 0.4500
## Mean :0.4908 Mean : 0.4872
## 3rd Qu.:0.6100 3rd Qu.: 0.8025
## Max. :1.1200 Max. : 1.2300
lm.4 = lm(sonAttractiveness ~ fatherOrnamentation, data = guppies)
summary(lm.4)
##
## Call:
## lm(formula = sonAttractiveness ~ fatherOrnamentation, data = guppies)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.66888 -0.14647 -0.02119 0.27727 0.51324
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.005084 0.118988 0.043 0.966
## fatherOrnamentation 0.982285 0.216499 4.537 6.78e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3212 on 34 degrees of freedom
## Multiple R-squared: 0.3771, Adjusted R-squared: 0.3588
## F-statistic: 20.59 on 1 and 34 DF, p-value: 6.784e-05
The estimated slope is 0.9822; as the father’s ornamentation increases by 1 the son’s attractiveness increases by 0.9822
The adjusted R-squared is 0.3588; we can explain about 35.88% of the variation in sons’ attractiveness with fathers’ ornamentation
predict(lm.4, newdata = (fatherOrnamentation = 0.25))
## 1
## 0.2506552
predict(lm.4, newdata = (fatherOrnamentation = 0.75))
## 1
## 0.7417978
visreg()
to plot the estimated regression line and the 95% confidence bands.visreg(lm.4, "fatherOrnamentation")
Using the flies
data given below, fit a regression model with lifespan as the dependent variable and fertility as the independent variable.
flies = read.csv(url("http://whitlockschluter.zoology.ubc.ca/wp-content/data/chapter18/chap18q07flyLifeSpan.csv"))
summary(flies)
## lifespanDays treatment fertility
## Min. : 2.00 high-cost:420 fertile:420
## 1st Qu.:22.00 low-cost :422 sterile:422
## Median :27.00
## Mean :27.74
## 3rd Qu.:33.00
## Max. :56.00
lm.5 = lm(lifespanDays ~ fertility, data = flies)
summary(lm.5)
##
## Call:
## lm(formula = lifespanDays ~ fertility, data = flies)
##
## Residuals:
## Min 1Q Median 3Q Max
## -26.8341 -5.6357 -0.8341 5.9655 29.3643
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 26.6357 0.4368 60.985 < 2e-16 ***
## fertilitysterile 2.1984 0.6169 3.563 0.000387 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.951 on 840 degrees of freedom
## Multiple R-squared: 0.01489, Adjusted R-squared: 0.01372
## F-statistic: 12.7 on 1 and 840 DF, p-value: 0.0003867
The residual standard error is 8.951; on average, predicted lifespan could be greater or smaller than the true lifespan by about 8.951 days
The adjusted R-squared is 0.0137; we can explain about 1.37% of the variation in lifespan with fertility
28.8341
for the sterile groupvisreg()
to plot the estimated regression line and the 95% confidence bands.visreg(lm.5, "fertility")