+ - 0:00:00
Notes for current slide
Notes for next slide

Correlations and Linear Regressions (continued)

MPA 6010

Ani Ruhil

1 / 28

Agenda

  1. Multiple linear regression

  2. Interaction effects

    (a) Two categorical independent variables

    (b) One categorical and one continuous independent variables

    (c) Two continuous independent variables

  3. Some closing thoughts on model fit

2 / 28

The Multiple Regression Model

3 / 28

A Model with Two Independent Variables: x1 & x2

Population Regression Function: y=a+b1(x1)+b2(x2)+ϵ

Sample Regression Function: y=ˆa+^b1(x1)+^b2(x2)+ˆe

Note: ^b1 and ^b2 are the partial regression coefficients or the partial slope coefficients

Why are they partial? They are partial in the sense that ...

  • ^b1= the impact of a unit change in x1 on the mean of y holding x2 constant
  • ^b2= the impact of a unit change in x2 on the mean of y holding x1 constant

R2=SSTRSST, with 0R21

Note: R2 is for multiple regression; r2 is for bivariate regression

4 / 28

For example, with the credit.sav dataset we could estimate the following regression model:

Rating=α+β1(Income)+β2(No. of Active Cards)+ϵ

^b1=3.480: Holding the number of active cards fixed, as income increases by 1 (which is in reality an increase of 1 thousand USD), credit rating increases by about 3.48

^b2=7.641: Holding income fixed, as the number of active credit cards increases by 1, credit rating increases by 7.641

5 / 28

Adjusted R-Square = 0.629 ... This model predicts/explains about 62.9% of the variation in credit rating

Std. Error of the Estimate = 94.242 ... The average prediction error you can expect when using this model to predict credit ratings is ±92.242

Estimated Regression: Rating=174.996+3.48(Income)+7.641(Cards)

6 / 28

Estimated Regression: Rating=174.996+3.48(Income)+7.641(Cards)

To generate predicted values, now you need to insert specific values of each independent variable. Recommendation is to either use particular values of interest, or then a specific set of values (see below).

(a) Calculate the Minimum, Maximum, Median, Mean, and Standard Deviation of each independent variable

(b) Calculate values that are 1 Standard Deviation above/below the Mean, and then 2 Standard Deviations above/below the Mean of each independent variable.

Check to ensure the ±1SD and ±2SD are plausible in-sample values

(c) Use the Median of each independent variable to calculate the outcome

(d) Predict the outcome holding one independent variable at its Mean and varying the other by cycling through appropriate values, either

Min;2SD;1SD;Mean;1SD;2SD;Max

(f) Present the results graphically, one graph per varying independent variable

7 / 28

Predicted values as income changes

Statistic Income Cards (held fixed) Predicted Rating
Min 10.35 3 233.9370
-2 SD NA 3 NA
-1 SD NA 3 NA
Mean 45.2189 3 355.2808
Median 33.1155 3 313.1609
1 SD 80.4631 3 477.9306
2 SD 115.7073 3 600.5804
Max 186.63 3 847.3914

The 1SD and 2SD values were implausible and hence not used

With Income as distributed in this dataset, I would just present the predicted Rating when Income is at its Min, Median, and Max, respectively, holding Cards fixed at their Median value of 3.

8 / 28

Since number of active cards is an integer, I would just cycle through the actual number of active cards we see in the dataset

Since the upper tail is thin we may want to stop at 6

9 / 28

Plotting predicted ratings as number of cards increases

Income (held fixed) Cards Predicted Rating
33.1155 1 297.8789
33.1155 2 305.5199
33.1155 3 313.1609
33.1155 4 320.8019
33.1155 5 328.4429
33.1155 6 336.0839

10 / 28

Interaction Effects

11 / 28

Interaction Effects

The effect of an independent variable may not be constant across the values of the other independent variable. For example, maybe at low levels of education there is no difference in hourly wages of men and women. However, a difference is visible for those with more than a high school degree

This may be a suspicion, a hypothesis that we would like to test. if we wish to do so then we need to modify the regression model

Three common types of interactions:

  • An interaction between two categorical variables
  • An interaction between one categorical variable and one continuous variable
  • An interaction between two continuous variables

Interactions ought to be specified on the basis of theory (preferred). However, do not hesitate to test for them if initial exploratory analysis (descriptive) of the data suggests interesting patterns

12 / 28

An Interaction Between Two Categorical Variables

y is a continuous variable (wage)

X1 is a dummy variable (female) that assumes two values (1 = Female; 0 = Male)

X2 is a dummy variable (single) that assumes two values (1 = Single; 0 = Not Single)

X3=(X1)×(X2)

wage=a+b1(female)+b2(single)+b3(single×female)+e

single female Interaction = single x female Who is this?
0 0 0 x 0 = 0 not single male
0 1 0 x 1 = 0 not single female
1 0 1 x 0 = 0 single male
1 1 1 x 1 = 1 single female
13 / 28

wage=a+b1(female)+b2(single)+b3(single×female)+e

Single Female: =10.8763.192(1)2.521(1)+3.097(1)=8.260

Married Female: =10.8763.192(1)2.521(0)+3.097(0)=7.684

Single Male: =10.8763.192(0)2.521(0)+3.097(0)=8.355

Married Male: =10.8763.192(0)2.521(0)+3.097(0)=10.876

The intercept is the estimated hourly wage for a married male

14 / 28

One Categorical and One Continuous Variable

wage=a+b1(female)+b2(age)+b3(female×age)+ϵ

No main effect of Female (p-value = 0.384)

Main effect of age (p-value = 0.000)

Interaction effect of Female and age (p-value = 0.009) ... as age increases, the wage-gap worsens for females

15 / 28

Generating predicted values

Use the estimated regression model to

  • Calculate predicted values of wage for males with varying ages

  • Calculate predicted values of wage for females with the same ages you used above

Run a frequency table for age and use the ages you see in the table

Values will be 18 through 64, in single-digit increments

16 / 28

wage=5.282+1.231(female)+0.131(age)0.095(female×age)

Calculate Minimum (18), Median (35), Maximum (64) of age

Set age at Minimum and calculate predicted hourly wage for Men versus Women

Set age at Median and calculate predicted hourly wage for Men versus Women

Set age at Maximum and calculate predicted hourly wage for Men versus Women

age age age age age
Sex 18 28 35 44 64
Female 7.162694 7.523649 7.776317 8.101176 8.823084
Male 7.639764 8.949691 9.866640 11.045575 13.665429
Difference 0.4770696 1.4260426 2.0903237 2.9443994 4.8423453

Note the increasing wage-gap as age increases

17 / 28

Two Continuous Variables

y=a+b1(x1)+b2(x2)+b3(x3)+ϵ where x3=(x1)×(x2)

If b3>0, the higher is x1, the more the effect of x2 on y and likewise, the higher is x2, the more the effect of x1 on y

If b3<0 then effects are reversed (i.e., the higher is xi the less the effect of xj on y)

Note also that b1 and b2 now reflect conditional relationships:

  • b1 is the effect of x1 on y when x2=0.

  • b2 is the effect of x2 on y when x1=0

18 / 28

When interpreting our regression coefficients, we are forced to say that the effect of a unit change in one variable depends upon the value of the other variable

If you have an interaction effect in your model, you must include the x1 and x2 as well even if they are not statistically significant

Calculating impacts of variables ...

  • Hold xi at its mean and tweak xj by ±1, ±2 standard deviations, or

  • Hold xi at its median and change the other by discrete units (for e.g., for variable that ranges from 0% to 100% you could tweak from MIN to MAX by 10%) ... This is the preferred strategy!!

19 / 28

Calculate Minimum, Median, Maximum for age and for exper, respectively

Now hold age at Minimum and tweak exper

Now hold exper at Minimum and tweak age

Repeat by setting each, in turn, at Median, then at Maximum

With missing data, all descriptive statistics must be calculated for the estimation sample and not the full sample since if you have missing data, the descriptive statistics of the estimation sample can differ from those of the full sample

20 / 28

wage=12.193+0.957(age)0.610(exper)0.004(age×exper)

age age age age age
exper 18 28 35 44 64
0 5.033918 14.604271 21.303518 29.916835 49.057541
8 -0.4178636 8.8359545 15.3136271 23.6420634 42.1496995
15 -5.188172 3.788678 10.072473 18.151638 36.105338
26 -12.684372 -4.142757 1.836373 9.523827 26.607057
55 -32.447081 -25.052904 -19.876980 -13.222221 1.566133

What seems odd about this table's structure and data??

21 / 28

Use feasible in-sample values

  • age = 18, exper = 0

  • age = 28, exper = (Min = 5, Max = 15)

  • age = 35, exper = (Min = 11, Max = 21)

  • age = 44, exper = (Min = 22, Max = 29)

  • age = 64, exper = (Min = 40, Max = 65)

Important to scan the data to avoid impossible predictions being generated. Better yet, generate predicted values' plots such as the one that follows.

22 / 28

23 / 28

Fitting Regression Models to these Data

u5mr is the dependent variable (aka the outcome of interest)

  • with 1 independent variable (say, GDP per capita)
  • with 2 independent variables (say, GDP per capita and Adult literacy rate)

NOTE!! Convert GDP per capita into GDP per capita in 1,000 USD

Some cautions ... Regression models are built on several features

  • Your independent variables cannot be very highly correlated
  • The dependent variable should have equal variance for every value of the independent variable
  • Your sample size should be large (at minimum 30 observations for every independent variable you wish to use)
25 / 28

U5MR=Constant+b1(GDP)

U5MR=Constant+b1(GDP)+b2(AdultLiteracy)

Interpret the partial slopes, adjusted R-Square, and the Standard Error of the Estimate

Notice the jump in the R-Square and the Adjusted R-Square

26 / 28

An aside on underlying hypotheses being tested here

H0: GDP per capita has no impact on child mortality, i.e., H0:b1=0
H1: GDP per capita has an impact on child mortality, i.e., H1:b10

H0: Adult Literacy has no impact on child mortality, i.e., H0:b2=0
H1: Adult Literacy has an impact on child mortality, i.e., H1:b20

H0: As GDP per capita increases child mortality stays the same or increases, i.e., H0:b10
H1: As GDP per capita increases child mortality decreases (i.e., H1:b1<0

H0: As Adult Literacy increases child mortality stays the same or increases, i.e., H0:b20
H1: As Adult Literacy increases child mortality decreases, i.e., H1:b2<0

NOTE!! Two-tailed when we don't know what to expect but one-tailed when we have very specific impacts we hypothesize should be evident

27 / 28

Could we build a better model?

Use the following independent variables:

(1) female youth literacy rates
(2) male youth literacy rates
(3) percent of the GDP spent on health
(4) density of nurses and midwives
(5) percent of the total population living in urban areas

What is the Adjusted R-Square now?

What is the average prediction error now?

How many independent variables are significant now?

Could some of our independent variables be highly correlated? Check!!

28 / 28

Agenda

  1. Multiple linear regression

  2. Interaction effects

    (a) Two categorical independent variables

    (b) One categorical and one continuous independent variables

    (c) Two continuous independent variables

  3. Some closing thoughts on model fit

2 / 28
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow