Correlations and Linear Regressions (continued)MPA 6010Ani Ruhil1 / 28

Agenda

Multiple linear regression
Interaction effects

(a) Two categorical independent variables

(b) One categorical and one continuous independent variables

(c) Two continuous independent variables
Some closing thoughts on model fit

2 / 28

The Multiple Regression Model 3 / 28

A Model with Two Independent Variables: $x_1$ & $x_2$

Population Regression Function: $y = a + b_1(x_1) + b_2(x_2) + \epsilon$

Sample Regression Function: $y = \hat{a} + \hat{b_1}(x_1) + \hat{b_2}(x_2) + \hat{e}$

Note: $\hat{b_1}$ and $\hat{b_2}$ are the partial regression coefficients or the partial slope coefficients

Why are they partial? They are partial in the sense that ...

$\hat{b_1} =$ the impact of a unit change in $x_1$ on the mean of $y$ holding $x_2$ constant
$\hat{b_2} =$ the impact of a unit change in $x_2$ on the mean of $y$ holding $x_1$ constant

$R^2 = \dfrac{SSTR}{SST}$ , with $0 \leq R^2 \leq 1$

Note: $R^2$ is for multiple regression; $r^2$ is for bivariate regression

4 / 28

For example, with the credit.sav dataset we could estimate the following regression model:

$Rating = \alpha + \beta_1(Income) + \beta_2(\text{No. of Active Cards}) + \epsilon$

$\hat{b_1} = 3.480$ : Holding the number of active cards fixed, as income increases by 1 (which is in reality an increase of 1 thousand USD), credit rating increases by about 3.48

$\hat{b_2} = 7.641$ : Holding income fixed, as the number of active credit cards increases by 1, credit rating increases by 7.641

5 / 28

Adjusted R-Square = 0.629 ... This model predicts/explains about 62.9% of the variation in credit rating

Std. Error of the Estimate = 94.242 ... The average prediction error you can expect when using this model to predict credit ratings is $\pm 92.242$

Estimated Regression: $Rating = 174.996 + 3.48\left(Income\right) + 7.641\left(Cards\right)$

6 / 28

Estimated Regression: $Rating = 174.996 + 3.48\left(Income\right) + 7.641\left(Cards\right)$

To generate predicted values, now you need to insert specific values of each independent variable. Recommendation is to either use particular values of interest, or then a specific set of values (see below).

(a) Calculate the Minimum, Maximum, Median, Mean, and Standard Deviation of each independent variable

(b) Calculate values that are 1 Standard Deviation above/below the Mean, and then 2 Standard Deviations above/below the Mean of each independent variable.

Check to ensure the $\pm 1 SD$ and $\pm2 SD$ are plausible in-sample values

(d) Predict the outcome holding one independent variable at its Mean and varying the other by cycling through appropriate values, either

$Min; -2 SD; -1 SD; Mean; 1 SD; 2 SD; Max$

(f) Present the results graphically, one graph per varying independent variable

7 / 28

Predicted values as income changes

Statistic	Income	Cards (held fixed)	Predicted Rating
Min	10.35	3	233.9370
-2 SD	NA	3	NA
-1 SD	NA	3	NA
Mean	45.2189	3	355.2808
Median	33.1155	3	313.1609
1 SD	80.4631	3	477.9306
2 SD	115.7073	3	600.5804
Max	186.63	3	847.3914

The $-1 SD$ and $-2 SD$ values were implausible and hence not used

With Income as distributed in this dataset, I would just present the predicted Rating when Income is at its Min, Median, and Max, respectively, holding Cards fixed at their Median value of 3.

8 / 28

Since number of active cards is an integer, I would just cycle through the actual number of active cards we see in the dataset

Since the upper tail is thin we may want to stop at 6

9 / 28

Plotting predicted ratings as number of cards increases

Income (held fixed)	Cards	Predicted Rating
33.1155	1	297.8789
33.1155	2	305.5199
33.1155	3	313.1609
33.1155	4	320.8019
33.1155	5	328.4429
33.1155	6	336.0839

10 / 28

Interaction Effects 11 / 28

Interaction Effects

The effect of an independent variable may not be constant across the values of the other independent variable. For example, maybe at low levels of education there is no difference in hourly wages of men and women. However, a difference is visible for those with more than a high school degree

This may be a suspicion, a hypothesis that we would like to test. if we wish to do so then we need to modify the regression model

Three common types of interactions:

An interaction between two categorical variables
An interaction between one categorical variable and one continuous variable
An interaction between two continuous variables

Interactions ought to be specified on the basis of theory (preferred). However, do not hesitate to test for them if initial exploratory analysis (descriptive) of the data suggests interesting patterns

12 / 28

An Interaction Between Two Categorical Variables

$y$ is a continuous variable (wage)

$X_1$ is a dummy variable (female) that assumes two values (1 = Female; 0 = Male)

$X_2$ is a dummy variable (single) that assumes two values (1 = Single; 0 = Not Single)

$X_3 = (X_1) \times (X_2)$

$wage = a + b_1(female) + b_2(single) + b_3(single \times female) + e$

single	female	Interaction = single x female	Who is this?
0	0	0 x 0 = 0	not single male
0	1	0 x 1 = 0	not single female
1	0	1 x 0 = 0	single male
1	1	1 x 1 = 1	single female

13 / 28

$wage = a + b_1(female) + b_2(single) + b_3(single \times female) + e$

Single Female: $= 10.876 - 3.192(1) - 2.521(1) + 3.097(1) = 8.260$

Married Female: $= 10.876 - 3.192(1) - 2.521(0) + 3.097(0) = 7.684$

Single Male: $= 10.876 - 3.192(0) - 2.521(0) + 3.097(0) = 8.355$

Married Male: $= 10.876 - 3.192(0) - 2.521(0) + 3.097(0) = 10.876$

The intercept is the estimated hourly wage for a married male

14 / 28

One Categorical and One Continuous Variable

$wage = a + b_1(female) + b_2(age) + b_3(female \times age) + \epsilon$

No main effect of Female (p-value = 0.384)

Main effect of age (p-value = 0.000)

Interaction effect of Female and age (p-value = 0.009) ... as age increases, the wage-gap worsens for females

15 / 28

Generating predicted values

Use the estimated regression model to

Calculate predicted values of wage for males with varying ages
Calculate predicted values of wage for females with the same ages you used above

Run a frequency table for age and use the ages you see in the table

Values will be 18 through 64, in single-digit increments

16 / 28

$wage = 5.282 + 1.231(female) + 0.131(age) - 0.095(female \times age)$

Calculate Minimum $(18)$ , Median $(35)$ , Maximum $(64)$ of age

Set age at Minimum and calculate predicted hourly wage for Men versus Women

Set age at Median and calculate predicted hourly wage for Men versus Women

Set age at Maximum and calculate predicted hourly wage for Men versus Women

	age	age	age	age	age
Sex	18	28	35	44	64
Female	7.162694	7.523649	7.776317	8.101176	8.823084
Male	7.639764	8.949691	9.866640	11.045575	13.665429
Difference	0.4770696	1.4260426	2.0903237	2.9443994	4.8423453

Note the increasing wage-gap as age increases

17 / 28

Two Continuous Variables

$y = a + b_1(x_1) + b_2(x_2) + b_3(x_3) + \epsilon$ where $x_3 = (x_1) \times (x_2)$

If $b_3 >0$ , the higher is $x_1$ , the more the effect of $x_2$ on $y$ and likewise, the higher is $x_2$ , the more the effect of $x_1$ on $y$

If $b_3 < 0$ then effects are reversed (i.e., the higher is $x_i$ the less the effect of $x_j$ on $y$ )

Note also that $b_1$ and $b_2$ now reflect conditional relationships:

$b_1$ is the effect of $x_1$ on $y$ when $x_2 = 0$ .
$b_2$ is the effect of $x_2$ on $y$ when $x_1 = 0$

18 / 28

When interpreting our regression coefficients, we are forced to say that the effect of a unit change in one variable depends upon the value of the other variable

If you have an interaction effect in your model, you must include the $x_1$ and $x_2$ as well even if they are not statistically significant

Calculating impacts of variables ...

Hold $x_i$ at its mean and tweak $x_j$ by $\pm 1$ , $\pm 2$ standard deviations, or
Hold $x_i$ at its median and change the other by discrete units (for e.g., for variable that ranges from $0\%$ to $100\%$ you could tweak from MIN to MAX by $10\%$ ) ... This is the preferred strategy!!

19 / 28

Calculate Minimum, Median, Maximum for age and for exper, respectively

Now hold age at Minimum and tweak exper

Now hold exper at Minimum and tweak age

Repeat by setting each, in turn, at Median, then at Maximum

With missing data, all descriptive statistics must be calculated for the estimation sample and not the full sample since if you have missing data, the descriptive statistics of the estimation sample can differ from those of the full sample

20 / 28

$wage = -12.193 + 0.957(age) - 0.610(exper) - 0.004(age \times exper)$

	age	age	age	age	age
exper	18	28	35	44	64
0	5.033918	14.604271	21.303518	29.916835	49.057541
8	-0.4178636	8.8359545	15.3136271	23.6420634	42.1496995
15	-5.188172	3.788678	10.072473	18.151638	36.105338
26	-12.684372	-4.142757	1.836373	9.523827	26.607057
55	-32.447081	-25.052904	-19.876980	-13.222221	1.566133

What seems odd about this table's structure and data??

21 / 28

Use feasible in-sample values

age = 18, exper = 0
age = 28, exper = (Min = 5, Max = 15)
age = 35, exper = (Min = 11, Max = 21)
age = 44, exper = (Min = 22, Max = 29)
age = 64, exper = (Min = 40, Max = 65)

Important to scan the data to avoid impossible predictions being generated. Better yet, generate predicted values' plots such as the one that follows.

22 / 28

23 / 28

State of the World's Children (2025)

24 / 28

Fitting Regression Models to these Data

u5mr is the dependent variable (aka the outcome of interest)

with 1 independent variable (say, GDP per capita)
with 2 independent variables (say, GDP per capita and Adult literacy rate)

NOTE!! Convert GDP per capita into GDP per capita in 1,000 USD

Some cautions ... Regression models are built on several features

Your independent variables cannot be very highly correlated
The dependent variable should have equal variance for every value of the independent variable
Your sample size should be large (at minimum 30 observations for every independent variable you wish to use)

25 / 28

$U5MR = Constant + b_{1}(GDP)$

$U5MR = Constant + b_{1}(GDP) + b_2(Adult Literacy)$

Interpret the partial slopes, adjusted R-Square, and the Standard Error of the Estimate

Notice the jump in the R-Square and the Adjusted R-Square

26 / 28

An aside on underlying hypotheses being tested here

$H_0:$ GDP per capita has no impact on child mortality, i.e., $H_0: b_1 = 0$
$H_1:$ GDP per capita has an impact on child mortality, i.e., $H_1: b_1 \neq 0$

$H_0:$ Adult Literacy has no impact on child mortality, i.e., $H_0: b_2 = 0$
$H_1:$ Adult Literacy has an impact on child mortality, i.e., $H_1: b_2 \neq 0$

$H_0:$ As GDP per capita increases child mortality stays the same or increases, i.e., $H_0: b_1 \geq 0$
$H_1:$ As GDP per capita increases child mortality decreases (i.e., $H_1: b_1 < 0$

$H_0:$ As Adult Literacy increases child mortality stays the same or increases, i.e., $H_0: b_2 \geq 0$
$H_1:$ As Adult Literacy increases child mortality decreases, i.e., $H_1: b_2 < 0$

NOTE!! Two-tailed when we don't know what to expect but one-tailed when we have very specific impacts we hypothesize should be evident

27 / 28

Could we build a better model?

Use the following independent variables:

(1) female youth literacy rates
(2) male youth literacy rates
(3) percent of the GDP spent on health
(4) density of nurses and midwives
(5) percent of the total population living in urban areas

What is the Adjusted R-Square now?

What is the average prediction error now?

How many independent variables are significant now?

Could some of our independent variables be highly correlated? `Check!!`

28 / 28

Help

Keyboard shortcuts

↑, ←, Pg Up, k

Go to previous slide

↓, →, Pg Dn, Space, j

Go to next slide

Home

Go to first slide

End

Go to last slide

Number + Return

Go to specific slide

b / m / f

Toggle blackout / mirrored / fullscreen mode

Clone slideshow

Toggle presenter mode

Restart the presentation timer

?, h

Toggle this help

Correlations and Linear Regressions (continued)

MPA 6010

Ani Ruhil

Agenda

The Multiple Regression Model

A Model with Two Independent Variables: x1x_1 & x2x_2

Predicted values as income changes

Plotting predicted ratings as number of cards increases

Interaction Effects

Interaction Effects

An Interaction Between Two Categorical Variables

One Categorical and One Continuous Variable

Generating predicted values

Two Continuous Variables

Use feasible in-sample values

State of the World's Children (2025)

Fitting Regression Models to these Data

An aside on underlying hypotheses being tested here

Could we build a better model?

Could some of our independent variables be highly correlated? Check!!

Agenda

Help

A Model with Two Independent Variables: $x_1$ & $x_2$

Could some of our independent variables be highly correlated? `Check!!`