ch04-sec06-10.pdf

ST430 Introduction to Regression Analysis
ST430: Introduction to Regression Analysis, Ch4, Sec
4.6-4.10
Luo Xiao
September 21, 2015
1 / 25
ST430 Introduction to Regression Analysis
Multiple Linear Regression
2 / 25
Multiple Linear Regression
ST430 Introduction to Regression Analysis
Testing the utility of a model
Usually, the first test is an overall test of the model:
H0 : β1 = β2 = · · · = βk = 0.
Ha : at least one βi 6= 0.
H0 asserts that none of the independent variables affects Y ; if this
hypothesis is not rejected, the model is worthless.
For instance, its predictions perform no better than ȳ .
The test statistic is usually denoted F .
3 / 25
Multiple Linear Regression
ST430 Introduction to Regression Analysis
F test statistic
F =
Mean Square (Model)
(SSyy − SSE)/k
=
.
SSE/(n − k − 1)
MSE
F is the ratio of the explained variability to the unexplained variability.
If H0 is true, F has a F -distribution with k and n − k − 1 degrees of
freedom.
Reject H0 if F is large, i.e., F > Fα , based on k and n − k − 1 degrees
of freedom.
4 / 25
Multiple Linear Regression
ST430 Introduction to Regression Analysis
Rejection region for the global F -test
5 / 25
Multiple Linear Regression
ST430 Introduction to Regression Analysis
Multiple linear regression: grandfather clocks data
Variables
Y : sale price of a clock at an auction
X1 : age of the clock
X2 : number of bidders in the auction
Model
Y = β0 + β1 X1 + β2 X2 + 6 / 25
Multiple Linear Regression
ST430 Introduction to Regression Analysis
Multiple linear regression in R: grandfather clocks data
# load in data
setwd("~/Dropbox/teaching/2015Fall/R_datasets/Exercises&Exampl
load("GFCLOCKS.Rdata")
fit = lm(PRICE~AGE + NUMBIDS,data=GFCLOCKS) #linear fit
summary(fit) #display results
7 / 25
Multiple Linear Regression
ST430 Introduction to Regression Analysis
F -test results: grandfather clocks data
F -statistic: 120.2.
Degrees of freedom for the F -test: 2 and 29.
P-value for the F -test: 9.2 × 10−15 .
Conclusion: Reject H0 in favor of H1 .
8 / 25
Multiple Linear Regression
ST430 Introduction to Regression Analysis
SAS output: grandfather clocks data
9 / 25
Multiple Linear Regression
ST430 Introduction to Regression Analysis
Testing the slopes
Same as testing in simple linear regression models.
For the grandfather clocks data, both the age of clocks and the number of
bidders are significant predictors.
10 / 25
Multiple Linear Regression
ST430 Introduction to Regression Analysis
Multiple coefficient of determination/ R-squared
Recall that:
SSyy − SSE
(yi − ŷi )2
R =
=1− P
SSyy
(yi − ȳ )2
2
P
The interpretation of R 2 is still the fraction of variance “explained” by the
regression model.
It measures the correlation between the dependent variable Y and the
independent variables jointly.
11 / 25
Multiple Linear Regression
ST430 Introduction to Regression Analysis
Adjusted R-squared
R 2 will always increase when new variables are added to the model.
Because the regression model is adapted to the sample data, it tends to
explain more variance in the sample data than it will in new data.
Rewrite:
SSE
1−R =
=
SSyy
2
12 / 25
1 P
(yi
n−1
1 P
(yi
n−1
− ŷi )2
− ȳ )2
Multiple Linear Regression
ST430 Introduction to Regression Analysis
Ideally we want:
1 − R2 =
but
1
n−1
Variance of residuals
,
Variance of Y
(yi − ŷi )2 is a biased estimator of variance of residuals.
P
1
in the numerator with the multiplier that gives unbiased
Replace n−1
variance estimator:
P
1
(yi − ŷi )2
n−k−1
,
P
1
(yi − ȳ )2
n−1
where k + 1 is the number of estimated βs.
This defines the adjusted R-squared:
Ra2
13 / 25
=1−
P
1
(yi − ŷi )2
n−k−1
.
1 P
(yi − ȳ )2
n−1
Multiple Linear Regression
ST430 Introduction to Regression Analysis
Rewrite:
Ra2
(yi − ŷi )2
n−1
× P
=1−
.
n−k −1
(yi − ȳ )2
P
Ra2 < R 2 , and for a poorly fitting model you may even find Ra2 < 0!
Because Ra2 adjusts for the number of covariates in the model, it can be
used for comparing different models: select model that has the largest Ra2 .
In particular, it is not necessarily true that Ra2 will increase when new
variable is added to the model.
14 / 25
Multiple Linear Regression
ST430 Introduction to Regression Analysis
Estimation and prediction
The multiple regression model may be used to make statements about the
response that would be observed under a new set of conditions
Xnew = (X1,new , X2,new , . . . , Xk,new ).
As before, the statement may be about:
E (Y |X = Xnew ), the expected value of Y under the new conditions;
a single new observation of Y under the new conditions.
15 / 25
Multiple Linear Regression
ST430 Introduction to Regression Analysis
The point estimate of E (Y |X = Xnew ) and the point prediction of Y when
X = Xnew are both
Ŷ = β̂0 + β̂1 X1,new + · · · + β̂k Xk,new .
The standard errors are different because, as always,
Y = E (Y ) + .
Prediction interval is wider than confidence interval.
16 / 25
Multiple Linear Regression
ST430 Introduction to Regression Analysis
Question:
Estimation and prediction (with 99% intervals) for the sale price of a
150-year old clock with 10 bidders at the auction.
# load in data
setwd("~/Dropbox/teaching/2015Fall/R_datasets/Exercises&Exampl
load("GFCLOCKS.Rdata")
fit = lm(PRICE~AGE + NUMBIDS,data=GFCLOCKS) #linear fit
predict(fit,newdata=data.frame(AGE=150,NUMBIDS=10),
interval="confidence",level=.99) #confidence interval
predict(fit,newdata=data.frame(AGE=150,NUMBIDS=10),
interval="prediction",level=.99) #prediction interval
17 / 25
Multiple Linear Regression
ST430 Introduction to Regression Analysis
R output:
##
fit
lwr
upr
## 1 1431.665 1363.92 1499.409
##
fit
lwr
upr
## 1 1431.665 1057.545 1805.785
18 / 25
Multiple Linear Regression
ST430 Introduction to Regression Analysis
Interaction models
One property of the first-order model
E (Y ) = β0 + β1 X1 + β2 X2 + · · · + βk Xk
is that βi is the change in E (Y ) as Xi increases by 1 with all the other
independent variables held fixed, and is the same regardless of the values of
those other variables.
Not all real-world situations work like that.
When the magnitude of the effect of one variable is affected by the level of
another variable, we say that they interact.
19 / 25
Multiple Linear Regression
ST430 Introduction to Regression Analysis
A simple model for two factors that interact is
E (Y ) = β0 + β1 X1 + β2 X2 + β3 X1 X2 .
Rewrite this in two ways:
E (Y ) = (β0 + β2 X2 ) + (β1 + β3 X2 )X1
= (β0 + β1 X1 ) + (β2 + β3 X1 )X2 .
Holding X2 fixed, the slope of E (Y ) against X1 is β1 + β3 X2 .
Holding X1 fixed, the slope of E (Y ) against X2 is β2 + β3 X1 .
20 / 25
Multiple Linear Regression
ST430 Introduction to Regression Analysis
We can fit this using k = 3:
E (Y ) = β0 + β1 X1 + β2 X2 + β3 X3
if we set X3 = X1 X2 .
R code for interaction model: grandfather clocks
lm(PRICE ˜ AGE*NUMBIDS, data = GFCLOCKS)
Note: the formula ’PRICE ˜ AGE*NUMBIDS’ specifies the interaction
model, which includes the separate effects of ’AGE’ and ’NUMBIDS’,
together with their product, which will be labeled ’AGE:NUMBIDS’.
21 / 25
Multiple Linear Regression
ST430 Introduction to Regression Analysis
R output for interaction model: grandfather clocks
Run the following R code for outputs!
# load in data
setwd("~/Dropbox/teaching/2015Fall/R_datasets/Exercises&Exampl
load("GFCLOCKS.Rdata")
fit = lm(PRICE~AGE*NUMBIDS,data=GFCLOCKS) #linear fit
summary(fit)
22 / 25
Multiple Linear Regression
ST430 Introduction to Regression Analysis
Global F-test leads to rejection of the null hypothesis
H0 : β1 = β2 = β3 = 0.
Note that the t-value on the ’AGE:NUMBIDS’ line is highly significant.
That is, we strongly reject the null hypothesis H0 : β3 = 0.
Effectively, this test is a comparison of the interaction model with the
original non-interactive (additive) model.
The other two t-statistics are usually irrelevant:
if ’AGE:NUMBIDS’ is important, then both ’AGE’ and ’NUMBIDS’
should be included in the model;
do not test the corresponding null hypotheses.
23 / 25
Multiple Linear Regression
ST430 Introduction to Regression Analysis
We can write the fitted model as
E (PRICE) =320 − 93 × NUMBIDS
(0.88 + 1.3 × NUMBIDS) × AGE
meaning that the effect of age increases with the number of bidders.
Compare with first-order model
E (PRICE) = −1339 + 12.74 × AGE + 85.95 × NUMBIDS.
One interpretation
The number of bidders creates more competition for the oldest clocks and
age has a stronger effect in a competitive environment.
24 / 25
Multiple Linear Regression
ST430 Introduction to Regression Analysis
We can also write the fitted model as
E (PRICE) =320 + 0.88 × AGE
+ (−93 + 1.3 × AGE) × NUMBIDS.
Because the range of ’AGE’ is [100, 200], the slope for ’NUMBIDS’ is
always positive; moreover, the effect of number of bidders increase with age.
25 / 25
Multiple Linear Regression