CaseStudy2.pdf

ST430 Introduction to Regression Analysis
ST430: Introduction to Regression Analysis, Case Study
2
Luo Xiao
October 12, 2015
1 / 23
ST430 Introduction to Regression Analysis
Case Study 2
2 / 23
Case Study 2
ST430 Introduction to Regression Analysis
Price of residential property
How does the sale price of a property relate to the appraised values of the
land and improvements on the property, and the neighborhood it is in?
Two questions:
Do the data indicate that price can be predicted based on these
variables?
Is the relationship the same in different neighborhoods?
3 / 23
Case Study 2
ST430 Introduction to Regression Analysis
Available data for 176 sales between May 2008 and June 2009:
Sale price, in thousands of dollars, Y ;
Appraised land value, in thousands of dollars, X1 ;
Appraised improvement value, in thousands of dollars, X2 ;
Neighborhood: Cheval, Davis Isles, Hunter’s Green and Hyde Park:
Baseline neighborhood: Cheval;
Indicator variable X3 for David Isles;
Indicator variable X4 for Hunter’s Green;
Indicator variable X5 for Hyde Park.
4 / 23
Case Study 2
ST430 Introduction to Regression Analysis
Load in data and plot them (next slide)
setwd("/Users/xiaoyuesixi/Dropbox/teaching/2015Fall/R_datasets
load("TAMSALES4.Rdata")# load in data
par(mfrow=c(1,2),mar=c(4,4.5,2,1))
plot(SALES~LAND,data=TAMSALES4,pch=20)
plot(SALES~IMP,data=TAMSALES4,pch=20)
5 / 23
Case Study 2
ST430 Introduction to Regression Analysis
2500
500
1500
SALES
1500
500
SALES
2500
Scatter plot
0
200
600
LAND
6 / 23
1000
0 200
600
IMP
Case Study 2
1000
ST430 Introduction to Regression Analysis
Scatter plot: Y versus X1 (different neighborhoods)
2000
1000
500
SALES
3000
CHEVAL
DAVISISLES
HUNTERSGREEN
HYDEPARK
0
200
400
600
800
1000
LAND
7 / 23
Case Study 2
ST430 Introduction to Regression Analysis
Scatter plot: Y versus X2 (different neighborhoods)
2000
1000
500
SALES
3000
CHEVAL
DAVISISLES
HUNTERSGREEN
HYDEPARK
0
200
400
600
800
1000
1200
IMP
8 / 23
Case Study 2
ST430 Introduction to Regression Analysis
Models (nested) to consider
Model 1 E (Y ) = —0 + —1 X1 + —2 X2 ;
Model 2 E (Y ) = —0 + —1 X1 + —2 X2 + —3 X3 + —4 X4 + —5 X5 ;
Model 3
E (Y ) = —0 + —1 X1 + —2 X2 + —3 X3 + —4 X4 + —5 X5
+ —6 X1 X3 + —7 X1 X4 + —8 X1 X5
+ —9 X2 X3 + —10 X2 X4 + —11 X2 X5 ;
9 / 23
Case Study 2
ST430 Introduction to Regression Analysis
Model 4
E (Y ) = —0 + —1 X1 + —2 X2 + —3 X3 + —4 X4 + —5 X5
+ —6 X1 X3 + —7 X1 X4 + —8 X1 X5
+ —9 X2 X3 + —10 X2 X4 + —11 X2 X5
+ —12 X1 X2 + —13 X1 X2 X3 + —14 X1 X2 X4 + —15 X1 X2 X5 .
10 / 23
Case Study 2
ST430 Introduction to Regression Analysis
Implications of models when X3 = 1
Model 1 E (Y ) = —0 + —1 X1 + —2 X2 ;
Model 2 E (Y ) = (—0 + —3 ) + —1 X1 + —2 X2 ;
Model 3 E (Y ) = (—0 + —3 ) + (—1 + —6 )X1 + (—2 + —9 )X2 ;
Model 4 E (Y ) = (—0 + —3 ) + (—1 + —6 )X1 + (—2 + —9 )X2 + (—12 + —13 )X1 X2 .
11 / 23
Case Study 2
ST430 Introduction to Regression Analysis
Model formulas in R
#Model
fit1 =
#Model
fit2 =
#Model
fit3 =
#Model
fit4 =
1
lm(SALES~LAND + IMP, data = TAMSALES4)
2
lm(SALES~LAND + IMP + NBHD, data = TAMSALES4)
3
lm(SALES~(LAND + IMP)*NBHD, data = TAMSALES4)
4
lm(SALES~LAND*IMP*NBHD, data = TAMSALES4)
12 / 23
Case Study 2
ST430 Introduction to Regression Analysis
Summary of models
13 / 23
Model
R2
Ra2
s
1
2
3
4
.9242
.9277
.9334
.9415
.9233
.9256
.9290
.9361
112.9
111.3
108.7
103.1
Case Study 2
ST430 Introduction to Regression Analysis
Notes
Models with more predictors always give smaller R 2 .
Models with higher Ra2 always give smaller s.
Small s and high Ra2 are desirable. Here, Model 4 is optimal for both.
14 / 23
Case Study 2
ST430 Introduction to Regression Analysis
Model 1 versus Model 2
H0 :—3 = —4 = —5 = 0;
Ha :at least one of —3 , —4 , —5 is not zero.
R code for testing nested models
’anova(fit1, fit2)’ (Output in "CS2_output1.txt")
Model 2 differs from Model 1 only by including NBHD, so we can also use
the R code: ’anova(fit2)’ (Output in "CS2_output2.txt")
The result of test shows that we reject Model 1 in favor of Model 2 at the
5% level, but not at the 1% level.
15 / 23
Case Study 2
ST430 Introduction to Regression Analysis
Model 2 versus Model 3
H0 :—6 = —7 = —8 = —9 = —10 = —11 = 0;
Ha :at least one of —6 , . . . , —11 is not zero.
Model 3 differs from Model 2 by including the interactions LAND:NBHD
and IMP:NBHD, so there are two ways for calculating the F-test:
1
2
Use the R code ’anova(fit2,fit3)’ (Ouput in "CS2_output3.txt")
Do the calculation by outputs from ’anova(fit2)’ and ’anova(fit3)’
(outputs in "CS2_output2.txt" and "CS2_output4.txt")
16 / 23
Case Study 2
ST430 Introduction to Regression Analysis
F -test
Recall the F statistic formula for nested models:
F =
(SSER ≠ SSEC )/Number of —’s tested
.
MSEC
From output "CS2_output2.txt" and "CS2_output4.txt":
SSER = 2104582,
SSEC = 1936871,
MSEC = 11810,
(2104582 ≠ 1936871)/6
F =
= 2.367.
11810
17 / 23
Case Study 2
ST430 Introduction to Regression Analysis
This F -statistic has a F -distribution with 6 and 164 degrees of freedom.
R code for p-value
’1-pf(2.367,6,164)’
F = 2.367 has a p-value of 0.0321, so we also reject Model 2 in favor of
Model 3 at the 5% level.
18 / 23
Case Study 2
ST430 Introduction to Regression Analysis
Model 3 versus Model 4
H0 :—12 = . . . = —15 = 0;
Ha :at least one of —12 , . . . , —15 is not zero.
Model 4 differs from Model 3 by including the interactions LAND:IMP and
LAND:IMP:NBHD.
Use either the R code ’anova(fit3, fit4)’ ("CS2_output5.txt") for testing or
the outputs from ’anova(fit3)’ ("CS2_output4.txt") and ’anova(fit4)’
("CS2_output6.txt")
F = 5.5389 with a p-value of .0003, so we also reject Model 3 in favor of
Model 4 at the 5% level, and at the 1% level.
19 / 23
Case Study 2
ST430 Introduction to Regression Analysis
Notes about the F -tests
Each of these tests answers the question:
Is there enough evidence against the simpler model to reject it?
This is not the same as:
Which of these models will give the best predictions?
20 / 23
Case Study 2
ST430 Introduction to Regression Analysis
Interpreting Model 4 (R output in ‘CS2_output7.txt’)
The baseline neighborhood is Cheval, so the equation for that neighborhood
is
E (Y ) = 155.2 ≠ 0.8272X1 + 0.9609X2 + 0.00517X1 X2 ,
a two-variable interaction model.
For each other neighborhood, the equation is also a two-variable interaction
model, but with different coefficients.
21 / 23
Case Study 2
ST430 Introduction to Regression Analysis
For another neighborhood Davis Isles (indicator variable X3 ), we must add
the corresponding interaction terms between X3 and X1 or X2 :
NBHDDAVISISLES = -60.17 to the intercept;
LAND:NBHDDAVISISLES = 2.012 to the coefficient of X1 ;
IMP:NBHDDAVISISLES = -0.1977 to the coefficient of X2 ;
LAND:IMP:NBHDDAVISISLES = -0.004278 to the coefficient of X1 X2 .
We get
E (Y ) = (155.2 ≠ 60.17) + (≠0.8272 + 2.012)X1 + (0.9609 ≠ 0.1977)X2
+ (0.00517 ≠ 0.004278)X1 X2
= 95.03 + 1.1848X1 + 0.7632X2 + 0.000892X1 X2 .
22 / 23
Case Study 2
ST430 Introduction to Regression Analysis
Prediction with Model 4
Predict the sale price of a residential property with appraised land value 150
thousand dollars and appraised improvement value of 750 thousand dollars
and located in Hyde Park.
R code (Output in “CS2_output8.txt”):
predict(fit4,newdata =
data.frame(LAND=150,IMP=750,NBHD=subsets[[4]]$NBHD[1]),
interval = "prediction")
23 / 23
Case Study 2