CaseStudy3.pdf

ST430 Introduction to Regression Analysis
ST430: Introduction to Regression Analysis, Case Study
3
Luo Xiao
October 26, 2015
1 / 22
ST430 Introduction to Regression Analysis
Case Study 3
2 / 22
Case Study 3
ST430 Introduction to Regression Analysis
Deregulation of trucking industry
What was the impact of deregulation on trucking prices in Florida?
What is a good model for predicting prices?
Ge the data and plot them (see "output1.pdf"):
setwd("~/Dropbox/teaching/2015Fall/R_datasets/Cases")
load("TRUCKING.Rdata")
pairs(TRUCKING[, c("LNPRICE","DISTANCE", "WEIGHT", "ORIGIN",
"DEREG")])
3 / 22
Case Study 3
ST430 Introduction to Regression Analysis
Data: 134 observations
Y : natural logarithm of price;
X1 : weight of product shipped (in 1,000 pounds);
X2 : miles traveled (in hundreds);
X3 : indicator variable: 1 for deregulation and 0 for regulation;
X4 : indicator variable: 1 if originate in Miami and 0 if originate in
Jacksonville.
4 / 22
Case Study 3
ST430 Introduction to Regression Analysis
Stepwise regression (output in “output2.txt”):
truck = list()
truck$Y = TRUCKING$LNPRICE
truck$X1 = TRUCKING$WEIGHT
truck$X2 = TRUCKING$DISTANCE
truck$X3 = TRUCKING$ORIGI
truck$X4 = TRUCKING$DEREG
truck$X5 = TRUCKING$PCTLOAD
truck$X6 = TRUCKING$MARKET
truck = as.data.frame(truck)
start = lm(Y~1,data=truck)
firstOrder = Y~X1 + X2 + X3 + X4 + X5 + X6
summary(step(start,scope=firstOrder))
5 / 22
Case Study 3
ST430 Introduction to Regression Analysis
The (first-order) stepwise regression identifies:
X1 , ’WEIGHT’;
X2 , ’DISTANCE’;
X3 , the ’DEREG’ indicator;
X4 , the ’ORIGIN’ indicator.
Stepping down from the full first-order model, instead of stepping up from
the empty model, finds the same variables:
R code:
summary(step(lm(firstOrder, truck), firstOrder))
6 / 22
Case Study 3
ST430 Introduction to Regression Analysis
The study continues with the full second order model (Model 1) in
X1 , X2 , X3 and X4 :
Y =β0 + β1 X1 + β2 X2 + β3 X1 X2 + β4 X12 + β5 X22
+ β6 X3 + β7 X4 + β8 X3 X4
+ β9 X1 X3 + β10 X1 X4 + β11 X1 X3 X4
+ β12 X2 X3 + β13 X2 X4 + β14 X2 X3 X4
+ β15 X1 X2 X3 + β16 X1 X2 X4 + β17 X1 X2 X3 X4
+ β18 X12 X3 + β19 X12 X4 + β20 X12 X3 X4
+ β21 X22 X3 + β22 X22 X4 + β23 X22 X3 X4 .
R code (output in “output3.txt”):
lm1 <- lm(Y ~ (X1*X2 +
truck)
summary(lm1)
7 / 22
I(X1^2) + I(X2^2)) * X3 * X4,
Case Study 3
ST430 Introduction to Regression Analysis
Note that none of the 8 squared terms are significant; try the model without
the quadratic terms (i.e., terms involving X12 or X22 ), denoted by Model 2:
Y =β0 + β1 X1 + β2 X2 + β3 X1 X2
+ β6 X3 + β7 X4 + β8 X3 X4
+ β9 X1 X3 + β10 X1 X4 + β11 X1 X3 X4
+ β12 X2 X3 + β13 X2 X4 + β14 X2 X3 X4
+ β15 X1 X2 X3 + β16 X1 X2 X4 + β17 X1 X2 X3 X4 .
R code (output in “output4.txt”):
lm2 <- lm(Y ~ X1*X2 * X3 * X4, truck)
summary(lm2)
anova(lm2, lm1) #compare nested models
8 / 22
Case Study 3
ST430 Introduction to Regression Analysis
R 2 drops substantially, and F is highly significant, so the simpler Model 2 is
rejected.
Next try dropping, from the full second order model (Model 1), the
interactions between quantitative and qualitative variables (Model 3):
Y =β0 + β1 X1 + β2 X2 + β3 X1 X2 + β4 X12 + β5 X22
+ β6 X3 + β7 X4 + β8 X3 X4 .
R code (output in “output5.txt”):
lm3 = lm(Y ~ X1 + X2 + X1:X2+
data = truck)
summary(lm3)
anova(lm3, lm1)
9 / 22
I(X1^2) + I(X2^2) + X3 * X4,
Case Study 3
ST430 Introduction to Regression Analysis
Again F is significant, and the simpler Model 3 is rejected.
Next try: drop the interactions of the qualitative variables with only the
squared terms (Model 4):
Y =β0 + β1 X1 + β2 X2 + β3 X1 X2 + β4 X12 + β5 X22
+ β6 X3 + β7 X4 + β8 X3 X4
+ β9 X1 X3 + β10 X1 X4 + β11 X1 X3 X4
+ β12 X2 X3 + β13 X2 X4 + β14 X2 X3 X4
+ β15 X1 X2 X3 + β16 X1 X2 X4 + β17 X1 X2 X3 X4 .
R code:
lm4 = lm(Y ~ X1*X2*X3*X4 + I(X1^2) + I(X2^2), truck)
summary(lm4)
anova(lm4, lm1)
10 / 22
Case Study 3
ST430 Introduction to Regression Analysis
Success! R 2 drops only a little, and Ra2 actually increases; also F is not
significant. This simpler Model 4 is not rejected.
Next, explore whether X4 can be dropped from Model 4 (Model 5):
Y =β0 + β1 X1 + β2 X2 + β3 X1 X2 + β4 X12 + β5 X22
+ β6 X3
+ β9 X1 X3
+ β12 X2 X3
+ β15 X1 X2 X3 .
R code:
lm5 = lm(Y ~ X1*X2*X3 + I(X1^2) + I(X2^2), truck)
summary(lm5)
anova(lm5, lm4)
11 / 22
Case Study 3
ST430 Introduction to Regression Analysis
F is highly significant, so we reject the simpler Model 5.
Next, explore whether X3 can be dropped (Model 6):
Y =β0 + β1 X1 + β2 X2 + β3 X1 X2 + β4 X12 + β5 X22
+ β7 X4
+ β10 X1 X4
+ β13 X2 X4
+ β16 X1 X2 X4 .
R code:
lm6 = lm(Y ~ X1*X2*X4 + I(X1^2) + I(X2^2), truck)
summary(lm6)
anova(lm6, lm4)
12 / 22
Case Study 3
ST430 Introduction to Regression Analysis
Again, F is highly significant, so we reject the simpler model Model 6.
Finally, explore whether X3 interacts with X4 by dropping their interaction
terms (Model 7):
Y =β0 + β1 X1 + β2 X2 + β3 X1 X2 + β4 X12 + β5 X22
+ β6 X3 + β7 X4
+ β9 X1 X3 + β10 X1 X4
+ β12 X2 X3 + β13 X2 X4
+ β15 X1 X2 X3 + β16 X1 X2 X4 .
lm7 = lm(Y ~ X1*X2*(X3 + X4) + I(X1^2) + I(X2^2),
truck)
summary(lm7)
anova(lm7, lm4)
13 / 22
Case Study 3
ST430 Introduction to Regression Analysis
This time, F is not significant, so the simpler Model 7, without the
interactions, is not rejected.
Model-building with step()
Suppose we begin with the full second order model and simplify it using
’step()’ and BIC:
R code:
stepLm1 = step(lm1, direction = "both",
k = log(nrow(truck)))
summary(stepLm1)
lmBIC = lm(Y~X1*X2*X3*X4 + I(X2^2),truck)
summary(lmBIC)
14 / 22
Case Study 3
ST430 Introduction to Regression Analysis
Note that the ’step’ function each times only drops one term, not useful for
evaluating dropping multiple terms.
Let’s drop the interaction of X3 and X4 (Model 8) manually and compare
BIC:
R code:
lm8 = lm(Y~X1*X2*(X3 + X4) + I(X2^2),truck)
extractAIC(lmBIC,k=log(nrow(truck)))
extractAIC(lm8,k=log(nrow(truck)))
15 / 22
Case Study 3
ST430 Introduction to Regression Analysis
Now we see that Model 8 is preferred and we continue using the ’step’
function but make sure that X2 and X22 are always included in the models:
R code:
lower = Y~X2 + I(X2^2)
upper = Y~X1*X2*(X3 + X4) + I(X2^2)
stepLm2 = step(lm8, scope = list(lower =lower, upper =upper),
direction = "both",k=log(nrow(truck)))
summary(stepLm2)
lm9 = lm(Y~X1*X2*X3 + X1*X4 + I(X2^2),truck)
summary(lm9)
extractAIC(lm9,k = log(nrow(truck)))
16 / 22
Case Study 3
ST430 Introduction to Regression Analysis
Now we end up with a model (Model 9) that can not be simplified further.
Y =β0 + β1 X1 + β2 X2 + β3 X1 X2 + β5 X22
+ β6 X3 + β7 X4
+ β9 X1 X3 + β10 X1 X4
+ β12 X2 X3
+ β15 X1 X2 X3 .
17 / 22
Case Study 3
ST430 Introduction to Regression Analysis
Which model to use? Model 7 or Model 9?
Let’s look at their AIC/BIC.
R code:
#compare BIC
extractAIC(lm9,k = log(nrow(TRUCKING)))
extractAIC(lm7, k = log(nrow(TRUCKING)))
#compare AIC
extractAIC(lm9,k = 2)
extractAIC(lm7, k = 2)
18 / 22
Case Study 3
ST430 Introduction to Regression Analysis
Model 9 has smaller AIC and BIC and is thus preferred.
Estimated Model 9:
Y =12.131 + 0.002X1 − 0.578X2 + 0.674X3 − 0.666X4 + 0.085X22
− 0.012X1 X2 − 0.026X1 X3 − 0.273X2 X3 − 0.031X1 X4 + 0.013X1 X2 X3 .
19 / 22
Case Study 3
ST430 Introduction to Regression Analysis
Effect of deregulation
If X3 = 0:
Y =12.131 + 0.002X1 − 0.578X2 − 0.666X4 + 0.085X22
− 0.012X1 X2 − 0.031X1 X4 .
If X3 = 1:
Y =(12.131 + 0.674) + (0.002 − 0.026)X1 − (0.578 + 0.273)X2
− 0.666X4 + 0.085X22 − 0.031X1 X4 + (0.013 − 0.012)X1 X2 .
20 / 22
Case Study 3
ST430 Introduction to Regression Analysis
Effect of deregulation:
E (Y |X1 , X2 , X4 , X3 = 1) − E (Y |X1 , X2 , X4 , X2 = 0)
= 0.674 − 0.026X1 − 0.273X2 − 0.012X1 X2 .
X4 does not appear in the abover formula: X3 and X4 does not interect.
If we plug in the observed values of (X1 , X3 ) in the data, we get 95%
negative values. The positive values are obtained when X1 and X2 are both
small.
21 / 22
Case Study 3
ST430 Introduction to Regression Analysis
10
Effect of deregulation
0
2
4
X2
6
8
positive effect
negative effect
0
5
10
15
20
X1
22 / 22
Case Study 3