ch04-sec11-12.pdf

ST430 Introduction to Regression Analysis
ST430: Introduction to Regression Analysis, Ch4, Sec
4.11-4.12
Luo Xiao
September 23, 2015
1 / 23
ST430 Introduction to Regression Analysis
Multiple Linear Regression
2 / 23
Multiple Linear Regression
ST430 Introduction to Regression Analysis
Quadratic models
We extended the additive model in two variables to the interaction model by
adding a third term to the equation.
Similarly, we can extend the linear model in one variable to the quadratic
model by adding a second term to the equation:
E (Y ) = β0 + β1 X + β2 X 2 .
This a special case of the two-variable model
E (Y ) = β0 + β1 X1 + β2 X2
with X1 = X and X2 = X 2 .
3 / 23
Multiple Linear Regression
ST430 Introduction to Regression Analysis
10
0 5
20
15
30
40
25
50
Quadratic functions: concave upward or downward?
2
4
6
x
4 / 23
8
10
2
4
6
x
Multiple Linear Regression
8
10
ST430 Introduction to Regression Analysis
Example: human immune system and exercise
Example 4.7 in textbook
X = maximal oxygen uptake (VO2 max, mL/(kg · min));
Y = immunoglobulin level (IgG, mg/dL);
data for 30 subjects (AEROBIC.Rdata).
Get the data and plot them (next slide).
Slight curvature suggests a linear model may not fit well.
5 / 23
Multiple Linear Regression
ST430 Introduction to Regression Analysis
1600
1200
800
IGG
2000
Scatter plot
40
50
60
70
MAXOXY
6 / 23
Multiple Linear Regression
ST430 Introduction to Regression Analysis
How to fit a quadratic model in R
For the human immune system data, use the R code:
fit = lm(IGG~MAXOXY+ I(MAXOXY^2),data = AEROBIC)
In the above model formula, to add the quadratic term, we use the symbol:
I(MAXOXY^2)
Just adding the following to the model formula will not work:
MAXOXY^2
7 / 23
Multiple Linear Regression
ST430 Introduction to Regression Analysis
1600
1200
fitted quadratic curve
least squares line
800
IGG
2000
Quadratic fit in R: graph
40
50
60
70
MAXOXY
8 / 23
Multiple Linear Regression
ST430 Introduction to Regression Analysis
See the text file for R output
The global F-test shows the model is useful.
The quadratic term for ’MAXOXY’ is significant, so we reject the null
hypothesis that the linear model is acceptable.
The quadratic term is negative, which is consistent with the concavity of
the curve.
The other two t-ratios test irrelevant hypotheses, because the quadratic
term is important.
9 / 23
Multiple Linear Regression
ST430 Introduction to Regression Analysis
Caution with the quadratic model
Extrapolation: the fitted curve has a maximum at
MAXOXY =
88.3071
≈ 82
2 × 0.5362
and declines for higher ’MAXOXY’, which seems unlikely to represent the
real relationship.
Extrapolation can be dangerous.
Quadratic model might not be realistic for this data.
10 / 23
Multiple Linear Regression
ST430 Introduction to Regression Analysis
An alternative analysis
The graph of ’IGG’ against ’log(MAXOXY)’ is more linear (next slide)
Fit the corresponding model (see the text file for output).
11 / 23
Multiple Linear Regression
1600
1200
800
IGG
2000
ST430 Introduction to Regression Analysis
3.6
3.8
4.0
4.2
log(MAXOXY)
12 / 23
Multiple Linear Regression
ST430 Introduction to Regression Analysis
1600
1200
Quadratic
Logrithmic
800
IGG
2000
Graph the two models
40
50
60
70
MAXOXY
13 / 23
Multiple Linear Regression
ST430 Introduction to Regression Analysis
Comparison of the two models
Similar adjusted R-squares: .933 for the quadratic model and .932 for
the logarithmic model.
The blue curve continues to increase indefinitely, but with diminishing
slope.
14 / 23
Multiple Linear Regression
ST430 Introduction to Regression Analysis
Qualitative variables
A qualitative variable (or factor) is one that indicates membership of
different categories.
E.g., a person’s ’gender’ = ’male’ or ’female’: a qualitative variable
with two levels, indicating membership of one of two categories.
E.g., package ’type’ = ’Fragile’, ’Semifragile’, or ’Durable’: three levels,
corresponding to three categories.
15 / 23
Multiple Linear Regression
ST430 Introduction to Regression Analysis
We code a qualitative variable using indicator (dummy) variables:
Choose one level to use as a base or reference level, say ’male’ or
’Durable’.
For each other level, create a variable
(
Xj =
1
0
if this item is in this category
otherwise.
For gender, there is only one other category, so the only indicator variable is
(
X=
16 / 23
1
0
for a female
for a male.
Multiple Linear Regression
ST430 Introduction to Regression Analysis
For packages, there are two other categories, so the indicator variables are
X1 =
X2 =
(
for a fragile package
otherwise,
(
for a semifragile package
otherwise,
1
0
1
0
For any item, at most one of the indicator variables is non-zero, indicating a
non-base category;
if they are all zero, the item belongs to the base category.
17 / 23
Multiple Linear Regression
ST430 Introduction to Regression Analysis
Example: cost of shipping packages
Example 4.9 in textbook
Y : cost of package (dollars)
Variable: package types: fragile, semifragile, and durable
X1 : see the previous slide
X2 : see the previous slide
Data
See next slide.
Model
EY = β0 + β1 X1 + β2 X2 .
18 / 23
Multiple Linear Regression
ST430 Introduction to Regression Analysis
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
COST
17.2
11.1
12.0
10.9
13.8
6.5
10.0
11.5
7.0
8.5
2.1
1.3
3.4
7.5
2.0
CARGO X1 X2
Fragile
1 0
Fragile
1 0
Fragile
1 0
Fragile
1 0
Fragile
1 0
SemiFrag
0 1
SemiFrag
0 1
SemiFrag
0 1
SemiFrag
0 1
SemiFrag
0 1
Durable
0 0
Durable
0 0
Durable
0 0
Durable
0 0
Durable
0 0
19 / 23
Multiple Linear Regression
ST430 Introduction to Regression Analysis
Box plots
Useful for plotting response againist a categorical variable.
For the cost of shipping package example, use the R code:
boxplot(COST~CARGO,data=CARGO)
20 / 23
Multiple Linear Regression
ST430 Introduction to Regression Analysis
5
10
15
Box plot: R output
Durable
21 / 23
Fragile
SemiFrag
Multiple Linear Regression
ST430 Introduction to Regression Analysis
See the text file for R output
The global F-test shows that the model is useful; that is, there is mean
difference for the three package types.
Note that the intercept is the fitted value for X1 = X2 = 0; that is, mean
for ’Durable’ packages.
The coefficient of X1 measure the mean difference between ’Fragile’ and
’Durable’.
The coefficient of X2 measure the mean difference between ’SemiFrag’ and
’Durable’.
22 / 23
Multiple Linear Regression
ST430 Introduction to Regression Analysis
Fitting model with qualitative variables in R
An alternative and simpler way:
fit = lm(COST~CARGO,data=CARGO)
Compare with:
fit = lm(COST~X1 + X2,data=CARGO)
23 / 23
Multiple Linear Regression