ch03-sec09-11.pdf

ST430 Introduction to Regression Analysis
ST430: Introduction to Regression Analysis, Ch3, Sec
3.9-3.11
Luo Xiao
September 9, 2015
1 / 22
ST430 Introduction to Regression Analysis
Simple Linear Regression
2 / 22
Simple Linear Regression
ST430 Introduction to Regression Analysis
Estimation and prediction of response
An attractive feature of a regression equation like
E (Y |X ) = β0 + β1 X
is that it is valid for values of X different than those in the data set,
x1 , x2 , . . . , xn .
That is, we can use it to estimate what E (Y |X ) would be for some X that
was not part of the experiment.
But...using it for some X that is far from all of x1 , x2 , . . . , xn is extrapolation,
and runs the risk that the model may not be a good approximation.
3 / 22
Simple Linear Regression
ST430 Introduction to Regression Analysis
Estimation of response
The estimate of E (Y |X = xp ) for some particular xp is
ŷ (xp ) = β̂0 + β̂1 xp .
This is a statistic, so it has a sampling distribution:
it is unbiased:
E [ŷ (xp )] = E (β̂0 + β̂1 xp ) = β0 + β1 xp = E (Y |X = xp );
its standard error is
s
σŷ (xp ) = σ
4 / 22
1 (xp − x̄ )2
+
.
n
SSxx
Simple Linear Regression
ST430 Introduction to Regression Analysis
Example: in the advertising/sales example, the least squares line is
ŷ (x ) = −0.1 + 0.7x .
So if x = 4 (advertising expenditure = $400), we estimate the expected
revenue to be
ŷ (4) = 2.7,
or $2,700.
The estimated standard error of this estimate is
s
sŷ (4) = 0.61 ×
1 (4 − 3)2
+
= 0.332.
5
10
or $332.
5 / 22
Simple Linear Regression
ST430 Introduction to Regression Analysis
Prediction of response
Note:
ŷ (xp ) = β̂0 + β̂1 xp
is the estimate of E (Y |X = xp ), the expected value of Y when X = xp .
Sometimes we want to predict the actual value of Y for a new observation
at X = xp .
Example: if the store spends $400 on advertising next month, what can we
predict about revenue?
6 / 22
Simple Linear Regression
ST430 Introduction to Regression Analysis
Our best guess, the predicted value, is still ŷ (xp ).
But the error is larger: the standard error of prediction is
s
σ[y −ŷ (xp )] = σ 1 +
1 (xp − x̄ )2
+
.
n
SSxx
Compare with
s
σŷ (xp ) = σ
1 (xp − x̄ )2
+
.
n
SSxx
In the example, s[y −ŷ (4)] = 0.690, or $690.
7 / 22
Simple Linear Regression
ST430 Introduction to Regression Analysis
Estimation and prediction of response in R
# input data manually
x = c(1,2,3,4,5) # x is a vector of 5 scalars
y = c(1,1,2,2,4) # y is a vector of 5 scalars
# fit and estimate
fit = lm(y~x) # lm() is the linear regression function
predict(fit,newdata=data.frame(x=4),
interval = "confidence") # estimate mean of Y at x = 4
##
fit
lwr
upr
## 1 2.7 1.644502 3.755498
8 / 22
Simple Linear Regression
ST430 Introduction to Regression Analysis
# load in data
setwd("~/Dropbox/teaching/2015Fall/R_datasets/Exercises&Exampl
load("ADSALES.Rdata")
x = ADSALES$ADVEXP_X # x is the advertising expenditure
y = ADSALES$SALES_Y # y is the sales revenue
# fit and predict
fit = lm(y~x) # lm() is the linear regression function
predict(fit,newdata=data.frame(x=4),
interval = "prediction") # predict Y at x = 4
##
fit
lwr
upr
## 1 2.7 0.5028056 4.897194
9 / 22
Simple Linear Regression
ST430 Introduction to Regression Analysis
A complete example: fire damage
How does the cost of fire damage vary with distance from the nearest fire
station?
Here x is distance in miles, and y is the cost of damage in thousands of
dollars.
Steps in the analysis (not quite the same as in the text):
1
Access and plot the data
2
Overlay the least squares line
3
Summarize the fitted model
10 / 22
Simple Linear Regression
ST430 Introduction to Regression Analysis
Step 1: access and plot the data
# first load in data
setwd("~/Dropbox/teaching/2015Fall/R_datasets/Exercises&Exampl
load("FIREDAM.Rdata")
head(FIREDAM) # show part of the data
hist(FIREDAM$DAMAGE) # plot a histogram of response
plot(FIREDAM) # plot the data
11 / 22
Simple Linear Regression
ST430 Introduction to Regression Analysis
R output: show part of the data
##
##
##
##
##
##
##
1
2
3
4
5
6
DISTANCE DAMAGE
3.4
26.2
1.8
17.8
4.6
31.3
2.3
23.1
3.1
27.5
5.5
36.0
12 / 22
Simple Linear Regression
ST430 Introduction to Regression Analysis
R output: histogram
2.0
1.5
1.0
0.5
0.0
Frequency
2.5
3.0
Histogram of FIREDAM$DAMAGE
10
15
20
25
30
35
40
45
FIREDAM$DAMAGE
13 / 22
Simple Linear Regression
ST430 Introduction to Regression Analysis
30
25
20
15
DAMAGE
35
40
R output: scatterplot
1
2
3
4
5
6
DISTANCE
14 / 22
Simple Linear Regression
ST430 Introduction to Regression Analysis
Step 2: overlay the least squares line
# first load in data
setwd("~/Dropbox/teaching/2015Fall/R_datasets/Exercises&Exampl
load("FIREDAM.Rdata")
plot(FIREDAM) # plot the data
FIREDAM.lm <- lm(DAMAGE ~ DISTANCE, data = FIREDAM) # fit
abline(reg = FIREDAM.lm, col = "blue") # plot the line
No obivous issue.
15 / 22
Simple Linear Regression
ST430 Introduction to Regression Analysis
30
25
20
15
DAMAGE
35
40
R output
1
2
3
4
5
6
DISTANCE
16 / 22
Simple Linear Regression
ST430 Introduction to Regression Analysis
Step 3: summarize the fitted model
Use R function: summary()
Least squares line: ŷ = 10.2779 + 4.9193x .
Residual standard error: 2.316 on 13 degrees of freedom
Test H0 : β1 = 0:
t = 12.525 with the same 13 degrees of freedom;
P-value: 1.25 × 10−8 ;
very strong evidence against H0 .
95% Confidence Interval: 4.0709 < β1 < 5.7678;
Coefficient of Determination (R-squared): 0.9235.
17 / 22
Simple Linear Regression
ST430 Introduction to Regression Analysis
Further steps
Using multiple regression (next topic), check whether the straight line
model is adequate by fitting the quadratic model
E (Y ) = β0 + β1 x + β2 x 2
and testing H0 : β2 = 0.
Use graphical regression diagnostics (later topic) to check the residuals
(yi − ŷi ) for issues.
18 / 22
Simple Linear Regression
ST430 Introduction to Regression Analysis
Regression through the origin
In the usual straight-line model
E (Y ) = β0 + β1 x ,
the intercept β0 is estimated from the data.
In some situations we may know that E (Y ) = 0 when x = 0, or in other
words that β0 = 0.
We should then fit the simpler “regression through the origin” model
E (Y ) = β1 x .
19 / 22
Simple Linear Regression
ST430 Introduction to Regression Analysis
Advertising and revenue Example
# input data manually
x = c(1,2,3,4,5) # x is a vector of 5 scalars
y = c(1,1,2,2,4) # y is a vector of 5 scalars
fit = lm(y~-1 + x) # fit through the origin
summary(fit) # output summary
20 / 22
Simple Linear Regression
ST430 Introduction to Regression Analysis
Caution
When the intercept is omitted, the coefficient of determination is calculated
as
P
(yi − ŷi )2
2
P 2
R =1−
yi
Because the denominator is yi2 in place of SSyy = (yi − ȳ )2 , this R 2
cannot be compared with the coefficient of determination in a model that
contains the intercept.
P
21 / 22
P
Simple Linear Regression
ST430 Introduction to Regression Analysis
What we “know” may not be true.
Cancer treatment example
X = dosage of a drug for cancer patients,
Y = increase in pulse rate after 1 minute
If X = 0, the patient takes no drug, so pulse rate should not change.
But a patient with zero dosage may be given a placebo instead, and pulse
rate may change because of taking any medication.
We would usually include the intercept in a situation like this.
22 / 22
Simple Linear Regression