ST430 Introduction to Regression Analysis ST430: Introduction to Regression Analysis, Ch3, Sec 3.9-3.11 Luo Xiao September 9, 2015 1 / 22 ST430 Introduction to Regression Analysis Simple Linear Regression 2 / 22 Simple Linear Regression ST430 Introduction to Regression Analysis Estimation and prediction of response An attractive feature of a regression equation like E (Y |X ) = β0 + β1 X is that it is valid for values of X different than those in the data set, x1 , x2 , . . . , xn . That is, we can use it to estimate what E (Y |X ) would be for some X that was not part of the experiment. But...using it for some X that is far from all of x1 , x2 , . . . , xn is extrapolation, and runs the risk that the model may not be a good approximation. 3 / 22 Simple Linear Regression ST430 Introduction to Regression Analysis Estimation of response The estimate of E (Y |X = xp ) for some particular xp is ŷ (xp ) = β̂0 + β̂1 xp . This is a statistic, so it has a sampling distribution: it is unbiased: E [ŷ (xp )] = E (β̂0 + β̂1 xp ) = β0 + β1 xp = E (Y |X = xp ); its standard error is s σŷ (xp ) = σ 4 / 22 1 (xp − x̄ )2 + . n SSxx Simple Linear Regression ST430 Introduction to Regression Analysis Example: in the advertising/sales example, the least squares line is ŷ (x ) = −0.1 + 0.7x . So if x = 4 (advertising expenditure = $400), we estimate the expected revenue to be ŷ (4) = 2.7, or $2,700. The estimated standard error of this estimate is s sŷ (4) = 0.61 × 1 (4 − 3)2 + = 0.332. 5 10 or $332. 5 / 22 Simple Linear Regression ST430 Introduction to Regression Analysis Prediction of response Note: ŷ (xp ) = β̂0 + β̂1 xp is the estimate of E (Y |X = xp ), the expected value of Y when X = xp . Sometimes we want to predict the actual value of Y for a new observation at X = xp . Example: if the store spends $400 on advertising next month, what can we predict about revenue? 6 / 22 Simple Linear Regression ST430 Introduction to Regression Analysis Our best guess, the predicted value, is still ŷ (xp ). But the error is larger: the standard error of prediction is s σ[y −ŷ (xp )] = σ 1 + 1 (xp − x̄ )2 + . n SSxx Compare with s σŷ (xp ) = σ 1 (xp − x̄ )2 + . n SSxx In the example, s[y −ŷ (4)] = 0.690, or $690. 7 / 22 Simple Linear Regression ST430 Introduction to Regression Analysis Estimation and prediction of response in R # input data manually x = c(1,2,3,4,5) # x is a vector of 5 scalars y = c(1,1,2,2,4) # y is a vector of 5 scalars # fit and estimate fit = lm(y~x) # lm() is the linear regression function predict(fit,newdata=data.frame(x=4), interval = "confidence") # estimate mean of Y at x = 4 ## fit lwr upr ## 1 2.7 1.644502 3.755498 8 / 22 Simple Linear Regression ST430 Introduction to Regression Analysis # load in data setwd("~/Dropbox/teaching/2015Fall/R_datasets/Exercises&Exampl load("ADSALES.Rdata") x = ADSALES$ADVEXP_X # x is the advertising expenditure y = ADSALES$SALES_Y # y is the sales revenue # fit and predict fit = lm(y~x) # lm() is the linear regression function predict(fit,newdata=data.frame(x=4), interval = "prediction") # predict Y at x = 4 ## fit lwr upr ## 1 2.7 0.5028056 4.897194 9 / 22 Simple Linear Regression ST430 Introduction to Regression Analysis A complete example: fire damage How does the cost of fire damage vary with distance from the nearest fire station? Here x is distance in miles, and y is the cost of damage in thousands of dollars. Steps in the analysis (not quite the same as in the text): 1 Access and plot the data 2 Overlay the least squares line 3 Summarize the fitted model 10 / 22 Simple Linear Regression ST430 Introduction to Regression Analysis Step 1: access and plot the data # first load in data setwd("~/Dropbox/teaching/2015Fall/R_datasets/Exercises&Exampl load("FIREDAM.Rdata") head(FIREDAM) # show part of the data hist(FIREDAM$DAMAGE) # plot a histogram of response plot(FIREDAM) # plot the data 11 / 22 Simple Linear Regression ST430 Introduction to Regression Analysis R output: show part of the data ## ## ## ## ## ## ## 1 2 3 4 5 6 DISTANCE DAMAGE 3.4 26.2 1.8 17.8 4.6 31.3 2.3 23.1 3.1 27.5 5.5 36.0 12 / 22 Simple Linear Regression ST430 Introduction to Regression Analysis R output: histogram 2.0 1.5 1.0 0.5 0.0 Frequency 2.5 3.0 Histogram of FIREDAM$DAMAGE 10 15 20 25 30 35 40 45 FIREDAM$DAMAGE 13 / 22 Simple Linear Regression ST430 Introduction to Regression Analysis 30 25 20 15 DAMAGE 35 40 R output: scatterplot 1 2 3 4 5 6 DISTANCE 14 / 22 Simple Linear Regression ST430 Introduction to Regression Analysis Step 2: overlay the least squares line # first load in data setwd("~/Dropbox/teaching/2015Fall/R_datasets/Exercises&Exampl load("FIREDAM.Rdata") plot(FIREDAM) # plot the data FIREDAM.lm <- lm(DAMAGE ~ DISTANCE, data = FIREDAM) # fit abline(reg = FIREDAM.lm, col = "blue") # plot the line No obivous issue. 15 / 22 Simple Linear Regression ST430 Introduction to Regression Analysis 30 25 20 15 DAMAGE 35 40 R output 1 2 3 4 5 6 DISTANCE 16 / 22 Simple Linear Regression ST430 Introduction to Regression Analysis Step 3: summarize the fitted model Use R function: summary() Least squares line: ŷ = 10.2779 + 4.9193x . Residual standard error: 2.316 on 13 degrees of freedom Test H0 : β1 = 0: t = 12.525 with the same 13 degrees of freedom; P-value: 1.25 × 10−8 ; very strong evidence against H0 . 95% Confidence Interval: 4.0709 < β1 < 5.7678; Coefficient of Determination (R-squared): 0.9235. 17 / 22 Simple Linear Regression ST430 Introduction to Regression Analysis Further steps Using multiple regression (next topic), check whether the straight line model is adequate by fitting the quadratic model E (Y ) = β0 + β1 x + β2 x 2 and testing H0 : β2 = 0. Use graphical regression diagnostics (later topic) to check the residuals (yi − ŷi ) for issues. 18 / 22 Simple Linear Regression ST430 Introduction to Regression Analysis Regression through the origin In the usual straight-line model E (Y ) = β0 + β1 x , the intercept β0 is estimated from the data. In some situations we may know that E (Y ) = 0 when x = 0, or in other words that β0 = 0. We should then fit the simpler “regression through the origin” model E (Y ) = β1 x . 19 / 22 Simple Linear Regression ST430 Introduction to Regression Analysis Advertising and revenue Example # input data manually x = c(1,2,3,4,5) # x is a vector of 5 scalars y = c(1,1,2,2,4) # y is a vector of 5 scalars fit = lm(y~-1 + x) # fit through the origin summary(fit) # output summary 20 / 22 Simple Linear Regression ST430 Introduction to Regression Analysis Caution When the intercept is omitted, the coefficient of determination is calculated as P (yi − ŷi )2 2 P 2 R =1− yi Because the denominator is yi2 in place of SSyy = (yi − ȳ )2 , this R 2 cannot be compared with the coefficient of determination in a model that contains the intercept. P 21 / 22 P Simple Linear Regression ST430 Introduction to Regression Analysis What we “know” may not be true. Cancer treatment example X = dosage of a drug for cancer patients, Y = increase in pulse rate after 1 minute If X = 0, the patient takes no drug, so pulse rate should not change. But a patient with zero dosage may be given a placebo instead, and pulse rate may change because of taking any medication. We would usually include the intercept in a situation like this. 22 / 22 Simple Linear Regression
© Copyright 2025 Paperzz