ST430 Introduction to Regression Analysis ST430: Introduction to Regression Analysis, Ch4, Sec 4.6-4.10 Luo Xiao September 21, 2015 1 / 25 ST430 Introduction to Regression Analysis Multiple Linear Regression 2 / 25 Multiple Linear Regression ST430 Introduction to Regression Analysis Testing the utility of a model Usually, the first test is an overall test of the model: H0 : β1 = β2 = · · · = βk = 0. Ha : at least one βi 6= 0. H0 asserts that none of the independent variables affects Y ; if this hypothesis is not rejected, the model is worthless. For instance, its predictions perform no better than ȳ . The test statistic is usually denoted F . 3 / 25 Multiple Linear Regression ST430 Introduction to Regression Analysis F test statistic F = Mean Square (Model) (SSyy − SSE)/k = . SSE/(n − k − 1) MSE F is the ratio of the explained variability to the unexplained variability. If H0 is true, F has a F -distribution with k and n − k − 1 degrees of freedom. Reject H0 if F is large, i.e., F > Fα , based on k and n − k − 1 degrees of freedom. 4 / 25 Multiple Linear Regression ST430 Introduction to Regression Analysis Rejection region for the global F -test 5 / 25 Multiple Linear Regression ST430 Introduction to Regression Analysis Multiple linear regression: grandfather clocks data Variables Y : sale price of a clock at an auction X1 : age of the clock X2 : number of bidders in the auction Model Y = β0 + β1 X1 + β2 X2 + 6 / 25 Multiple Linear Regression ST430 Introduction to Regression Analysis Multiple linear regression in R: grandfather clocks data # load in data setwd("~/Dropbox/teaching/2015Fall/R_datasets/Exercises&Exampl load("GFCLOCKS.Rdata") fit = lm(PRICE~AGE + NUMBIDS,data=GFCLOCKS) #linear fit summary(fit) #display results 7 / 25 Multiple Linear Regression ST430 Introduction to Regression Analysis F -test results: grandfather clocks data F -statistic: 120.2. Degrees of freedom for the F -test: 2 and 29. P-value for the F -test: 9.2 × 10−15 . Conclusion: Reject H0 in favor of H1 . 8 / 25 Multiple Linear Regression ST430 Introduction to Regression Analysis SAS output: grandfather clocks data 9 / 25 Multiple Linear Regression ST430 Introduction to Regression Analysis Testing the slopes Same as testing in simple linear regression models. For the grandfather clocks data, both the age of clocks and the number of bidders are significant predictors. 10 / 25 Multiple Linear Regression ST430 Introduction to Regression Analysis Multiple coefficient of determination/ R-squared Recall that: SSyy − SSE (yi − ŷi )2 R = =1− P SSyy (yi − ȳ )2 2 P The interpretation of R 2 is still the fraction of variance “explained” by the regression model. It measures the correlation between the dependent variable Y and the independent variables jointly. 11 / 25 Multiple Linear Regression ST430 Introduction to Regression Analysis Adjusted R-squared R 2 will always increase when new variables are added to the model. Because the regression model is adapted to the sample data, it tends to explain more variance in the sample data than it will in new data. Rewrite: SSE 1−R = = SSyy 2 12 / 25 1 P (yi n−1 1 P (yi n−1 − ŷi )2 − ȳ )2 Multiple Linear Regression ST430 Introduction to Regression Analysis Ideally we want: 1 − R2 = but 1 n−1 Variance of residuals , Variance of Y (yi − ŷi )2 is a biased estimator of variance of residuals. P 1 in the numerator with the multiplier that gives unbiased Replace n−1 variance estimator: P 1 (yi − ŷi )2 n−k−1 , P 1 (yi − ȳ )2 n−1 where k + 1 is the number of estimated βs. This defines the adjusted R-squared: Ra2 13 / 25 =1− P 1 (yi − ŷi )2 n−k−1 . 1 P (yi − ȳ )2 n−1 Multiple Linear Regression ST430 Introduction to Regression Analysis Rewrite: Ra2 (yi − ŷi )2 n−1 × P =1− . n−k −1 (yi − ȳ )2 P Ra2 < R 2 , and for a poorly fitting model you may even find Ra2 < 0! Because Ra2 adjusts for the number of covariates in the model, it can be used for comparing different models: select model that has the largest Ra2 . In particular, it is not necessarily true that Ra2 will increase when new variable is added to the model. 14 / 25 Multiple Linear Regression ST430 Introduction to Regression Analysis Estimation and prediction The multiple regression model may be used to make statements about the response that would be observed under a new set of conditions Xnew = (X1,new , X2,new , . . . , Xk,new ). As before, the statement may be about: E (Y |X = Xnew ), the expected value of Y under the new conditions; a single new observation of Y under the new conditions. 15 / 25 Multiple Linear Regression ST430 Introduction to Regression Analysis The point estimate of E (Y |X = Xnew ) and the point prediction of Y when X = Xnew are both Ŷ = β̂0 + β̂1 X1,new + · · · + β̂k Xk,new . The standard errors are different because, as always, Y = E (Y ) + . Prediction interval is wider than confidence interval. 16 / 25 Multiple Linear Regression ST430 Introduction to Regression Analysis Question: Estimation and prediction (with 99% intervals) for the sale price of a 150-year old clock with 10 bidders at the auction. # load in data setwd("~/Dropbox/teaching/2015Fall/R_datasets/Exercises&Exampl load("GFCLOCKS.Rdata") fit = lm(PRICE~AGE + NUMBIDS,data=GFCLOCKS) #linear fit predict(fit,newdata=data.frame(AGE=150,NUMBIDS=10), interval="confidence",level=.99) #confidence interval predict(fit,newdata=data.frame(AGE=150,NUMBIDS=10), interval="prediction",level=.99) #prediction interval 17 / 25 Multiple Linear Regression ST430 Introduction to Regression Analysis R output: ## fit lwr upr ## 1 1431.665 1363.92 1499.409 ## fit lwr upr ## 1 1431.665 1057.545 1805.785 18 / 25 Multiple Linear Regression ST430 Introduction to Regression Analysis Interaction models One property of the first-order model E (Y ) = β0 + β1 X1 + β2 X2 + · · · + βk Xk is that βi is the change in E (Y ) as Xi increases by 1 with all the other independent variables held fixed, and is the same regardless of the values of those other variables. Not all real-world situations work like that. When the magnitude of the effect of one variable is affected by the level of another variable, we say that they interact. 19 / 25 Multiple Linear Regression ST430 Introduction to Regression Analysis A simple model for two factors that interact is E (Y ) = β0 + β1 X1 + β2 X2 + β3 X1 X2 . Rewrite this in two ways: E (Y ) = (β0 + β2 X2 ) + (β1 + β3 X2 )X1 = (β0 + β1 X1 ) + (β2 + β3 X1 )X2 . Holding X2 fixed, the slope of E (Y ) against X1 is β1 + β3 X2 . Holding X1 fixed, the slope of E (Y ) against X2 is β2 + β3 X1 . 20 / 25 Multiple Linear Regression ST430 Introduction to Regression Analysis We can fit this using k = 3: E (Y ) = β0 + β1 X1 + β2 X2 + β3 X3 if we set X3 = X1 X2 . R code for interaction model: grandfather clocks lm(PRICE ˜ AGE*NUMBIDS, data = GFCLOCKS) Note: the formula ’PRICE ˜ AGE*NUMBIDS’ specifies the interaction model, which includes the separate effects of ’AGE’ and ’NUMBIDS’, together with their product, which will be labeled ’AGE:NUMBIDS’. 21 / 25 Multiple Linear Regression ST430 Introduction to Regression Analysis R output for interaction model: grandfather clocks Run the following R code for outputs! # load in data setwd("~/Dropbox/teaching/2015Fall/R_datasets/Exercises&Exampl load("GFCLOCKS.Rdata") fit = lm(PRICE~AGE*NUMBIDS,data=GFCLOCKS) #linear fit summary(fit) 22 / 25 Multiple Linear Regression ST430 Introduction to Regression Analysis Global F-test leads to rejection of the null hypothesis H0 : β1 = β2 = β3 = 0. Note that the t-value on the ’AGE:NUMBIDS’ line is highly significant. That is, we strongly reject the null hypothesis H0 : β3 = 0. Effectively, this test is a comparison of the interaction model with the original non-interactive (additive) model. The other two t-statistics are usually irrelevant: if ’AGE:NUMBIDS’ is important, then both ’AGE’ and ’NUMBIDS’ should be included in the model; do not test the corresponding null hypotheses. 23 / 25 Multiple Linear Regression ST430 Introduction to Regression Analysis We can write the fitted model as E (PRICE) =320 − 93 × NUMBIDS (0.88 + 1.3 × NUMBIDS) × AGE meaning that the effect of age increases with the number of bidders. Compare with first-order model E (PRICE) = −1339 + 12.74 × AGE + 85.95 × NUMBIDS. One interpretation The number of bidders creates more competition for the oldest clocks and age has a stronger effect in a competitive environment. 24 / 25 Multiple Linear Regression ST430 Introduction to Regression Analysis We can also write the fitted model as E (PRICE) =320 + 0.88 × AGE + (−93 + 1.3 × AGE) × NUMBIDS. Because the range of ’AGE’ is [100, 200], the slope for ’NUMBIDS’ is always positive; moreover, the effect of number of bidders increase with age. 25 / 25 Multiple Linear Regression
© Copyright 2025 Paperzz