1 / 31

# Welcome to BUAD 310 - PowerPoint PPT Presentation

Welcome to BUAD 310. Instructor: Kam Hamidieh Lecture 21, Wednesday April 9 , 2014. Agenda & Announcement. Today: Finish up the problem from last time & finish off Simple Linear Regression Start Multiple Regression, Chapter 23 Homework 6 is due today at 5 PM. . About Exam II.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

Instructor: Kam Hamidieh

Lecture 21, Wednesday April 9, 2014

• Today:

• Finish up the problem from last time & finish off Simple Linear Regression

• Start Multiple Regression, Chapter 23

• Homework 6 is due today at 5 PM.

• NO CELL PHONES ARE ALLOWED.

• Two cheat sheets allowed, both sides, hand written.

• In class this Wednesday April 16.

• Coversheet will be posted by Monday, 33 questions

• Print z, and t tables and bring them with you.

• Coverage: Lecture 12, March 3 to the end of lecture 21 (minus multiple regression), April 9, and HW 4, 5, & 6

• All Exam II relevant material will be posted by tomorrow morning.

• Scantrons passed out Monday, fill out before the exam, do not bend it!

• We will review all of Monday.

• Extended office hours:

• Monday April 14: 4-6 PM

• Tuesday April 15: 2-6 PM

To test H0: B1 = 0 vs. Ha: B1 ≠ 0:

(1) 100(1-α)% confidence interval for B1 is:

b1 ± tα/2se(b1)

where tα/2comes from a t distribution with df = n-2.

Or (2) Compute the test statistics:

then get the p-value from a t distribution with df = n-2.

100(1-α)% confidence interval for at xnew is:

)

where tα/2comes from a t distribution with df = n-2,

and

We will generally use software.

• “Outliers are observations that stand away from the rest of the data and appear distinct in a plot.” Imprecise!

• They can have very strong influence in your final results.

r2 = 0.80, Se = 3.28

X = 1,2,…,20

r2 = 0.80, Se = 3.28

r2 = 0.25, Se = 10

r2 = 0.29, Se = 9.7

r2 = 0.92, Se = 3.2

r2 = 026, Se = 6.1

• There are NO hard and fast rules on how to deal with outliers except: you should not just throw out yours without SOLID justification.

• Check for data entry errors. (Not always possible!)

• Examine the physical context.

• Report your results with and without outliers.

• Standardized residuals can help identify outliers too.

• Transformations can help. (This will be discussed when we cover multiple regression.)

• Simple Linear Regression:

• One Y and one X, fit a line that gives the mean of Y’s for a given X

• Multiple regression:

• One Y and multiple X’s, you have multiple predictors

The observed response Y is linearly related to k explanatory variables X1, X2, …, and XK by the equation:

A single Value of

response

comes from….

a linear combination of k variables plus…

Error,

Where…

Error are normal iid

Given a fixed values of X’s, the mean of Y’s is equal to ….

a linear combination of X’s at those fixed values

Assumption (Redundant Slide?)

• Constant Variance AssumptionThe variance of the error terms is σε2 the same for every combination of values of x1, x2,…, xk

• Normality AssumptionThe error terms follow a normal distribution for every combination of values of x1, x2,…, xk

• Independence AssumptionThe values of the error terms are statistically independent of each other

Simple regression

Data:

(x1,y1)

(x2,y2)

(xn,yn)

Assumed Model:

yi = B0 + B1 xi + εi

εi ~ iid N(0,σε)

Parameters: B0, B1, σε

Multiple regression

Data:

(y1, x11,x12,…,x1k)

(y2, x21,x22,…,x2k)

(yn, xn1,xn2,…,xnk)

Assumed Model:

yi = B0 + B1 xi 1 + B2xi 2 + … + Bkxi k εi

εi ~ iid N(0,σε)

Parameters: B0, B1, B2, … , Bk, σε

Example (Page 615)

• Defaults from subprime housing market brought down several financial institutions in 2008 (Lehman, Bear Stern, and AIG) and led to a massive bailout of the financial system.

• Goal: A bank regulator wants to know how lenders are using credit scores to determine the rate of interest paid by subprime borrows.

• The variables of interest are:

Y = APR, annual % rate on the loan

X1 = LTV, loan to value ratio, how much of the loan covers the value of the property. Values near 0 are “good”, near 1 are “bad”.

X2= Credit Score. The higher the better.

X3 = Income in 1000’s of dollars

X4 = Home value in 1000’s of dollars

• The data are n = 372 mortgages obtained from a credit bureau.

• There are 4 predictors: k = 4.

Variable Names

X73

X72

X74

X71

Y7

A row is one observation

APR seems linearly dependent on LTV and Credit Score and not so much on the other two.

Looking at the relationship between predictors is a good idea too.

Highest correlations are APR with LTV and Credit score.

Why are some of the boxes empty?

The values for B0, B1, …, BK are estimated via least squares method:

Pick b0, b1,…, bkso this is as small as possible.

But where is the line?

One Response Y, two predictors X1 & X2.

Method of least squares minimizes the vertical distances between the points and a plane.

(Picture from An Introduction to Statistical Learbing with Applications in R by James, Witten, Hastie, Tibshirani)

He may know!!!

b0 ≈ 23.73

b1≈ -1.59

b2≈ -0.018

b3≈ 0.0004

b4≈ -0.00075

The estimated regression model now is:

Note: y-hat gives the mean APR for a given set of predictor values.

APR = 23.73 - 1.59(LTV) - 0.018(CreditScore) + + 0.0004(StatedIncome) - 0.00075(HomeValue)

APR = 23.73 - 1.59(LTV) - 0.018(CreditScore) + + 0.0004(StatedIncome) - 0.00075(HomeValue)

b0 = 23.73:

When LTV = Credit Score = State Income = Home Value = 0, then the mean APR = 23.73%

b1= -1.59:

Holding all other x variables fixed, when LTV goes up by 0.1, then on average APR goes down by 0.159% (1.59 × 0.1)

b1 = -0.018:

Holding all other x variables fixed, when Credit Score goes up by 1 unit, then on average APR goes down by 0.018%

etc…….

Suppose we observe a subprime borrower with the following characteristics:LTV = 0.90

Credit Score = 650

Stated Income = \$45,000

Home Value = \$400,000

Our estimated model says that on average such a customer gets:

APR = 23.73 - 1.59(0.90) - 0.018(650) + 0.0004(45) -0.00075(400)

APR ≈ 10.32%

Part (1): Refer to slide 15.

• What are the predictor and response values for the 9th observation?

• What are the values of y10, x24, x11,3?

Part (2) Refer to slide 25.

• Interpret the slope term for stated income variable.

• What is the estimated mean APR for customer with LTV = 0.50, Credit Score = 600, Stated Income = \$10,000, Home Value = \$200,000?

• Residuals are defined just like the simple linear regression case: residual = observed – fitted.

• The official formula:

• What is the “picture” for residuals?

• Compute the standard deviation of the residuals:

• It has the same interpretation as before: it tells how far away your observed points are from the “plane” on average.

• Se estimates σε.

• The value n – k – 1 is called the residual degrees of freedom.

• SSE = Sums of Squared (due to) Error

• MSE = Mean squared (due to) Error

n – k – 1 =

372 – 4 – 1 = 367

MSE = 1.55

SSE = 567.80

Se = 1.24