1 / 12

Bivariate Linear Regression

Bivariate Linear Regression. Scatter plots of real world data rarely fit perfectly onto a straight line (or any other deterministic functional form.) e.g. Study time and exam grades.

truda
Download Presentation

Bivariate Linear Regression

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Bivariate Linear Regression • Scatter plots of real world data rarely fit perfectly onto a straight line (or any other deterministic functional form.) e.g. Study time and exam grades. • y=a+bx is a deterministic relationship: knowing the value of x meaning knowing the value of y (assume we know the intercept and the slope). The correlation coefficient is 1 (or -1) in this case. e.g. Distance traveled = speed x time; fixing speed, there's a deterministic relationship between distance and time. • y=a+bx + e describes an imperfect linear relationship between two variables: the value of y is not completely determined by x, but is also affected by the “random error”, e. • In regression modeling, we try to fit a line through our scatter plot data, so that the (sum of squared) errors are minimized. • The straight line so obtained can then be used for interpretation of the relationship and for predictions.

  2. Bivariate Linear Regression • Regression Model: E(yi) = a + bxi xi is the independent variable value for observation i. yi is the dependent/response variable value for observation i a and b are the intercept and slope of the straight line E(ei)=0 by model assumption • Meaning of b: a one unit increase in xi is associated with b units increase in E(yi). (If b is 0 in the population, then there is no relationship between xi and yi.). Meaning of a: E(yi) for xi=0. • The correlation coefficient, r, and the coefficient of x in the regression, b, are not the same thing, but their signs will agree. • How to find the “best” a and b? According to the “OLS” principle

  3. Ordinary Least Squares (OLS) • Used to determine the “best” line that is as close as possible to the data points in the vertical (y) direction (since that is what we are trying to predict) • Least Squares: find the line that minimizes the sum of the squares of the vertical distances of the data points from the line (software does this for us. Soon...) • Property of OLS estimators: BLUE (best linear unbiased estimator)

  4. Ordinary Least Square Line

  5. Prediction Using Regression Line: Example (Husband and Wife's Ages) Hand, et.al., A Handbook of Small Data Sets, London: Chapman and Hall • The estimated regression model is E(y) = 3.6 + 0.97x E(y) is the average age of all husbands who have wives of age x • For all women aged 30, we predict the average husband age to be 32.7 years: 3.6 + (0.97)(30) = 32.7 years • If an individual wife’s age is 30. What would we predict her husband’s age to be? (Best prediction is still 32.7, since E(e)=0. But more uncertainty)

  6. Hypothesis Testing: Is There a Relationship? • The estimated b value is based on one particular sample set, and so is the slope for the sample data. What is the slope in the population? • Testing population slope being zero in the population data (i.e., testing the hypothesis that there is no relationship between x and y): • Recall logic of hypothesis test • Under the null hypothesis, the sampling distribution of the estimated b/sd(b) is shown to follow the “Student-t” distribution (very similar to the normal, just thicker tails. A particular member of the family is characterized by one parameter, the degree of freedom) with n-2 degrees of freedom • Software routinely reports the p-value from the test

  7. Goodness of Fit: the Coefficient of Determination (R2) • Measures how well the regression line fits the data • R2 equals r2 , the square of the correlation, and measures how much variation in the values of the response variable (y) is explained by the regression line • The distance between an observed Y and the mean of Y in the data set can be decomposed into two parts: from Y to E(Y) on the regression line, and from E(Y) to the mean of all Y. R2 is define as RSS/TSS. r=1: R2=1: regression line explains all (100%) of the variation in y r=.7: R2=.49: regression line explains almost half (50%) of the variation in y

  8. Using Software • Stata: “regress” • Example: sysuse lifeexp; reg lexp safewater • egen ybar=mean(lexp) • twoway (scatter lexp safewater) (lfit lexp safewater) (line ybar safewater) • “Correlation and Regression Demo” applet at the book's website (explore the effect of outliers, for example.)

  9. Ordinary Least Squares (OLS): Not Robust to Outliers

  10. Using Linear Regression Model: Extrapolation Can Be Dangerous (e.g. What would be the presidential approval rate if the inflation rate rises to 200%?)

  11. Caution: Correlation Does Not Imply Causation • Even very strong correlations may not correspond to a real causal relationship. • Correlation can be due to: • Explanatory variable a cause of response variable. e.g. x=meditation, y=health • “Response” variable a cause of “explanatory” variable. e.g. x=divorce, y=alcohol abuse • Both variables may result from a common cause (or there may be confounding). e.g. both divorce and alcohol abuse may result from an unhappy marriage. • The correlation may be merely a coincidence

  12. Establishing Causality • A properly conducted experiment can establish the (lack of) causal connection • Other considerations lending support to arguments for a causal link: • A reasonable explanation for a cause and effect exists • The connection happens in repeated trials, and under varying conditions • Potential confounding factors are ruled out • Alleged cause precedes the effect in time

More Related