200 likes | 214 Views
Class 4. Simple Linear Regression. Regression Analysis. Reality is thought to behave in a manner which may be simulated (predicted) to an acceptable degree of accuracy by a simplified mathematical model.
E N D
Class 4 Simple Linear Regression
Regression Analysis • Reality is thought to behave in a manner which may be simulated (predicted) to an acceptable degree of accuracy by a simplified mathematical model. • Statistical models (which include regression) permit some degree of random error, because some variable of interest cannot be duplicated under seemingly identical conditions.
An Example • We would like to predict test scores on an academic test. Ten such scores are shown below: • A possible model of test scores: A test score, y, is obtained by taking the average test score, , and adding a random value, , to it. 65 73 73 75 81 87 92 96 98 100 y = +
Example (cont.) • How might we estimate ? • How do we tell if our model is useful?
Improving the Model • We would have a more useful model if we could remove (explain) some of the variability that we see in the data. • Perhaps there exists other factors that cause variability in the test score. Can you think of some?
Improving the Model (cont.) • Here is the data including the hours of study.
Improving the Model (cont.) • We have the same problem: • Select the best line that minimizes the (squared) distance of the data points to the line. • This line is referred to as the least square line. • Our model now looks like • Our estimated or fitted line will be called
Another view of the Model • This (and all linear regression) model(s) can be expressed as y = E(y) + . • So in our model, E(y) = 0 + 1x, that is, the mean test score falls on a straight line as a function of hours of study. • The random error term, , is assumed to have a normal distribution with mean 0 and variance 2. • Our ability to effectively use the model depends on this variation.
Analysis of Variance • It turns out that the variation displayed by the variable y, referred to as the total sum of squares (SST), can be broken into two pieces: • The part caused by the variable x, called the regression sum of squares (SSR), • The part left over (the distance from the data points to the regression line), called the sum of squared errors or residual sum of squares (SSE). SST = SSR + SSE
Getting it done with EXCEL • Select tools/data analysis/regression. r, correlation r2 = SSR/SST s, the square root of s2, our estimate of 2 ra2 = 1 - (1-r2)[(n-1)/(n-p-1)]
SSR SSE SST For example, Getting it done with EXCEL Sums of squares divided by degrees of freedom p-value Actual sums of squares MSR MSE, also s2 MSR/MSE
Confidence Intervals for our estimates Getting it done with EXCEL Least square estimates b0 b1 Standard deviation of our estimate t-test for the hypothesis that the coefficient () is 0 The important part: the p-value for the t-test
Hypothesis Testing • The F-test tests to see if all of the coefficients of the independent variables are zero. For our model: • The t-test tests to see if each coefficient of an independent variables is zero.
Using the Model • The model has two basic purposes: • (1) It can be used to provide partial confirmation of the theory that a particular factor is, indeed, influencing the response variable, y. • (2) It can be used to estimate the mean, E(y), and predict an actual value of y. • Under the function wizard, select forecast(new_x, known_y, known_x).
Using the Model • Confidence interval can be generated (page 555 for more discussion). Let
Using the Model • EXCEL does not provide an automatic calculation for confidence and prediction intervals • The authors have included a macro in the spreadsheet called PredInt.xls on your data disk. • Simply open the file and follow the instructions!
A Note on Correlation • Many people prefer to perform a correlational analysis before they build regression models. • In EXCEL this can be accomplished in two ways: • Under the function wizard, use correl(array1,array2) to find the correlation between two variables. • Under tools/data analysis/correlation to determine the correlation between several variables.
Correlation (cont.) • What correlation does: • Provides an easy measure to determine if two variables have a linear relationship. • Positive correlation implies if one variable goes up, the other also tends to go up. • Negative correlation implies if one variable goes up, the other tends to go down. • What correlation does not do: • There is no implication of cause and effect. • There may exist some lurking factor that produces the behavior being witnessed.