Linear Regression

Linear Regression Least Squares Method: an introduction

We are given the following ordered pairs: (1.2,1), (1.3,1.6), (1.7,2.7), (2,2), (3,1.8), (3,3), (3.8,3.3), (4,4.2). They are shown in the scatterplot below:

If we draw a line, not the best line, necessarily, but a line, as shown, we can begin to consider how well it fits the data. From each data point, we construct a vertical line segment to the line. This distance gives us an indication of the error, the difference between the predicted and actual y values. Squaring this error, which may be positive or negative, gives all positive values, an advantage in finding a total. The sum of the squares gives us a measure of the scatter of the data away from the line.

We try drawing another line, this time a horizontal line is shown. The squares are still fairly large.

This line seems like a better fit. It has a positive slope and an intercept that seems reasonable. The total sum of the squares is less than that for the two previous lines.

Wow, this line with a negative slope does not fit so well. The sum of the squares will be very large. This would make a very poor model.

Again, this line looks much better. This is clearly a better model than some of the earlier attempts.

This line does not look as good as the purple or pink ones.

This line has larger squares than some of the others. This is not the best model.

This is the line based on calculations. This is very similar to the purple one.The equation is . (The graphs are just approximations, and are not exact.)

This shows the approximate sums of the squares of the previous examples. The smaller this quantity, the better the model. Fortunately, we have a technique that allows us to go straight to the equation without all the guesswork. Each colored square represents the total area of the squares for an earlier example. The yellow was the worst of the proposals, and the dark green the best.

Below are the standardized coordinates. All ordered pairs (x,y) are now represented (Zx, Zy ). The best fit line will pass through the origin. Remember, to standardize is to calculate a z-score.

The line drawn is the best fit line. The difference between the data point and the line is shown.

In order to find the best fit line we want to minimize the quantity This is the standardized sum of the squares of the differences, divided by degrees of freedom to adjust for sample size.

The equation of the best fit line is where and This means that the equation can be found if you have the means and standard deviations for both x and y, even without knowing all of the data values. We usually make use of technology to carry out these calculations, and formulas are always provided, but do know how to use the formulas.

Linear Regression