Introduction to Linear Regression

Introduction to Linear Regression

You have seen how to find the equation of a line that connects two points.

You have seen how to find the equation of a line that connects two points. • Often, we have more than two data points, and usually the data points do not all lie on a single line.

You have seen how to find the equation of a line that connects two points. • Often, we have more than two data points, and usually the data points do not all lie on a single line. • It is possible to find the equation of a line that most closely fits a set of data points. Such a line is called a regression line or a linear regression equation.

You have seen how to find the equation of a line that connects two points. • Often, we have more than two data points, and usually the data points do not all lie on a single line. • It is possible to find the equation of a line that most closely fits a set of data points. Such a line is called a regression line or a linear regression equation. • Our goal here is to learn what a regression line is. You can then watch the presentation on how to find the equation of a regression line on Excel.

Consider the following table that the average price of a two-bedroom apartment in downtown New York City from 1994 to 2004, where t=0 represents 1994.

Consider the following table that the average price of a two-bedroom apartment in downtown New York City from 1994 to 2004, where t=0 represents 1994. • We can plot each of these data points on a graph. Each point is of the form (t, p), so we have 6 points to plot.

Consider the following table that the average price of a two-bedroom apartment in downtown New York City from 1994 to 2004, where t=0 represents 1994. • We can plot each of these data points on a graph. Each point is of the form (t, p), so we have 6 points to plot. • They are (0, 0.38), (2, 0.40), (4, 0.60), (6, 0.95), (8, 1.20), and (10, 1.60). Just looking at them like this doesn’t give much indication of a pattern, although we can see that the p-values are increasing as t increases.

When we plot the points all together on a set of axes, we get the following scatter plot:

When we plot the points all together on a set of axes, we get the following scatter plot: • It seems that the data do follow a somewhat linear pattern.

We can find the line the line that most closely fits the equation and graph it over the data points.

We can find the line the line that most closely fits the equation and graph it over the data points. • Notice that the line does not go through all of the data points.

We can also find the equation of this “line of best fit”.

We can also find the equation of this “line of best fit”. • We can also get what’s called the correlation coefficient.

We can also find the equation of this “line of best fit”. • We can also get what’s called the correlation coefficient. • You will be able to do all of this on Excel once you watch the instructional video and read the PDFs for this material. For now, we just want to get an idea of what the regression line is and what the correlation coefficient tells us about the regression equation.

What does the regression equation tell us about the relationship between time and sale price?

What does the regression equation tell us about the relationship between time and sale price? • The slope and the vertical intercept (usually the y-intercept, here the p-intercept) tell us different things.

In this case, the p-intercept tells us what the sale price is predicted to be when t=0 (that is, in the year 1994).

In this case, the p-intercept tells us what the sale price is predicted to be when t=0 (that is, in the year 1994). • The regression equation is p=0.1264t+0.2229. Recall that price is in millions of dollars.

In this case, the p-intercept tells us what the sale price is predicted to be when t=0 (that is, in the year 1994). • The regression equation is p=0.1264t+0.2229. Recall that price is in millions of dollars. • Thus, if t=0, the regression equation predicts a price of $0.2229 million or $222,900.

In this case, the p-intercept tells us what the sale price is predicted to be when t=0 (that is, in the year 1994). • The regression equation is p=0.1264t+0.2229. Recall that price is in millions of dollars. • Thus, if t=0, the regression equation predicts a price of $0.2229 million or $222,900. • According to the table, the actual price was $0.38 million or $380,000. These values don’t have to be the same however, since the regression equation can’t match every point exactly. It is only a model that most closely fits the data points.

What does the slope of the regression equation tell us?

What does the slope of the regression equation tell us? • The slope of our regression equation is 0.1264.

What does the slope of the regression equation tell us? • The slope of our regression equation is 0.1264. • We can always write a number x as x divided by 1, so we can write this slope as .

What does the slope of the regression equation tell us? • The slope of our regression equation is 0.1264. • We can always write a number x as x divided by 1, so we can write this slope as . • Recall that the definition of slope is .

What does the slope of the regression equation tell us? • The slope of our regression equation is 0.1264. • We can always write a number x as x divided by 1, so we can write this slope as . • Recall that the definition of slope is . • In this case we are using p and t, so it’s .

What does the slope of the regression equation tell us? • The slope of our regression equation is 0.1264. • We can always write a number x as x divided by 1, so we can write this slope as . • Recall that the definition of slope is . • In this case we are using p and t, so it’s . • So for our problem, we have .

What does the slope of the regression equation tell us? • The slope of our regression equation is 0.1264. • We can always write a number x as x divided by 1, so we can write this slope as . • Recall that the definition of slope is . • In this case we are using p and t, so it’s . • So for our problem, we have . • We can interpret this to mean that when t increases by 1, we can expect that p will increase by 0.1264.

For this problem, t is measure in years and p is measured in millions of dollars.

For this problem, t is measure in years and p is measured in millions of dollars. • So more specifically, the slope can be interpreted to mean that if t increases by 1 year, the model predicts that the average price p of a two-bedroom apartment will increase by about $0.1264 million dollars, or $126,400.

For this problem, t is measure in years and p is measured in millions of dollars. • So more specifically, the slope can be interpreted to mean that if t increases by 1 year, the model predicts that the average price p of a two-bedroom apartment will increase by about $0.1264 million dollars, or $126,400. • Even more plainly, we can say that the model predicts that the average price of a two-bedroom apartment in New York City will increase by about $126,400 per year.

For this problem, t is measure in years and p is measured in millions of dollars. • So more specifically, the slope can be interpreted to mean that if t increases by 1 year, the model predicts that the average price p of a two-bedroom apartment will increase by about $0.1264 million dollars, or $126,400. • Even more plainly, we can say that the model predicts that the average price of a two-bedroom apartment in New York City will increase by about $126,400 per year. • We can now use the linear regression model to predict future prices. For example, if we wanted to predict what the price of an apartment was in 2008, we could plug in 14 for t in the regression equation (since t=0 is 1994).

Plugging in 14 for t into the regression equation gives p=0.1264(14)+0.2229=1.9925.

Plugging in 14 for t into the regression equation gives p=0.1264(14)+0.2229=1.9925. • This means that if the trend continued, we can expect that the price of a two-bedroom apartment was around $1,992,500 in 2008.

Plugging in 14 for t into the regression equation gives p=0.1264(14)+0.2229=1.9925. • This means that if the trend continued, we can expect that the price of a two-bedroom apartment was around $1,992,500 in 2008. • You can also use the regression equation to check how closely the model matches the actual price in some years that were given on the table. For example, for 2000 the equation predicts a price of p=0.1264(6)+0.2229=0.9813, or $981,300.

Plugging in 14 for t into the regression equation gives p=0.1264(14)+0.2229=1.9925. • This means that if the trend continued, we can expect that the price of a two-bedroom apartment was around $1,992,500 in 2008. • You can also use the regression equation to check how closely the model matches the actual price in some years that were given on the table. For example, for 2000 the equation predicts a price of p=0.1264(6)+0.2229=0.9813, or $981,300. • According to the table, the actual price was $950,000, so the regression equation is pretty close.

It is important to remember that the regression equation is just a model, and it won’t give the exact values.

It is important to remember that the regression equation is just a model, and it won’t give the exact values. • If the equation is a good fit to the data however, it will give a very good approximation, so it can be used to forecast what may happen in the future if the current trend continues.

It is important to remember that the regression equation is just a model, and it won’t give the exact values. • If the equation is a good fit to the data however, it will give a very good approximation, so it can be used to forecast what may happen in the future if the current trend continues. • Next, let’s take a quick look at how a regression equation is derived, and then take a look at what the correlation coefficient (or the r-squared value on Excel) tell us about the regression equation.

Let’s take another look at the data points and the regression line.

Let’s take another look at the data points and the regression line. • Why does this particular line give the best “fit” for the data? Why not some other line?

It has to do with what is called a residual.

It has to do with what is called a residual. • A residual is the difference between a particular data point and the regression line.

If we zoom in on a particular data point, we can see what a residual is.

If we zoom in on a particular data point, we can see what a residual is. • Let’s zoom in on this particular data point.

Zooming into this box:

Introduction to Linear Regression