LINEAR REGRESSION

LINEAR REGRESSION

LINEAR REGRESSION The equation of the linear model y = a + b x represents a generic line on the scatter plot. How can we find a and b in such a way that the line is the best and is determined univocally? We refer to the so called the Method of Least Square. The line obtained by using least square method is called least square regression line.

LINEAR REGRESSION: the methodofleastsquare e Food expenditure Regression line Income Each value of y obtained for a member from the survey is called the observed or actual value of y. The corresponding value on the regression line is called theoretical or predicted value of. The difference is called residual (or error) and is indicated with e. For a given household e indicates the difference between the observed value of food expenditure and the theoretical (predicted) value given by the regression model.

LINEAR REGRESSION: the methodofleastsquare The value of e is positive if the observed point is above the regression line and negative if it is below the regression line. Among all the possible lines that interpolate the observed points the best line should be the one that minimizes all the differences that is the sum of the residual should be minimized. But, whatever the line is, the sum of these residuals is always zero

LINEAR REGRESSION: the methodofleastsquare Hence, to find the line that best fits the scatter of points, we minimize the error sum of squares, denoted by SSE, which is obtained by adding the squares of errors. Thus The value of and which give the minimum SSE are called least square estimates of a and b and the regression line obtained is called least square line.

LINEAR REGRESSION: the methodofleastsquare The least square values of a and b are computed as follows: where and SS stands for “sum of squares”. The least squares regression line is also called the regression of y on x.

LINEAR REGRESSION: example 1 Find the least squares regression line for the data on incomes and food expenditure of the seven households. We have to compute the least square values of a and b. We can do it in 4 steps.

LINEAR REGRESSION: example 1 Step 1. Compute

LINEAR REGRESSION: example 1 Step 2. Compute Thus

LINEAR REGRESSION: example 1 Step 3. Compute Step 4. Compute a and b.

LINEAR REGRESSION: example 1 Thus, the estimated regression model is ŷ = 1.1414 + .2642x Using this model we can find the predicted value of y for any specific value of x For instance, suppose we randomly select a household whose monthly income is $3500 so that x=35. The predicted value of food expenditure for this household is ŷ = 1.1414 + (.2642)(35) =$10.3884 hundred=$1038.84 In other words, based on our regression line, we predict that a household with a monthly income of $3500 is expected to spend $1038.84 per month on food.

LINEAR REGRESSION: example 1 In our data, there is one household whose income is $3500. The observed food expenditure for the household is $900. The difference between the observed and the predicted values gives the residual error of prediction. It is equal to: The negative error indicates that the predicted value of y is greater than the observed value of y. Thus if we use the regression model, the household’s food expenditure is overestimated by $138.84.

Interpretation of a and b Interpretation of a • Consider the household with zero income • ŷ = 1.1414 + .2642(0) = $1.1414 hundred • Thus, we can state that households with no income is expected tospend $114.14 per month on food. • Thus a gives the predicted value of y for x=0 based on the regression model estimated for the sample data. • However, the regression line is valid only for the values of x between 15 and 49. Thus, the prediction of a household with zero income is not credible. We should to be very careful in interpreting a!

Interpretation of a and b Interpretation of b The value of b in the regression model gives the change in y due to change of one unit in x. It is called regression coefficient. For example in our regression model we have that when x=30ŷ=1.1414+.2642(30)=9.0674 whenx=31 ŷ =1.1414+.2642(31)=9.3316 Hence, when x increased by one unit, from 30 to 31, y increased by 9.3316-9.0674=0.2642, which is the value of b. Because our unit of measurement is hundreds of dollars, we can state that, on average, a $100 increase in income will cause a $26.42 increase in food expenditure. We can also state that, on average, a $1 increase in income of a household will increase the food expenditure be $0.2642. “On average” means that the food may increase or may not increase if the income increases by $100.

Coefficient of determination The linear regression should be apply with caution. When we use a linear regression we assume that the relationship between two variables is described by a straight line. In the real world, the relationship between the two variables may not be linear. In such cases fitting a linear regression would be wrong

Coefficient of determination If we want to evaluate how good the regression model is, that is how well the independent variable explains the dependent variable using a linear model we can use the coefficient of determination. Consider the case in which x=0.The regression model becomes: y=a But remind that the least square value of a is if x=0 Thus if x=0 the regression line is .

Coefficient of determination Y X Graphically The picture represents the extreme situation in which there is no linear relation between xand y.

Coefficient of determination Y X In the picture we can add the regression model and the observed y for the individual j: We obtain We can observe that if there is linear independence between x and y if there is strong dependence between x and y

Coefficient of determination If consider the square of these differences and sum by all the individuals we obtain: SST = SSE + SSR SST= total sum of squares. It expresses all the variability of the y variable SSR= regression sum of squares. It expresses the portion of SST explained by the regression model. SSE= error sum of squares. It expresses the portion of SST not explained by the regression model.

Coefficient of determination Thecoefficient of determination, denoted by r2, represents the proportion of SST that is explained by the use of the regression model. The computational formula for r2 is The value of r2 lies in the range 0 to 1. If it is close to 1 it means that almost all the variability of y is explained by the regression model. In other words, the regression model is a good model. The computational formula of r2 is:

Coefficient of determination: example 1 On monthly incomes and food expenditure of seven households calculate the coefficient of determination. From earlier calculation we know b=.2642 SSxy =211.7143 and SSyy= 60.8571 We can state that the 92% of the variability of y is explained by the regression model (the 92% of the food expenditure is determined by the monthly income of the seven households).

LINEAR CORRELATION COEFFICIENT

LINEAR CORRELATION COEFFICIENT Another measure of the relationship between two variables is the linearcorrelation coefficient. The linear correlation coefficient measures how closely the points in a scatter diagram are spread around the regression line. It is indicated with r. The value of the correlation coefficient always lies in the range -1 to 1, that is, . There are three possible extreme values of r

LINEAR CORRELATION COEFFICIENT • There are 3 possible extreme values of r: • r=1: perfect positive linear correlation between the two variables • r=-1: perfect negative linear correlation • r=0: no linear correlation y r=1 x y r=-1 x y x

LINEAR CORRELATION COEFFICIENT We do not usually encounter an example with perfect positive or perfect negative correlation. What we observe in real-word problems is either a positive linear correlation with 0<r<1 or a negative linear correlation with -1<r<0. If the correlation between two variables is positive and close to 1, we say that the variables have a strong positive correlation. If the correlation between two variables is positive but close to 0, we say that the variables have a weak positive correlation.

LINEAR CORRELATION COEFFICIENT If the correlation between two variables is negative and close to -1, we say that the variables have a strong negative correlation. If the correlation between two variables is negative but close to 0, we say that the variables have a weak negative correlation.

LINEAR CORRELATION COEFFICIENT • The simple linear correlation measures the strength of the linear relationship between two variables for a sample and is calculated as

LINEAR CORRELATION COEFFICIENT: example 1 • Calculate the correlation coefficient for the example on incomes and food expenditures of seven households. The linear correlation tells us how strongly the two variables are linearly related. The correlation coefficients of .96 for incomes and food expenditures of seven households indicates that income and food expenditure are very strongly and positive correlated.

LINEAR REGRESSION

LINEAR REGRESSION

Presentation Transcript

Linear regression

Linear regression

Linear Regression

Linear Regression

Linear Regression

Linear Regression

Linear Regression

Linear Regression

Linear Regression

Linear Regression

Linear Regression

Regression Linear Regression

Linear Regression

Linear Regression

Linear Regression

Linear Regression

Linear Regression

Linear Regression

Linear Regression

Linear regression

Linear Regression

Linear Regression