1 / 28

Regression - PowerPoint PPT Presentation

Regression. Petter Mostad 2005.10.10. Some problems you might want to look at. Given the annual number of cancers of a certain type, over a few decades, make a prediction for the future, with uncertainty.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

PowerPoint Slideshow about ' Regression' - allistair-guthrie

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

Regression

2005.10.10

• Given the annual number of cancers of a certain type, over a few decades, make a prediction for the future, with uncertainty.

• There seems to be a connection between efficiency and size for Norwegian hospitals. Given data from many hospitals, determine if there is a connection, and what it is.

• Investigate the connection between efficiency and a number of possible explanatory variables.

We would like to study connection between x and y!

Fit a line!

• Interpolation

• Extrapolation (sometimes dangerous!)

• Interpret the parameters of the line

The sum of the squares of

the ”errors” minimized

=

Least squares method!

• Note: many other ways to fit the line can be imagined

• Let (x1, y1), (x2, y2),...,(xn, yn) denote the points in the plane.

• Find a and b so that y=a+bx fit the points by minimizing

• Solution:

where and all sums are done for i=1,...,n.

• Differentiate S with respect to a og b, and set the result to 0

We get:

This is two equations with two unknowns, and the solution of these give the answer.

Some grasshoppers make sound by rubbing their wings against each other. There is a connection between the temperature and the frequency of the movements, unique for each species. Here are some data for Nemobius fasciatus fasciatus:

If you measure 18 movements per sec, what is estim. temperature?

Data from Pierce, GW. The Songs of Insects. Cambridge, Mass.: Harvard University Press, 1949, pp. 12-21

Computation:

y against x ≠ x against y

• Linear regression of y against x does not give the same result as the opposite.

Regression of

y against x

Regression of x against y

• Assume we subtract the average from both x- and y-values

• We get and

• We get and

• From definitions of correlation and standard deviation se get

(even in uncentered case)

• Note also: The residuals sum to 0.

• Define

• SSE: Error sum of squares

• SSR: Regression sum of squares

• SST: Total sum of squares

• We can show that

SST = SSR + SSE

• Define

• R2 is the ”coefficient of determination”

• Given that a positive slope (b) has been estimated: Does it give a reproducible indication that there is a positive trend, or is it a result of random variation?

• What is a confidence interval for the estimated slope?

• What is the prediction, with uncertainty, at a new x value?

The standard simple regression model

• We have to do as before, and define a model

where are independent, normally distributed, with equal variance

• We can then use data to estimate the model parameters, and to make statements about their uncertainty

Confidence intervals for simple regression

• In a simple regression model,

• a estimates

• b estimates

• estimates

• Also,

where estimates variance of b

• So a confidence interval for is given by

Hypothesis testing for simple regression

• Choose hypotheses:

• Test statistic:

• Reject H0 if or

Prediction from a simple regression model

• A regression model can be used to predict the response at a new value xn+1

• The uncertainty in this prediction comes from two sources:

• The uncertainty in the regression line

• The uncertainty of any response, given the regression line

• A confidence interval for the prediction:

• It is also possible to test whether a sample correlation r is large enough to indicate a nonzero population correlation

• Test statistic:

• Note: The test only works for normal distributions and linear correlations: Always also investigate scatter plot!

• NOTE: The result of a regression analysis is very much influenced by points with extreme values, in either the x or the y direction.

• Always investigate visually, and determine if outliers are actually erroneous observations

• The relationship between variables may not be linear

• Example: The natural model may be

• We want to find a and b so that the line approximates the points as well as possible

• When then

• Use standard formulas on the pairs (x1,log(y1)), (x2, log(y2)), ..., (xn, log(yn))

• We get estimates for log(a) and b, and thus a and b

Another example of transformed variables

• Another natural model may be

• We get that

• Use standard formulas on the pairs

(log(x1), log(y1)),

(log(x2), log(y2)), ...,(log(xn),log(yn))

Note: In this model, the curve goes through (0,0)

• Assume we have data of the type

(x11, x12, x13, y1), (x21, x22, x23, y2), ...

• We want to ”explain” y from the x-values by fitting the following model:

• Just like before, one can produce formulas for a,b,c,d minimizing the sum of the squares of the ”errors”.

• x1,x2,x3 can be transformations of different variables, or transformations of the same variable

• The errors are independent random (normal) variables with expectation zero and variance

• The explanatory variables x1i, x2i, …, xni cannot be linearily related

• Versions of multiple regression is the most used model in econometrics, and in health economics

• It is a powerful tool to detect and verify connections between variables

• Plot the data first, to investigate whether there is a natural relationship

• Linear or transformed model?

• Are there outliers which will unduly affect the result?

• Fit a model. Different models with same number of parameters may be compared with R2

• Make tests / confidence intervals for parameters

• The parameters may have important interpretations

• The model may be used for prediction at new values (caution: Extrapolation can sometimes be dangerous!)

• Remember that subjective choices have been made, and interpret cautiously