Regression

1 / 28

Regression - PowerPoint PPT Presentation

Regression. Petter Mostad 2005.10.10. Some problems you might want to look at. Given the annual number of cancers of a certain type, over a few decades, make a prediction for the future, with uncertainty.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

PowerPoint Slideshow about ' Regression' - allistair-guthrie

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

Regression

2005.10.10

Some problems you might want to look at
• Given the annual number of cancers of a certain type, over a few decades, make a prediction for the future, with uncertainty.
• There seems to be a connection between efficiency and size for Norwegian hospitals. Given data from many hospitals, determine if there is a connection, and what it is.
• Investigate the connection between efficiency and a number of possible explanatory variables.
Connection between variables

We would like to study connection between x and y!

What can you do with a fitted line?
• Interpolation
• Extrapolation (sometimes dangerous!)
• Interpret the parameters of the line
How to define the line that ”fits best”?

The sum of the squares of

the ”errors” minimized

=

Least squares method!

• Note: many other ways to fit the line can be imagined
How to compute the line fit with the least squares method?
• Let (x1, y1), (x2, y2),...,(xn, yn) denote the points in the plane.
• Find a and b so that y=a+bx fit the points by minimizing
• Solution:

where and all sums are done for i=1,...,n.

How do you get this answer?
• Differentiate S with respect to a og b, and set the result to 0

We get:

This is two equations with two unknowns, and the solution of these give the answer.

Example

Some grasshoppers make sound by rubbing their wings against each other. There is a connection between the temperature and the frequency of the movements, unique for each species. Here are some data for Nemobius fasciatus fasciatus:

If you measure 18 movements per sec, what is estim. temperature?

Data from Pierce, GW. The Songs of Insects. Cambridge, Mass.: Harvard University Press, 1949, pp. 12-21

Example (cont.)

Computation:

y against x ≠ x against y
• Linear regression of y against x does not give the same result as the opposite.

Regression of

y against x

Regression of x against y

Centered variables
• Assume we subtract the average from both x- and y-values
• We get and
• We get and
• From definitions of correlation and standard deviation se get

(even in uncentered case)

• Note also: The residuals sum to 0.
Anaylzing the variance
• Define
• SSE: Error sum of squares
• SSR: Regression sum of squares
• SST: Total sum of squares
• We can show that

SST = SSR + SSE

• Define
• R2 is the ”coefficient of determination”
But how to answer questions like:
• Given that a positive slope (b) has been estimated: Does it give a reproducible indication that there is a positive trend, or is it a result of random variation?
• What is a confidence interval for the estimated slope?
• What is the prediction, with uncertainty, at a new x value?
The standard simple regression model
• We have to do as before, and define a model

where are independent, normally distributed, with equal variance

• We can then use data to estimate the model parameters, and to make statements about their uncertainty
Confidence intervals for simple regression
• In a simple regression model,
• a estimates
• b estimates
• estimates
• Also,

where estimates variance of b

• So a confidence interval for is given by
Hypothesis testing for simple regression
• Choose hypotheses:
• Test statistic:
• Reject H0 if or
Prediction from a simple regression model
• A regression model can be used to predict the response at a new value xn+1
• The uncertainty in this prediction comes from two sources:
• The uncertainty in the regression line
• The uncertainty of any response, given the regression line
• A confidence interval for the prediction:
Testing for correlation
• It is also possible to test whether a sample correlation r is large enough to indicate a nonzero population correlation
• Test statistic:
• Note: The test only works for normal distributions and linear correlations: Always also investigate scatter plot!
Influence of extreme observations
• NOTE: The result of a regression analysis is very much influenced by points with extreme values, in either the x or the y direction.
• Always investigate visually, and determine if outliers are actually erroneous observations
Example: Transformed variables
• The relationship between variables may not be linear
• Example: The natural model may be
• We want to find a and b so that the line approximates the points as well as possible
Example (cont.)
• When then
• Use standard formulas on the pairs (x1,log(y1)), (x2, log(y2)), ..., (xn, log(yn))
• We get estimates for log(a) and b, and thus a and b
Another example of transformed variables
• Another natural model may be
• We get that
• Use standard formulas on the pairs

(log(x1), log(y1)),

(log(x2), log(y2)), ...,(log(xn),log(yn))

Note: In this model, the curve goes through (0,0)

More than one independent variable: Multiple regression
• Assume we have data of the type

(x11, x12, x13, y1), (x21, x22, x23, y2), ...

• We want to ”explain” y from the x-values by fitting the following model:
• Just like before, one can produce formulas for a,b,c,d minimizing the sum of the squares of the ”errors”.
• x1,x2,x3 can be transformations of different variables, or transformations of the same variable
Multiple regression model
• The errors are independent random (normal) variables with expectation zero and variance
• The explanatory variables x1i, x2i, …, xni cannot be linearily related
Use of multiple regression
• Versions of multiple regression is the most used model in econometrics, and in health economics
• It is a powerful tool to detect and verify connections between variables
Doing a regression analysis
• Plot the data first, to investigate whether there is a natural relationship
• Linear or transformed model?
• Are there outliers which will unduly affect the result?
• Fit a model. Different models with same number of parameters may be compared with R2
• Make tests / confidence intervals for parameters
Interpretation
• The parameters may have important interpretations
• The model may be used for prediction at new values (caution: Extrapolation can sometimes be dangerous!)
• Remember that subjective choices have been made, and interpret cautiously