1 / 44

Canadian Bioinformatics Workshops

Canadian Bioinformatics Workshops. www.bioinformatics.ca. Module #: Title of Module. 2. Module 4 Regression. Aline Tabet Exploratory Data Analysis and Essential Statistics using R Jan24-25, 2011. Regression. What is regression? One of the most widely used statistical methodology.

Download Presentation

Canadian Bioinformatics Workshops

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Canadian Bioinformatics Workshops www.bioinformatics.ca

  2. Module #: Title of Module 2

  3. Module 4Regression Aline Tabet Exploratory Data Analysis and Essential Statistics using R Jan24-25, 2011

  4. Regression What is regression? • One of the most widely used statistical methodology. • Describes how one variable (or set of variables) depend on another variable (or set of variables) • Example: • Weight vs height • Yield vs fertilizer

  5. Outline • Introduction • Simple Linear Regression • Multiple Linear Regression • For both cases we will discuss • Assumptions. • Fitting a model in R and interpreting output. • Model assessment. • Some model selection procedures.

  6. Regression • Regression aims to predict a response that could take on continuous values. • Often characterized as a quantitative prediction rather than qualitative. • Simple linear regression is a part of a much more general methodology: Generalized Linear Models. • Very closely related to t-test and ANOVA.

  7. Simple Linear Regression Model Linear regression assumes a particular model: xiis the independent variable. Depending on the context, also known as a "predictor variable," "regressor," "controlled variable," "manipulated variable," "explanatory variable," "exposure variable," and/or "input variable." yiis the dependent variable, also known as "response variable," "regressand," "measured variable," "observed variable," "responding variable," "explained variable," "outcome variable," "experimental variable," and/or "output variable." i are "errors" - not in the sense of being "wrong", but in the sense of creating deviations from the idealized model. The i are assumed to be independent and N(0,2) (normally distributed), they can also be called residuals. This model has two parameters: the regression coefficient , and the intercept .

  8. Simple Linear regression • Characteristics: • Only two variables are of interest • One variable is a response and one a predictor • No adjustment is needed for confounding or other between-subject variation • Assumptions • Linearity • σ2 is constant, independent of x • i are independent of each other • For proper statistical inference (CI, p-values), i are normal distributed • No outliers • X measured without error

  9. A Simple Example • Investigate the relationship between yield (Liters) and fertilizer (kg/ha) for tomato plants. • A varied amount of fertilizer was randomly assigned to 11 plots of land and the yield measured at the end of the season. • The amount of fertilizer applied to each plot was chosen Interest also lies in predicting the yield when 16 kg/ha are assigned. • At the end of the experiment, the yields were measured and the following data were obtained.

  10. We are interested in fitting the line

  11. Linear regression Linear regression analysis includes: Estimation of the parameters; Characterization of goodness of fit.

  12. Linear regression: estimation For a linear model, estimated parameters a, b Estimation: choose parameters a, b so that the SSE is as small as possible. We call these: least squares estimates. This method of least squares has an analytic solution for the linear case.

  13. Linear regression: residuals

  14. The model we fit summary of the residuals Parameter Estimates Other Useful things

  15. The fitted line

  16. Interpretation of the R output • The estimated slope is the estimated change in yield when the amount of fertilizer is increased by 1 unit. • The estimated intercept is the estimated yield when the amount of fertilizer is 0. • The estimated standard error is an estimate of the standard deviation over all possible experiments. It can be used to construct an approximate confidence interval:

  17. Hypothesis testing in LM • In linear regression problems, one hypothesis of interest is if the true slope is zero. • Compute the test statistic • This will be compared to a t-distribution with n-2 = 9 degrees of freedom. The p-value is found to be very small (less than 0.0001). • We can conclude that there is strong evidence that the true slope is not zero.

  18. What about predictions? • What would be the future yield when 16kg/ha of fertilizer are applied? Interpretation? The 95% confidence interval of the mean yield in tomatoes when the 16kg/ha of fertilizer is applied is between 28.81 and 32.15 litres.

  19. Prediction Interval for a single observation • We can also compute prediction intervals for one single future observation Prediction intervals for one single observation are wider than confidence intervals for the mean.

  20. Linear regression: quality control • Two parts: • Is the model adequate? • Residuals • Are the parameter estimates good? • Prediction confidence limits • Mean square error • Cross Validation

  21. Linear regression: quality control • Residual plots allow us to validate underlying assumptions: • Relationship between response and regressor should be linear (at least approximately). • Error term,  should have zero mean • Error term,  should have constant variance • Errors should be normally distributed (required for tests and intervals)

  22. Linear regression: quality control Source: Montgomery et al., 2001, Introduction to Linear Regression Analysis Check constant variance and linearity, and look for potential outliers.

  23. Linear regression: Q-Q plot Adequate Inadequate Inadequate Residuals vs. similarly distributed normal deviates check the normality assumption Inadequate Inadequate Source: Montgomery et al., 2001, Introduction to Linear Regression Analysis

  24. Linear regression: Evaluating accuracy If the model is valid, i.e. nothing terrible in the residuals, we can use it to predict. But how good is the prediction?

  25. Another Example • Relationship between mercury in food and in the blood. • Outliers?

  26. The New Fitted Line • With Prediction Intervals #sort on X o=order(merc2[,1]) mercn=merc2[o,] #Compute prediction and confidence intervals pc=predict(Merc_fit,mercn,interval="confidence") pp=predict(Merc_fit,mercn,interval="prediction") plot (mercn, xlab="Mercury in Food", ylab="Mercury in Blood") matlines(mercn[,1],pc,ltv=c(1,2,2),col="black") matlines(mercn[,1],pp,ltv=c(1,3,3),col="red")

  27. Multiple Linear Regression • Similar to simple linear regression, but with multiple predictors. • Not to be confused with multivariate regression which has multiple responses. • Many of the concepts carry over directly from simple linear regression. • The model becomes:

  28. Model Assumptions • Marginal linearity. • Random sampling. • No outlier or influential points. • Constant variance. • Independence of observations. • Normality of the errors. • Predictors are measured without error.

  29. An Example: the Stackloss dataset • The data sets stack.loss and stack.x contain information on ammonia loss in a manufacturing (oxidation of ammonia to nitric acid) plant measured on 21 consecutive days. • The stack.x data set is a matrix with 21 rows and 3 columns representing three predictors: • air flow (Air.Flow) to the plant, • cooling water inlet temperature (C) (Water.Temp), and • acid concentration (Acid.Conc.) as a percentage (coded by subtracting 50 and then multiplying by 10). • The stack.loss data set is a vector of length 21 containing percent of ammonia lost x10 (the response variable).

  30. Would a transformation be appropriate?

  31. Careful with the interpretation!

  32. Model Selection • Simple model (parsimonious). • Only include variables that significantly improve model. • One simple way to do it, fit a model with all of the variables, and ask if we can drop one. • It lowers the risk of over-fitting. • In our example we can compare a model that has all three predictors or one that has two (Acid Conc. omitted).

  33. Variable Selection: Procedure Model selection follows five general steps: 1. Specify the maximum model (i.e. the largest set of predictors). 2. Specify a criterion for selecting a model. 3. Specify a strategy for selecting variables. 4. Specify a mechanism for fitting the models – usually least squares. 5. Assess the goodness-of-fit of the the models and the predictions.

  34. Some Criteria that can be used • R2: the proportion of total variation in the data that is explained by the predictors. • Fp: hypothesis tests to see find the set of p variables that is not statistically different from the full model. • MSEp: the set of p variables gives the the smallest estimated residual variance about the regression line. • Cp an AIC/BIC: a combination of fit and penalty for number of predictors.

  35. Choosing which Subset to examine • All possible subsets • Forward addition • Backwards elimination • Stepwise selection

  36. Regression: summary • Regression is a statistical technique for investigating and modeling the relationship between variables, which allows: • Parameter Estimation • Hypothesis testing • Use the Model (Prediction) • It's a powerful framework that can be readily generalized. • You need to be familiar with your data, simulate it in various ways and check the model assumptions carefully!

  37. We are on a Coffee Break & Networking Session

More Related