1 / 36

Principles of Biostatistics Chapter 18 Simple Linear Regression

Principles of Biostatistics Chapter 18 Simple Linear Regression. 宇传华 http://statdtedm.6to23.com. Terminology. Linear regression 线性回归 Response (dependent) variable 反应 ( 应 ) 变量 Explanatory (independent) variable 解释 ( 自 ) 变量 Linear regression model 线性回归模型

Download Presentation

Principles of Biostatistics Chapter 18 Simple Linear Regression

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Principles of Biostatistics Chapter 18 Simple Linear Regression 宇传华http://statdtedm.6to23.com

  2. Terminology Linear regression线性回归 Response (dependent) variable 反应(应)变量 Explanatory (independent) variable 解释(自)变量 Linear regression model 线性回归模型 Regression coefficient 回归系数 Slope 斜率 Intercept 截距 Method of least squares最小二乘法 Error sum of squares or residual sum of squares 残差(剩余)平方和 Coefficient of Determination决定系数 Outlier异常点(值) Homoscedasticity方差齐同 heteroscedasticity方差非齐同

  3. 18.1 An example 18.2 The Simple Linear Regression Model 18.3 Estimation: The Method of Least Squares 18.4 Error Variance and the Standard Errors of Regression Estimators 18.5 Confidence Intervals for the Regression Parameters 18.6 Hypothesis Tests about the Regression Relationship 18.7 How Good is the Regression? 18.8 Analysis of Variance Table and an F Test of the Regression Model 18.9 Residual Analysis 18.10 Prediction Interval and Confidence Interval Contents

  4. 8.1 An example

  5. Scatterplot This scatterplot locates pairs of observations of serum IL-6 on the x-axis and brain IL-6 on the y-axis. We notice that: Larger (smaller) values of brain IL-6 tend to be associated with larger (smaller) values of serum IL-6. The scatter of points tends to be distributed around a positively sloped straight line. The pairs of values of serum IL-6and brain IL-6 are not located exactly on a straight line. The scatter plot reveals a more or less strong tendency rather than a precise linear relationship. The line represents the nature of the relationship on average.

  6. 0 0 Y Y Y 0 0 0 X X X Y Y Y X X X Examples of Other Scatterplots

  7. Data The inexact nature of the relationship between serum and brain suggests that a statistical model might be useful in analyzing the relationship. A statistical model separates the systematic component of a relationship from the random component. In ANOVA, the systematic component is the variation of means between samples or treatments (SSTR) and the random component is the unexplained variation (SSE). In regression, the systematic component is the overall linear relationship, and the random component is the variation around the line. Statistical model Systematic component + Random errors Model Building

  8. 18.2 The Simple Linear Regression Model The population simple linear regression model: y= a + b x +  or my|x=a+b x Nonrandom or Random Systematic Component Component Where y is the dependent (response) variable, the variable we wish to explain or predict; x is the independent (explanatory)variable, also called the predictor variable; and  is the error term, the only random component in the model, and thus, the only source of randomness in y. my|xis the mean of y when x is specified, all called the conditional mean of Y. a is the intercept of the systematic component of the regression relationship. is the slope of the systematic component.

  9. Picturing the Simple Linear Regression Model The simple linear regression model posits (假定) an exact linear relationship between the expected or average value of Y, the dependent variable Y, and X, the independent or predictor variable: my|x= a+b x Actual observed values of Y (y)differ from the expected value (my|x) by an unexplained or random error(e): y = my|x +  = a+b x+  Regression Plot Y my|x=a + x { y } } Error:   = Slope 1 { a = Intercept X 0 x

  10. The relationship between X and Y is a straight-Line线性relationship. The values of the independent variable X are assumed fixed (not random); the only randomness in the values of Y comes from the error term . The errors  are uncorrelated (i.e.Independent独立) in successive observations. The errors  are Normally正态distributed with mean 0 and variance 2(Equal variance等方差). That is: ~ N(0,2) Assumptions of the Simple Linear Regression Model LINE assumptions of the Simple Linear Regression Model Y my|x=a + x y Identical normal distributions of errors, all centered on the regression line. N(my|x,sy|x2) X x

  11. 18.3 Estimation: The Method of Least Squares Estimation of a simple linear regression relationship involves finding estimated or predicted values of the intercept and slope of the linear regression line. The estimated regression equation: y= a+ bx + e where a estimates the intercept of the population regression line, a ; b estimates the slope of the population regression line,  ; and estands for the observed errors ------- the residuals from fitting the estimated regression line a+ bx to a set of n points. The estima ted regres sion line: $ y = a + b x $ where (y - hat) is th e value of Y lying o n the fitt ed regress ion line f or a given value of X .

  12. Fitting a Regression Line Y Y Data Three errors from the least squares regression line X X Y e Errors from the least squares regression line are minimized Three errors from a fitted line X X

  13. Errors in Regression Y . yi { X xi

  14. Least Squares Regression The sum of squared e rrors in r egression is: n n å å $ = - SSE = e 2 (y y ) 2 SSE: 残差平方和 i i i i = 1 i = 1 The least squa res regres sion line is that which minimizes the SSE with respe ct to the estimates . a and b a SSE Least squares a b Least squares b

  15. Sums of Sq uares and Cross Products: ( ) å 2 x å å = - = - 2 2 lxx ( x x ) x n ( ) å 2 y å å = - = - lyy ( y y ) 2 y 2 n ( ) å å x ( y ) å å = - - = - lxy ( x x )( y y ) xy n - Least squares re gression estimators: lxy = b lxx = - a y b x Sums of Squares, Cross Products, and Least Squares Estimators 最小二乘回归直线一定经过均数这一点

  16. Example 18-1

  17. Example 18-1: Using Computer-Excel The results on the bottom are the output created by selecting REGRESSION (回归)option from the DATA ANALYSIS(数据分析) toolkit. 完全安装Office后,点击菜单“工具”“加载宏”可安装“数据分析”插件

  18. Y Y X X What you see when looking at the total variation of Y. What you see when looking along the regression line at the error variance of Y. Total Variance and Error Variance

  19. 18.4 Error Variance and the Standard Errors of Regression Estimators Y Square and sum all regression errors to find SSE. X

  20. Standard Errors of Estimates in Regression

  21. 18.5 Confidence Intervals for the Regression Parameters

  22. Y Y Y X X X 18.6 Hypothesis Tests about the Regression Relationship Constant Y Unsystematic Variation Nonlinear Relationship H0:b =0 H0:b =0 H0:b =0 A hypothes is test fo r the exis tence of a linear re lationship between X and Y: b = H : 0 0 b ¹ H : 0 1 Test stati stic for t he existen ce of a li near relat ionship be tween X an d Y: sb b b where is the le ast - squares es timate of the regres sion slope and is the s tandard er ror of When the null hypot hesis is t rue, the stati stic has a t distribu tion with n - 2 degrees o f freedom.

  23. Hypothesis Tests for the Regression Slope

  24. 18.7 How Good is the Regression? The coefficient of determination, R2, is a descriptive measure of the strength of the regression relationship, a measure how well the regression line fits the data. R2:决定系数 Y . } { Unexplained Deviation Total Deviation { Explained Deviation Percentage of total variation explained by the regression. R2= X

  25. The Coefficient of Determination决定系数 Y Y Y X X X SST SST SST S S E SSR SSR SSE R2=0 SSE R2=0.50 R2=0.90

  26. 18.8 Analysis of Variance Table and an F Test of the Regression Model

  27. Residuals Residuals 0 0 Homoscedasticity: Residuals appear completely random. No indication of model inadequacy. Heteroscedasticity: Variance of residuals changes when x changes. Residuals Residuals 0 0 Time Curved pattern in residuals resulting from underlying nonlinear relationship. Residuals exhibit a linear trend with time. 18.9 Residual Analysis

  28. Example 18-1: Using Computer-Excel Residual Analysis. The plot shows the a curve relationship between the residuals and the X-values (serum IL-6).

  29. Point Prediction A single-valued estimate of Y for a given value of X obtained by inserting the value of X in the estimated regression equation. Prediction Interval For a value of Y given a value of X Variation in regression line estimate Variation of points around regression line For confidence interval of an average value of Y given a value of X Variation in regression line estimate 18.10 Prediction Interval and Confidence Interval

  30. Prediction Interval for a Value of Y

  31. Confidence Interval for the Average Value of Y

  32. Confidence Interval for the Average Value of Yand Prediction Interval for the Individual Value of Y

  33. Summary 1. Regression analysis is applied for prediction while control effect of independent variable X. 2. The principle of least squares in solution of regression parameters is to minimize the residual sum of squares. 3. The coefficient of determination, R2, is a descriptive measure of the strength of the regression relationship. 4. There are two confidence bands: one for mean predictions and the other for individual prediction values 5. Residual analysis is used to check goodness of fit for models

  34. Assignments 1. What is the main distinctions and assossiations between correlation analysis and simple linear regression? 2. What is the least squares method to estimate regression line? 3. Please describe the main steps for fitting a simple linear regression model with data.

  35. main distinctions Difference: 1. Data source: correlation analysis is required that both x and y follow normal distribution; but for simple linear regression, only y is required following normal distribution. 2. application: correlation analysis is employed to measure the association between two random variables (both x and y are treated symmetrically) simple linear regression is employed to measure the change in y for x (x is the independent varible, y is the dependent variable) 3. r is a dimensionless number, it has no unit of measurement; but b has its unit which relate to y.

  36. main associations relationship: 1. tr=tb 2. Have same sign between r and b.

More Related