320 likes | 338 Views
Explore the basics of simple linear regression, hypothesis testing, use of dummy variables, and interpretation of results in a pediatric patient group. Get hands-on with Rcmdr for modeling and prediction.
 
                
                E N D
Statistics April 9, 2009 Tim Bunnell, Ph.D. & Jobayer Hossain, Ph.D. Nemours Bioinformatics Core Facility Nemours Biomedical Research
Simple Linear Regression • Summary of last class discussion- • Scatter plot for exploring relationship (form, direction and strength)of two variables • Correlation coefficient to describe the direction and strength of the linear relationship between two variables • Simple linear regression to describe the linear association between a quantitative response variable and a quantitative explanatory variable, and to predict the average of the response variable utilizing the explanatory variable Nemours Biomedical Research
Rcmdr demo: Simple Linear Regression • Statistics -> Fit Model -> Linear regression -> In the small window, write name of models, pick response (e.g.hgt) and explanatory (e.g. age) variables, then ok. lm(formula = hgt ~ age, data = data) Call: Residuals: Min 1Q Median 3Q Max -2.53975 -0.55722 0.08105 0.68147 2.24326 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 31.018720 0.439077 70.64 <2e-16 *** age 0.187735 0.004609 40.73 <2e-16 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 1.07 on 58 degrees of freedom Multiple R-squared: 0.9662, Adjusted R-squared: 0.9656 F-statistic: 1659 on 1 and 58 DF, p-value: < 2.2e-16 Nemours Biomedical Research
Simple Linear Regression • Our hypothesis of interest in regression analysis is, • H0: regression coefficient, b1=0 (no influence of ‘age’ on ‘height’ • Ha: b1 0, (there is influence of ‘age’ on ‘height’ • Test statistic is ‘t’. • From our example, we have b1=0.1877 with a p-value <0.001. • It indicates that, on an average, the response variable ‘height’ increases 0.1877 inches, for one month increase of the independent variable ‘age’ and this per month average increment is statistically significant ( that is, we reject H0 that b1=0). Nemours Biomedical Research
Simple Linear Regression • The intercept, b0, of a regression line indicates the average of the response variable in absence of independent variable/or influence of independent variable. • In our our example, the intercept 31.01 indicates that the height of this group of kids was 31.01 inches at age 0 (at birth), which may not reflect the actual height at birth. • The age of this particular group of pediatric patients ranges 48 – 143 months. • The possible reason of this miscalculation of height at birth is that the growth slows down with an increasing age and the slopes will be different (higher or lower?) for younger age group than this particular age group. • If we could have included patients for younger age group than we might end up with a slope that could give us better estimate of height at birth. • Lesson: We need to be careful before extrapolate the results (predict beyond the range of data) of regression. Nemours Biomedical Research
Simple Linear Regression • In our example, The regression line of height on age is, • Height= 31.01872 + 0.187735* age • What is the average height of pediatric patients at the age of 100 month? • Height = 31.01872 + 0.187735* 100  49.7922 inches • How about the height for age of 10 months or 200 months (Do these ages belong to the age range of this particular pediatric patients (48-143 months)? • What is your expectation for the slope of ages below 48 months (Is that higher or lower than .1877?) or ages more than 143 months? Nemours Biomedical Research
Simple Linear Regression • Dummy or Indicator variable: • An artificially defined variable that marks or encodes a particular attribute to facilitate to include qualitative variables in the standard regression model. • A dummy variable usually takes value 1 or 0 to indicate presence or absence of an attribute, e.g., 1 for male and 0 for female. • Zero (0) represents the reference category • A categorical variable of k levels can be expressed as (k-1) dummy variables. Nemours Biomedical Research
Rcmdr:Simple Linear Regression • Creating Dummy variable: • Data->Manage variables in active data set -> Recode variables -> pick variable to recode (e.g. sex) -> Give a name of new variable (e.g. sexc) -> Unselect make new variable a factor (by default it is selected) -> Enter recode directives (“m”=1 “f”=0) -> ok • It will create a variable sexc with values 1 and 0. • Running Regression: • Statistics -> Fit Models -> Linear Regression -> type a name for model (e.g Model2), pick response (e.g. LWAS) and explanatory (e.g. sexc) variables -> ok Nemours Biomedical Research
Rcmdr:Simple Regression (output) Call: lm(formula = LWAS ~ sexc, data = data) Residuals: Min 1Q Median 3Q Max -45.030 -10.186 3.845 10.244 22.649 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 86.805 2.740 31.68 <2e-16 *** sexc -9.454 3.875 -2.44 0.0178 * --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 15.01 on 58 degrees of freedom Multiple R-squared: 0.09307, Adjusted R-squared: 0.07744 F-statistic: 5.952 on 1 and 58 DF, p-value: 0.01778 Nemours Biomedical Research
Rcmdr:Simple Regression (output): Interpretetion • A dummy variable regression predicts the mean level of each group. • The estimate of intercept represents the mean for reference group • In our example, mean LWAS for female patients is 86.805. • The estimate of the dummy variable indicates difference of means of two categories. • Sum of the estimates of intercept and variable itself (e.g. sexc) represents the predicted mean for the other group. • In our example, mean LWAS for male patients is (86.805 – 9.454) = 77.351 • A p-value of <0.05 for the dummy variable (e.g.sexc) indicates the significant effect of the dummy variable. • P-value for sexc is <0.05, which indicates the significant effect of the variable sex on the response LWAS. Another way, we can say, the mean LWAS in male and female is significantly different Nemours Biomedical Research
Multiple Regression • Two or more independent variables to predict a single dependent variable. • Multiple regression model of Y on p number of explanatory variables can be written as, Y = b0 + b1X1 + b2X2 +… + bpXp + e where bi (i=1,2, …, p) is the regression coefficient of Xi Nemours Biomedical Research
Multiple Regression • Fitted Y is given by, • The estimated residual error is the same as that in the simple linear regression, Nemours Biomedical Research
Rcmdr: Multiple Regression • Statistic->Fit Model-> simple linear regression-> response variable (e.g. PLUC.pre) and more than one explanatory variables (e.g. age and LWAS) Every thing is the same as for the simple regression except we need to select more than two explanatory (e.g. age, PLUC.pre) Call: lm(formula = PLUC.pre ~ age + LWAS, data = data) Residuals: Min 1Q Median 3Q Max -5.9201 -1.5830 0.3182 1.4167 4.0089 Nemours Biomedical Research
Rcmdr: Multiple Regression Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 12.248701 1.677373 7.302 9.97e-10 *** age -0.004814 0.009321 -0.517 0.607 LWAS -0.025224 0.018033 -1.399 0.167 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 2.16 on 57 degrees of freedom Multiple R-squared: 0.03926, Adjusted R-squared: 0.005548 F-statistic: 1.165 on 2 and 57 DF, p-value: 0.3194 Interpretation is the same as it is for the simple regression. Nemours Biomedical Research
Coefficient of Determination (Multiple R-squared): Terminology • Sum of Squares (SS): Let x1, …, xn be n observations. The sum of squares of these n observations can be written as x12 + x22 +…. xn2. In notations, ∑xi2. In a corrected form this sum of squares can be written as (xi -x )2. • Degrees of freedom (df): Number of quantities of the form – Number of restrictions. For example, in the following SS, we need n quantities of the form (xi -x ). There is one constraint (xi -x ) = 0. So the df for this SS is n – 1. • Mean Sum of Squares (MSS): The SS divided by it’s df. Nemours Biomedical Research
Coefficient of Determination (Multiple R-squared) • Total variation in the response variable Y is due to (i) regression of all variables in the model (ii) residual (error). • Total variation of y, SS (y) = SS(Regression) + SS(Residual) • The Coefficient of Determination is, Nemours Biomedical Research
Coefficient of Determination (Multiple R-squared) • R2 lies between 0 and 1. • R2 = 0.8 implies that 80% of the total variation in the response variable Y is due to the contribution of all explanatory variables in the model. That is, the fitted regression model explains 80% of the variance in the response variable. Nemours Biomedical Research
Coefficient of Determination (Multiple R-squared) • A R2 always increases with an increasing number of variables in the model, without consideration of sample size. This increase of R2 may be due to chance variation. • An Adjusted R2, which is in fact the (1- (ratio of MSS(residual) and MSS (Total)), accounts for sample size and number of variables are being used in the model and reduces the possibility of chance variation. Nemours Biomedical Research
Rcmdr: Coefficient of Determination (Multiple R-squared) • It’s in the output of Multiple regression. • For the previous example, coefficient of determination is 0.039. Multiple R-squared: 0.03926, Adjusted R-squared: 0.005548 F-statistic: 1.165 on 2 and 57 DF, p-value: 0.3194 Multiple R2 squared .039 implies that the regression of age and LWAS on the response is very poor. Both of these variables combinedly explains only about 4% of the total variation of the response variable. The rest of the 96% of the variation of response left unexplained. The p-value 0.3194 also indicates the same that there was no influence of these two variables on the response variable. Nemours Biomedical Research
Analysis of Variance (ANOVA) Nemours Biomedical Research
The basic ANOVA situation • Type of variables: Quantitative response • Categorical predictors (factors) • Main Question: Does mean of the response variable depend on which group (given by categorical variable) the individual is in? • If there is only one categorical variable with only 2 levels (groups): • 2-sample t-test • ANOVA allows for comparing means of 3 or more groups (levels) for a single factor (One way ANOVA) or two or more groups (levels) of more than two factors . Nemours Biomedical Research
Experimental Design Terminology • An Experimental Unit is the entity on which measurement or an observation is made. For example, subjects are experimental units in most clinical studies. • Homogeneous Experimental Units: Units that are as uniform as possible on all characteristics that could affect the response. • A Block is a group of homogeneous experimental units. For example, if an investigator had reason to believe that age might be a significant factor in the effect of a given medication, he might choose to first divide the experimental subjects into age groups, such as under 5 years old, 5-10 years old, and over 10 years old etc. Nemours Biomedical Research
Experimental Design Terminology • A Factor is a controllable independent variable that is being investigated to determine its effect on a response. E.g. treatment group is a factor. • Factors can be fixed or random • Fixed -- the factor can take on a discrete number of values and these are the only values of interest. • Random -- the factor can take on a wide range of values and one wants to generalize from specific values to all possible values. • Each specific value of a factor is called a level. Nemours Biomedical Research
Experimental Design Terminology • A covariate is an independent variable not manipulated by the experimenter but still affecting the response. E.g. in many clinical experiments, the demographic variables such as race, gender, age may influence the response variable significantly even though these are not the variables of interest of the study. These variables are termed as covariate. • Effect is the change in the average response between two factor levels. That is, factor effect = average response at one level – average response at a second level. Nemours Biomedical Research
Experimental Design Terminology • Interaction is the joint factor effects in which the effect of one factor depends on the levels of the other factors. No interaction effect of factor A and B Interaction effect of factor A and B Nemours Biomedical Research
Experimental Design Terminology • Randomization is the process of assigning experimental units randomly to different experimental groups. • It is the most reliable method of creating homogeneous treatment groups, without involving potential biases or judgments. Nemours Biomedical Research
Experimental Design Terminology • A Replication is the repetition of an entire experiment or portion of an experiment under two or more sets of conditions. • Although randomization helps to insure that treatment groups are as similar as possible, the results of a single experiment, applied to a small number of experimental units, will not impress or convince anyone of the effectiveness of the treatment. • To establish the significance of an experimental result, replication, the repetition of an experiment on a large group of subjects, is required. • If a treatment is truly effective, the average effect of replicated experimental units will reflect it. Nemours Biomedical Research
Experimental Design Terminology • Replication Contd. … • If it is not effective, then the few members of the experimental units who may have reacted to the treatment will be negated by the large numbers of subjects who were unaffected by it. • Replication reduces variability in experimental results and increases the significance and the confidence level with which a researcher can draw conclusions about an experimental factor. Nemours Biomedical Research
Experimental Design Terminology • The analysis of variance (ANOVA) is a technique of decomposing the total variability of a response variable into: • Variability due to the experimental factor(s) and… • Variability due to error (i.e., factors that are not accounted for in the experimental design). • The basic purpose of ANOVA is to test the equality of several means. • A fixed effect model includes only fixed factors in the model. • A random effect model includes only random factors in the model. • A mixed effect model includes both fixed and random factors in the model. Nemours Biomedical Research
Investigation • Graphical investigation: • side-by-side box plots • multiple histograms • Whether the differences between the groups are significant depends on • the difference in the means • the standard deviations of each group • the sample sizes • A useful web (thanks Betty for sending this website for the class): www.psych.utah.edu/stat/introstats/anovaflash.html • ANOVA determines P-value from the F statistic Nemours Biomedical Research
Informal Investigation: Side by Side Boxplots Nemours Biomedical Research
Thank you Nemours Biomedical Research