1 / 35

BMS 617

BMS 617. Lecture 10 – Correlation and linear regression: Introduction to statistical models. Correlation. Correlation describes the propensity for one variable to vary in the same (or opposite) way to another variable

ldowns
Download Presentation

BMS 617

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. BMS 617 Lecture 10 – Correlation and linear regression: Introduction to statistical models Marshall University Genomics Core Facility

  2. Correlation Correlation describes the propensity for one variable to vary in the same (or opposite) way to another variable Example (from Motulsky): Borkmann et al. measured the insulin sensitivity and fraction of polyunsaturated fatty acids with between 20 and 22% carbon atoms in 13 healthy men Both variables show a degree of variation: Marshall University School of Medicine

  3. Scatterplot • The plot seems to show a relationship, or correlation, between the variables • The higher the %C20-22, the higher the insulin sensitivity Scatterplot of insulin sensitivity against %C20-22: Marshall University School of Medicine

  4. Correlation Coefficient • The correlation coefficient between two sets of values x1…xn and y1…yn is computed as follows: • Calculate the standardized values of x and y: • zx,i=(xi-mean(x))/sd(x); zy,i=(yi-mean(y))/sd(y); • Compute the products of all the standardized values, add them up, and divide by n-1: • r=(zx,1zy,1+zx,2zy,2+…+zx,nzy,n)/(n-1) Marshall University School of Medicine

  5. Why the correlation coefficient works If a value is bigger than the mean, its standardized score is positive, otherwise its standardized score is negative The product of two standardized scores will be positive if both scores are positive, or both scores are negative i.e. if both scores are bigger than the mean, or both are less than the mean So if one variable tends to increase when the other tends to increase, the bulk of the products of standardized scores will be positive, and the correlation coefficient will be high On the other hand, if one variable tends to decrease when the other increases, the bulk of the products of standardized scores will be negative, and the correlation coefficient will be low If there is no relationship, the standardized scores will be randomly distributed, and their products will tend to cancel out Marshall University School of Medicine

  6. Correlation coefficient for the insulin sensitivity data The correlation coefficient for the insulin sensitivity data is r=0.77 The square of this value is r2=0.59 r2is always between 0 and 1 r2is easier to interpret than r: 59% of the variation in insulin sensitivity can be "explained" by the variation in %C20-22. We will make this more precise later Marshall University School of Medicine

  7. Confidence Intervals for Correlation Coefficients Most statistical software will compute a confidence interval for a correlation coefficient: 95% confidence interval for these data is [0.38, 0.93] We are 95% confident the interval from 0.38 to 0.93 includes the true correlation coefficient for insulin sensitivity and %C20-22 fatty acid content Marshall University School of Medicine

  8. GRHL2 and Epithelial-Mesenchymal Transition Epithelial-Mesenchymal Transition (EMT) is a process cancer cells must undergo before metastasis can occur Mani et al (Cell 2008; 133; 704-15) published a gene signature for cells which have undergone EMT Relative expression of a set of 251 genes indicative of EMT Cieply et al. (Cancer Research, 2012) attempted to induce EMT in GRHL2-overexpressed cells, profiled the resulting gene expression by microarray Hypothesized that GRHL2 would suppress EMT Compared expression of Core EMT genes in their assay to that of Mani Marshall University School of Medicine

  9. Expression of Core EMT genes in GRHL2 overexpressed cells • Expression patterns show a strong negative correlation • Suggests that GRHL2 has suppressed EMT Marshall University School of Medicine

  10. p-values for correlation coefficients • It is possible to compute a p-value for correlation coefficients • The null hypothesis is that there is no correlation • i.e. that the true correlation coefficient is zero • So the p-value is the probability of getting a correlation coefficient at least as big as the one observed, from a random sample of the same size as the one used, assuming there is no correlation in the population • Note that with large samples, p-values for correlation coefficients tend to be very small • For the insulin sensitivity example (n=13), p=0.0021 • For the GRHL2-EMT example, (n=216), p<10-16 • It is important to look at the r or r2 value to determine if the result is of biological importance Marshall University School of Medicine

  11. Correlation and Causality • A very common error is to assume that correlation implies causality • In the insulin sensitivity example, it would be wrong to conclude from the correlation alone that high lipid content caused high insulin sensitivity • The possible reasons for the correlation in this example are: • Lipid content determines insulin sensitivity • Insulin sensitivity determines lipid content • Both lipid content and insulin sensitivity are determined by a common factor • There is a complex network of interacting factors of which lipid content and insulin sensitivity are two components • It is a coincidence • The p-value tells you how rare a coincidence would be, under the null hypothesis • To determine among the other possibilities, further experimentation is needed Marshall University School of Medicine

  12. Correlation and Causality in the Examples • In the first example (insulin sensitivity), the investigators performed further experiments in which they manipulated the variables • They concluded that lipid content determined insulin sensitivity (to some extent) • In the second example, the data come from the same genes under different sets of conditions • There is no direct mechanism for the expression under one condition to affect the expression under another condition • In the first example, it makes sense to investigate the nature of the influence of lipid content on insulin sensitivity further Marshall University School of Medicine

  13. Simple Linear Regression • Correlation asks the question "To what extent is there a linear relationship between two variables” • Linear regression asks the question "What is the linear relationship between two variables” • Correlation is symmetric • the correlation coefficient between x and y is the same as the correlation coefficient between y and x • Linear regression is not symmetric: • One variable must be designated as independent and one must be designated as dependent • It assumes a model of causality • Switching the roles of independent and dependent variables will produce different results Marshall University School of Medicine

  14. What does linear regression do? • Linear regression calculates the straight line that gives the best prediction of the y values from the xvalues • It finds the values of a and b in the equation y=a+bx to do this • This is done by minimizing the sum of the squares of the vertical distance from each point to the line • Note that: • The roles of x and y are predetermined, and affect the result • We can only estimate a and b based on our data sample • We cannot know the true population values for a and b • Usually helpful to calculate a confidence interval for these • Does it make sense to perform linear regression on • The insulin sensitivity data? • The GRHL2-EMT expression data? Marshall University School of Medicine

  15. Linear Regression for Insulin Sensitivity Marshall University School of Medicine

  16. Linear Regression results for Insulin Sensitivity Marshall University School of Medicine

  17. Interpreting linear regression results • The best fit values show the slope and intercepts of the line, along with their standard errors • So estimate of slope is 37.21 with standard error 9.3 • For each 1% increase in the percentage of polyunsaturated fatty acids with 20-22 Carbon atoms, the insulin sensitivity increases on average by 37.21 mg/m2/min • The 95% confidence interval for the slope ranges from 16.75 to 57.67 • This is easier to interpret than the standard error • We are 95% confident the range 16.75 to 57.67 includes the true value of the slope • The intercepts give the values of the insulin sensitivity when the %C20-22 is 0, and the value of %C20-22 that would yield an insulin sensitivity of 0 • Are these meaningful? • The R2 value is 0.5929. This means that 59% of the variance in insulin sensitivity can be accounted for by the variation in C20-22 polyunsaturated fatty acids, and the remaining 41% is the result of other factors. • We will discuss R2 in more detail later Marshall University School of Medicine

  18. p-value for linear regression • The linear regression results give a p-value of 0.0021 • To interpret this, we need to know the null hypothesis • The null hypothesis is that there really is no linear relationship between insulin sensitivity and %C20-22. • If this were true, the best fit line would have a slope of zero • If the null hypothesis were true (there is no linear relationship between the two), the chances of seeing a best fit line with a slope at least this steep would be 0.21% • Note that the null hypothesis for correlation is essentially equivalent to the null hypothesis for linear regression • Hence the p-values are equal • However, the interpretations are different Marshall University School of Medicine

  19. Assumptions for linear regression • Linear regression is based on the following assumptions: • There is a linear relationship between the two quantities • The residuals are normally distributed • The residuals are the vertical distances of each point from the line; the random scatter • The variability is the same all the way along the line • Data points are independent • The x and y values are measured independently • The x values are known precisely • Be careful of the following: • Do not try to interpret the linear regression for values far from the data • In our example, the %C20-22 values were all between 17 and 25. The linear regression is unlikely to be meaningful for values far from this. In particular, the intercept value (%C20-22=0) is likely to be meaningless. Marshall University School of Medicine

  20. Common mistakes with linear regression • Be careful of the following traps when using Linear Regression: • Not all relationships are linear! If the R2value for linear regression is low, consider the possibility there may be another relationship between the variables • Don't use linear regression on smoothed data • This violates the assumption that data points are independent • Don't use linear regression if y is (partly) calculated from x • For example, if y is the change in a measurement before and after treatment, and x is the value before treatment • This violates the assumptions that x and y are measured independently • Always carefully consider which variable is x and which is y • If you can't decide, you probably shouldn't be using regression • Always plot the data Marshall University School of Medicine

  21. Summary: Correlation Correlation determines the extent to which two variables share a linear relationship Makes no assumptions and draws no conclusions about causality The correlation coefficient is between -1 and 1, with ±1 being a perfect linear relationship The square of the correlation coefficient is the percentage of variability in one variable which is "explained by" the variability in the other variable Marshall University School of Medicine

  22. Summary: Linear Regression • Linear Regression provides the best prediction of one variable from another variable, assuming they have a linear relationship • Causal direction is built into the model • Results give estimates for two parameters: • Intercept and slope • and confidence intervals for each Marshall University School of Medicine

  23. Introduction to Statistical Models Marshall University School of Medicine

  24. What is a model? • In general, a model is a (simpler) representation of something else • We use models to study complex phenomena • Easier to manipulate than the real thing of interest • Easier to focus on specific aspects • E.g. we use mouse models to study human disease • Easier to control behavior of the mouse • Easier to control genetics… Marshall University School of Medicine

  25. What is a mathematical model? • A mathematical model is an equation (or set of equations) that describes a physical state or process • Describes how values in the state or process are related to each other • Aim is not to provide a perfect model • A good model is simple enough to be easy to understand • Yet complex enough to be useful Marshall University School of Medicine

  26. Statistical Models • Statistical models are mathematical models that model both the ideal predictions and the random “scatter” or “noise” • Model both the population values and the “random” variation from the population values • “Random” variation is really just variation not explained or accounted for by the model Marshall University School of Medicine

  27. Model terminology • A model is an equation (or set of equations) • The equation defines the outcome, or dependent variable as a function of • one or more independent variables, and • one or more parameters • Each data point has its own values for the independent and dependent variables • The values of the parameters are properties of the population • Do not vary from data point to data point Marshall University School of Medicine

  28. Fitting a model to data • The parameters are properties of the population • They are unknown • Typically, we collect a sample of data points • Assuming the model is correct, we can use the sample to estimate the parameters of the model • This is called “fitting a model to the data” • Results in estimates and confidence intervals for each of the parameters Marshall University School of Medicine

  29. Simplest possible model The simplest possible model for a data set involves no independent variable! Sample values from a population Assume the population values follow a Normal distribution Our model is Marshall University School of Medicine

  30. Average as a model • In the simple model Y=μ+ε, • Y is the dependent variable • Different value for each data point • μ is a parameter • The mean of the population • Single, unknown value we will estimate from our data • ε is the “random error” • Different for each data point, assumed normally distributed with mean zero • Can make the roles of the variable types more explicit by writingYi=μ+εi Marshall University School of Medicine

  31. Why the mean is important • If we assume the model is correct: • Our data are sampled from a population where the values are some fixed value, plus some scatter that is normally distributed with mean zero • then we want to use our data to estimate μ • It turns out that the value of μ that makes our observed data the most likely, out of all possible choices of μ, is the mean of our data • The mean is the maximum likelihood estimate of μ Marshall University School of Medicine

  32. A more sophisticated model: linear regression • Revisit the example from linear regression: • Measured insulin sensitivity and %C20-22 content in 13 healthy men • Hypothesized that an increase in %C20-22 content caused an increase in insulin sensitivity • Used linear regression to fit the modelY = intercept + slope × X + scatterto the data • Y is the insulin sensitivity, X the %C20-22 content • In more conventional notation:Y = β0 + β1 × X + ε, orYi = β0 + β1 × Xi + εi Marshall University School of Medicine

  33. Linear regression as a statistical model • The linear regression model has two parameters: • β0, the intercept • β1, the slope • These are both properties of the population • We use the data to estimate them • Uses the method of “least squares” • Gives the maximum likelihood estimate for the two parameters • The values of the parameters that maximize the chances of our data being observed Marshall University School of Medicine

  34. Recap of models • The linear regression in this example gave an estimate of the slope of 37.2, and an estimate of the intercept of -486.5 • Our estimated model isInsulin sensitivity = 37.2 × %C20-22 - 486.5 + ε • The model is not assumed to be perfect! • Simple, but powerful enough to draw some basic conclusions • Within the range of the data, an increase in one unit in %C20-22 results, on average, in an increase in 37.2 units in insulin sensitivity Marshall University School of Medicine

  35. Other types of model • We will look at other types of model in upcoming lectures: • Multiple regression • More than one independent variable • Logistic regression • Outcome variable is binary, one or more independent variables • Proportional hazards regression • Outcome variable is survival time, one or more independent variables Marshall University School of Medicine

More Related