Canadian Bioinformatics Workshops

Canadian Bioinformatics Workshops www.bioinformatics.ca

Module #: Title of Module 2

Lecture 4Multivariate Analyses I: Specific Models MBP1010 Dr. Paul C. Boutros Winter 2014 † Aegeus, King of Athens, consulting the Delphic Oracle. High Classical (~430 BCE) DEPARTMENT OF MEDICAL BIOPHYSICS This workshop includes material originally developed by Drs. Raphael Gottardo, Sohrab Shah, Boris Steipe and others †

Course Overview • Lecture 1: What is Statistics? Introduction to R • Lecture 2: Univariate Analyses I: continuous • Lecture 3: Univariate Analyses II: discrete • Lecture 4: Multivariate Analyses I: specialized models • Lecture 5: Multivariate Analyses II: general models • Lecture 6: Sequence Analysis • Lecture 7: Microarray Analysis I: Pre-Processing • Lecture 8: Microarray Analysis II: Multiple-Testing • Lecture 9: Machine-Learning • Final Exam (written)

How Will You Be Graded? • 9% Participation: 1% per week • 56% Assignments: 8 x 7% each • 35% Final Examination: in-class • Each individual will get their own, unique assignment • Assignments will all be in R, and will be graded according to computational correctness only (i.e. does your R script yield the correct result when run) • Final Exam will include multiple-choice and written answers

Course Information Updates • Website will have up to date information, lecture notes, sample source-code from class, etc. • http://medbio.utoronto.ca/students/courses/mbp1010/mbp_1010.html • Tutorials are Thursdays 13:00-15:00 in 4-204 TMDT • New TA (focusing on bioinformatics component) will be Irakli(Erik) Dzneladze • Assignment #1 is released today, due on January 30 • Assignment #2 will be released on February 3, due Feb 10 • Updated course-schedule on website

House Rules • Cell phones to silent • No side conversations • Hands up for questions

Topics For This Week • Review to date • Cervix Cancer Genomes • Attendance • Multivariate analyses

Review From Lecture #1 All MBP Students = Population MBP Students in 1010 = Sample Populationvs.Sample How do you report statistical information? P-value, variance, effect-size, sample-size, test Why don’t we useExcel/spreadsheets? Input errors, reproducibility, wrong results

Review From Lecture #2 No gaps on the number-line Define discrete data What is the central limit theorem? A random variable that is the sum of many small random variables is normally distributed Theoretical vs. empiricalquantiles Probability vs. percentage of values less than p Components of a boxplot? 25% - 1.5 IQR, 25%, 50%, 75%, 75% + 1.5 IQR

Review From Lecture #2 Compares two samples or a sample and a distribution. Straight line indicates identity. How can you interpret a QQ plot? What is hypothesis testing? Confirmatory data-analysis; test null hypothesis What is a p-value? Evidence against null; probability of FP, probability of seeing as extreme a value by chance alone

Review From Lecture #2 Parametric tests have distributional assumptions Parametric vs. non-parametric tests What is the t-statistics? Signal:Noise ratio Assumptions of the t-test? Data sampled from normal distribution; independence of replicates; independence of groups; homoscedasticity

Flow-Chart For Two-Sample Tests Is Data Sampled From a Normally-Distributed Population? Yes No Sufficient n for CLT (>30)? Equal Variance (F-Test)? Yes Yes No No Heteroscedastic T-Test Homoscedastic T-Test Wilcoxon U-Test

Review From Lecture #2 Parametric tests have distributional assumptions Parametric vs. non-parametric tests What is the t-statistics? Signal:Noise ratio Assumptions of the t-test? Data sampled from normal distribution; independence of replicates; independence of groups; homoscedasticity

Review From Lecture #3 Probability a test will incorrect reject the null AKA sensitivity or 1- false-negative rate What is statistical power? What is a correlation? A relationship between two (random) variables Common correlation metrics? Pearson, Spearman, Kendall

Review From Lecture #3 An endogenous RNA that “soaks up” miRNAs to prevent their activity on another endogenous RNA What is a ceRNA? What are the major univariate discrete tests? Pearson’s Chi-Squared, Fisher’s Exact, Proportion, Hypergeometric Common correlation metrics? Pearson, Spearman, Kendall

Four Main Discrete Univariate Tests • Hypergeometric test • Is a sample randomly selected from a fixed population? • Proportion test • Are two proportions equivalent? • Fisher’s Exact test • Are two binary classifications associated? • (Pearson’s) Chi-Squared Test • Are paired observations on two variables independent?

When Do We Use Statistics? • Ubiquitous in modern biology • Every class I will show a use of statistics in a (very, very) recent Nature paper. Advance Online Publication

Cervix Cancer 101 • Diesease burden increasing • (~380k to ~450k in the last 30 years) • By age 50, >80% of women have HPV infection • >75% of sexually active women exposed, only a subset affected • Why is nearly totally unknown! • Tightly Associated with Poverty

HPV Infection Associated Multiple Cancers • Cervix >99% • Anal ~85% • Vaginal ~70% • Vulvar ~40% • Penile ~45% • Head & Neck ~20-30% Of course not all of these are the HPV subtypes caught by current vaccines, but a majority are. Thus many cancers are preventable.

Figure 1 is a Classic Sequencing Figure

What Statistical Analysis Did They Do? • Lots of information in Supplementary, but in large part citations to previous work • Main text, mutation rate vs. histology compared using Wilcoxon • Reported p-value, sample-size, effect-size • They did incredibly good reporting in supplementary for example...

Attendance Break

Problem • When we measure more one than one variable for each member of a population, a scatter plot may show us that the values are not completely independent: there is e.g. a trend for one variable to increase as the other increases. • Regression analyses the dependence. • Examples: • Height vs. weight • Gene dosage vs.expression level • Survival analysis:probability of death vs. age

Correlation When one variable depends on the other, the variables are to some degree correlated. (Note: correlation need not imply causality.) In R, the function cov() measures covariance and cor() measures the Pearson coefficient of correlation (a normalized measure of covariance). Pearson's coeffecient of correlation values rangefrom -1 to 1, with 0 indicating no correlation.

Modeling Linear regression is one possible model we can apply to data analysis. A model in the statistician's sense might not be what you think ... it is merely a device to explain data. While it may help you consider mechanisms and causalities, it is not necessarily a representation of any particular physical or biological mechanism. Note: correlation does not entail causation. Specify Model Estimate Parameters No Adequate? Yes Use Model

Types of regression Linear regression assumes a particular model: xiis the independent variable. Depending on the context, also known as a "predictor variable," "regressor," "controlled variable," "manipulated variable," "explanatory variable," "exposure variable," and/or "input variable." yiis the dependent variable, also known as "response variable," "regressand," "measured variable," "observed variable," "responding variable," "explained variable," "outcome variable," "experimental variable," and/or "output variable." i are "errors" - not in the sense of being "wrong", but in the sense of creating deviations from the idealized model. The i are assumed to be independent and N(0,2) (normally distributed), they can also be called residuals. This model has two parameters: the regression coefficient , and the intercept .

Linear regression Assumptions: Only two variables are of interest One variable is a response and one a predictor No adjustment is needed for confounding or other between-subject variation Linearity σ2 is constant, independent of x i are independent of each other For proper statistical inference (CI, p-values), i are normal distributed

Linear regression • Linear regression analysis includes: • estimation of the parameters; • characterization how good the model is.

Linear regression: estimation Parameter estimation: choose parameters that come as close as possible to the "true" values. Problem: how do we distinguish "good" from "poor" estimates? One possibility: minimize the Sum of Squared Errors SSE In a general sense, for a sample and a model M,

Linear regression: estimation For a linear model, with the estimated parameters a, b Estimation: choose parameters a, b so that the SSE is as small as possible. We call these: least squares estimates. This method of least squares has an analytic solution for the linear case.

Linear regression example: height vs. weight synthHWsamples <- function(n) { set.seed(83) # parameters for a height vs. weight plot hmin <- 1.5 hmax <- 2.3 M <- matrix(nrow=n,ncol=2) # generate a column of numbers in the interval M[,1] <- runif(n, hmin, hmax) # generate a column of numbers with a linear model M[,2] <- 40 * M[,1] + 1 # add some errors M[,2] <- M[,2] + rnorm(n, 0, 15) return(M) } Under the parmeters used above, a linear regression analysis should show a slope of 40kg/m and an intercept of 40 kg. It is always a good idea to sanity-test your analysis with synthetic data. After all: if you can't retrieve your model parameters from synthetic data, how could you trust your analysis of real data?

Linear regression example: height vs. weight > HW<-synthHWsamples(50) > plot(HW, xlab="Height (cm)", ylab="Weight (kg)") > cov(HW[,1], HW[,2]) [1] 2.498929 > cor(HW[,1], HW[,2]) [1] 0.5408063

Pearson's Coefficient of Correlation How to interpret the correlation coefficient: Explore varying degrees of randomness ... > x<-rnorm(50) > r <- 0.99; > y <- (r * x) + ((1-r) * rnorm(50)); > plot(x,y); cor(x,y) [1] 0.9999666

Pearson's Coefficient of Correlation Varying degrees of randomness ... > x<-rnorm(50) > r <- 0.8; > y <- (r * x) + ((1-r) * rnorm(50)); > plot(x,y); cor(x,y) [1] 0.9661111

Linear regression example: height vs. weight Estimate a linear model to recover parameters: > lm(HW[,2] ~ HW[,1]) Call: lm(formula = HW[, 2] ~ HW[, 1]) Coefficients: (Intercept) dat[, 1] -2.86 42.09 > abline(-2.86, 42.09) or: abline(lm(HW[,2] ~ HW[,1]), col=rgb(192/255, 80/255, 77/255), lwd=3)

Linear regression example: height vs. weight Extract information: > summary(lm(HW[,2] ~ HW[,1])) Call: lm(formula = HW[, 2] ~ HW[, 1]) Residuals: Min 1Q Median 3Q Max -36.490 -10.297 3.426 9.156 37.385 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -2.860 18.304 -0.156 0.876 HW[, 1] 42.090 9.449 4.454 5.02e-05 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘’ 1 Residual standard error: 16.12 on 48 degrees of freedom Multiple R-squared: 0.2925, Adjusted R-squared: 0.2777 F-statistic: 19.84 on 1 and 38 DF, p-value: 5.022e-05

Linear regression example: height vs. weight > summary(lm(HW[,2] ~ HW[,1])) Call: lm(formula = HW[, 2] ~ HW[, 1]) Residuals: Min 1Q Median 3Q Max -36.490 -10.297 3.426 9.156 37.385 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -2.860 18.304 -0.156 0.876 HW[, 1] 42.090 9.449 4.454 5.02e-05 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘’ 1 Residual standard error: 16.12 on 48 degrees of freedom Multiple R-squared: 0.2925, Adjusted R-squared: 0.2777 F-statistic: 19.84 on 1 and 38 DF, p-value: 5.022e-05

Linear regression example: height vs. weight Extract information: > summary(lm(HW[,2] ~ HW[,1])) Call: lm(formula = HW[, 2] ~ HW[, 1]) Residuals: Min 1Q Median 3Q Max -36.490 -10.297 3.426 9.156 37.385 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -2.860 18.304 -0.156 0.876 HW[, 1] 42.090 9.449 4.454 5.02e-05 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘’ 1 Residual standard error: 16.12 on 48 degrees of freedom Multiple R-squared: 0.2925, Adjusted R-squared: 0.2777 F-statistic: 19.84 on 1 and 38 DF, p-value: 5.022e-05

Linear regression: quality control Intepreting the results has two parts: 1: Is the model adequate? (Residuals) 2: Are the parameter estimates good? (Confidence limits)

Linear regression: residuals Residuals: The solid red line is the least-squares-fit line (regression line), defined by a particular slope and intercept. The red lines between the regression line and the actual data points are the residuals. Residuals are "signed" i.e. negative if an observation is smaller than the corresponding value of the regression line.

Linear regression: quality control • Residual plots allow us to validate underlying assumptions: • Relationship between response and regressor should be linear (at least approximately). • Error term,  should have zero mean • Error term,  should have constant variance • Errors should be normally distributed (required for tests and intervals)

Linear regression: quality control Source: Montgomery et al., 2001, Introduction to Linear Regression Analysis Check constant variance and linearity, and look for potential outliers. What does our synthetic data look like, regarding this aspect?

Linear regression example: height vs. weight Get residuals: res <- resid(lm(HW[,2] ~ HW[,1])) Get idealized values: fit <- fitted(lm(HW[,2] ~ HW[,1])) Plot differences: segments(HW[,1], HW[,2], HW[,1], fit, col=2)

Linear regression example: height vs. weight fit vs. residuals > plot(fit, res) > cor(fit, res) [1] -1.09228e-16

Linear regression: Q-Q plot Adequate Inadequate Inadequate Residuals vs. similarly distributed normal deviates check the normality assumption Inadequate Inadequate Source: Montgomery et al., 2001, Introduction to Linear Regression Analysis Cumulative probability of normal distribution

Canadian Bioinformatics Workshops