- 76 Views
- Uploaded on
- Presentation posted in: General

Canadian Bioinformatics Workshops

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Canadian Bioinformatics Workshops

www.bioinformatics.ca

Module #: Title of Module

2

Lecture 4Multivariate Analyses I: Specific Models

MBP1010

Dr. Paul C. Boutros

Winter 2014

†

Aegeus, King of Athens, consulting the Delphic Oracle. High Classical (~430 BCE)

DEPARTMENT OF

MEDICAL BIOPHYSICS

This workshop includes material

originally developed by Drs. Raphael Gottardo,

Sohrab Shah, Boris Steipe and others

†

- Lecture 1: What is Statistics? Introduction to R
- Lecture 2: Univariate Analyses I: continuous
- Lecture 3: Univariate Analyses II: discrete
- Lecture 4: Multivariate Analyses I: specialized models
- Lecture 5: Multivariate Analyses II: general models
- Lecture 6: Sequence Analysis
- Lecture 7: Microarray Analysis I: Pre-Processing
- Lecture 8: Microarray Analysis II: Multiple-Testing
- Lecture 9: Machine-Learning
- Final Exam (written)

- 9% Participation: 1% per week
- 56% Assignments: 8 x 7% each
- 35% Final Examination: in-class
- Each individual will get their own, unique assignment
- Assignments will all be in R, and will be graded according to computational correctness only (i.e. does your R script yield the correct result when run)
- Final Exam will include multiple-choice and written answers

- Website will have up to date information, lecture notes, sample source-code from class, etc.
- http://medbio.utoronto.ca/students/courses/mbp1010/mbp_1010.html

- Tutorials are Thursdays 13:00-15:00 in 4-204 TMDT
- New TA (focusing on bioinformatics component) will be Irakli(Erik) Dzneladze
- Assignment #1 is released today, due on January 30
- Assignment #2 will be released on February 3, due Feb 10
- Updated course-schedule on website

- Cell phones to silent
- No side conversations
- Hands up for questions

- Review to date
- Cervix Cancer Genomes
- Attendance
- Multivariate analyses

All MBP Students = Population

MBP Students in 1010 = Sample

Populationvs.Sample

How do you report statistical information?

P-value, variance, effect-size, sample-size, test

Why don’t we useExcel/spreadsheets?

Input errors, reproducibility, wrong results

No gaps on the number-line

Define discrete data

What is the central limit theorem?

A random variable that is the sum of many small random variables is normally distributed

Theoretical vs. empiricalquantiles

Probability vs. percentage of values less than p

Components of a boxplot?

25% - 1.5 IQR, 25%, 50%, 75%, 75% + 1.5 IQR

Compares two samples or a sample and a distribution. Straight line indicates identity.

How can you interpret a QQ plot?

What is hypothesis testing?

Confirmatory data-analysis; test null hypothesis

What is a p-value?

Evidence against null; probability of FP, probability of seeing as extreme a value by chance alone

Parametric tests have distributional assumptions

Parametric vs. non-parametric tests

What is the t-statistics?

Signal:Noise ratio

Assumptions of the t-test?

Data sampled from normal distribution; independence of replicates; independence of groups; homoscedasticity

Flow-Chart For Two-Sample Tests

Is Data Sampled From a Normally-Distributed Population?

Yes

No

Sufficient n for CLT (>30)?

Equal Variance

(F-Test)?

Yes

Yes

No

No

Heteroscedastic

T-Test

Homoscedastic

T-Test

Wilcoxon

U-Test

Parametric tests have distributional assumptions

Parametric vs. non-parametric tests

What is the t-statistics?

Signal:Noise ratio

Assumptions of the t-test?

Data sampled from normal distribution; independence of replicates; independence of groups; homoscedasticity

Probability a test will incorrect reject the null

AKA sensitivity or 1- false-negative rate

What is statistical power?

What is a correlation?

A relationship between two (random) variables

Common correlation metrics?

Pearson, Spearman, Kendall

An endogenous RNA that “soaks up” miRNAs to prevent their activity on another endogenous RNA

What is a ceRNA?

What are the major univariate discrete tests?

Pearson’s Chi-Squared, Fisher’s Exact, Proportion, Hypergeometric

Common correlation metrics?

Pearson, Spearman, Kendall

- Hypergeometric test
- Is a sample randomly selected from a fixed population?

- Proportion test
- Are two proportions equivalent?

- Fisher’s Exact test
- Are two binary classifications associated?

- (Pearson’s) Chi-Squared Test
- Are paired observations on two variables independent?

- Ubiquitous in modern biology
- Every class I will show a use of statistics in a (very, very) recent Nature paper.

Advance Online Publication

- Diesease burden increasing
- (~380k to ~450k in the last 30 years)

- By age 50, >80% of women have HPV infection
- >75% of sexually active women exposed, only a subset affected
- Why is nearly totally unknown!

- Tightly Associated with Poverty

- Cervix>99%
- Anal~85%
- Vaginal~70%
- Vulvar~40%
- Penile~45%
- Head & Neck~20-30%

Of course not all of these are the HPV subtypes caught by current vaccines, but a majority are. Thus many cancers are preventable.

- Lots of information in Supplementary, but in large part citations to previous work
- Main text, mutation rate vs. histology compared using Wilcoxon
- Reported p-value, sample-size, effect-size

- They did incredibly good reporting in supplementary for example...

Attendance Break

Problem

- When we measure more one than one variable for each member of a population, a scatter plot may show us that the values are not completely independent: there is e.g. a trend for one variable to increase as the other increases.
- Regression analyses the dependence.
- Examples:
- Height vs. weight
- Gene dosage vs.expression level
- Survival analysis:probability of death vs. age

Correlation

When one variable depends on the other, the variables are to some degree correlated.

(Note: correlation need not imply causality.)

In R, the function cov() measures covariance and cor() measures the Pearson coefficient of correlation (a normalized measure of covariance).

Pearson's coeffecient of correlation values rangefrom -1 to 1, with 0 indicating no correlation.

Modeling

Linear regression is one possible model we can apply to data analysis.

A model in the statistician's sense might not be what you think ... it is merely a device to explain data. While it may help you consider mechanisms and causalities, it is not necessarily a representation of any particular physical or biological mechanism.

Note: correlation does not entail causation.

Specify Model

Estimate Parameters

No

Adequate?

Yes

Use Model

Linear regression assumes a particular model:

xiis the independent variable. Depending on the context, also known as a "predictor variable," "regressor," "controlled variable," "manipulated variable," "explanatory variable," "exposure variable," and/or "input variable."

yiis the dependent variable, also known as "response variable," "regressand," "measured variable," "observed variable," "responding variable," "explained variable," "outcome variable," "experimental variable," and/or "output variable."

i are "errors" - not in the sense of being "wrong", but in the sense of creating deviations from the idealized model. The i are assumed to be independent and N(0,2) (normally distributed), they can also be called residuals.

This model has two parameters: the regression coefficient , and the intercept .

Assumptions:

Only two variables are of interest

One variable is a response and one a predictor

No adjustment is needed for confounding or other between-subject variation

Linearity

σ2 is constant, independent of x

i are independent of each other

For proper statistical inference (CI, p-values), i are normal distributed

- Linear regression analysis includes:
- estimation of the parameters;
- characterization how good the model is.

Parameter estimation: choose parameters that come as close as possible to the "true" values.

Problem: how do we distinguish "good" from "poor" estimates?

One possibility: minimize the Sum of Squared Errors SSE

In a general sense, for a sample

and a model M,

For a linear model, with the estimated parameters a, b

Estimation: choose parameters a, b so that the SSE is as small as possible. We call these: least squares estimates.

This method of least squares has an analytic solution for the linear case.

Linear regression example: height vs. weight

synthHWsamples <- function(n) {

set.seed(83)

# parameters for a height vs. weight plot

hmin <- 1.5

hmax <- 2.3

M <- matrix(nrow=n,ncol=2)

# generate a column of numbers in the interval

M[,1] <- runif(n, hmin, hmax)

# generate a column of numbers with a linear model

M[,2] <- 40 * M[,1] + 1

# add some errors

M[,2] <- M[,2] + rnorm(n, 0, 15)

return(M)

}

Under the parmeters used above, a linear regression analysis should show a slope of 40kg/m and an intercept of 40 kg.

It is always a good idea to sanity-test your analysis with synthetic data. After all: if you can't retrieve your model parameters from synthetic data, how could you trust your analysis of real data?

Linear regression example: height vs. weight

> HW<-synthHWsamples(50)

> plot(HW, xlab="Height (cm)", ylab="Weight (kg)")

> cov(HW[,1], HW[,2])

[1] 2.498929

> cor(HW[,1], HW[,2])

[1] 0.5408063

Pearson's Coefficient of Correlation

How to interpret the correlation coefficient:

Explore varying degrees of randomness ...

> x<-rnorm(50)

> r <- 0.99;

> y <- (r * x) + ((1-r) * rnorm(50));

> plot(x,y); cor(x,y)

[1] 0.9999666

Pearson's Coefficient of Correlation

Varying degrees of randomness ...

> x<-rnorm(50)

> r <- 0.8;

> y <- (r * x) + ((1-r) * rnorm(50));

> plot(x,y); cor(x,y)

[1] 0.9661111

Linear regression example: height vs. weight

Estimate a linear model

to recover parameters:

> lm(HW[,2] ~ HW[,1])

Call:

lm(formula = HW[, 2] ~ HW[, 1])

Coefficients:

(Intercept) dat[, 1]

-2.86 42.09

> abline(-2.86, 42.09)

or:

abline(lm(HW[,2] ~ HW[,1]), col=rgb(192/255, 80/255, 77/255), lwd=3)

Linear regression example: height vs. weight

Extract information:

> summary(lm(HW[,2] ~ HW[,1]))

Call:

lm(formula = HW[, 2] ~ HW[, 1])

Residuals:

Min 1Q Median 3Q Max

-36.490 -10.297 3.426 9.156 37.385

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) -2.860 18.304 -0.156 0.876

HW[, 1] 42.090 9.449 4.454 5.02e-05 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘’ 1

Residual standard error: 16.12 on 48 degrees of freedom

Multiple R-squared: 0.2925,Adjusted R-squared: 0.2777

F-statistic: 19.84 on 1 and 38 DF, p-value: 5.022e-05

Linear regression example: height vs. weight

> summary(lm(HW[,2] ~ HW[,1]))

Call:

lm(formula = HW[, 2] ~ HW[, 1])

Residuals:

Min 1Q Median 3Q Max

-36.490 -10.297 3.426 9.156 37.385

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) -2.860 18.304 -0.156 0.876

HW[, 1] 42.090 9.449 4.454 5.02e-05 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘’ 1

Residual standard error: 16.12 on 48 degrees of freedom

Multiple R-squared: 0.2925,Adjusted R-squared: 0.2777

F-statistic: 19.84 on 1 and 38 DF, p-value: 5.022e-05

Linear regression example: height vs. weight

Extract information:

> summary(lm(HW[,2] ~ HW[,1]))

Call:

lm(formula = HW[, 2] ~ HW[, 1])

Residuals:

Min 1Q Median 3Q Max

-36.490 -10.297 3.426 9.156 37.385

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) -2.860 18.304 -0.156 0.876

HW[, 1] 42.090 9.449 4.454 5.02e-05 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘’ 1

Residual standard error: 16.12 on 48 degrees of freedom

Multiple R-squared: 0.2925,Adjusted R-squared: 0.2777

F-statistic: 19.84 on 1 and 38 DF, p-value: 5.022e-05

Linear regression example: height vs. weight

Extract information:

> summary(lm(HW[,2] ~ HW[,1]))

Call:

lm(formula = HW[, 2] ~ HW[, 1])

Residuals:

Min 1Q Median 3Q Max

-36.490 -10.297 3.426 9.156 37.385

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) -2.860 18.304 -0.156 0.876

HW[, 1] 42.090 9.449 4.454 5.02e-05 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘’ 1

Residual standard error: 16.12 on 48 degrees of freedom

Multiple R-squared: 0.2925,Adjusted R-squared: 0.2777

F-statistic: 19.84 on 1 and 38 DF, p-value: 5.022e-05

Linear regression example: height vs. weight

Extract information:

> summary(lm(HW[,2] ~ HW[,1]))

Call:

lm(formula = HW[, 2] ~ HW[, 1])

Residuals:

Min 1Q Median 3Q Max

-36.490 -10.297 3.426 9.156 37.385

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) -2.860 18.304 -0.156 0.876

HW[, 1] 42.090 9.449 4.454 5.02e-05 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘’ 1

Residual standard error: 16.12 on 48 degrees of freedom

Multiple R-squared: 0.2925,Adjusted R-squared: 0.2777

F-statistic: 19.84 on 1 and 38 DF, p-value: 5.022e-05

Intepreting the results has two parts:

1: Is the model adequate? (Residuals)

2: Are the parameter estimates good? (Confidence limits)

Residuals:

The solid red line is the least-squares-fit line (regression line), defined by a particular slope and intercept. The red lines between the regression line and the actual data points are the residuals. Residuals are "signed" i.e. negative if an observation is smaller than the corresponding value of the regression line.

- Residual plots allow us to validate underlying assumptions:
- Relationship between response and regressor should be linear (at least approximately).
- Error term, should have zero mean
- Error term, should have constant variance
- Errors should be normally distributed (required for tests and intervals)

Source: Montgomery et al., 2001, Introduction to Linear Regression Analysis

Check constant variance and linearity, and look for potential outliers.

What does our synthetic data look like, regarding this aspect?

Linear regression example: height vs. weight

Get residuals:

res <- resid(lm(HW[,2] ~ HW[,1]))

Get idealized values:

fit <- fitted(lm(HW[,2] ~ HW[,1]))

Plot differences:

segments(HW[,1], HW[,2], HW[,1], fit, col=2)

Linear regression example: height vs. weight

fit vs. residuals

> plot(fit, res)

> cor(fit, res)

[1] -1.09228e-16

Adequate

Inadequate

Inadequate

Residuals vs. similarly distributed normal deviates check the normality assumption

Inadequate

Inadequate

Source: Montgomery et al., 2001, Introduction to Linear Regression Analysis

Cumulative probability

of normal distribution

Linear regression example: height vs. weight

Q-Q plot: are the residuals normally distributed?

qqnorm(res)

If the model is valid, i.e. nothing terrible in the residuals, we can use it to predict. But how good is the prediction?

Linear regression example: height vs. weight

prediction and confidence limits

> pp<-predict(lm(HW[,2] ~ HW[,1]), int="p")

Warning message:

In predict.lm(lm(HW[, 2] ~ HW[, 1]), int = "p") :

Predictions on current data refer to _future_ responses

> pc<-predict(lm(HW[,2] ~ HW[,1]), int="c")

> head(pc)

fit lwr upr

1 60.57098 51.45048 69.69148

2 67.98277 61.53194 74.43360

3 77.96070 73.37784 82.54356

4 92.04435 84.23698 99.85171

5 76.34929 71.70340 80.99518

6 76.57656 71.94643 81.20670

Linear regression example: height vs. weight

Plot pp and pc limits

a: sort on x

o <- order(HW[,1])

HW2 <- HW[o,]

b: recompute pp, pc

pc<-predict(lm(HW2[,2] ~ HW2[,1]), int="c")

pp<-predict(lm(HW2[,2] ~ HW2[,1]), int="p")

c: plot

> plot(HW2, xlab="Height (cm)", ylab="Weight (kg)", ylim=range(HW2[,2], pp))

> matlines (HW2[,1], pc, lty=c(1,2,2), col="black")

> matlines (HW2[,1], pp, lty=c(1,3,3), col="red")

Linear regression example: height vs. weight

Plot pp and pc limits

prediction interval (at p=0.95)

(values)

a: sort on x

o <- order(HW[,1])

HW2 <- HW[o,]

b: recompute pp, pc

confidence interval(at p=0.95)

(parameters)

pc<-predict(lm(HW2[,2] ~ HW2[,1]), int="c")

pp<-predict(lm(HW2[,2] ~ HW2[,1]), int="p")

c: plot

> plot(HW2, xlab="Height (cm)", ylab="Weight (kg)", ylim=range(HW2[,2], pp))

> matlines (HW2[,1], pc, lty=c(1,2,2), col="black")

> matlines (HW2[,1], pp, lty=c(1,3,3), col="red")

Lots of Analyses Are Linear Regressions

Y = a0 + a1x1

x1 continuous

Linear Regression

Y factorial

Y = a0 + a1x1

Logistic Regression

x1 factorial

Y = a0 + a1x1

1-way ANOVA

One-Way ANOVAs in R

# have a list of groups

x <- as.factor(rep(c(‘A’,‘B’,C’), 3));

# and some continuous data

y <- rnorm(9);

# fit a one-way anova with:

tmp <- aov(y ~ x);

# get a p-value with:

summary(tmp);

- Lecture 1: What is Statistics? Introduction to R
- Lecture 2: Univariate Analyses I: continuous
- Lecture 3: Univariate Analyses II: discrete
- Lecture 4: Multivariate Analyses I: specialized models
- Lecture 5: Multivariate Analyses II: general models
- Lecture 6: Sequence Analysis
- Lecture 7: Microarray Analysis I: Pre-Processing
- Lecture 8: Microarray Analysis II: Multiple-Testing
- Lecture 9: Machine-Learning
- Final Exam (written)