Canadian bioinformatics workshops
This presentation is the property of its rightful owner.
Sponsored Links
1 / 58

Canadian Bioinformatics Workshops PowerPoint PPT Presentation


  • 74 Views
  • Uploaded on
  • Presentation posted in: General

Canadian Bioinformatics Workshops. www.bioinformatics.ca. Module #: Title of Module. 2. Lecture 4 Multivariate Analyses I: Specific Models. MBP1010 Dr. Paul C. Boutros Winter 2014. †. Aegeus, King of Athens, consulting the Delphic Oracle. High Classical (~430 BCE). D EPARTMENT OF

Download Presentation

Canadian Bioinformatics Workshops

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Canadian bioinformatics workshops

Canadian Bioinformatics Workshops

www.bioinformatics.ca


Canadian bioinformatics workshops

Module #: Title of Module

2


Canadian bioinformatics workshops

Lecture 4Multivariate Analyses I: Specific Models

MBP1010

Dr. Paul C. Boutros

Winter 2014

Aegeus, King of Athens, consulting the Delphic Oracle. High Classical (~430 BCE)

DEPARTMENT OF

MEDICAL BIOPHYSICS

This workshop includes material

originally developed by Drs. Raphael Gottardo,

Sohrab Shah, Boris Steipe and others


Course overview

Course Overview

  • Lecture 1: What is Statistics? Introduction to R

  • Lecture 2: Univariate Analyses I: continuous

  • Lecture 3: Univariate Analyses II: discrete

  • Lecture 4: Multivariate Analyses I: specialized models

  • Lecture 5: Multivariate Analyses II: general models

  • Lecture 6: Sequence Analysis

  • Lecture 7: Microarray Analysis I: Pre-Processing

  • Lecture 8: Microarray Analysis II: Multiple-Testing

  • Lecture 9: Machine-Learning

  • Final Exam (written)


How will you be graded

How Will You Be Graded?

  • 9% Participation: 1% per week

  • 56% Assignments: 8 x 7% each

  • 35% Final Examination: in-class

  • Each individual will get their own, unique assignment

  • Assignments will all be in R, and will be graded according to computational correctness only (i.e. does your R script yield the correct result when run)

  • Final Exam will include multiple-choice and written answers


Course information updates

Course Information Updates

  • Website will have up to date information, lecture notes, sample source-code from class, etc.

    • http://medbio.utoronto.ca/students/courses/mbp1010/mbp_1010.html

  • Tutorials are Thursdays 13:00-15:00 in 4-204 TMDT

  • New TA (focusing on bioinformatics component) will be Irakli(Erik) Dzneladze

  • Assignment #1 is released today, due on January 30

  • Assignment #2 will be released on February 3, due Feb 10

  • Updated course-schedule on website


House rules

House Rules

  • Cell phones to silent

  • No side conversations

  • Hands up for questions


Topics for this week

Topics For This Week

  • Review to date

  • Cervix Cancer Genomes

  • Attendance

  • Multivariate analyses


Review from lecture 1

Review From Lecture #1

All MBP Students = Population

MBP Students in 1010 = Sample

Populationvs.Sample

How do you report statistical information?

P-value, variance, effect-size, sample-size, test

Why don’t we useExcel/spreadsheets?

Input errors, reproducibility, wrong results


Review from lecture 2

Review From Lecture #2

No gaps on the number-line

Define discrete data

What is the central limit theorem?

A random variable that is the sum of many small random variables is normally distributed

Theoretical vs. empiricalquantiles

Probability vs. percentage of values less than p

Components of a boxplot?

25% - 1.5 IQR, 25%, 50%, 75%, 75% + 1.5 IQR


Review from lecture 21

Review From Lecture #2

Compares two samples or a sample and a distribution. Straight line indicates identity.

How can you interpret a QQ plot?

What is hypothesis testing?

Confirmatory data-analysis; test null hypothesis

What is a p-value?

Evidence against null; probability of FP, probability of seeing as extreme a value by chance alone


Review from lecture 22

Review From Lecture #2

Parametric tests have distributional assumptions

Parametric vs. non-parametric tests

What is the t-statistics?

Signal:Noise ratio

Assumptions of the t-test?

Data sampled from normal distribution; independence of replicates; independence of groups; homoscedasticity


Canadian bioinformatics workshops

Flow-Chart For Two-Sample Tests

Is Data Sampled From a Normally-Distributed Population?

Yes

No

Sufficient n for CLT (>30)?

Equal Variance

(F-Test)?

Yes

Yes

No

No

Heteroscedastic

T-Test

Homoscedastic

T-Test

Wilcoxon

U-Test


Review from lecture 23

Review From Lecture #2

Parametric tests have distributional assumptions

Parametric vs. non-parametric tests

What is the t-statistics?

Signal:Noise ratio

Assumptions of the t-test?

Data sampled from normal distribution; independence of replicates; independence of groups; homoscedasticity


Review from lecture 3

Review From Lecture #3

Probability a test will incorrect reject the null

AKA sensitivity or 1- false-negative rate

What is statistical power?

What is a correlation?

A relationship between two (random) variables

Common correlation metrics?

Pearson, Spearman, Kendall


Review from lecture 31

Review From Lecture #3

An endogenous RNA that “soaks up” miRNAs to prevent their activity on another endogenous RNA

What is a ceRNA?

What are the major univariate discrete tests?

Pearson’s Chi-Squared, Fisher’s Exact, Proportion, Hypergeometric

Common correlation metrics?

Pearson, Spearman, Kendall


Four main discrete univariate tests

Four Main Discrete Univariate Tests

  • Hypergeometric test

    • Is a sample randomly selected from a fixed population?

  • Proportion test

    • Are two proportions equivalent?

  • Fisher’s Exact test

    • Are two binary classifications associated?

  • (Pearson’s) Chi-Squared Test

    • Are paired observations on two variables independent?


When do we use statistics

When Do We Use Statistics?

  • Ubiquitous in modern biology

  • Every class I will show a use of statistics in a (very, very) recent Nature paper.

Advance Online Publication


Cervix cancer 101

Cervix Cancer 101

  • Diesease burden increasing

    • (~380k to ~450k in the last 30 years)

  • By age 50, >80% of women have HPV infection

    • >75% of sexually active women exposed, only a subset affected

    • Why is nearly totally unknown!

  • Tightly Associated with Poverty


Hpv infection associated multiple cancers

HPV Infection Associated Multiple Cancers

  • Cervix>99%

  • Anal~85%

  • Vaginal~70%

  • Vulvar~40%

  • Penile~45%

  • Head & Neck~20-30%

Of course not all of these are the HPV subtypes caught by current vaccines, but a majority are. Thus many cancers are preventable.


Figure 1 is a classic sequencing figure

Figure 1 is a Classic Sequencing Figure


What statistical analysis did they do

What Statistical Analysis Did They Do?

  • Lots of information in Supplementary, but in large part citations to previous work

  • Main text, mutation rate vs. histology compared using Wilcoxon

    • Reported p-value, sample-size, effect-size

  • They did incredibly good reporting in supplementary for example...


Canadian bioinformatics workshops

Attendance Break


Canadian bioinformatics workshops

Problem

  • When we measure more one than one variable for each member of a population, a scatter plot may show us that the values are not completely independent: there is e.g. a trend for one variable to increase as the other increases.

  • Regression analyses the dependence.

  • Examples:

    • Height vs. weight

    • Gene dosage vs.expression level

    • Survival analysis:probability of death vs. age


Canadian bioinformatics workshops

Correlation

When one variable depends on the other, the variables are to some degree correlated.

(Note: correlation need not imply causality.)

In R, the function cov() measures covariance and cor() measures the Pearson coefficient of correlation (a normalized measure of covariance).

Pearson's coeffecient of correlation values rangefrom -1 to 1, with 0 indicating no correlation.


Canadian bioinformatics workshops

Modeling

Linear regression is one possible model we can apply to data analysis.

A model in the statistician's sense might not be what you think ... it is merely a device to explain data. While it may help you consider mechanisms and causalities, it is not necessarily a representation of any particular physical or biological mechanism.

Note: correlation does not entail causation.

Specify Model

Estimate Parameters

No

Adequate?

Yes

Use Model


Types of regression

Types of regression

Linear regression assumes a particular model:

xiis the independent variable. Depending on the context, also known as a "predictor variable," "regressor," "controlled variable," "manipulated variable," "explanatory variable," "exposure variable," and/or "input variable."

yiis the dependent variable, also known as "response variable," "regressand," "measured variable," "observed variable," "responding variable," "explained variable," "outcome variable," "experimental variable," and/or "output variable."

i are "errors" - not in the sense of being "wrong", but in the sense of creating deviations from the idealized model. The i are assumed to be independent and N(0,2) (normally distributed), they can also be called residuals.

This model has two parameters: the regression coefficient , and the intercept .


Linear regression

Linear regression

Assumptions:

Only two variables are of interest

One variable is a response and one a predictor

No adjustment is needed for confounding or other between-subject variation

Linearity

σ2 is constant, independent of x

i are independent of each other

For proper statistical inference (CI, p-values), i are normal distributed


Linear regression1

Linear regression

  • Linear regression analysis includes:

  • estimation of the parameters;

  • characterization how good the model is.


Linear regression estimation

Linear regression: estimation

Parameter estimation: choose parameters that come as close as possible to the "true" values.

Problem: how do we distinguish "good" from "poor" estimates?

One possibility: minimize the Sum of Squared Errors SSE

In a general sense, for a sample

and a model M,


Linear regression estimation1

Linear regression: estimation

For a linear model, with the estimated parameters a, b

Estimation: choose parameters a, b so that the SSE is as small as possible. We call these: least squares estimates.

This method of least squares has an analytic solution for the linear case.


Canadian bioinformatics workshops

Linear regression example: height vs. weight

synthHWsamples <- function(n) {

set.seed(83)

# parameters for a height vs. weight plot

hmin <- 1.5

hmax <- 2.3

M <- matrix(nrow=n,ncol=2)

# generate a column of numbers in the interval

M[,1] <- runif(n, hmin, hmax)

# generate a column of numbers with a linear model

M[,2] <- 40 * M[,1] + 1

# add some errors

M[,2] <- M[,2] + rnorm(n, 0, 15)

return(M)

}

Under the parmeters used above, a linear regression analysis should show a slope of 40kg/m and an intercept of 40 kg.

It is always a good idea to sanity-test your analysis with synthetic data. After all: if you can't retrieve your model parameters from synthetic data, how could you trust your analysis of real data?


Canadian bioinformatics workshops

Linear regression example: height vs. weight

> HW<-synthHWsamples(50)

> plot(HW, xlab="Height (cm)", ylab="Weight (kg)")

> cov(HW[,1], HW[,2])

[1] 2.498929

> cor(HW[,1], HW[,2])

[1] 0.5408063


Canadian bioinformatics workshops

Pearson's Coefficient of Correlation

How to interpret the correlation coefficient:

Explore varying degrees of randomness ...

> x<-rnorm(50)

> r <- 0.99;

> y <- (r * x) + ((1-r) * rnorm(50));

> plot(x,y); cor(x,y)

[1] 0.9999666


Canadian bioinformatics workshops

Pearson's Coefficient of Correlation

Varying degrees of randomness ...

> x<-rnorm(50)

> r <- 0.8;

> y <- (r * x) + ((1-r) * rnorm(50));

> plot(x,y); cor(x,y)

[1] 0.9661111


Canadian bioinformatics workshops

Linear regression example: height vs. weight

Estimate a linear model

to recover parameters:

> lm(HW[,2] ~ HW[,1])

Call:

lm(formula = HW[, 2] ~ HW[, 1])

Coefficients:

(Intercept) dat[, 1]

-2.86 42.09

> abline(-2.86, 42.09)

or:

abline(lm(HW[,2] ~ HW[,1]), col=rgb(192/255, 80/255, 77/255), lwd=3)


Canadian bioinformatics workshops

Linear regression example: height vs. weight

Extract information:

> summary(lm(HW[,2] ~ HW[,1]))

Call:

lm(formula = HW[, 2] ~ HW[, 1])

Residuals:

Min 1Q Median 3Q Max

-36.490 -10.297 3.426 9.156 37.385

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) -2.860 18.304 -0.156 0.876

HW[, 1] 42.090 9.449 4.454 5.02e-05 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘’ 1

Residual standard error: 16.12 on 48 degrees of freedom

Multiple R-squared: 0.2925,Adjusted R-squared: 0.2777

F-statistic: 19.84 on 1 and 38 DF, p-value: 5.022e-05


Canadian bioinformatics workshops

Linear regression example: height vs. weight

> summary(lm(HW[,2] ~ HW[,1]))

Call:

lm(formula = HW[, 2] ~ HW[, 1])

Residuals:

Min 1Q Median 3Q Max

-36.490 -10.297 3.426 9.156 37.385

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) -2.860 18.304 -0.156 0.876

HW[, 1] 42.090 9.449 4.454 5.02e-05 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘’ 1

Residual standard error: 16.12 on 48 degrees of freedom

Multiple R-squared: 0.2925,Adjusted R-squared: 0.2777

F-statistic: 19.84 on 1 and 38 DF, p-value: 5.022e-05


Canadian bioinformatics workshops

Linear regression example: height vs. weight

Extract information:

> summary(lm(HW[,2] ~ HW[,1]))

Call:

lm(formula = HW[, 2] ~ HW[, 1])

Residuals:

Min 1Q Median 3Q Max

-36.490 -10.297 3.426 9.156 37.385

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) -2.860 18.304 -0.156 0.876

HW[, 1] 42.090 9.449 4.454 5.02e-05 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘’ 1

Residual standard error: 16.12 on 48 degrees of freedom

Multiple R-squared: 0.2925,Adjusted R-squared: 0.2777

F-statistic: 19.84 on 1 and 38 DF, p-value: 5.022e-05


Canadian bioinformatics workshops

Linear regression example: height vs. weight

Extract information:

> summary(lm(HW[,2] ~ HW[,1]))

Call:

lm(formula = HW[, 2] ~ HW[, 1])

Residuals:

Min 1Q Median 3Q Max

-36.490 -10.297 3.426 9.156 37.385

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) -2.860 18.304 -0.156 0.876

HW[, 1] 42.090 9.449 4.454 5.02e-05 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘’ 1

Residual standard error: 16.12 on 48 degrees of freedom

Multiple R-squared: 0.2925,Adjusted R-squared: 0.2777

F-statistic: 19.84 on 1 and 38 DF, p-value: 5.022e-05


Canadian bioinformatics workshops

Linear regression example: height vs. weight

Extract information:

> summary(lm(HW[,2] ~ HW[,1]))

Call:

lm(formula = HW[, 2] ~ HW[, 1])

Residuals:

Min 1Q Median 3Q Max

-36.490 -10.297 3.426 9.156 37.385

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) -2.860 18.304 -0.156 0.876

HW[, 1] 42.090 9.449 4.454 5.02e-05 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘’ 1

Residual standard error: 16.12 on 48 degrees of freedom

Multiple R-squared: 0.2925,Adjusted R-squared: 0.2777

F-statistic: 19.84 on 1 and 38 DF, p-value: 5.022e-05


Linear regression quality control

Linear regression: quality control

Intepreting the results has two parts:

1: Is the model adequate? (Residuals)

2: Are the parameter estimates good? (Confidence limits)


Linear regression residuals

Linear regression: residuals

Residuals:

The solid red line is the least-squares-fit line (regression line), defined by a particular slope and intercept. The red lines between the regression line and the actual data points are the residuals. Residuals are "signed" i.e. negative if an observation is smaller than the corresponding value of the regression line.


Linear regression quality control1

Linear regression: quality control

  • Residual plots allow us to validate underlying assumptions:

    • Relationship between response and regressor should be linear (at least approximately).

    • Error term,  should have zero mean

    • Error term,  should have constant variance

    • Errors should be normally distributed (required for tests and intervals)


Linear regression quality control2

Linear regression: quality control

Source: Montgomery et al., 2001, Introduction to Linear Regression Analysis

Check constant variance and linearity, and look for potential outliers.

What does our synthetic data look like, regarding this aspect?


Canadian bioinformatics workshops

Linear regression example: height vs. weight

Get residuals:

res <- resid(lm(HW[,2] ~ HW[,1]))

Get idealized values:

fit <- fitted(lm(HW[,2] ~ HW[,1]))

Plot differences:

segments(HW[,1], HW[,2], HW[,1], fit, col=2)


Canadian bioinformatics workshops

Linear regression example: height vs. weight

fit vs. residuals

> plot(fit, res)

> cor(fit, res)

[1] -1.09228e-16


Linear regression q q plot

Linear regression: Q-Q plot

Adequate

Inadequate

Inadequate

Residuals vs. similarly distributed normal deviates check the normality assumption

Inadequate

Inadequate

Source: Montgomery et al., 2001, Introduction to Linear Regression Analysis

Cumulative probability

of normal distribution


Canadian bioinformatics workshops

Linear regression example: height vs. weight

Q-Q plot: are the residuals normally distributed?

qqnorm(res)


Linear regression evaluating accuracy

Linear regression: Evaluating accuracy

If the model is valid, i.e. nothing terrible in the residuals, we can use it to predict. But how good is the prediction?


Canadian bioinformatics workshops

Linear regression example: height vs. weight

prediction and confidence limits

> pp<-predict(lm(HW[,2] ~ HW[,1]), int="p")

Warning message:

In predict.lm(lm(HW[, 2] ~ HW[, 1]), int = "p") :

Predictions on current data refer to _future_ responses

> pc<-predict(lm(HW[,2] ~ HW[,1]), int="c")

> head(pc)

fit lwr upr

1 60.57098 51.45048 69.69148

2 67.98277 61.53194 74.43360

3 77.96070 73.37784 82.54356

4 92.04435 84.23698 99.85171

5 76.34929 71.70340 80.99518

6 76.57656 71.94643 81.20670


Canadian bioinformatics workshops

Linear regression example: height vs. weight

Plot pp and pc limits

a: sort on x

o <- order(HW[,1])

HW2 <- HW[o,]

b: recompute pp, pc

pc<-predict(lm(HW2[,2] ~ HW2[,1]), int="c")

pp<-predict(lm(HW2[,2] ~ HW2[,1]), int="p")

c: plot

> plot(HW2, xlab="Height (cm)", ylab="Weight (kg)", ylim=range(HW2[,2], pp))

> matlines (HW2[,1], pc, lty=c(1,2,2), col="black")

> matlines (HW2[,1], pp, lty=c(1,3,3), col="red")


Canadian bioinformatics workshops

Linear regression example: height vs. weight

Plot pp and pc limits

prediction interval (at p=0.95)

(values)

a: sort on x

o <- order(HW[,1])

HW2 <- HW[o,]

b: recompute pp, pc

confidence interval(at p=0.95)

(parameters)

pc<-predict(lm(HW2[,2] ~ HW2[,1]), int="c")

pp<-predict(lm(HW2[,2] ~ HW2[,1]), int="p")

c: plot

> plot(HW2, xlab="Height (cm)", ylab="Weight (kg)", ylim=range(HW2[,2], pp))

> matlines (HW2[,1], pc, lty=c(1,2,2), col="black")

> matlines (HW2[,1], pp, lty=c(1,3,3), col="red")


Canadian bioinformatics workshops

Lots of Analyses Are Linear Regressions

Y = a0 + a1x1

x1 continuous

Linear Regression

Y factorial

Y = a0 + a1x1

Logistic Regression

x1 factorial

Y = a0 + a1x1

1-way ANOVA


Canadian bioinformatics workshops

One-Way ANOVAs in R

# have a list of groups

x <- as.factor(rep(c(‘A’,‘B’,C’), 3));

# and some continuous data

y <- rnorm(9);

# fit a one-way anova with:

tmp <- aov(y ~ x);

# get a p-value with:

summary(tmp);


Course overview1

Course Overview

  • Lecture 1: What is Statistics? Introduction to R

  • Lecture 2: Univariate Analyses I: continuous

  • Lecture 3: Univariate Analyses II: discrete

  • Lecture 4: Multivariate Analyses I: specialized models

  • Lecture 5: Multivariate Analyses II: general models

  • Lecture 6: Sequence Analysis

  • Lecture 7: Microarray Analysis I: Pre-Processing

  • Lecture 8: Microarray Analysis II: Multiple-Testing

  • Lecture 9: Machine-Learning

  • Final Exam (written)


  • Login