Canadian Bioinformatics Workshops

Canadian Bioinformatics Workshops www.bioinformatics.ca

Module #: Title of Module 2

Lecture 2Univariate Analyses: Continuous Data MBP1010 Dr. Paul C. Boutros Winter 2014 † Aegeus, King of Athens, consulting the Delphic Oracle. High Classical (~430 BCE) DEPARTMENT OF MEDICAL BIOPHYSICS This workshop includes material originally developed by Drs. Raphael Gottardo, Sohrab Shah, Boris Steipe and others †

Course Overview • Lecture 1: What is Statistics? Introduction to R • Lecture 2: Univariate Analyses I: continuous • Lecture 3: Univariate Analyses II: discrete • Lecture 4: Multivariate Analyses I: specialized models • Lecture 5: Multivariate Analyses II: general models • Lecture 6: Sequence Analysis • Lecture 7: Microarray Analysis I: Pre-Processing • Lecture 8: Microarray Analysis II: Multiple-Testing • Lecture 9: Machine-Learning • Final Exam (written)

How Will You Be Graded? • 9% Participation: 1% per week • 56% Assignments: 8 x 7% each • 35% Final Examination: in-class • Each individual will get their own, unique assignment • Assignments will all be in R, and will be graded according to computational correctness only (i.e. does your R script yield the correct result when run) • Final Exam will include multiple-choice and written answers

Course Information Updates • Website will have up to date information, lecture notes, sample source-code from class, etc. • http://medbio.utoronto.ca/students/courses/mbp1010/mbp_1010.html • Tutorials are Thursdays 13:00-15:00 in 4-204 TMDT • Next week we will be switching lecture and tutorial: • Tutorial: January 20 • Lecture: January 23 • Assignment #1 was delayed because of registration issues • Email quantitative.biology.utoronto@gmail.com with your student ID and we will email back your personal assignment

House Rules • Cell phones to silent • No side conversations • Hands up for questions • Others?

Review From Last Week Population vs. Sample All MBP Students = Population MBP Students in 1010 = Sample How do you report statistical information? P-value, variance, effect-size, sample-size, test Why don’t we use Excel/spreadsheets? Spreadsheet errors, reproducibility, wrong results

Topics For This Week • Introduction to continuous data & probability distributions • Slightly boring, but necessary! • Attendance • Common continuous univariate analyses • Correlations • ceRNAs

Continuous vs. Discrete Data • Definitions? • Examples of discrete data in biological studies? • Why does it matter in the first place? • Areas of discrete mathematics: • Combinatorics • Graph Theory • Discrete Probability Theory (Dice, Cards) • Number Theory

Exploring Data • When teaching (or learning new procedures) we usually prefer to work with synthetic data. • Synthetic data has the advantage that we know what the outcome of the analysis should be. • Typically one would create values according to a function and then add noise. • R has several functions to create sequences of values – or you can write your own ... 0:10; seq(0, pi, 5*pi/180); rep(1:3, each=3, times=2); for (i in 1:10) { print(i*i); }

synthetic data Noise ... Function ... Noisy Function ... Explore functions and noise.

Probability Distributions Normal distribution N(μ,σ2) μ is the mean and σ2 is the variance. Extremely important because of the Central Limit Theorem: if a random variable is the sum of a large number of small random variables, it will be normally distributed. The area under the curve is the probability of observing a value between 0 and 2. x <- seq(-4, 4, 0.1) f <- dnorm(x, mean=0, sd=1) plot(x, f, xlab="x", ylab="density", lwd=5, type="l")

Probability Distributions Normal distribution N(μ,σ2) μ is the mean and σ2 is the variance. Extremely important because of the Central Limit Theorem: if a random variable is the sum of a large number of small random variables, it will be normally distributed. Task: Explore line parameters The area under the curve is the probability of observing a value between 0 and 2. x <- seq(-4, 4, 0.1) f <- dnorm(x, mean=0, sd=1) plot(x, f, xlab="x", ylab="density", lwd=5, type="l")

Probability Distributions Random sampling: Generate 100 observations from a N(0,1) Histograms can be used to estimate densities! set.seed(100) x <- rnorm(100, mean=0, sd=1) hist(x) lines(seq(-3,3,0.1),50*dnorm(seq(-3,3,0.1)), col="red")

Quantiles (Theoretical) Quantiles: The p-quantile has the property that there is a probability p of getting a value less than or equal to it. The 50% quantile is called the median. 90% of the probability (area under the curve) is to the left of the red vertical line. q90 <- qnorm(0.90, mean = 0, sd = 1); x <- seq(-4, 4, 0.1); f <- dnorm(x, mean=0, sd=1); plot(x, f, xlab="x", ylab="density", type="l", lwd=5); abline(v=q90, col=2, lwd=5);

Descriptive Statistics Empirical Quantiles: The p-quantile has the property that p% of the observations are less than or equal to it. Empirical quantiles can be easily obtained in R. > set.seed(100); > x <- rnorm(100, mean=0, sd=1); > quantile(x); 0% 25% 50% 75% 100% -2.2719255 -0.6088466 -0.0594199 0.6558911 2.5819589 > quantile(x, probs=c(0.1, 0.2, 0.9)); 10% 20% 90% -1.1744996 -0.8267067 1.3834892

Descriptive Statistics We often need to quickly 'quantify' a data set, and this can be done using a set of summary statistics (mean, median, variance, standard deviation). > mean(x); [1] 0.002912563 > median(x); [1] -0.0594199 > IQR(x); [1] 1.264738 > var(x); [1] 1.04185 > summary(x); Min. 1st Qu. Median Mean 3rd Qu. Max. -2.272000 -0.608800 -0.059420 0.002913 0.655900 2.582000 Exercise: what are the units of variance and standard deviation?

Boxplot Descriptive statistics can be intuitively summarized in a Boxplot. 1.5 x IQR 75% quantile Median 25% quantile IQR > boxplot(x) 1.5 x IQR Everything above and below 1.5 x IQR is considered an "outlier". IQR = Inter Quantile Range = 75% quantile – 25% quantile

Violinplot Internal structure of a data-vector can be made visible in a violin plot. The principle is the same as for a boxplot, but a width is calculated from a smoothed histogram. p <- ggplot(X, aes(1,x)) p + geom_violin()

plotting data in R Task: Explore types of plots.

QQ–plot One of the first things we may ask about data is whether it deviates from an expectation e.g. to be normally distributed. The quantile-quantile plot provides a way to visually verify this. The QQ-plot shows the theoretical quantiles versus the empirical quantiles. If the distribution assumed (theoretical one) is indeed the correct one, we should observe a straight line. R provides qqnorm() and qqplot().

QQ–plot: sample vs. Normal Only valid for the normal distribution! qqnorm(x) qqline(x, col=2)

QQ–plot: sample vs. Normal set.seed(100) t <- rt(100, df=2) qqnorm(t) qqline(t, col=2) Clearly the t distribution with two degrees of freedom is not Normal.

QQ–plot Verify the CLT. set.seed(101) generateVariates <- function(n) { Nvar <- 10000 Vout <- c() for (i in 1:n) { x <- runif(Nvar, -0.01, 0.01) Vout <- c(Vout, sum(x) ) } return(Vout) } x <- generateVariates(1000) y <- rnorm(1000, mean=0, sd=1) qqnorm(x) qqline(x, y, col=2)

QQ–plot: sample vs. sample Comparing two samples: are their distributions the same? ... or ... compare a sample vs. a synthetic dataset. set.seed(100) x <- rt(100, df=2) y <- rnorm(100, mean=0, sd=1) qqplot(x, y) Exercise: try different values of df for rt() and compare the vectors.

Boxplots The boxplot function can be used to display several variables at a time. boxplot(gvhdCD3p) Exercise: Interpret this plot.

Attendance Break

Hypothesis Testing Hypothesis testing is confirmatory data analysis, in contrast to exploratory data analysis. Concepts: Null – and Alternative Hypothesis Region of acceptance / rejection and critical value Error types p - value Significance level Power of a test (1 - false negative)

Null Hypothesis / Alternative Hypothesis The null hypothesisH0 states that nothing of consequence is apparent in the data distribution. The data corresponds to our expectation. We learn nothing new. The alternative hypothesisH1 states that some effect is apparent in the data distribution. The data is different from our expectation. We need to account for something new. Not in all cases will this result in a new model, but a new model always begins with the observation that the old model is inadequate. Don’t think about this too much!

Test types ... common types of tests A Z–test compares a sample mean with a normal distribution. A t–test compares a sample mean with a t- distribution and thus relaxes the requirements on normality for the sample. Nonparametric tests can be applied if we have no reasonable model from which to derive a distribution for the null hypothesis. Chi–squared tests analyze whether samples are drawn from the same distribution. F-tests analyze the variance of populations (ANOVA).

Error Types Truth H0 H1 Decision  1 -  Accept H0 "False negative" “Sensitivity” "Type II error"  1 -  Reject H0 "False positive" "Type I error" “Power”

what is a p–value? A measure of how much evidence we have against the alternative hypothesis. The probability of making an error. Something that biologists want to be below 0.05 . The probability of observing a value as extreme or more extreme by chance alone. All of the above.

Distributional Assumptions A parametric test makes assumptions about the underlying distribution of the data. A non-parametric test makes no assumptions about the underlying distribution, but may make other assumptions!

Most Common Statistical Test: The T-Test A Z–test compares a sample mean with a normal distribution. A t–test compares a sample mean with a t- distribution and thus relaxes the requirements on normality for the sample. Nonparametric tests can be applied if we have no reasonable model from which to derive a distribution for the null hypothesis. One-Sample vs. Two-Sample One-Sided vs. Two-Sided Heteroscedastic vs. Homoscedastic

Two-Sample t–test Test if the means of two distributions are the same. The datasets yi1, ..., yin are independentand normally distributed with mean μi and variance σ2, N (μi,σ2), where i=1,2. In addition, we assume that the data in the two groups are independentand that the variance is the same.

two–sample t–test

t–test assumptions Normality: The data need to be sampled from a normal distribution. If not, one can use a transformation or a non-parametric test. If the sample size is large enough (n>30), the t-test will work just fine (CLT). Independence: Usually satisfied. If not independent, more complex modeling is required. Independence between groups: In the two sample t- test, the groups need to be independent. If not, one can sometimes use a paired t-test instead Equal variances: If the variances are not equal in the two groups, use Welch's t-test (default in R). How Do We Test These?

non–parametric tests Non-parametric tests constitute a flexible alternative to t-tests if you don't have a model of the distribution. In cases where a parametric test would be appropriate, non-parametric tests have less power. Several non parametric alternatives exist e.g. the Wilcoxon and Mann-Whitney tests.

Wilcoxon test principle Consider two random distributions with 25 samples each and slightly different means. set.seed(53) n <- 25 M <- matrix(nrow = n+n, ncol=2) for (i in 1:n) { M[i,1] <- rnorm(1, 10, 1) M[i,2] <- 1 M[i+n,1] <- rnorm(1, 11, 1) M[i+n,2] <- 2 } plot(M[,1], col=M[,2])

Wilcoxon test principle o <- order(M[,1]) plot(M[o,1], col=M[o,2]) For each observation in a, count the number of observations in b that have a smaller rank. The sum of these counts is the test statistic. wilcox.test(M[1:n,1], M[(1:n)+n,1])

Flow-Chart For Two-Sample Tests Is Data Sampled From a Normally-Distributed Population? Yes No Sufficient n for CLT (>30)? Equal Variance (F-Test)? Yes Yes No No Heteroscedastic T-Test Homoscedastic T-Test Wilcoxon U-Test

Power, error rates and decision Power calculation in R: > power.t.test(n = 5, delta = 1, sd=2, alternative="two.sided", type="one.sample") One-sample t test power calculation n = 5 delta = 1 sd = 2 sig.level = 0.05 power = 0.1384528 alternative = two.sided Other tests are available – see ??power.

Power, error rates and decision PR(False Negative) PR(Type II error) Let’s Try Some Power Analyses in R μ0 μ1 PR(False Positive) PR(Type I error)

Problem • When we measure more one than one variable for each member of a population, a scatter plot may show us that the values are not completely independent: there is e.g. a trend for one variable to increase as the other increases. • Regression analyses the dependence. • Examples: • Height vs. weight • Gene dosage vs.expression level • Survival analysis:probability of death vs. age

Correlation When one variable depends on the other, the variables are to some degree correlated. (Note: correlation need not imply causality.) In R, the function cov() measures covariance and cor() measures the Pearson coefficient of correlation (a normalized measure of covariance). Pearson's coeffecient of correlation values rangefrom -1 to 1, with 0 indicating no correlation.

Pearson's Coefficient of Correlation How to interpret the correlation coefficient: Explore varying degrees of randomness ... > x<-rnorm(50) > r <- 0.99; > y <- (r * x) + ((1-r) * rnorm(50)); > plot(x,y); cor(x,y) [1] 0.9999666

Pearson's Coefficient of Correlation Varying degrees of randomness ... > x<-rnorm(50) > r <- 0.8; > y <- (r * x) + ((1-r) * rnorm(50)); > plot(x,y); cor(x,y) [1] 0.9661111

Canadian Bioinformatics Workshops