Canadian Bioinformatics Workshops

Canadian Bioinformatics Workshops www.bioinformatics.ca

Module #: Title of Module 2

Module 2Exploratory Data Analysis (EDA) Exploratory Data Analysis of Biological Data using R Boris Steipe Toronto, May 23. and 24. 2013 † Odysseus listening to the song of the sirens. Late Archaic (500–480 BCE) DEPARTMENT OF BIOCHEMISTRY DEPARTMENT OF MOLECULAR GENETICS † This workshop includes material originally developed by Raphael Gottardo, FHCRC and by Sohrab Shah, UBC

Exploratory Data Analysis (EDA) EDA is an approach to data analysis without a particular statistical model or hypothesis. It is therefore often the first step of data analysis.

Exploratory Data Analysis (EDA) • The objectives of EDA include: • uncovering underlying structure and identifying trends and patterns; • extracting important variables; • detecting outliers and anomalies; • testing underlying assumptions; • developing statistical models.

Exploratory Data Analysis (EDA) • The practice of EDA emphasizes looking at data in different ways through: • computing and tabulating basic descriptors of data properties such as ranges, means, and variances; • generating graphics, such as boxplots, histograms, scatter plots; • applying transformations, such as log or rank; • comparing observations to statistical models, such as the QQ-plot, or regression; • identifying underlying structure through clustering; • simplifying data through dimension reduction ... • ... all with the final goal of defining a statistical model and using the model for hypothesis testing and prediction.

Graphics Good graphics are immensely valuable. Poor graphics are worse than none. If you want to learn more about good graphics and information design, find a copy of Edward Tufte's The Visual Display of Quantitative Information. You can also visit his Web site to get a sense of the field (www.edwardtufte.com). Fundamentally, there is one simple rule. Use less ink. The rule has many corollaries.

Use less ink Make sure that all elements on your graphics are necessary. Make sure that all elements on your graphics are informative. Make sure that all information in your data is displayed. Not all of R's defaults use as little ink as possible...

refined graphics ... for a popular alternative to R's defaults, check out the ggplot2 package (http://www.ggplot2.org). > install.packages("ggplot2") ... > library(ggplot2) ... Example: Bubble chart. http://ygc.name/2010/12/01/bubble-chart-by-using-ggplot2/ crime <- read.csv("crimeRatesByState2008.csv", header=TRUE, sep="\t") p <- ggplot(crime, aes(murder,burglary,size=population, label=state)) p <- p+geom_point(colour="red") +scale_area(to=c(1,20))+geom_text(size=3) p + xlab("Murders per 100,000 population") + ylab("Burglaries per 100,000")

Graphics R Graph Gallery: http://gallery.r-enthusiasts.com/

exploring data • When teaching (or learning new procedures) I prefer to work with synthetic data. • Synthetic data has the advantage that I know what the outcome of the analysis should be. • Typically one would create values acording to a function and then add noise. • R has several functions to create sequences of values – or you can write your own ... 0:10 seq(0, pi, 5*pi/180) rep(1:3, each=3, times=2) for (i in 1:10) { print(i*i) }

synthetic data Noise ... Function ... Noisy Function ... Explore functions and noise.

discrete data TGGGACGATCAACTCGAAGCGGATGCCCCTATTGATCCTCTCCGAACACAATGAGAATGCTCTCAAAGGCTGTAAGCTCAGTGAACGGGACTAAATCAGACGTTTGTCGTCGATACGGGTGCCCTACCTCTACCTAATCTGATCGAGGCCAAGTATCTGGAAGGACTGCTAATTTTGCCGTAAACCCCGTTCTTGTTCAGTATAATCTAAGTAGATGTGTTAGCAGTCATATGACGATAGTTTGGGGGTCCTCCTATAGGAAAAAGAAAGAGTCGCCACACCAGAGCCTGTAGCGCTTTCTAGATAGTGCTGCATATTTATATGTCGGCCCCGAACCAGAGCGCTCCAATGGTAGCCCCTTTAATCTTCGTATCTTACCTTTTATGAGTGCACAGGTTTCCATCGAGGGAGAAAAACCTCAGCAACGTGGGTGGGTAGAAGAGCCCTAGTTTGAAAGCCGCACCATAACCCGCACATCGTCAGATCAGTAACCCAAGATCGGTGGGCTGTAACTAGGCTCCGTGACACAGCGTGGTATTCCGAGTTCCCGAAATCGTTTCACCTATAGAACGCCACCCCGGACGGGGTTGTTAGTTTTTCTACCTTTTAAGAAGAAAAGCAAAGTGTGTGGACACGAGAACTAGTGTGAGTACGGTTTTGTATGTGGCCCTACTGTGGAAACTCAGTAGTACGAAGGGGATAGCGAGACTTAGCTTTGCCCCAACTGCCGTCACGCACCCGCTTGTGCCGGTACGCAGAGCTCCGCCGGGTGTCCAAGTGCCGTTCTACGATAAGAACCTGTGTATCTAGCGCGCCCGATATGAATAAAGCCTACTCTTATCCAGATTTTGCGGACTGGTAAGCGTGACAATTATTGCGCAGCTTCGACTTAGTTCTCCTTGCCTTGCTTTAGGGGAGTTCTCCACTCAAAAGTCGTTGACGTACAATCGCAGATTTTGTAATCCCTTAAACCTCTGATTAGTCTCAGCCGTATTCACTA Exercise: Write a function to output random nucleotides.

random nucleotides: solution 1 Generating uniformly distributed random nucleotides using sample(). rNucUnif <- function(n) { nuc <- c("A", "C", "G", "T") rSeq <- sample(nuc, n, replace=TRUE) return(rSeq) } Task: Explore a function that samples according to arbitrary probabilities.

sample() • Random samples from a vector with/without replacement. • Sampling a vector over its length without replacement is a permutation or shuffle of the vector. sums <- rep(0, 10) digits <- 0:9 rsd <- vector() for (i in 1:10000) { perm <- sample(digits, 10) sums <- sums + perm if (i %% 1000 == 0) { rsd <- c(rsd, sd(sums)/mean(sums)) } } Task: Test whether the permutation is biased.

probability distributions • Can be either discrete or continuous (uniform, Bernoulli, normal, etc) • Defined by a density function, p(x) or f(x) • Bernoulli distribution Be(p): flip a coin (T=0, H=1). Assume probability of H is 0.1 ... x <- 0:1 f <- dbinom(x, size=1, prob=0.1) plot(x, f, xlab="x", ylab="density", type="h", lwd=5)

probability distributions Random sampling: Generate 100 observations from a Be(0.1) set.seed(100) x <- rbinom(100, size=1, prob=0.1) hist(x)

probability distributions Normal distribution N(μ,σ2) μ is the mean and σ2 is the variance. Extremely important because of the Central Limit Theorem: if a random variable is the sum of a large number of small random variables, it will be normally distributed. The area under the curve is the probability of observing a value between 0 and 2. x <- seq(-4, 4, 0.1) f <- dnorm(x, mean=0, sd=1) plot(x, f, xlab="x", ylab="density", lwd=5, type="l")

probability distributions Normal distribution N(μ,σ2) μ is the mean and σ2 is the variance. Extremely important because of the Central Limit Theorem: if a random variable is the sum of a large number of small random variables, it will be normally distributed. Task: Explore line parameters The area under the curve is the probability of observing a value between 0 and 2. x <- seq(-4, 4, 0.1) f <- dnorm(x, mean=0, sd=1) plot(x, f, xlab="x", ylab="density", lwd=5, type="l")

probability distributions Random sampling: Generate 100 observations from a N(0,1) Histograms can be used to estimate densities! set.seed(100) x <- rnorm(100, mean=0, sd=1) hist(x) lines(seq(-3,3,0.1),50*dnorm(seq(-3,3,0.1)), col="red")

quantiles (Theoretical) Quantiles: The p-quantile has the property that there is a probability p of getting a value less than or equal to it. The 50% quantile is called the median. 90% of the probability (area under the curve) is to the left of the red vertical line. q90 <- qnorm(0.90, mean = 0, sd = 1) x <- seq(-4, 4, 0.1) f <- dnorm(x, mean=0, sd=1) plot(x, f, xlab="x", ylab="density", type="l", lwd=5) abline(v=q90, col=2, lwd=5)

descriptive statistics Empirical Quantiles: The p-quantile has the property that p% of the observations are less than or equal to it. Empirical quantiles can be easily obtained in R. > set.seed(100) > x <- rnorm(100, mean=0, sd=1) > quantile(x) 0% 25% 50% 75% 100% -2.2719255 -0.6088466 -0.0594199 0.6558911 2.5819589 > quantile(x, probs=c(0.1, 0.2, 0.9)) 10% 20% 90% -1.1744996 -0.8267067 1.3834892

descriptive statistics We often need to quickly 'quantify' a data set, and this can be done using a set of summary statistics (mean, median, variance, standard deviation). > mean(x) [1] 0.002912563 > median(x) [1] -0.0594199 > IQR(x) [1] 1.264738 > var(x) [1] 1.04185 > summary(x) Min. 1st Qu. Median Mean 3rd Qu. Max. -2.272000 -0.608800 -0.059420 0.002913 0.655900 2.582000 Exercise: what are the units of variance and standard deviation?

Box-plot Descriptive statistics can be intuitively summarized in a Box-plot. 1.5 x IQR 75% quantile Median 25% quantile IQR > boxplot(x) 1.5 x IQR Everything above and below 1.5 x IQR is considered an "outlier". IQR = Inter Quantile Range = 75% quantile – 25% quantile

Violinplot Internal structure of a data-vector can be made visible in a violin plot. The principle is the same as for a boxplot, but a width is calculated from a smoothed histogram. p <- ggplot(X, aes(1,x)) p + geom_violin()

plotting data in R Task: Explore types of plots.

QQ–plot One of the first things we may ask about data is whether it deviates from an expectation e.g. to be normally distributed. The quantile-quantile plot provides a way to visually verify this. The QQ-plot shows the theoretical quantiles versus the empirical quantiles. If the distribution assumed (theoretical one) is indeed the correct one, we should observe a straight line. R provides qqnorm() and qqplot().

QQ–plot: sample vs. Normal Only valid for the normal distribution! qqnorm(x) qqline(x, col=2)

Plotting: lines and legends Example: compare the t-distribution with the Normal distribution. x <- seq(-4, 4, 0.1) f1 <- dnorm(x, mean=0, sd=1) f2 <- dt(x, df=2) plot(x, f1, xlab="x", ylab="density", lwd=5, type="l") lines(x, f2, lwd=5, col=2) legend(-4, .4, c("Normal", "t2"), col=1:2, lwd=5)

QQ–plot: sample vs. Normal set.seed(100) t <- rt(100, df=2) qqnorm(t) qqline(t, col=2) Clearly the t distribution with two degrees of freedom is not Normal.

QQ–plot Verify the CLT. set.seed(101) generateVariates <- function(n) { Nvar <- 10000 Vout <- c() for (i in 1:n) { x <- runif(Nvar, -0.01, 0.01) Vout <- c(Vout, sum(x) ) } return(Vout) } x <- generateVariates(1000) y <- rnorm(1000, mean=0, sd=1) qqnorm(x) qqline(x, y, col=2)

QQ–plot: sample vs. sample Comparing two samples: are their distributions the same? ... or ... compare a sample vs. a synthetic dataset. set.seed(100) x <- rt(100, df=2) y <- rnorm(100, mean=0, sd=1) qqplot(x, y) Exercise: try different values of df for rt() and compare the vectors.

scatter plots Biological data sets often contain several variables, they are multivariate. Scatter plots allow us to look at two variables at a time. # GvHD flow cytometry data gvhd <- read.table("GvHD.txt", header=TRUE) # Only extract the CD3 positive cells gvhdCD3p <- as.data.frame(gvhd[gvhd[, 5]>280, 3:6]) plot(gvhdCD3p[, 1:2]) This can be used to assess independence and identify subgroups.

scatter plots Task: Explore scatter plots.

trellis graphics Trellis Graphics is a family of techniques for viewing complex, multi-variable data sets. plot(gvhdCD3p, pch=".") Tip: Many more possibilities are available in the "lattice" package. See ?Lattice

Boxplots The boxplot function can be used to display several variables at a time. boxplot(gvhdCD3p) Exercise: Interpret this plot.

Histograms par(mfrow=c(2, 2)) hist(gvhdCD3p[, 1], 50, main=names(gvhdCD3p)[1], xlab="fluorescent intensity", ylim=c(0, 120)) hist(gvhdCD3p[, 2], 50, main=names(gvhdCD3p)[2], xlab="fluorescent intensity", ylim=c(0, 120)) hist(gvhdCD3p[, 3], 50, main=names(gvhdCD3p)[3], xlab="fluorescent intensity", ylim=c(0, 120)) hist(gvhdCD3p[, 4], 50, main=names(gvhdCD3p)[4], xlab="fluorescent intensity", ylim=c(0, 120)) We discover a mix of distinct cell subpopulations!

EDA: HIV microarray data • HIV Data: • The expression levels of 7680 genes were measured in CD4-T-cell lines at time t = 24 hours after infection with HIV type 1 virus. • 12 positive controls (HIV genes). • 4 replicates (2 with a dye swap)

EDA: HIV microarray data One of the array chips. We get the data after the image analysis is done. For each spot we have an estimate of the intensity in both channels. The data matrix therefore is of size 7680 x 8.

EDA: HIV microarray data data <- read.table(file="hiv.raw.data.24h.txt", sep="\t", header=TRUE) summary(data) boxplot(data)

EDA: HIV microarray data opar <- par(mfrow=c(2, 2)) hist(data[, 1], 50, main=names(data)[1], xlab="fluorescent intensity") hist(data[, 2], 50, main=names(data)[2], xlab="fluorescent intensity") hist(data[, 5], 50, main=names(data)[5], xlab="fluorescent intensity") hist(data[, 6], 50, main=names(data)[6], xlab="fluorescent intensity") par(oPar)

EDA: HIV microarray data # 'apply' will apply the function to all rows of the data matrix mean <- apply(data[, 1:4], 1, "mean") sd <- apply(data[, 1:4], 1, "sd") plot(mean, sd) trend <- lowess(mean, sd) lines(trend, col=2, lwd=5) lowess fit LOcally WEighted Scatter plot Smoother; used to estimate the trend in a scatter plot. Non parametric!

EDA: Transformations Observations: The data are highly skewed. The standard deviation is not constant, as it increases with the mean. Solution: Look for a transformation that will make the data more symmetric and the variance more constant. With positive data the log transformation is often appropriate.

EDA: Transformations data <- log(read.table(file="hiv.raw.data.24h.txt", sep="\t", header=TRUE)) summary(data) boxplot(data)

EDA: HIV microarray data opar <- par(mfrow=c(2, 2)) hist(data[, 1], 50, main=names(data)[1], xlab="fluorescent intensity") hist(data[, 2], 50, main=names(data)[2], xlab="fluorescent intensity") hist(data[, 5], 50, main=names(data)[5], xlab="fluorescent intensity") hist(data[, 6], 50, main=names(data)[6], xlab="fluorescent intensity") par(oPar)

EDA: Transformations # 'apply' will apply the function to all rows of the data matrix mean <- apply(data[, 1:4], 1, "mean") sd <- apply(data[, 1:4], 1, "sd") plot(mean, sd) trend <- lowess(mean, sd) lines(trend, col=2, lwd=5) The standard deviation has become almost independent of the mean.

EDA for microarray data: always log • Makes the data more symmetric, large observations are not as influential • The variance is (more) constant • Turns multiplication into addition (log(ab)=log(a)+log(b)) • In practice use log base 2, log2(x)=log(x)/log(2)

EDA for gene expression # scatter plot plot(data[, 1], data[, 5], xlab=names(data)[1], ylab=names(data)[5]) Is this a good way to look at the data?

EDA for gene expression: MA plots # MA plot A <- (data[, 1]+data[, 5])/2 M <- (data[, 1]-data[, 5]) plot(A, M, xlab="A", ylab="M") M (minus) is the log ratio. A (average) is overall intensity.

EDA for gene expression: MA plots # MA plots per replicate par(mfrow=c(2, 2)) A1 <- (data[, 1]+data[, 5])/2 M1 <- (data[, 1]-data[, 5]) plot(A1, M1, xlab="A", ylab="M", main="rep 1") trend <- lowess(A1, M1) lines(trend, col=2, lwd=5) A2 <- (data[, 2]+data[, 6])/2 M2 <- (data[, 2]-data[, 6]) plot(A2, M2, xlab="A", ylab="M", main="rep 2") trend <- lowess(A2, M2) lines(trend, col=2, lwd=5) A3 <- (data[, 3]+data[, 7])/2 M3 <- (data[, 3]-data[, 7]) plot(A3, M3, xlab="A", ylab="M", main="rep 3") trend <- lowess(A3, M3) lines(trend, col=2, lwd=5) A4 <- (data[, 4]+data[, 8])/2 M4 <- (data[, 4]-data[, 8]) plot(A4, M4, xlab="A", ylab="M", main="rep 4") trend <- lowess(A4, M4) lines(trend, col=2, lwd=5) par(OldPar) The data suggests that a small group of genes are interesting. How do we define/find such differentially expressed genes? More on this in Module 6: Hypothesis Testing.

Canadian Bioinformatics Workshops