First steps in data analysis with R and Bioconductor

First steps in data analysis with R and Bioconductor Gregoire Pau, W. Huber, EMBL-EBI Cambridge gregoire.pau@ebi.ac.uk

Motivation • Biology is becoming a computational science • Data analysis and mathematical modeling require computational solutions • Bioconductor is software project written in R language for the analysis of biomedical and genomic data which put a premium on code reuse: • many of the tasks have already been solved • if we use those solutions we can put effort into new research • Data complexity is dealt with using well designed, self-describing data structures • Open access to computational code allows to perform reproducible research

Behind R: the S language • The S language has been developed since the late 1970s by John Chambers and his colleagues at Bell Labs • S is a reference language and not an implementation • The language has been through a number of major changes but has been relatively stable since the mid 1990s • The language combines ideas from a variety of sources (e.g. Awk, Lisp, APL...) and provides an environment for quantitative computations and visualization

S Implementations • S-Plus is a commercialization of the Bell Labs code • R is an independent open source version that was originally developed at the University of Auckland but which is now developed by a world wide group of developers • Each version has advantages and problems • References • The New S Language, Statistical models in S, Programming with Data, by John Chambers and various co-authors • Modern Applied Statistics, S Programming by W. Venables and B. Ripley • Introductory Statistics with R by P. Dalgaard • Data Analysis and Graphics Using R by J. Maindonald and J. Braun.

R Extensibility: Packages • R can be easily extended using packages • Packages are the main unit of software authoring, versioning and distribution • CRAN is the major repository for R packages. It is hosted by TU Vienna and ETH Zürich, and has many mirrors world-wide • Bioconductor is a repository for biology related packages. It is hosted at the Fred Hutchinson Cancer Research Centre.

Bioconductor • Open source and open development software project for the analysis of biomedical and genomic data • Started in the autumn of 2001 and includes core developers in the US, Europe, and Australia • R and the R package system are used to design and distribute software

Goals of the Bioconductor project • Provide access to powerful statistical and graphical methods for the analysis of genomic and post-genomic data • Facilitate the integration of biological metadata (e.g. Entrez, Ensembl, GO(A), PubMed) in the analysis of experimental data • Allow the rapid development of extensible, interoperable, and scalable software • Promote high-quality documentation and reproducible research • Provide training in computational and statistical methods

Bioconductor is Open Source • So that you can find out what algorithm is being used, and how it is being used • So that you can modify these algorithms to try out new ideas or to accommodate local conditions or needs • So that they can be used by others as components (potentially modified)

Good scientific software is like a good scientific publication • Reproducibility • Peer-review • Easy accessibility by other researchers, society • Build on the work of others • Others will build their work on top of it • Commercialize those developments that are successful and have a market

Starting R • R is available on Windows, MacOS and Unixes • Command-line application • Just a window • Waiting for an input command with a > • The result of a command can be a line of text in the window

Simple calculator • Simple arithmetic • Variables > 5+4 1 9 > 3*(8+2.2)/14.1 1 2.170213 > a=5 > b=(a+3)*6 > b 1 48 > b^2+a/2 1 2306.5

Functions • Predefined functions • Scientific, statistics, data analysis, bioinformatics, … Basic R contains >1000 functions • Can be easily expanded using packages • Defining functions • Simple function > sqrt(2)+1 1 2.414214 > a = 3 > log(log(a)) 1 0.09404783 > f = function(x)  sqrt(x^2+1)  > f(1.4) 1 1.720465

Arrays • R can easily create and read arrays of numbers • Creating arrays using the concatenation function c • Reading arrays from an external text file using read.table • Processing arrays • Computation of [H+] given pH, using [H+]=10-pH > pH = c(6.4, 5.8, 5.9, 6.1, 6.2, 6.1) > pH = read.table('ph-exp13.dat') > length(pH) 1 453 > Hconcentration = 10^(-pH) > Hconcentration 1 3.981e-07 1.584e-06 1.258e-06 7.943e-07 5 6.309e-07 7.943e-07

Spreadsheet processing • R can easily process matrices and tables • Suppose a text file containing pH batch measurements • Read the table • Access a column using $ or [] • Basic queries exp1 exp2 6.2 6.5 6.4 6.4 6.0 6.2 6.1 6.5 6.1 6.4 6.7 6.2 7.3 6.1 6.2 6.3 6.4 6.4 6.3 6.6 6.5 6.7 6.6 6.4 6.7 6.6 6.1 6.1 6.1 6.1 7.1 6.3 ... ... > tab = read.table('phbatch.dat') > dim(tab) 1 171 2 > v1=tab$exp1 > v2=tab$exp2 > mean(v1) 1 6.36 > median(v2) 1 6.57 > max(v2) 1 7.2 phbatch.txt

Spreadsheet processing Tables can be processed as matrices exp1 exp2 exp3 6.2 6.5 6.2 6.4 6.4 6.0 6.0 6.2 6.1 6.1 6.5 6.2 6.1 6.4 6.3 6.7 6.2 6.2 7.3 6.1 6.5 6.2 6.3 6.7 6.4 6.4 6.2 6.3 6.6 6.1 6.5 6.7 6.0 6.6 6.4 6.1 6.7 6.6 6.2 6.1 6.1 6.3 6.1 6.1 6.5 7.1 6.3 6.3 ... ... ... > mean(tab1,) 1 6.3 > median(tab,2:3) 1 6.2 > sum(tab3:7,1:2) 1 62.35 > tab1,=tab1,+0.1 > tab=rbind(tab,c(6.1,6.2,7.3)) > tab=cbind(tab,tab1,+tab2,) > apply(tab,2,median) 1 6.3 6.5 6.1

More about R • R is much more than a calculator or a spreadsheet processor • Can handle many types • Numeric • Strings • Logical • Factors • Flow control instructions • Conditional switches (if, then, else structures) • Loops (for),repeat-until(repeat)… • R support object-oriented S4 classes • Ability to define objects and functions that operate on these objects

Plotting with R • R can draw many types of plots • Boxplots • Histogram/density plots • QQplots • Scatterplots • Image, contour plots • 3D surfaces, meshes… > demo("graphics") > library("rgl") > example("surface3d") > example("plot3d")

Boxplot • Outliers (unusual values) are those data points whose distance from the box is larger than 1.5 times the interquartile range. • The whiskers extend to the last point which is not an outlier. • A boxplot is a graphical representation of the Five-number summary: [Minimum, First quartile, Median, Third quartile, Maximum] 50% of the data is in the box “Interquartile range” Whisker Outliers Outlier Median 50% of the data is above 50% of the data is below 25% of the data is below the box 25% of the data is above the box

QQ plot: quantiles vs quantiles Compare a distribution of univariate values against another one straight line  the two distributions are the same

Histogram: summarizes a univariate sample Area of this bar represents the proportion of observations between 16 and 17 • Main complication: choice of bin width/number of bins • Most statistical programs do a good job at choosing a reasonable bin width, but manual override is sometimes necessary.

Density • Describes the theoretical probability distribution of a variable • Conceptually, it is obtained in the limit of infinitely many data points • When we estimate it from a finite set of data, we usually assume that the density is a smooth function • You can think of it “smoothed histogram” (but to actually compute it, there are much better methods!)

Scatterplots

Scatterplot • Plots can be easily customized • Using legend, colors, … > plot(p1,p2,col=phenotype) > legend(3,7,c('Wild type'…), col=c(grey,orange,liliac)) Scatterplot of cell population parameters, according to the observed phenotype

smoothScatter

Grouping plots • Using xyplot from the lattice package • Cell population time courses

Statistics • R has been designed for statistical computing • Statistical tests • Parametric (t-test, ANOVA, Fisher, …) • Non-parametric (Rank sum, Spearman correlation…) • Model fitting • Parameter estimation • Linear fitting • …

t-test example • Let's go back to our pH batch test • Compute the mean pH of the two experiments • Is this difference significant ? • If the values are (approximately) normally distributed, Student's t-test can answer this question • Note that if data are not normally distributed, many non-parametric tests are available (including Wilcoxon's rank-sum wilcox.test) • But there is no free lunch: every test makes some sort of assumptions. More informative assumptions -> better power of the test on small sample sizes exp1 exp2 6.23 6.57 6.42 6.48 6.01 6.26 6.12 6.55 6.13 6.44 6.74 6.20 7.35 6.12 6.24 6.32 6.43 6.46 6.36 6.68 6.56 6.79 6.65 6.40 6.74 6.67 6.14 6.16 6.13 6.14 7.17 6.34 ... ... > apply(tab,2,mean) 1 6.36 6.57 > t.test(tab[,1],tab[,2]) t = -1.10, df = 77.1, p-value = 0.2731

Extending R • R can be easily extended using packages • CRAN repository • Using install.package • Broad collection of method- or application-oriented packages • Classification (SVM, nearest Neigbhours…) • Clustering (Hierarchical, k-Means,…) • Signal/image processing • HMM, Markov models • Database • Graphs

Bioconductor • Collection of application-based packages • limma: Linear models for microarrays • affy: Method for microarrays • tilingArray: Methods for processing and analysing tiling arrays • EBImage: Image processing tools • …

Cell segmentation using EBImage • HeLa cells • Channels: • Actin (TRITC) • Tubulin (Alexa 488) • DNA (Hoechst)

Cell segmentation using EBImage

Reproducible Research and Compendia There is a tendency to accept seemingly realistic computational results, as presented by figures and tables, without any proof of correctness. F. Leisch, T. Rossini, Chance 16 (2003) We re-analyzed the breast cancer data from van‘t Veer et al. (2002). ... Even with some helpof the authors, we were unable to exactly reproduce this analysis. R. Tibshirani, B. Efron, SAGMB (2002)

Re-analysis of a breast cancer outcome study E. Huang et al., Gene expression predictors of breast cancer outcome, The Lancet 361 (9369): 1590-6 (2003) 89 primary breast tumors on Affymetrix Chips (HG-U95av2), among them: 52 with 1-3 positive lymph nodes, 18 led to recurrence within 3 years, 34 did not. Goal: predict recurrence Claim: 5 misclassification errors, 1 unclear (leave-one-out cross-validation) Method: Bayesian binary prediction trees (at the time, unpublished) Data available on http://www.cagp.duke.edu

…we tried to reproduce these results, starting from the published marray raw data (CEL files) • But couldn't. • The paper (and supplements) didn't contain the necessary details to re-implement their algorithm. • Authors didn't provide comparisons to simple well-known methods. • In our hands, all other methods resulted in worse misclassification results. • Is their new Bayesian tree method miles better than everything else? • Or was their analysis over-optimistic? (over-fitting, selection bias)

Compendia • Interactive documents that contain: • Primary data • Processing methods (computer code) • Derived data, figures, tables and other output • Text: research report (result, materials and methods, conclusions) • Based on R/Bioconductor's package and vignette technologies

source markup (here: latex & R) processed document (here: PDF) Sweave <<MCRestimate call,eval=FALSE,echo=TRUE>>=r.forest <- MCRestimate(eset,class.label, class.function="RF.wrap", select.fun=red.fct,cross.outer=10, cross.inner=5,cross.repeat=20)@ <<rf.save,echo=FALSE,results=hide>>=savepdf(plot(r.forest, main="Random Forest"),"image-RF.pdf")@ <<result>>=r.forest@ The final document includes results of the calculation, graphicaloutputs, tables, and optionally parts of the R-Code which has beenused. Also the description of theexperiment,the interpretation of theresults, and the conclusion can be integrated. In this example we applied our compendium toT. Golubs ALL/AMLdata~\cite{Golub.1999}. \begin{figure}[h]\begin{center}\includegraphics[width=0.4\textwidth]{image-RF}\end{center}\end{figure}\smallskip <<summary,echo=FALSE>>=method.list <- list(r.forest,r.pam,r.logReg,r.svm)name.list<- c("RF","PAM","PLR","SVM")conf.table <-MCRconfusion(method.list, col.names=name.list)@ <<writinglatex1,echo=FALSE, results=tex>>=xtable(conf.table,"Overall number ofmisclassifications",label="conf.table",display=rep("d",6))@ %\input{samples.1}%\input{conf.table} \begin{thebibliography}{1}\bibitem[Golub et al. ,1999]{Golub.1999} Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh ML, Downing JR, Caligiuri MA, Bloomfield CD, Lander ES.\newblock Molecular classification of cancer: class discovery and class prediction by gene expression monitoring\newblock\textit{Science} 286(5439): 531-7 (1999). \end{thebibliography}

 Structure of a compendium Package directory General info: author, version, … software documentation source markup additional software code data directory function definition manual page data files function definition manual page data files . . . function definition manual page data files . . . . . .

First steps in data analysis with R and Bioconductor