Bioconductor project: scope and experiences

Bioconductor project:scope and experiences Wolfgang Huber EMBL-EBI 1 July 2008

Bioconductor an open sourceand open development software project for the analysis of biomedical and genomic data started in the autumn of 2001 and includes core developers in the US, Europe, and Australia >100 contributing developers, several thousand users in academia and industry

Computational Biology mathematical and computational modeling of biological systems + high-troughput data analysis

Goals of the Bioconductor project Create a durable and flexible environment for development and deployment of software for computational biology. Provide access to powerful statistical and graphical methods for the analysis of genomic data. Facilitate the integration of biological metadata (e.g. Entrez, Ensembl, GO(A)) in the analysis of experimental data. Allow the rapid development of extensible, interoperable, and scalable software. Promote high-quality documentation and reproducible research. Provide training in computational and statistical methods.

Subject matter scope Bioconductor Microarrays incl. tiling (expression, ChIP, copy number) New sequencing technologies (Solexa et al.) Observational studies involving genomic data from patients Data integration along gene (product) IDs or genomic coordinates Cell-based assays, RNAi and compound screens Flow cytometry and HT cell imaging R Econometrics Spatial statistics ("Geoinformatics") Maching Learning, inference in high-dimensional situations

Precedents Free Software Foundation, GNU (Stallman, 80s) Linux Kernel (Thorvalds, 90s) Gnome, KDE R project Companies have figured out that they can make money with open-source software (IBM, Sun, ...). Research funding agencies have realized that their investments into software projects tend to have higher impact and to be more durable when open source. Developing good open source software also costs money, just the business model is different.

Seven topics to be considered • Language selection • Infrastructure resources • Design strategies • Distributed development and recruitment of developers • Reuse of exogenous resources • Publication and licensure of code • Documentation

1. Language Selection Criteria: Numerical capabilities (matrix algebra, signal processing, statistical models) Metadata handling (text processing, (relational) database interaction, categorical data) Visualisation (interactive and publication quality) Speed: efficient use of CPU time and RAM Speed: of development

2. Infrastructure resources: Self-describing, standardized data containers • our datasets are more complex than just a table or matrix (e.g. a microarray experiment) • we want to use & combine software modules from many different authors (e.g. normalisation, quality assessment, differential expression)

D Sample-ID red R Sample-ID green G Physical coordinates Sample-ID blue B Sequence Array-ID _ALL_ Target gene ID NChannelSet Physical coordinates assayData can contain N=0, 1, 2, ..., matrices of the same size Sequence Target gene ID Sample-ID red Sample-ID green Sample-ID blue Array ID “pheno”Data (AnnotatedDataframe)‏ featureData (AnnotatedDataframe)‏ labelDescription channelDescription labelDescription varMetaData

3. Design strategies Design by contract, encapsulation: components are defined by their inputs and outputs, not their implementation Modularisation - data structures, functions, packages Object oriented programming NB - cost of modularity to users: Multiscale, executable documentation - function man pages, task oriented vignettes (show demo) Automated resources distribution - package management system, dependencies

4. Distributed development and recruitment of developers Subversion archive Unit of responsibility: package Nightly build + test (incl. the dependencies): propagated changes are detected during development rather than in the field Mailing list + ad hoc communication Personal recognition (careers...)

5. Reuse of exogenous resources Writing good software is hard. Well-used and maintained software contains fewer bugs. Computational Biology is enormous and no single project can cover all of it. Lower training costs

6. Publication and licensure of code Good scientific software is like a good scientific publication oReproducible oPeer-review oEasy to access by other researchers, society o Builds on the work of others o Others will build their work on top of it o Commercialization of spin-offs can make sense (but is usually not the primary goal at the outset)

Why are we Open Source? so that you can find out what algorithm is being used, and how it is being used so that you can modify these algorithms to try out new ideas or to accommodate local conditions or needs so that they can be used by others as components (potentially modified)

6. Publication and licensure of code Buckheit and Donoho: "An article about computational science in a scientific publication is not the scholarship itself, it is merely advertising of the scholarship. The actual scholarship is the complete software development environment and that complete set of instructions that generated the figures." Schwab et al.:"... In a traditional article the author merely outlines the relevant computations: the limitations of a paper medium prohibit complete documentation including experimental data, parameter values and the author's programs. Consequently, the reader has painfully to re-implement the author's work before verifying and utilizing it.... The reader must spend valuable time merely rediscovering minutiae, which the author was unable to communicate conveniently."

6. Publication and licensure of code Gentleman et al. : "It is easy to identify major publications in the most prestigious journals that provide sketchy or indecipherable characterizations of computational and inferential processes underlying basic conclusions. This problem could be eliminated if the data housed in public archives were accompanied by portable code and scripts that regenerate the article's figures and tables."

Bioconductor Strict 6-monthly release cycle (in sync with R), starting with about 15 packages 1.0 in March 2003, now at 2.2 with 260 packages Thousands of downloads within 4 weeks after release Aggressive development Focus on cutting edge research Packages vary in their maturity: software ecosystem

The S language The S language has been developed since the late 1970s by John Chambers and his colleagues at Bell Labs. The language has been through a number of major changes but has been relatively stable since the mid 1990s The language combines ideas from a variety of sources (e.g. Awk, Lisp, APL...) and provides an environment for quantitative computations and visualization.

Implementations S-Plus is a commercialization of the Bell Labs code. R is an independent open source version that was originally developed at the University of Auckland but which is now developed by a world wide group of developers. Each version has advantages and problems.

Main features of R • Most comprehensive collection of statistical models + functions • Publication quality graphics • Package system with dependency management, name spaces; typical sessions with dozens of packages from different authors • Functional language • Object oriented programming • Foreign language interface (using objects shared in memory) • Pragmatic: emphasis on inclusion of many different tools and ideas, and on making particular tasks simple; but not on stringent overall design or safety

The two major drawbacks of R • Its loops are slow • Pass-by-value semantics can cause a lot of unnecessary copying of large objects – wasting CPU time and memory • Ad 1.: operators and many functions are vectorized • Not difficult to include user-defined C functions for time-critical loops • Ad 2.: R has some support for references and mutable state of objects, and future versions of R may support this more (http://www.stat.uiowa.edu/~luke/R/references.html)

Foreign language interface demo.R

Design of the EBImage package Image class inherits from R's array, hence functionality for matrix algebra, subsetting, statistics and signal processing instantly available Use ImageMagick library for (de)seriaIisation, I/O Use Gtk2 for image viewing Add own C/C++ code for specialised functionality (e.g. Ray Jones' Voronoi segmentation on image manifolds for cell segmentation)

Discussion R is a comprehensive environment for statistical data analysis and machine learning Bioconductor covers much of bioinformatics "Barrier of entry" as a developer is low; rapid development Re-use existing libraries (in any language) as much as possible; focus on genuinely new algorithms

Acknowledgments Robert Gentleman Vince Carey, Seth Falcon, and all Bioconductor developers R community Oleg Sklyar Greg Pau

Bioconductor project: scope and experiences

Bioconductor project: scope and experiences

Presentation Transcript

Telecommunications Consultants India Ltd

Scope: R101.2

You Are Your Memory

Chapter 5: Project Scope Management

Some pictures and slides courtesy of Jerry Delena MCVTHS Thanks Jerry

Graphs and Networks with Bioconductor

Project Management

Introduction to Project Management Chapter 5 Managing Project Scope (range)

Project Charter

Experiences / Content

Problems for Verificationism

You Are Your Memory

FOUR QUADRANT DC MOTOR SPEED CONTROL WITHOUT MICROCONTROLLER

Solar Sail

Experiences of Consciousness Beyond Hypnagogia - By Sirley Marques Bonham, Ph.D.

WELCOME

Project Name Kick Off Meeting

VERNIER Virtualized Execution Realizing Network Infrastructures Enhancing Reliability

Course Management Scope Meeting

Project

Planning a Software Project