Bioconductor project: scope and experiences. Wolfgang Huber EMBL-EBI 1 July 2008. Bioconductor. an open source and open development software project for the analysis of biomedical and genomic data started in the autumn of 2001 and includes core developers in the US, Europe, and Australia
1 July 2008
an open sourceand open development software project for the analysis of biomedical and genomic data
started in the autumn of 2001 and includes core developers in the US, Europe, and Australia
>100 contributing developers, several thousand users in academia and industry
mathematical and computational modeling of biological systems
high-troughput data analysis
Create a durable and flexible environment for development and deployment of software for computational biology.
Provide access to powerful statistical and graphical methods for the analysis of genomic data.
Facilitate the integration of biological metadata (e.g. Entrez, Ensembl, GO(A)) in the analysis of experimental data.
Allow the rapid development of extensible, interoperable, and scalable software.
Promote high-quality documentation and reproducible research.
Provide training in computational and statistical methods.
Microarrays incl. tiling (expression, ChIP, copy number)
New sequencing technologies (Solexa et al.)
Observational studies involving genomic data from patients
Data integration along gene (product) IDs or genomic coordinates
Cell-based assays, RNAi and compound screens
Flow cytometry and HT cell imaging
Spatial statistics ("Geoinformatics")
Maching Learning, inference in high-dimensional situations
Free Software Foundation, GNU (Stallman, 80s)
Linux Kernel (Thorvalds, 90s)
Companies have figured out that they can make money with open-source software (IBM, Sun, ...).
Research funding agencies have realized that their investments into software projects tend to have higher impact and to be more durable when open source.
Developing good open source software also costs money, just the business model is different.
Numerical capabilities (matrix algebra, signal processing, statistical models)
Metadata handling (text processing, (relational) database interaction, categorical data)
Visualisation (interactive and publication quality)
Speed: efficient use of CPU time and RAM
Speed: of development
Target gene IDNChannelSet
assayData can contain N=0, 1, 2, ..., matrices of the same size
Target gene ID
Design by contract, encapsulation: components are defined by their inputs and outputs, not their implementation
Modularisation - data structures, functions, packages
Object oriented programming
NB - cost of modularity to users:
Multiscale, executable documentation - function man pages, task oriented vignettes (show demo)
Automated resources distribution - package management system, dependencies
Unit of responsibility: package
Nightly build + test (incl. the dependencies): propagated changes are detected during development rather than in the field
Mailing list + ad hoc communication
Personal recognition (careers...)
Writing good software is hard. Well-used and maintained software contains fewer bugs.
Computational Biology is enormous and no single project can cover all of it.
Lower training costs
Good scientific software is like a good scientific publication
oEasy to access by other researchers, society
o Builds on the work of others
o Others will build their work on top of it
o Commercialization of spin-offs can make sense (but is usually not the primary goal at the outset)
so that you can find out what algorithm is being used, and how it is being used
so that you can modify these algorithms to try out new ideas or to accommodate local conditions or needs
so that they can be used by others as components (potentially modified)
Buckheit and Donoho: "An article about computational science in a scientific publication is not the scholarship itself, it is merely advertising of the scholarship. The actual scholarship is the complete software development environment and that complete set of instructions that generated the figures."
Schwab et al.:"... In a traditional article the author merely outlines the relevant computations: the limitations of a paper medium prohibit complete documentation including experimental data, parameter values and the author's programs. Consequently, the reader has painfully to re-implement the author's work before verifying and utilizing it.... The reader must spend valuable time merely rediscovering minutiae, which the author was unable to communicate conveniently."
Gentleman et al. : "It is easy to identify major publications in the most prestigious journals that provide sketchy or indecipherable characterizations of computational and inferential processes underlying basic conclusions. This problem could be eliminated if the data housed in public archives were accompanied by portable code and scripts that regenerate the article's figures and tables."
Strict 6-monthly release cycle (in sync with R), starting with about 15 packages 1.0 in March 2003, now at 2.2 with 260 packages
Thousands of downloads within 4 weeks after release
Focus on cutting edge research
Packages vary in their maturity: software ecosystem
The S language has been developed since the late 1970s by John Chambers and his colleagues at Bell Labs.
The language has been through a number of major changes but has been relatively stable since the mid 1990s
The language combines ideas from a variety of sources (e.g. Awk, Lisp, APL...) and provides an environment for quantitative computations and visualization.
S-Plus is a commercialization of the Bell Labs code.
R is an independent open source version that was originally developed at the University of Auckland but which is now developed by a world wide group of developers.
Each version has advantages and problems.
Image class inherits from R's array, hence functionality for matrix algebra, subsetting, statistics and signal processing instantly available
Use ImageMagick library for (de)seriaIisation, I/O
Use Gtk2 for image viewing
Add own C/C++ code for specialised functionality (e.g. Ray Jones' Voronoi segmentation on image manifolds for cell segmentation)
R is a comprehensive environment for statistical data analysis and machine learning
Bioconductor covers much of bioinformatics
"Barrier of entry" as a developer is low; rapid development
Re-use existing libraries (in any language) as much as possible; focus on genuinely new algorithms
Vince Carey, Seth Falcon, and all Bioconductor developers