bioconductor project scope and experiences n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Bioconductor project: scope and experiences PowerPoint Presentation
Download Presentation
Bioconductor project: scope and experiences

Loading in 2 Seconds...

play fullscreen
1 / 26

Bioconductor project: scope and experiences - PowerPoint PPT Presentation


  • 139 Views
  • Uploaded on

Bioconductor project: scope and experiences. Wolfgang Huber EMBL-EBI 1 July 2008. Bioconductor. an open source and open development software project for the analysis of biomedical and genomic data started in the autumn of 2001 and includes core developers in the US, Europe, and Australia

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Bioconductor project: scope and experiences' - bessie


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
bioconductor project scope and experiences

Bioconductor project:scope and experiences

Wolfgang Huber

EMBL-EBI

1 July 2008

bioconductor
Bioconductor

an open sourceand open development software project for the analysis of biomedical and genomic data

started in the autumn of 2001 and includes core developers in the US, Europe, and Australia

>100 contributing developers, several thousand users in academia and industry

computational biology
Computational Biology

mathematical and computational modeling of biological systems

+

high-troughput data analysis

goals of the bioconductor project
Goals of the Bioconductor project

Create a durable and flexible environment for development and deployment of software for computational biology.

Provide access to powerful statistical and graphical methods for the analysis of genomic data.

Facilitate the integration of biological metadata (e.g. Entrez, Ensembl, GO(A)) in the analysis of experimental data.

Allow the rapid development of extensible, interoperable, and scalable software.

Promote high-quality documentation and reproducible research.

Provide training in computational and statistical methods.

subject matter scope
Subject matter scope

Bioconductor

Microarrays incl. tiling (expression, ChIP, copy number)

New sequencing technologies (Solexa et al.)

Observational studies involving genomic data from patients

Data integration along gene (product) IDs or genomic coordinates

Cell-based assays, RNAi and compound screens

Flow cytometry and HT cell imaging

R

Econometrics

Spatial statistics ("Geoinformatics")

Maching Learning, inference in high-dimensional situations

precedents
Precedents

Free Software Foundation, GNU (Stallman, 80s)

Linux Kernel (Thorvalds, 90s)

Gnome, KDE

R project

Companies have figured out that they can make money with open-source software (IBM, Sun, ...).

Research funding agencies have realized that their investments into software projects tend to have higher impact and to be more durable when open source.

Developing good open source software also costs money, just the business model is different.

seven topics to be considered
Seven topics to be considered
  • Language selection
  • Infrastructure resources
  • Design strategies
  • Distributed development and recruitment of developers
  • Reuse of exogenous resources
  • Publication and licensure of code
  • Documentation
1 language selection
1. Language Selection

Criteria:

Numerical capabilities (matrix algebra, signal processing, statistical models)

Metadata handling (text processing, (relational) database interaction, categorical data)

Visualisation (interactive and publication quality)

Speed: efficient use of CPU time and RAM

Speed: of development

2 infrastructure resources self describing standardized data containers
2. Infrastructure resources: Self-describing, standardized data containers
  • our datasets are more complex than just a table or matrix (e.g. a microarray experiment)
  • we want to use & combine software modules from many different authors (e.g. normalisation, quality assessment, differential expression)
nchannelset

D

Sample-ID red

R

Sample-ID green

G

Physical coordinates

Sample-ID blue

B

Sequence

Array-ID

_ALL_

Target gene ID

NChannelSet

Physical

coordinates

assayData can contain N=0, 1, 2, ..., matrices of the same size

Sequence

Target gene ID

Sample-ID red

Sample-ID green

Sample-ID blue

Array ID

“pheno”Data (AnnotatedDataframe)‏

featureData (AnnotatedDataframe)‏

labelDescription

channelDescription

labelDescription

varMetaData

3 design strategies
3. Design strategies

Design by contract, encapsulation: components are defined by their inputs and outputs, not their implementation

Modularisation - data structures, functions, packages

Object oriented programming

NB - cost of modularity to users:

Multiscale, executable documentation - function man pages, task oriented vignettes (show demo)

Automated resources distribution - package management system, dependencies

4 distributed development and recruitment of developers
4. Distributed development and recruitment of developers

Subversion archive

Unit of responsibility: package

Nightly build + test (incl. the dependencies): propagated changes are detected during development rather than in the field

Mailing list + ad hoc communication

Personal recognition (careers...)

5 reuse of exogenous resources
5. Reuse of exogenous resources

Writing good software is hard. Well-used and maintained software contains fewer bugs.

Computational Biology is enormous and no single project can cover all of it.

Lower training costs

6 publication and licensure of code
6. Publication and licensure of code

Good scientific software is like a good scientific publication

oReproducible

oPeer-review

oEasy to access by other researchers, society

o Builds on the work of others

o Others will build their work on top of it

o Commercialization of spin-offs can make sense (but is usually not the primary goal at the outset)

why are we open source
Why are we Open Source?

so that you can find out what algorithm is being used, and how it is being used

so that you can modify these algorithms to try out new ideas or to accommodate local conditions or needs

so that they can be used by others as components (potentially modified)

6 publication and licensure of code1
6. Publication and licensure of code

Buckheit and Donoho: "An article about computational science in a scientific publication is not the scholarship itself, it is merely advertising of the scholarship. The actual scholarship is the complete software development environment and that complete set of instructions that generated the figures."

Schwab et al.:"... In a traditional article the author merely outlines the relevant computations: the limitations of a paper medium prohibit complete documentation including experimental data, parameter values and the author's programs. Consequently, the reader has painfully to re-implement the author's work before verifying and utilizing it.... The reader must spend valuable time merely rediscovering minutiae, which the author was unable to communicate conveniently."

6 publication and licensure of code2
6. Publication and licensure of code

Gentleman et al. : "It is easy to identify major publications in the most prestigious journals that provide sketchy or indecipherable characterizations of computational and inferential processes underlying basic conclusions. This problem could be eliminated if the data housed in public archives were accompanied by portable code and scripts that regenerate the article's figures and tables."

bioconductor1
Bioconductor

Strict 6-monthly release cycle (in sync with R), starting with about 15 packages 1.0 in March 2003, now at 2.2 with 260 packages

Thousands of downloads within 4 weeks after release

Aggressive development

Focus on cutting edge research

Packages vary in their maturity: software ecosystem

the s language
The S language

The S language has been developed since the late 1970s by John Chambers and his colleagues at Bell Labs.

The language has been through a number of major changes but has been relatively stable since the mid 1990s

The language combines ideas from a variety of sources (e.g. Awk, Lisp, APL...) and provides an environment for quantitative computations and visualization.

implementations
Implementations

S-Plus is a commercialization of the Bell Labs code.

R is an independent open source version that was originally developed at the University of Auckland but which is now developed by a world wide group of developers.

Each version has advantages and problems.

main features of r
Main features of R
  • Most comprehensive collection of statistical models + functions
  • Publication quality graphics
  • Package system with dependency management, name spaces; typical sessions with dozens of packages from different authors
  • Functional language
  • Object oriented programming
  • Foreign language interface (using objects shared in memory)
  • Pragmatic: emphasis on inclusion of many different tools and ideas, and on making particular tasks simple; but not on stringent overall design or safety
the two major drawbacks of r
The two major drawbacks of R
  • Its loops are slow
  • Pass-by-value semantics can cause a lot of unnecessary copying of large objects – wasting CPU time and memory
  • Ad 1.: operators and many functions are vectorized
  • Not difficult to include user-defined C functions for time-critical loops
  • Ad 2.: R has some support for references and mutable state of objects, and future versions of R may support this more (http://www.stat.uiowa.edu/~luke/R/references.html)
design of the ebimage package
Design of the EBImage package

Image class inherits from R's array, hence functionality for matrix algebra, subsetting, statistics and signal processing instantly available

Use ImageMagick library for (de)seriaIisation, I/O

Use Gtk2 for image viewing

Add own C/C++ code for specialised functionality (e.g. Ray Jones' Voronoi segmentation on image manifolds for cell segmentation)

discussion
Discussion

R is a comprehensive environment for statistical data analysis and machine learning

Bioconductor covers much of bioinformatics

"Barrier of entry" as a developer is low; rapid development

Re-use existing libraries (in any language) as much as possible; focus on genuinely new algorithms

acknowledgments
Acknowledgments

Robert Gentleman

Vince Carey, Seth Falcon, and all Bioconductor developers

R community

Oleg Sklyar

Greg Pau