bioinformatics workflow management l.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Bioinformatics workflow management PowerPoint Presentation
Download Presentation
Bioinformatics workflow management

Loading in 2 Seconds...

play fullscreen
1 / 31

Bioinformatics workflow management - PowerPoint PPT Presentation


  • 221 Views
  • Uploaded on

Bioinformatics workflow management. Thoughts and case studies from industry. Mark Schreiber, Bioinformatics Research Investigator WWWFG, 5-7 June 2007. Outline. Integration and workflows Early attempts Case studies and examples What does the future hold? Conclusions.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Bioinformatics workflow management' - avedis


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
bioinformatics workflow management

Bioinformatics workflow management

Thoughts and case studies from industry.

Mark Schreiber, Bioinformatics Research Investigator

WWWFG, 5-7 June 2007

outline
Outline
  • Integration and workflows
  • Early attempts
  • Case studies and examples
  • What does the future hold?
  • Conclusions

2 | Bioinformatics workflow management | Mark Schreiber

bioinformatics at nitd
Bioinformatics at NITD
  • Data Integration
    • Ontologies, Standards, DBs
  • Knowledge Discovery
    • Algorithms, Informatics, Machine Learning
  • Modelling
    • Pathways, Circuits, Abstraction

3 | Bioinformatics workflow management | Mark Schreiber

bioinformatics at nitd4
Bioinformatics at NITD
  • BI combines data gathering, data storage and knowledge management with analytical tools to present complex and competitive information to planners and decision makers.
  • Hypothesis Generation and Validation. Providing the right information at the right time.
  • Decision Support.

4 | Bioinformatics workflow management | Mark Schreiber

data sources heterogeneity
Data SourcesHeterogeneity
  • The most significant research is done when heterogeneous data sources can be combined in one analysis.

5 | Bioinformatics workflow management | Mark Schreiber

applications services yet more heterogeneity
Applications (Services)Yet more heterogeneity
  • RDBMS
    • Oracle, MySQL, PostGres etc
  • Open Source
    • Usually just a command line interface
  • Commercial software
    • API, scripting engine, webservice
  • Web services and Web resources
  • Integration is rarely seamless

6 | Bioinformatics workflow management | Mark Schreiber

productivity vs innovation finding a balance
Productivity vs. InnovationFinding a balance
  • Development and manufacturing prioritize productivity
  • Research requires more innovation
  • Standardization increases productivity
  • Standardization limits innovation
    • At the level it is applied
  • Standardization promotes innovation
    • At higher levels
  • Workflows give a nice balance

7 | Bioinformatics workflow management | Mark Schreiber

what is a workflow in bioinformatics
What is a workflow?In Bioinformatics
  • A data-driven procedure consisting of one or more transformation processes (nodes).
  • Can be represented as a directed graph.
    • Direction is time – The order of transformations.
    • A set of transformation rules.
  • A flow of data from it’s source to a destination (or result) via a series of merges, joins, manipulations and interconnected tools (services).
  • A specification designed in a Workflow Design System (modeling component) and run by a Workflow Management System (execution component).

8 | Bioinformatics workflow management | Mark Schreiber

the unix philosophy analogy to workflows
The UNIX PhilosophyAnalogy to workflows
  • Write programs that do one thing and do it well
  • Write programs that work together
  • Write programs to handle text streams, because that is the universal interface
    • Text formatted as XML
  • Do one thing and do it well
  • A workflow is made up of nodes that do one thing and do it well
    • So is a Service Oriented Architecture (SOA)

9 | Bioinformatics workflow management | Mark Schreiber

an early attempt polymer unix shell scripts biojava objects
An early attempt: PolymerUnix shell scripts + Biojava objects
  • Biojava is a large API of Java objects that are useful for bioinformatics.
  • Biojava objects can be assembled into mini-programs tha ‘do one thing and do it well’.
  • Polymer combines these mini-programs into a very simple workflow using Unix shell scripts.
    • Much like Unix piping.
  • Unfortunately it instantiates multiple JVMs
  • Lacks management and logging systems

10 | Bioinformatics workflow management | Mark Schreiber

how could polymer have been better
How could Polymer have been better?
  • Provide an execution class and allow it to execute a script.
    • This would mean only one JVM is launched and could allow for threading of branches in the script.
  • Use Groovy script instead of Unix shell script.
    • But Groovy hadn’t been invented at the time.
  • At the same time workflow management systems were emerging which made Polymer redundant.

11 | Bioinformatics workflow management | Mark Schreiber

a production example drug target identification rational bioinformatics prioritization
A production example: Drug Target IdentificationRational bioinformatics prioritization
  • In collaboration with biologists identify desirable characteristics of a drug target
  • Integrate relevant data from large datasets
  • Combine data and score each target based on the presence or absence of desirable characteristics
  • Prioritize targets based on their overall score

12 | Bioinformatics workflow management | Mark Schreiber

slide13

Essentiality

Homology

Expression

Druggable domains

DB

Structure

Pathways

Epidemiology

Assayability

Scientist defines desirable criteria

AssessDrugTarget

Assign weights

Produce a score for each gene

Select targets for promotion to D1

Competitive advantage

Literature

Legal position

Biological feasibility

A production example: Drug Target IdentificationRational bioinformatics prioritization

  • Hasan S, Daugelat S, Rao PSS, Schreiber M (2006) Prioritizing genomic drug targets in pathogens: Application to Mycobacterium tuberculosis. PLoS Comput Biol 2(6):e61

13 | Bioinformatics workflow management | Mark Schreiber

workflow management system controlling the workflow
Workflow Management SystemControlling the workflow
  • A WMS should provide a means to execute a workflow in a controlled way.
  • Ideally it will also provide:
    • Logging
    • Messaging
    • Security and provenance management
    • Scheduling and load balancing
    • Exception handling
    • Resource pooling (eg DB connections)
  • Much of the above is easily accessible from a JEE/ .NET application server
    • JBoss, Glassfish

14 | Bioinformatics workflow management | Mark Schreiber

workflow design system building the workflow
Workflow Design SystemBuilding the workflow
  • Many WMS systems are also a WDS
    • Eg Taverna, Pipeline Pilot, Inforsense
  • A GUI that allows rapid workflow development
    • Increases productivity and encourages experimentation
    • Drag and drop assembly of a workflow
  • Provides an API or scripting interface to allow the design of new nodes
  • A simple scripting interface would also be an alternative to using a GUI for design

15 | Bioinformatics workflow management | Mark Schreiber

simple data mining workflow
Simple Data Mining Workflow
  • Each node has a discrete function.
  • Internally the processing can be complex (eg Decision Tree) but input and output is simple and generic.
  • Self documenting.
  • Can be run by other users.

16 | Bioinformatics workflow management | Mark Schreiber

annotation finding malaria kinases
AnnotationFinding malaria kinases
  • Semi-automated annotation

17 | Bioinformatics workflow management | Mark Schreiber

advanced annotation combining multiple services
Advanced annotationCombining multiple services

18 | Bioinformatics workflow management | Mark Schreiber

workflows become nodes standing on the shoulders of giants
Workflows become nodesStanding on the shoulders of giants
  • Elements of workflows that are frequently re-used should become nodes.
  • Workflow re-use, Object oriented workflows

19 | Bioinformatics workflow management | Mark Schreiber

example from arrays to pathways using whole workflows as nodes
Example: From Arrays to PathwaysUsing whole workflows as nodes
  • Process and array and find the over represented KEGG pathways and NCBI processes.

20 | Bioinformatics workflow management | Mark Schreiber

workflow design systems promote rapid development
Workflow design systems promote rapid development
  • Finding orthologues and paralogues using whole genome pairwise blast.
  • Development of the workflow took about 5mins.

21 | Bioinformatics workflow management | Mark Schreiber

workflow design systems promote experimentation mind map data analysis
Workflow design systems promote experimentationMind map data analysis

22 | Bioinformatics workflow management | Mark Schreiber

integration via ontology
Integration Via Ontology
  • Workflows in bioinformatics typically do a lot of integration before and/ or after analysis.
  • Integration is normally done using joins and filters.
    • Using equality and Boolean operations.
      • Eg type = protease OR type = serine protease …
  • Joins and filters should be able to be evaluated using ontology.
    • Eg. Filtering for proteases would include all subconcepts automatically.
  • Data sets could be quickly mapped using custom ontologies.

23 | Bioinformatics workflow management | Mark Schreiber

simplifying service integration expose an api
Simplifying Service IntegrationExpose an API
  • All programs likely to be called by a workflow management system should publish a webservice or expose a scripting API.
  • Easier to learn than a full Java or C API.
  • Should be based on an existing scripting language not a new one.
    • Python, Groovy, Ruby or Perl
  • While you are at it expose your stack via the scripting language.
    • Imagine what could be done with BLAST if the stack could be manipulated via scripting.

24 | Bioinformatics workflow management | Mark Schreiber

web services and service oriented architecture outsourcing your processing
Web Services and Service Oriented Architecture‘Outsourcing your processing’
  • Webservices
    • Services can reside on different servers
    • Platform independent HTTP protocol
    • CGI, REST, XML-RPC, SOAP
    • SOAP is the easiest to generically connect to and parse
    • Results are available as XML
  • Service Oriented Architecture
    • Usually implies web services
    • SOA promotes re-use and simplifies maintenance
    • Bottleneck shifts from CPU time to network availability

25 | Bioinformatics workflow management | Mark Schreiber

resource oriented architecture outsourcing your data warehouse
Resource Oriented ArchitectureOutsourcing your data warehouse
  • Bioinformatics is very resource intensive
  • ROA simplifies maintenance and removes the need for synchronization.
  • Many resources are now accessible by webservices in XML format

26 | Bioinformatics workflow management | Mark Schreiber

resource oriented architecture the challenges
Resource Oriented ArchitectureThe challenges
  • Network latency can become a major problem
    • Intelligent caching and increased network speed are a must
  • Requires resource discovery and cross referencing
    • RDF and Ontology will play an increasingly important role
    • Workflow management systems will need to understand these
  • Increasingly workflows will make use of loosely-coupled interoperable resources and services.

27 | Bioinformatics workflow management | Mark Schreiber

business processes from proactive to reactive
Business ProcessesFrom proactive to reactive
  • Business processes are long running, asynchronous processes
    • Typically they react to events, e.g. a change in a stock price.
      • ‘Push’ vs ‘Pull’ model of data access.
    • Known as ‘programming in the large’
    • Defined using BPEL with very heavy use of SOA and ROA
  • Currently, most workflows are explicitly executed, ‘short running’, synchronous processes
  • Bioinformatics will increasingly use business processes
    • React to streaming machine data
    • Continuously process literature or database updates

28 | Bioinformatics workflow management | Mark Schreiber

web service choreography will it be relevant to bioinformatics
Web Service ChoreographyWill it be relevant to bioinformatics?
  • Business processes and workflows are ‘orchestrations’
    • Scope is limited to one participant
    • The BP or the Workflow talks to other participants but doesn’t care how they do their job or how they are managed.
  • Choreography involves the management of several loosely coupled BP’s
    • A network of long running asynchronous BP’s that react to the behavior of their peers.
    • Choreography of workflows would require a standard workflow description or exposure of a workflow as a business process

One to Many

One to Many

BP

Choreography

Web Service

One to Many

One to Many

Workflow

Node

???

29 | Bioinformatics workflow management | Mark Schreiber

conclusions design and management
ConclusionsDesign and management
  • Workflows are created using a workflow design system and executed on a workflow management system
  • A well designed workflow management can considerably increase productivity
  • Promotes workflow re-use and helps organize a multi-user environment
  • A good design system allows rapid development of a workflow
  • A good design system promotes experimentation and data exploration

30 | Bioinformatics workflow management | Mark Schreiber

conclusions the future
ConclusionsThe future
  • Ontology will play an increasing role in data integration
    • Join and Filter operations that can reason over an ontology model
  • Business processes and web choreography will become more relevant to bioinformatics
    • ‘Live’ data favors programming ‘in the large’
    • Workflows exposed as business processes
    • Network speed and optimal caching are key
  • All of these approaches have been used before
    • Used and proven in business intelligence
    • Bioinformatics needs to acquaint itself with modern IT practice and stop re-inventing technology

31 | Bioinformatics workflow management | Mark Schreiber