html5-img
1 / 31

Bioinformatics workflow management

Bioinformatics workflow management. Thoughts and case studies from industry. Mark Schreiber, Bioinformatics Research Investigator WWWFG, 5-7 June 2007. Outline. Integration and workflows Early attempts Case studies and examples What does the future hold? Conclusions.

avedis
Download Presentation

Bioinformatics workflow management

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Bioinformatics workflow management Thoughts and case studies from industry. Mark Schreiber, Bioinformatics Research Investigator WWWFG, 5-7 June 2007

  2. Outline • Integration and workflows • Early attempts • Case studies and examples • What does the future hold? • Conclusions 2 | Bioinformatics workflow management | Mark Schreiber

  3. Bioinformatics at NITD • Data Integration • Ontologies, Standards, DBs • Knowledge Discovery • Algorithms, Informatics, Machine Learning • Modelling • Pathways, Circuits, Abstraction 3 | Bioinformatics workflow management | Mark Schreiber

  4. Bioinformatics at NITD • BI combines data gathering, data storage and knowledge management with analytical tools to present complex and competitive information to planners and decision makers. • Hypothesis Generation and Validation. Providing the right information at the right time. • Decision Support. 4 | Bioinformatics workflow management | Mark Schreiber

  5. Data SourcesHeterogeneity • The most significant research is done when heterogeneous data sources can be combined in one analysis. 5 | Bioinformatics workflow management | Mark Schreiber

  6. Applications (Services)Yet more heterogeneity • RDBMS • Oracle, MySQL, PostGres etc • Open Source • Usually just a command line interface • Commercial software • API, scripting engine, webservice • Web services and Web resources • Integration is rarely seamless 6 | Bioinformatics workflow management | Mark Schreiber

  7. Productivity vs. InnovationFinding a balance • Development and manufacturing prioritize productivity • Research requires more innovation • Standardization increases productivity • Standardization limits innovation • At the level it is applied • Standardization promotes innovation • At higher levels • Workflows give a nice balance 7 | Bioinformatics workflow management | Mark Schreiber

  8. What is a workflow?In Bioinformatics • A data-driven procedure consisting of one or more transformation processes (nodes). • Can be represented as a directed graph. • Direction is time – The order of transformations. • A set of transformation rules. • A flow of data from it’s source to a destination (or result) via a series of merges, joins, manipulations and interconnected tools (services). • A specification designed in a Workflow Design System (modeling component) and run by a Workflow Management System (execution component). 8 | Bioinformatics workflow management | Mark Schreiber

  9. The UNIX PhilosophyAnalogy to workflows • Write programs that do one thing and do it well • Write programs that work together • Write programs to handle text streams, because that is the universal interface • Text formatted as XML • Do one thing and do it well • A workflow is made up of nodes that do one thing and do it well • So is a Service Oriented Architecture (SOA) 9 | Bioinformatics workflow management | Mark Schreiber

  10. An early attempt: PolymerUnix shell scripts + Biojava objects • Biojava is a large API of Java objects that are useful for bioinformatics. • Biojava objects can be assembled into mini-programs tha ‘do one thing and do it well’. • Polymer combines these mini-programs into a very simple workflow using Unix shell scripts. • Much like Unix piping. • Unfortunately it instantiates multiple JVMs • Lacks management and logging systems 10 | Bioinformatics workflow management | Mark Schreiber

  11. How could Polymer have been better? • Provide an execution class and allow it to execute a script. • This would mean only one JVM is launched and could allow for threading of branches in the script. • Use Groovy script instead of Unix shell script. • But Groovy hadn’t been invented at the time. • At the same time workflow management systems were emerging which made Polymer redundant. 11 | Bioinformatics workflow management | Mark Schreiber

  12. A production example: Drug Target IdentificationRational bioinformatics prioritization • In collaboration with biologists identify desirable characteristics of a drug target • Integrate relevant data from large datasets • Combine data and score each target based on the presence or absence of desirable characteristics • Prioritize targets based on their overall score 12 | Bioinformatics workflow management | Mark Schreiber

  13. Essentiality Homology Expression Druggable domains DB Structure Pathways Epidemiology Assayability Scientist defines desirable criteria AssessDrugTarget Assign weights Produce a score for each gene Select targets for promotion to D1 Competitive advantage Literature Legal position Biological feasibility A production example: Drug Target IdentificationRational bioinformatics prioritization • Hasan S, Daugelat S, Rao PSS, Schreiber M (2006) Prioritizing genomic drug targets in pathogens: Application to Mycobacterium tuberculosis. PLoS Comput Biol 2(6):e61 13 | Bioinformatics workflow management | Mark Schreiber

  14. Workflow Management SystemControlling the workflow • A WMS should provide a means to execute a workflow in a controlled way. • Ideally it will also provide: • Logging • Messaging • Security and provenance management • Scheduling and load balancing • Exception handling • Resource pooling (eg DB connections) • Much of the above is easily accessible from a JEE/ .NET application server • JBoss, Glassfish 14 | Bioinformatics workflow management | Mark Schreiber

  15. Workflow Design SystemBuilding the workflow • Many WMS systems are also a WDS • Eg Taverna, Pipeline Pilot, Inforsense • A GUI that allows rapid workflow development • Increases productivity and encourages experimentation • Drag and drop assembly of a workflow • Provides an API or scripting interface to allow the design of new nodes • A simple scripting interface would also be an alternative to using a GUI for design 15 | Bioinformatics workflow management | Mark Schreiber

  16. Simple Data Mining Workflow • Each node has a discrete function. • Internally the processing can be complex (eg Decision Tree) but input and output is simple and generic. • Self documenting. • Can be run by other users. 16 | Bioinformatics workflow management | Mark Schreiber

  17. AnnotationFinding malaria kinases • Semi-automated annotation 17 | Bioinformatics workflow management | Mark Schreiber

  18. Advanced annotationCombining multiple services 18 | Bioinformatics workflow management | Mark Schreiber

  19. Workflows become nodesStanding on the shoulders of giants • Elements of workflows that are frequently re-used should become nodes. • Workflow re-use, Object oriented workflows 19 | Bioinformatics workflow management | Mark Schreiber

  20. Example: From Arrays to PathwaysUsing whole workflows as nodes • Process and array and find the over represented KEGG pathways and NCBI processes. 20 | Bioinformatics workflow management | Mark Schreiber

  21. Workflow design systems promote rapid development • Finding orthologues and paralogues using whole genome pairwise blast. • Development of the workflow took about 5mins. 21 | Bioinformatics workflow management | Mark Schreiber

  22. Workflow design systems promote experimentationMind map data analysis 22 | Bioinformatics workflow management | Mark Schreiber

  23. Integration Via Ontology • Workflows in bioinformatics typically do a lot of integration before and/ or after analysis. • Integration is normally done using joins and filters. • Using equality and Boolean operations. • Eg type = protease OR type = serine protease … • Joins and filters should be able to be evaluated using ontology. • Eg. Filtering for proteases would include all subconcepts automatically. • Data sets could be quickly mapped using custom ontologies. 23 | Bioinformatics workflow management | Mark Schreiber

  24. Simplifying Service IntegrationExpose an API • All programs likely to be called by a workflow management system should publish a webservice or expose a scripting API. • Easier to learn than a full Java or C API. • Should be based on an existing scripting language not a new one. • Python, Groovy, Ruby or Perl • While you are at it expose your stack via the scripting language. • Imagine what could be done with BLAST if the stack could be manipulated via scripting. 24 | Bioinformatics workflow management | Mark Schreiber

  25. Web Services and Service Oriented Architecture‘Outsourcing your processing’ • Webservices • Services can reside on different servers • Platform independent HTTP protocol • CGI, REST, XML-RPC, SOAP • SOAP is the easiest to generically connect to and parse • Results are available as XML • Service Oriented Architecture • Usually implies web services • SOA promotes re-use and simplifies maintenance • Bottleneck shifts from CPU time to network availability 25 | Bioinformatics workflow management | Mark Schreiber

  26. Resource Oriented ArchitectureOutsourcing your data warehouse • Bioinformatics is very resource intensive • ROA simplifies maintenance and removes the need for synchronization. • Many resources are now accessible by webservices in XML format 26 | Bioinformatics workflow management | Mark Schreiber

  27. Resource Oriented ArchitectureThe challenges • Network latency can become a major problem • Intelligent caching and increased network speed are a must • Requires resource discovery and cross referencing • RDF and Ontology will play an increasingly important role • Workflow management systems will need to understand these • Increasingly workflows will make use of loosely-coupled interoperable resources and services. 27 | Bioinformatics workflow management | Mark Schreiber

  28. Business ProcessesFrom proactive to reactive • Business processes are long running, asynchronous processes • Typically they react to events, e.g. a change in a stock price. • ‘Push’ vs ‘Pull’ model of data access. • Known as ‘programming in the large’ • Defined using BPEL with very heavy use of SOA and ROA • Currently, most workflows are explicitly executed, ‘short running’, synchronous processes • Bioinformatics will increasingly use business processes • React to streaming machine data • Continuously process literature or database updates 28 | Bioinformatics workflow management | Mark Schreiber

  29. Web Service ChoreographyWill it be relevant to bioinformatics? • Business processes and workflows are ‘orchestrations’ • Scope is limited to one participant • The BP or the Workflow talks to other participants but doesn’t care how they do their job or how they are managed. • Choreography involves the management of several loosely coupled BP’s • A network of long running asynchronous BP’s that react to the behavior of their peers. • Choreography of workflows would require a standard workflow description or exposure of a workflow as a business process One to Many One to Many BP Choreography Web Service One to Many One to Many Workflow Node ??? 29 | Bioinformatics workflow management | Mark Schreiber

  30. ConclusionsDesign and management • Workflows are created using a workflow design system and executed on a workflow management system • A well designed workflow management can considerably increase productivity • Promotes workflow re-use and helps organize a multi-user environment • A good design system allows rapid development of a workflow • A good design system promotes experimentation and data exploration 30 | Bioinformatics workflow management | Mark Schreiber

  31. ConclusionsThe future • Ontology will play an increasing role in data integration • Join and Filter operations that can reason over an ontology model • Business processes and web choreography will become more relevant to bioinformatics • ‘Live’ data favors programming ‘in the large’ • Workflows exposed as business processes • Network speed and optimal caching are key • All of these approaches have been used before • Used and proven in business intelligence • Bioinformatics needs to acquaint itself with modern IT practice and stop re-inventing technology 31 | Bioinformatics workflow management | Mark Schreiber

More Related