340 likes | 736 Views
Bioinformatics workflow management. Thoughts and case studies from industry. Mark Schreiber, Bioinformatics Research Investigator WWWFG, 5-7 June 2007. Outline. Integration and workflows Early attempts Case studies and examples What does the future hold? Conclusions.
E N D
Bioinformatics workflow management Thoughts and case studies from industry. Mark Schreiber, Bioinformatics Research Investigator WWWFG, 5-7 June 2007
Outline • Integration and workflows • Early attempts • Case studies and examples • What does the future hold? • Conclusions 2 | Bioinformatics workflow management | Mark Schreiber
Bioinformatics at NITD • Data Integration • Ontologies, Standards, DBs • Knowledge Discovery • Algorithms, Informatics, Machine Learning • Modelling • Pathways, Circuits, Abstraction 3 | Bioinformatics workflow management | Mark Schreiber
Bioinformatics at NITD • BI combines data gathering, data storage and knowledge management with analytical tools to present complex and competitive information to planners and decision makers. • Hypothesis Generation and Validation. Providing the right information at the right time. • Decision Support. 4 | Bioinformatics workflow management | Mark Schreiber
Data SourcesHeterogeneity • The most significant research is done when heterogeneous data sources can be combined in one analysis. 5 | Bioinformatics workflow management | Mark Schreiber
Applications (Services)Yet more heterogeneity • RDBMS • Oracle, MySQL, PostGres etc • Open Source • Usually just a command line interface • Commercial software • API, scripting engine, webservice • Web services and Web resources • Integration is rarely seamless 6 | Bioinformatics workflow management | Mark Schreiber
Productivity vs. InnovationFinding a balance • Development and manufacturing prioritize productivity • Research requires more innovation • Standardization increases productivity • Standardization limits innovation • At the level it is applied • Standardization promotes innovation • At higher levels • Workflows give a nice balance 7 | Bioinformatics workflow management | Mark Schreiber
What is a workflow?In Bioinformatics • A data-driven procedure consisting of one or more transformation processes (nodes). • Can be represented as a directed graph. • Direction is time – The order of transformations. • A set of transformation rules. • A flow of data from it’s source to a destination (or result) via a series of merges, joins, manipulations and interconnected tools (services). • A specification designed in a Workflow Design System (modeling component) and run by a Workflow Management System (execution component). 8 | Bioinformatics workflow management | Mark Schreiber
The UNIX PhilosophyAnalogy to workflows • Write programs that do one thing and do it well • Write programs that work together • Write programs to handle text streams, because that is the universal interface • Text formatted as XML • Do one thing and do it well • A workflow is made up of nodes that do one thing and do it well • So is a Service Oriented Architecture (SOA) 9 | Bioinformatics workflow management | Mark Schreiber
An early attempt: PolymerUnix shell scripts + Biojava objects • Biojava is a large API of Java objects that are useful for bioinformatics. • Biojava objects can be assembled into mini-programs tha ‘do one thing and do it well’. • Polymer combines these mini-programs into a very simple workflow using Unix shell scripts. • Much like Unix piping. • Unfortunately it instantiates multiple JVMs • Lacks management and logging systems 10 | Bioinformatics workflow management | Mark Schreiber
How could Polymer have been better? • Provide an execution class and allow it to execute a script. • This would mean only one JVM is launched and could allow for threading of branches in the script. • Use Groovy script instead of Unix shell script. • But Groovy hadn’t been invented at the time. • At the same time workflow management systems were emerging which made Polymer redundant. 11 | Bioinformatics workflow management | Mark Schreiber
A production example: Drug Target IdentificationRational bioinformatics prioritization • In collaboration with biologists identify desirable characteristics of a drug target • Integrate relevant data from large datasets • Combine data and score each target based on the presence or absence of desirable characteristics • Prioritize targets based on their overall score 12 | Bioinformatics workflow management | Mark Schreiber
Essentiality Homology Expression Druggable domains DB Structure Pathways Epidemiology Assayability Scientist defines desirable criteria AssessDrugTarget Assign weights Produce a score for each gene Select targets for promotion to D1 Competitive advantage Literature Legal position Biological feasibility A production example: Drug Target IdentificationRational bioinformatics prioritization • Hasan S, Daugelat S, Rao PSS, Schreiber M (2006) Prioritizing genomic drug targets in pathogens: Application to Mycobacterium tuberculosis. PLoS Comput Biol 2(6):e61 13 | Bioinformatics workflow management | Mark Schreiber
Workflow Management SystemControlling the workflow • A WMS should provide a means to execute a workflow in a controlled way. • Ideally it will also provide: • Logging • Messaging • Security and provenance management • Scheduling and load balancing • Exception handling • Resource pooling (eg DB connections) • Much of the above is easily accessible from a JEE/ .NET application server • JBoss, Glassfish 14 | Bioinformatics workflow management | Mark Schreiber
Workflow Design SystemBuilding the workflow • Many WMS systems are also a WDS • Eg Taverna, Pipeline Pilot, Inforsense • A GUI that allows rapid workflow development • Increases productivity and encourages experimentation • Drag and drop assembly of a workflow • Provides an API or scripting interface to allow the design of new nodes • A simple scripting interface would also be an alternative to using a GUI for design 15 | Bioinformatics workflow management | Mark Schreiber
Simple Data Mining Workflow • Each node has a discrete function. • Internally the processing can be complex (eg Decision Tree) but input and output is simple and generic. • Self documenting. • Can be run by other users. 16 | Bioinformatics workflow management | Mark Schreiber
AnnotationFinding malaria kinases • Semi-automated annotation 17 | Bioinformatics workflow management | Mark Schreiber
Advanced annotationCombining multiple services 18 | Bioinformatics workflow management | Mark Schreiber
Workflows become nodesStanding on the shoulders of giants • Elements of workflows that are frequently re-used should become nodes. • Workflow re-use, Object oriented workflows 19 | Bioinformatics workflow management | Mark Schreiber
Example: From Arrays to PathwaysUsing whole workflows as nodes • Process and array and find the over represented KEGG pathways and NCBI processes. 20 | Bioinformatics workflow management | Mark Schreiber
Workflow design systems promote rapid development • Finding orthologues and paralogues using whole genome pairwise blast. • Development of the workflow took about 5mins. 21 | Bioinformatics workflow management | Mark Schreiber
Workflow design systems promote experimentationMind map data analysis 22 | Bioinformatics workflow management | Mark Schreiber
Integration Via Ontology • Workflows in bioinformatics typically do a lot of integration before and/ or after analysis. • Integration is normally done using joins and filters. • Using equality and Boolean operations. • Eg type = protease OR type = serine protease … • Joins and filters should be able to be evaluated using ontology. • Eg. Filtering for proteases would include all subconcepts automatically. • Data sets could be quickly mapped using custom ontologies. 23 | Bioinformatics workflow management | Mark Schreiber
Simplifying Service IntegrationExpose an API • All programs likely to be called by a workflow management system should publish a webservice or expose a scripting API. • Easier to learn than a full Java or C API. • Should be based on an existing scripting language not a new one. • Python, Groovy, Ruby or Perl • While you are at it expose your stack via the scripting language. • Imagine what could be done with BLAST if the stack could be manipulated via scripting. 24 | Bioinformatics workflow management | Mark Schreiber
Web Services and Service Oriented Architecture‘Outsourcing your processing’ • Webservices • Services can reside on different servers • Platform independent HTTP protocol • CGI, REST, XML-RPC, SOAP • SOAP is the easiest to generically connect to and parse • Results are available as XML • Service Oriented Architecture • Usually implies web services • SOA promotes re-use and simplifies maintenance • Bottleneck shifts from CPU time to network availability 25 | Bioinformatics workflow management | Mark Schreiber
Resource Oriented ArchitectureOutsourcing your data warehouse • Bioinformatics is very resource intensive • ROA simplifies maintenance and removes the need for synchronization. • Many resources are now accessible by webservices in XML format 26 | Bioinformatics workflow management | Mark Schreiber
Resource Oriented ArchitectureThe challenges • Network latency can become a major problem • Intelligent caching and increased network speed are a must • Requires resource discovery and cross referencing • RDF and Ontology will play an increasingly important role • Workflow management systems will need to understand these • Increasingly workflows will make use of loosely-coupled interoperable resources and services. 27 | Bioinformatics workflow management | Mark Schreiber
Business ProcessesFrom proactive to reactive • Business processes are long running, asynchronous processes • Typically they react to events, e.g. a change in a stock price. • ‘Push’ vs ‘Pull’ model of data access. • Known as ‘programming in the large’ • Defined using BPEL with very heavy use of SOA and ROA • Currently, most workflows are explicitly executed, ‘short running’, synchronous processes • Bioinformatics will increasingly use business processes • React to streaming machine data • Continuously process literature or database updates 28 | Bioinformatics workflow management | Mark Schreiber
Web Service ChoreographyWill it be relevant to bioinformatics? • Business processes and workflows are ‘orchestrations’ • Scope is limited to one participant • The BP or the Workflow talks to other participants but doesn’t care how they do their job or how they are managed. • Choreography involves the management of several loosely coupled BP’s • A network of long running asynchronous BP’s that react to the behavior of their peers. • Choreography of workflows would require a standard workflow description or exposure of a workflow as a business process One to Many One to Many BP Choreography Web Service One to Many One to Many Workflow Node ??? 29 | Bioinformatics workflow management | Mark Schreiber
ConclusionsDesign and management • Workflows are created using a workflow design system and executed on a workflow management system • A well designed workflow management can considerably increase productivity • Promotes workflow re-use and helps organize a multi-user environment • A good design system allows rapid development of a workflow • A good design system promotes experimentation and data exploration 30 | Bioinformatics workflow management | Mark Schreiber
ConclusionsThe future • Ontology will play an increasing role in data integration • Join and Filter operations that can reason over an ontology model • Business processes and web choreography will become more relevant to bioinformatics • ‘Live’ data favors programming ‘in the large’ • Workflows exposed as business processes • Network speed and optimal caching are key • All of these approaches have been used before • Used and proven in business intelligence • Bioinformatics needs to acquaint itself with modern IT practice and stop re-inventing technology 31 | Bioinformatics workflow management | Mark Schreiber