1 / 49

Nobody said it was easy: Semantically Discovering BioGrid Services is tricky

Nobody said it was easy: Semantically Discovering BioGrid Services is tricky. Professor Carole Goble University of Manchester, UK myGrid project http://www.mygrid.org.uk. Environmental requirements of bioinformatics in silico experimentation The services

tehya
Download Presentation

Nobody said it was easy: Semantically Discovering BioGrid Services is tricky

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Nobody said it was easy:Semantically Discovering BioGrid Services is tricky Professor Carole Goble University of Manchester, UK myGrid project http://www.mygrid.org.uk

  2. Environmental requirements of bioinformatics in silico experimentation The services Workflow execution And the impact on describing services for how you description stuff, what to describe and how and when to use the descriptions different levels of descriptions different views on services depending on whether you are middleware or a user implications for registration

  3. Present Grid Architecture is a services architecture Implemented using Web Services Technology OGSA will provide Naming /Authorization / Security / Privacy Higher level services: Workflow, Transactions, Data Mining,Knowledge Discovery,… Exploiting Synergy: Commercial Internet with Grid Services OGSI extends Web Services Transient Service Instances Service State Lifetime management Defines fundamental (WSDL) interfaces and behaviors that define a Grid Service Required + optional interfaces = WS “profile” Defines WSDL extensibility elements E.g., serviceType (a group of portTypes) Open Grid Services Architecture

  4. myGrid • EPSRC UK e-Science pilot project • Open Source Upper Middleware for Bioinformatics • Data intensive not compute intensive • Sharing knowledge and sharing components

  5. myGrid in a nutshell • An example of a “second generation” open service-based Grid project, specifically a test bed for the OGSI, OGSA and OGSA-DAI base services; • myGrid Information Repository that is OGSA-DAI compliant • Developing high level services for data intensive integration, rather than computationally intensive problems; • Workflow & distributed query processing • Developing high level services for e-Science experimental management; • Provenance, change notification and personalisation • Developing Semantic Grid capabilities and knowledge-based technologies, such as semantic-based resource discovery and matching. • Metadata descriptions and ontologies for service discovery, component discovery and linking components.

  6. Experiment life cycle • Service discovery • Workflow discovery & refinement • Workflow creation • Personalised service registries • Personalised workflows Forming experiments Personalisation Discovering and reusing experiments and resources Executing experiments • Service discovery • Workflow discovery & refinement • Provenance logs • Workflow enactment • Service invocation • Provenance logs Providing services & experiments Managing experiments • Service registration • Workflow deposition • Metadata annotation • Third party registration • Provenance records • Workflow evolution • Service monitoring

  7. Provenance • Experiment is repeatable, if not reproducible, and explained by provenance records • Who, what, where, why, when, (w)how? • The tracability of knowledge as it is evolves and as it is derived. • Implications for recording which services invoked on what data when with what parameters. • Immutatable and persistent

  8. Architectural Overview Knowledge Services Knowledge Service Semantic registration Registry Registry Ontology Server Reasoner Structural registration UDDI Matcher Service KB Store Registry View Notification Service Notification Service RDF-based UDDI Service Discovery JMS Provenance service Workflow enactment engine Discover Workflow or Service mIR Test Data Scufl & WSFL mG Object Discovery Information Extraction Distributed Query Processor Job Execution m Info Repository Workflow templates Workflow instances PESTO Service Service Service Metadata Concepts Provenance Data SoapLab DB2 DB2

  9. Workflows • Workflow discovery • Finding workflows that others have done, and that I have done myself • Workflow specification • Finding classes of services • Guiding service composition • We don’t do automated composition • Dynamic workflow enactment service discovery and invocation • Choose services instances when running workflow • User involvement

  10. Ontologies myGrid Find Service Word-based discovery Discovery Client Find Service Syntactic discovery Semantic discovery Views UDDI-M Ontology Server Views Third party description Reasoner RDF Service FaCT publishes Description Store Matcher Gather service descriptions publishes Org. registry KAON Public registry WSDL Third Party UDDI

  11. myGrid Components ~ Demo • portal operation. • semantics to define type system. • mIR, to store, and retrieve data. • registry to describe and record services Uncharacterised DNA sequence Select an open reading frame Translate to protein BLAST search Characterised DNA sequence

  12. myGrid Components ~ Demo • Pre-existing third party application • Service invocation • Workflow enactment DNA sequence getOrf transeq prophet plotorf Proteins from a family emma prophecy Classical bioinformatics: detecting whether an uncharacterised protein domain is conserved across a group of proteins

  13. Bio Services Landscape • Wrap CORBA, Perl etc to look like web services, to become Grid services (eventually) • Multiple services • Many hundreds of different services in the public domain and privately owned • Multiple registries • 3rd party public registries, private registries, personal registries • 3rd parties • JEMBOSS, PathPort, bioMoby • Wrap our own • Soaplab • A soap-based programmatic interface to command-line applications • ~300 different classes of services • Swiss-Prot, EMBOSS, Medline, blah, blah … • http://industry.ebi.ac.uk/soap/soaplab

  14. Bio Services Problem Space • Multiple service providers of same service (not just similar service) • Many implementations of Swiss-Prot version 40 • “What and which” Discovery based on • What the services does from a domain perspective. • Which service instance has the appropriate capabilities from an operational perspective. • Users don’t care if the service is a service or a workflow. • Same what description from their perspective • Different “how” description from middleware perspective. SWISS-PROT SWISS-PROT@local SWISS-PROT@ncbi SWISS-PROT@ebi

  15. Consequences • We support (at least) two types of semantic service discovery: • Domain • requiring access to common application domain ontologies • Biology and bioinformatics • Service • using cross-domain knowledge independent of application • Quality of service, ownership, location, organisations … • We describe the profile of workflows as if they were services (of course a workflow could be deployed as a service…) • Should workflow descriptions be in the same registry as service descriptions, or elsewhere? • A find service must transcend the location.

  16. Tiers of service description Select an open reading frame Uncharacterised DNA sequence Characterised DNA sequence Sequence alignment Translate to protein Characterised DNA sequence EMBOSS TransSeq CATTACCC EMBOSS GetORF BLAST-p Characterised DNA sequence EMBOSS TransSeq@http:ed.ac.uk CATTACCC EMBOSS GetORF @http:img.cs.man.ac.uk BLASTp @ncbi.nih.gov

  17. Summary: Tiered levels of descriptions Abstract Service Sequence alignment Classes of services Domain “semantic” Unexecutable “Potentials” Ontology Specific Service Blastn Ontology Instances of services Business “operational” Executable “Actuals” Service Instance Blastn@EBI Ontology Data model Invoked Service Blastn@EBI invoked proxy Service Data Element

  18. What are you discovering? Classes & Users • Finding a service that will fulfil some task e.g. aligning of biological sequences. • What services perform a specific kind of task, for example, what services can I used to perform a biological sequence similarity search? • Finding a service that will accept or produce some kind of data. • What services produce this kind of data, for example, from where can I find sequence data for a protein? • What services consume this kind of data, for example, if I have protein sequence data, what can I do with it? • Class of service: • a protein sequence alignment, a protein sequence database. • Specific example of an abstract service: • BLAST, BLASTn, SWISS-PROT, • Applies to class of services and workflow specifications Workflow specifications Discovery Classes of Service

  19. provides Resource Service presents supports What it does describedBy Service profile Service grounding Service model How to access it description How it works functionalities functional attributes Originally Based on DAML-S • US DARPA Agent Markup Language – Services http://www.daml.org • An Upper Ontology for Services

  20. Suite Specialises. All concepts are subclassed from those in the more general ontology. Contributes concepts to form definitions. Upper level ontology Informatics ontology Molecularbiology ontology Publishing ontology Organisationontology Task ontology parameters: input, output, precondition, effect performs_task uses-resource is_function_of Bioinformatics ontology Web serviceontology

  21. Pedro interface to Service Discovery

  22. Classification and matchmaking of services • Classification of services/workflows • Imprecise (best effort) substitutions of services/workflows • Service/workflow organisation & indexing, • Service/workflow matchmaking & substitution • “BLAST” finds tblastx, tblastn, psi-blast, marks_super_blast. • “Alignment” finds ClustalW, Blast, Smith-Waterman, Needleman-Wunsch • Expanded selection of services based on expansion of in-hand object • A vocabulary for expressing service descriptions without pre-determining every description • A reasoning process to manage: • coherency of the classifications and the descriptions when they are created, • the service discovery, matching and composition when they are deployed. • Ontologies in DAML+OIL/OWL based on the DAML-S ontology

  23. What are you discovering? Instances & Machines Workflow specifications Discovery Classes of Service registry Instantiate Select instances

  24. Discovering services based on their operational properties • What resources does a specific organisation provide? • Who authored this resource? • What services offering x currently give the best quality of service? • Which service would the local bioinformatics expert suggest we use? • Data quality, quality of service, cost, geographical location, authorisation, provenance of data and so on. • Third party metadata • Instance service description of a specific service • BLAST, SWISS-PROT as offered by the EBI is 80% reliable. • Invoked instance service description • BLAST as offered by the EBI on a particular date, with particular parameters when a service invoked. Applies to instances of services and workflows

  25. RDF based UDDI metadata for service instances

  26. User engagement Workflow specifications Discovery Classes of Service registry Instantiate Select instances • Support for the user to find a service that fulfils their task. • ontology should be fairly simple • couched in concepts the user is familiar with e.g. protein sequence. • analogous to DAML-S profile

  27. EMBOSS seqret • Function that reads and writes (returns) sequences • But its so much more than that! • EMBOSS programs can take a wide range of qualifiers that slightly change the behaviour of the program when reading or writing a sequence • seqretcan read a sequence or many sequences from databases, files, files of sequence names, the command-line or the output of other programs and then can write them to files, the screen or pass them to other programs. • Because it can read in a sequence from a database and write it to a file, its a program for extracting sequences from databases • Because it can write the sequence to the screen, seqret is a program for displaying sequences.

  28. And more…. seqret can read sequences in any of a wide range of standard sequence formats. You can specify the input and output formats being used. If you don't specify the input format, it will try a set of possible formats until it reads it in successfully. Because you can specify the output sequence format, its a program to reformat a sequence. seqret can read in the reverse complement of a nucleic acid sequence. So its a program for producing the reverse complement of a sequence. seqret can read in a sequence whose begin and end positions you have specified and write out that fragment. So its a utility for doing simple extraction of a region of a sequence. seqret can change the case of the sequence being read in to upper or to lower case. So its a simple sequence beautification utility. seqret can do any combination of the above functions. ......

  29. EMBOSS • EMBOSS sequence alignment service matcher simple way to describe the task it fulfils ismatcher has_input sequence        performs_task aligning • some verb acting on some object to produce a result and it fits most descriptions. • Quickly get more complicated. • EMBOSS degap removes gap characters from a sequence. • Where should the gap character concept be included? It is neither an input or an output.

  30. Several properties added over the DAML-S profile for bioinformatics • e.g. uses_resource and uses_application. • These could be simplified away either just as one additional property or a precondition as used DAML-S. • More obtuse to the user. • Makes the model more complex or redundant for the benefit of the user. • Reduces inter operability with service descriptions in other domains. • Perhaps this redundancy should be encoded within the applications delivering the ontology and a more complex precondition description used under the hood?

  31. EMBOSS matcher • protein sequence is an ambiguous term and relies on implicit information held in the head of the bioinformatician. • to reason over or organise concepts we need a more precise definition • data structure conforming to some schema that encodes the sequence of amino acid in a protein molecule. • We can now start to infer the relationship between protein sequences and nucleotide sequences. • But a user cannot be expected to interact with such a complex model.

  32. Outcome: Views • Multiple descriptions over same services & workflows held in registries • Third party descriptions & Subsets of services • publication of descriptions must be supported both for the author of the service and third parties; • third party annotations are a view of a service and discovery should offer a variety of views based upon third party annotations; • there is a need for control over who make add and alter third party annotations; • Generic services supporting a wide variety of multiple tasks • Middleware must have some way of going beyond a generic description and stating given these inputs what are the outputs going to be. • Rather than author very complex description that cater for all possibilities, it is better to author many simpler descriptions for each case. • It may in fact be necessary to ask the service itself for specific answers, such as ‘given these inputs what would you perform?’

  33. Ontologies myGrid Find Service Word-based discovery Discovery Client Find Service Syntactic discovery Semantic discovery Views UDDI-M Ontology Server Views Third party description Reasoner RDF Service FaCT publishes Description Store Matcher Gather service descriptions publishes Org. registry KAON Public registry WSDL Third Party UDDI

  34. Bio Services Problem Space • Wrap CORBA, Perl etc to look like web services, to become Grid services (eventually) • Dialogue oriented (e.g. Soaplab) and function oriented (e.g. bioMOBY) • Often highly parameterised • Mixture of synchronous and asynchronous • Simulations and feedback loops • Streaming large scale data • Mixture of binary and text

  35. EMBOSS • Suite of 200+ command line programs, which uses a command definition language AJAX How do we present these services? • As 200 different services, one for each EMBOSS program, with a single method, with as many parameters as the EMBOSS program requires. • As 200 different services, one for each EMBOSS program, with a number of overloaded methods where the program takes optional parameters. • As a single service with 200 different methods, one for each EMBOSS program. • As a single, highly parametric service, with a single method, called invoke, the first parameter of which names the EMBOSS program to run.

  36. registry Registry? Workflow specifications Discovery Classes of Service Instantiate Select instances Execution Invoked instance Workflow enactment

  37. Invocation Workflow specifications Discovery Classes of Service Registry Discovery & Instantiate Select instances Registry? Execution Invoked instance Workflow enactment Monitor Terminate

  38. Support for middleware to perform tasks such as substitution, data transformation between services, automatic invocation of services where the invocation model is not simple. • a complex model to explicitly describe every implementation detail of the service or a binding to it. • analogous to DAML-S process model and grounding. Phases Workflow specifications Discovery Classes of Service Discovery & Instantiate Select instances Execution Invoked instance Workflow enactment Monitor Terminate

  39. Invocation models • bioMoby forces services to have a single operation that completely encompasses the single task the service supports. • Each task may be in turn supported by a single operation • Soaplab there is no one to one mapping between a single task and a single operation. • Can repurpose a service to be presented multiple times – a different wrapper for every view • Proliferation of views • Makes discovery easier • Reasoning that it’s the same service as one running

  40. createEmptyJob get_detailed_status get_report get_outfile set_gappenalty set_sbegin1 set_sbegin2 set_send1 set_send2 set_sformat1 set_sformat2 set_slower1 set_slower2 set_snucleotide1 set_snucleotide2 set_sprotein1 set_sprotein2 set_sreverse1 set_sreverse2 set_supper1 set_supper2 set_datafile_direct_data set_datafile_url set_sequencea_direct_data   set_sequencea_usa set_sequenceb_direct_data set_sequenceb_usa set_gaplength set_alternatives run destroy getStatus describe getInputSpec getResultSpec getAnalysisType createJob runNotifiable createAndRun createAndRunNotifiable waitFor runAndWaitFor getResults terminate getLastEvent getNotificationDescriptor getCreated getStarted getEnded getElapsed getCharacteristics getSomeResults ...... Soaplab version of matcheralignment_local::matcher::derived (wsdl)

  41. Coordinating EMBOSS through Soaplab - WSFL • for each task: • createJob(inputs:Map) • run(...) • waitFor(...) • getResults(...) • destroy(...) Workflow Engine WSFL

  42. Coordinating EMBOSS through Soaplab - Scufl • for each task: • run(operation, inputs) Soaplab plugin Workflow Engine Scufl

  43. Does the user ever see this? • If the user never has to deal with the invocation model • The DAML-S approach of splitting the information between two descriptions seems plausible. • Once the user has used the simpler profile, the middleware gets to work on the more complex process model and binding, or a myGrid workflow to actually translate the task into concrete service operation calls. • If the user does want to know what is going to happen • A more unified model with views for user and middleware seems more appropriate. • The downside is the cost of implementing the infrastructure to deliver the views.

  44. Summary: Views • Two parallel but slightly redundant descriptions of the service • one for human discovery and one for middleware. • what DAML-S does. OR • One common model which is complex and supports multiple tasks but have an extra layer that provides a view to support each specific task • intermediate representations, reasonables, perspectives, language generation. • The user sees the term protein sequence even though the underlying concept is far more explicit. • Transformed into the more complex pattern; the user may be promoted for attributes associated with the parent concept “data” even though the user never explicitly stated this was a kind of data. • The view approach used in GALEN and GONG. • The DAML-S profile probably too complex to present to bioinformatics users.

  45. Summary 2: human vs machine views Service User Human Machine Weak semantic descriptions Rewriting views UDDI style advertisements Human Syntactic descriptions Semantic mining Elaborate Semantic descriptions Simplication views Machine Service provider

  46. Discovery space Classes and instances Abstractions over a single description of a service Third party multiple viewpoints People and machines Multiple descriptions over a single service Multiple tasks

  47. Acknowledgements:Luc Moreau, Simon Miles, Keith Decker, Terry Payne, Phil Lord, Chris Wroe, Roberts Stevens, Kevin Garwoodhttp://www.mygrid.org.uk/

More Related