(Bio)Web Services at the INB

(Bio)Web Services at the INB BioMOBY

Instituto Nacional de Bioinformática

INB Mission “To generate and apply bioinformatics solutions to needs detected in development and implementation of genomics and proteomics focused projects” • To support Bioinformatics and Computational Biology development in Spain • To collaborate and provide scientific and technical support to national genomics and proteomics projects • To contribute to the creation and establishment of local Bioinformatics groups with research and services components through bioinformaticians training • To train bioinformaticians for genomics and proteomics research groups • To develop pure Bioinformatics projects related with the Institute activities • To support companies with activity in this sector in Spain • To internationalize all its activities

INB Structure. A “virtual” institute

Web Services

Making some sense of this 12181 acatttctac caacagtgga tgaggttgtt ggtctatgtt ctcaccaaat ttggtgttgt 12241 cagtctttta aattttaacc tttagagaag agtcatacag tcaatagcct tttttagctt 12301 gaccatccta atagatacac agtggtgtct cactgtgatt ttaatttgca ttttcctgct 12361 gactaattat gttgagcttg ttaccattta gacaacttca ttagagaagt gtctaatatt 12421 taggtgactt gcctgttttt ttttaattgg gatcttaatt tttttaaatt attgatttgt 12481 aggagctatt tatatattct ggatacaagt tctttatcag atacacagtt tgtgactatt 12541 ttcttataag tctgtggttt ttatattaat gtttttattg atgactgttt tttacaattg 12601 tggttaagta tacatgacat aaaacggatt atcttaacca ttttaaaatg taaaattcga 12661 tggcattaag tacatccaca atattgtgca actatcacca ctatcatact ccaaaagggc 12721 atccaatacc cattaagctg tcactcccca atctcccatt ttcccacccc tgacaatcaa 12781 taacccattt tctgtctcta tggatttgcc tgttctggat attcatatta atagaatcaa Fuente: myGrid

Or this…

Current practices • Bioinformatics Integration: State of the Art • A Web Page is the de facto standard • Discovery: • Word of mouth • Web directories • Google • Paper publications • Description: • Word of mouth • Web documentation / examples / tutorials / courses • Paper publications • Data transfer & Message Format: • Cut & paste! + Data reformatting • Automation: • CGI & Bespoke code (ad hoc) • APIs (normally big Bioinformatics Projects/institutes) Do you have data? Do you have tools? Publish a web page Discovery Service (Application, DB) Description Remote Programatic Acces Service Consumers Service Providers

What is wrong with Web apps. User side • How do I find out where services are provided? • Once I discover a service, how do I use it? • Input/output data types. How do I take the output of one service and send it to another service? • How do I use the service from within a program instead of through a form on a web page? Developer side • Most of developer time is spent in presentation and input/output management. Error handling is mandatory • Different projects use different formats, rules of use, etc.

Web services can solve the problem • Central repository • Well known input/output formats • No need of user interfaces

Web services model Service Descriptions Service Registry Find Publish WDSL, UDDI WSDL, UDDI Service Description Service Bind Service Requestor Service Provider • A Web service is an interface that describes a collection of operations that are network accessible through standardized XML messaging* *Web Services Conceptual Architecture, Heather Kreger, IBM Software Group, 2001

However… • Don’t help the situation much since… • A bioinformatics that consumes a “string” might be expecting a FASTA sequence, or a keyword…?? • Bioinformatics has many different ‘strings’! • In Bioinformatics Web Service registries merely catalogue the chaos! • Semantics rather than structure is necessary

Our audience • Information is distributed • MOST data never makes it off of the scientists hard drive • This data should be added to the global scientific archive • Biologists, by and large, are willing and able, but… • The Web was embraced enthusiastically by biologists • In fact, most wet labs run a website! • Unfortunately, this only adds to the chaos… The interoperability solution must be simple enough for a Biologist, with a little bit of computer knowledge, to implement on their own

BioMOBY From MOBY-DIC (Model Organisms, Bring Your own Database Interface Conference) http://www.biomoby.org

OBJETIVES Study how to address interoperability problems that are actually being faced by bioinformatics users of web-accesible resources today, and what are the factors that promote the adoption of new approaches How to balance between increasing potential for interoperability and the likelihood of widespread adoption? I.e. focus upon minimizing the barriers to entry into the system, or insist upon a set of constraints that will guarantee usefulness of components of the system BioMOBY – Scope and Definitionhttp://www.biomoby.org MOBY is a project to develop a web services architecture for bioinformatics BioMOBY is an international research project involving biological data hosts, biological data service providers, and coders whose aim is to explore various methodologies for biological data representation, distribution, and discovery. • Common Syntax • Common Semantic • Dynamic Discovery

The MOBY plan • Define data-types commonly used in bioinformatics • Organize these into an Ontology • Ontologically define web service inputs and outputs • Register the inputs and outputs in a “yellow pages” • Machines can find an appropriate service • Machines can execute that service unattended • But users still can understand data types

Define: Semantics • For a piece of data, its “semantics” are • its intention • its meaning • its raison d’etre • its context • its relationship to other data

MOBY Semantic Typing: Namespaces • Any identifiable piece of data is an “entity” • Identifiers fall into particular “Namespaces” • NCBI has gi numbers (gi Namespace) • GO Terms have accession numbers (GO Namespace) • Namespaces indicate data’s semantic type. • GO:0003476  a Gene Ontology Term • gi|163483  a GenBank record • However, we cannot tell if it is protein, RNA, or DNA sequence • Namespace + ID precisely specifies a data “entity” • The Namespace is assumed to be sufficiently descriptive of the data’s semantic type that a service provider can define their interface in terms of Namespaces

Define: Syntax • For a piece of data, its “syntax” are • its representation • its form • its structure • its language (of representation)

MOBY Syntactic Typing: The Object Ontology • Syntactic types are defined by a GO-like ontology • Type (“Class”) name at each node • Edges define the relationships between Classes • GO used as a model because of its comprehension & familiarity • Edges define one of three relationships • ISA • Inheritance relationship • All properties of the parent are present in the child • HASA • Container relationship of ‘exactly 1’ • HAS • Container relationship with ‘1 or more’

Female hasGender Mother hasParent Child partnerOf Father Male hasParent hasGender Define: Ontology • A systematic representation of the entities that exist in a domain of discourse, and the relationships between them.

A portion of the MOBY-S Object Ontology …community-built!

What’s an “Object”? • The smallest unit of information that can be passed by MOBY • Consists simply of • Namespace • ID • Thus an Object is nothing more than a “reference” to a data entity • Ex. <Object if=‘2KI5’ namespace=‘PDB’/> refers to the 3D structure of a Herpes Virus I Thymidine kinase, whereas • <Object id=‘KITH_HHV1’ namespace=‘Uniprot’/> refers to its sequence

The Object Ontology: A small slice • ISA relationships do not necessarily add complexity to objects, some times they are just semantics • Inheritance makes easier service discovery

MOBY objects A MOBY triple includes a namespace, an ID, and a class <Class namespace='...' id='...'> A simple MOBY object is just a pointer to data to be retrieved from somewhere <Object namespace='NCBI_id' id='163483'> An object may contain data in addition to the namespace and ID: <Class namespace='...' id='...'>object's data</Class> The object's data may include XML markup A complete MOBY object: <GenericSequence namespace='NCBI_gi' id='163483' articleName='mySequence'> <Integer namespace='' id='' articleName='Length'>975</Integer> <String namespace='' id='' articleName='SequenceString'> ATGATGCGGCTAGTGATGCTGTCGGCGGCATGATTAGG... </String> </GenericSequence> articleName’s add human readable semantics to subclasses

ISA relationship - inheritance • Classes become more specialized as you move along the ISA relationship hierarchy • DNA_Sequence • ISA • Nucleotide_Sequence • ISA • Generic_Sequence • ISA • Virtual_Sequence • ISA • Object • Classes do not become more complex as a result of ISA relationships alone

HASA & HAS relationships • HASA and HAS relationships make Classes more complex by embedding Classes within Classes • Virtual_SequenceISAObject • Virtual_Sequence HASA Length (Integer) • Generic_SequenceISAVirtual_Sequence • Generic_Sequence HASA Sequence (String) • Annotated_GIF ISA Image (base_64_GIF) • Annotated_GIF HAS Description (String)

Legacy file formats • Classic bioinformatics “strings” are just embedded into XML • Binaries are base64 encoded. <NCBI_Blast_Report namespace=‘NCBI_gi’ id=‘115325’> <String articleName=‘content’> TBLASTN 2.0.4 [Feb-24-1998] Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schäffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997), "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs", Nucleic Acids Res. 25:3389-3402. Query= gi|1401126 (504 letters) Database: Non-redundant GenBank+EMBL+DDBJ+PDB sequences 336,723 sequences; 677,679,054 total letters Searchingdone Score E Sequences producing significant alignments: (bits) Value gb|U49928|HSU49928 Homo sapiens TAK1 binding protein (TAB1) mRNA... 1009 0.0 emb|Z36985|PTPP2CMR P.tetraurelia mRNA for protein phosphatase t... 58 4e-07 emb|X77116|ATMRABI1 A.thaliana mRNA for ABI1 protein 53 1e-05 gb|U12856|ATU12856 Arabidopsis thaliana Col-0 abscisic acid inse... 53 1e-05 </String> </NCBI_Blast_Report>

Extending legacy data types • With legacy data-types defined, we can extend them as we see fit • annotated_jpeg ISA base64_encoded_jpeg • annotated_jpeg HASA 2D_Coordinate_set • annotated_jpeg HASA Description • <annotated_jpeg namespace=‘TAIR_Image’ id=‘3343532’> • <String namespace=‘’ id=‘’ articleName=“content”> • MIAGCSqGSIb3DQEHAqCAMIACAQExCzAJBgUrDgMCGgUAMIAGCSqGSIb3DQEHAQAAoIIJQDCC • Av4wggJnoAMCAQICAwhH9jANBgkqhkiG9w0BAQQFADCBkjELMAkGA1UEBhMCWkExFTATBgNV • </String> • <2D_Coordinate_set namespace=‘’ id=‘’ articleName=“pixelCoordinates”> • <Integer namespace=‘’ id=‘’ articleName=“x_coordinate”>3554</Integer> • <Integer namespace=‘’ id=‘’ articleName=“y_coordinate”>663</Integer> • </2D_Coordinate_set> • <String namespace=‘’ id=‘’ articleName=“Description”> • This is the phenotype of a ufo-1 mutant under long daylength, 16’C • </String> • </annotated_jpeg>

The Object Ontology: Defines an XML Schema! • The position of an ontology node precisely defines the syntax by which that node will be represented • End-users can define new data-types without having to write XML Schema! • This was an important aim of the project • A machine can “understand” the structure of any incoming message by querying its ontological type!

The Service Ontology • A simple ISA hierarchy • Primitive types include: • Analysis • Parsing • Registration • Retrieval • Resolution • Conversion

Goals achieved • Common data-type ontology assures fully interoperability of services • Present ontology, built freely, has low redundancy and covers most of bioinformatics entities. • Currently, offered services are very specific and small modules, easy to interconnect to build complex workflows. • This has been a natural behaviour rather than imposed by the standard!

Web services and workflows • Common XML based input/output formats allow to chain several services to built a logical workflow • Workflows are stored (in XML of course) and can be run several times • Workflows can include web services from several providers

Output Service Input/output String PDB ID Uniprot ID getPDBFilefromPDBId StringtoAAS getAASfromUniprot getAASfromPDBId AAS parseAASfromPDBText PDBText runFSOLVFromPDBText runPSIBlastfromAAS FSOLVText BLASTText AAS: AminoAcidSeq showPMUTonStruc PDB Enriched showFSOLVonStruc runPMUTHSfromBlastText parseFeatureSeqfromPMUTText parseFeatureSeqfromFSOLVText PMUTText plotFeatureAAS FeatureAAS parsePropfromFSOLVText parsePropfromPMUTText PropertySeq Typed Image

Genedetectionbyhomology Input: Protein Id and DNA genomic sequence Building of Blast Database from DNA seq. BLAST Search Run GeneWise to detect gene structure

BioMoby Web services offer (146) Database retrieval (34) Sequence comparison and alignment (46) Phylogeny (10) Sequence analysis (26) Structure analysis (9) Data handling and conversion (21) Applications covered Blast (24), Fasta (6), Clustal (2), Tcoffee (3), Hmmer (4), Phylip (10), Dali (1), Procheck (1), EMBOSS (25) http://inb.bsc.es/webservices.php BioMOBY Central: 254 Object Types, 430 Services Web Services at INB-BSC

Implementation

XML? • XML stands for EXtensible Markup Language • XML is a markup language much like HTML • XML was designed to describe data • XML tags are not predefined. You must define your own tags • XML uses a Document Type Definition (DTD) or an XML Schema to describe the data • XML can be parsed easily in most programming languages (Perl XML::LibXML module)

XML example (from Amazon)

XML Bio example. FASTA sequence <FASTA id=“SRC_HUMAN” namespace=“Swiss-Prot”> <Header>SRC_HUMAN (P12931) Proto-oncogene tyrosine-protein kinase Src (EC 2.7.1.112) (p60-Src) (c-Src) (pp60c-src </Header> <Sequence>GSNKSKPKDASQRRRSLEPAENVHGAGGGAFPASQTPSKPASADGHRGPSAAFAPAAAEPKLFGGFNSSDTVTSPQRAGPLAGGVTTFVALYDYESRTETDLSFKKGERLQIVNNTEGDWWLAHSLSTGQTGYIPSNYVAPSDSIQAEEWYFGKITRRESERLLLNAENPRGTFLVRESETTKGAYCLSVSDFDNAKGLNVKHYKIRKLDSGGFYITSRTQFNSLQQLVAYYSKHADGLCHRLTTVCPTSKPQTQGLAKDAWEIPRESLRLEVKLGQGCFGEVWMGTWNGTTRVAIKTLKPGTMSPEAFLQEAQVMKKLRHEKLVQLYAVVSEEPIYIVTEYMSKGSLLDFLKGETGKYLRLPQLVDMAAQIASGMAYVERMNYVHRDLRAANILVGENLVCKVADFGLARLIEDNEYTARQGAKFPIKWTAPEAALYGRFTIKSDVWSFGILLTELTTKGRVPYPGMVNREVLDQVERGYRMPCPPECPESLHDLMCQCWRKEPEERPTFEYLQAFLEDYFTSTEPQYQPGENL </Sequence> </FASTA>

XML Prosite entry <Prosite entry id=“TYR_PHOSPHO_SITE” namespace=“Prosite”> <Id>TYR_PHOSPHO_SITE</Id> <Type>PATTERN</Type> <AC>PS00007</AC> <DT_Created>APR-1990</DT_Created> <DT_Updated>APR-1990</DT_Updated> <Info Update>APR-1990</Info_Update> <Description>Tyrosine kinase phosphorylation site</Description> <Pattern>[RK]-x(2,3)-[DE]-x(2,3)-Y</Pattern> <CC>/TAXO-RANGE=??E?V; /SITE=5,phosphorylation; /SKIP-FLAG=TRUE; /VERSION=1;</CC> <DocumentRef>PDOC00007</DocumentRef> <Prosite_entry>

XML / SOAP / WSDL • XML is the basic language to transmit data between (bio)web services. • Additional data between communication and data layers are necessary Data layer “Bio” layer Data layer XML layers SOAP layer HTTP layer HTTP layer TCP/IP layer TCP/IP layer Classical Web transaction (Bio)Web service transaction

XML / SOAP / WSDL • SOAP (Simple Object Access Protocol):simple XML-based protocol to let applications exchange information over HTTP. • Can use HTTP (mainly) or SMTP as underlying communication protocol, so it is platform and language independent • WSDL: (Web Services Description Language)is an XML-based language for describing Web services and how to access them. • Allow to recover all the necessary information to call a Web service through automatic SOAP requests

Structure of BioMOBY transaction <SOAP-ENV:Envelope xmlns:SOAP-ENV=http://schemas.xmlsoap.org/soap/envelope/ ... > <SOAP-ENV:Body> <m:ServicioSwissProt xmlns:m="http://biomoby.org/"> <m:body xsi:type="xsd:string"> <?xml version='1.0' encoding='UTF-8'?> <moby:MOBY xmlns:moby='http://www.biomoby.org/moby-s'> <moby:Query> <moby:queryInput moby:articleName='' queryID='1'> <moby:Simple> <Object namespace = '' id = 'ASA'/> </moby:Simple> </moby:queryInput> </moby:Query> </moby:MOBY> </m:body> </m:ServicioSwissProt> </SOAP-ENV:Body> </SOAP-ENV:Envelope>

BioMOBY answer <SOAP-ENV:Envelope xmlns:xsi=http://www.w3.org/1999/XMLSchema-instance ...> <SOAP-ENV:Body> <namesp1:ServicioSwissProtResponse xmlns:namesp1="http://biomoby.org/"> <s-gensym3 xsi:type="xsd:string"> <?xml version='1.0' encoding='UTF-8'?> <moby:MOBY xmlns:moby='http://www.biomoby.org/moby' xmlns='http://www.biomoby.org/moby'> <moby:Response moby:authority='not_provided'> <moby:queryResponse moby:queryID=''> <moby:Simple articleName=''> <String namespace='' id=''> <![CDATA[ Id: "ASA" SWRIISSIEQ KEESRGNEDH VKCIQEYRSK IESELSNICD GILKLLDSCL IPSASAGDSK ....]]> </String> </moby:Simple> </moby:queryResponse> </moby:Response> </moby:MOBY> </s-gensym3> </namesp1:ServicioSwissProtResponse> </SOAP-ENV:Body> </SOAP-ENV:Envelope> ** Librería SOAP usada: libsoap 1.0.1

The three components of MOBY MOBY-Central • Knows about all existing MOBY services • Ask it for services by type of input, output or keyword • Returns info on how to connect to a service Service • Accepts MOBY requests • Runs a program on service provider's computer Client • Locates a service through MOBY-Central • Connects to service, sends input data • Waits for result • Finds result buried within XML markup

MOBY Client MOBY Service MOBY Central Registration Phase Query Phase Input Data Object Data Object Type Selected Service Service Def Request DATA OK Available Services Register Service DATA Service Types WSDL Output Data Object Transaction Phase MOBY transactions

Moby Service MOBY object extraction MOBY Object Extraction of biological data Input data SOAP packet Application Connection to external users SOAP server SOAP packet MOBY Object Building MOBY object Building SOAP packet Output data Service provider BioMOBY API

How to use MOBY services: clients • Programatic Access – MOBY API (perl, java, python…) • Web Access – GBrowse browser, INB • Clients: • Bluejay, Eclipse/Haystack, Talisman/Taverna (myGrid) • Expert Bioinformaticians • Developers • Biologists • Genomic Projects • Bioinformaticians • Expert Biologists • Genomic Projects

Programmatic access • Native MOBY APIs in Perl, Java or Python Developed at INB-BSC • MOBYLite API • Runs on top of Perl MOBY API • MOBY datatypes are translated into perl classes and services into perl functions • API is built automatically from MOBY catalogue • CommLineMOBY. • Perl API to run in-house services without need of the SOAP layer

(Bio)Web Services at the INB