Web Services as Integrators of Public Chemistry Databases Gary Wiggins School of Informatics Indiana University email@example.com
Interdisciplinary Science • “The boundary between the known and the unknown is where science flourishes.” --Michael Shermer • “There are known knowns… There are known unknowns… There are unknown unknowns…” --Donald Rumsfeld • BUT…. • What about UNKNOWN knowns?
zzzzzCAS • “In an industry first, Chemical Abstracts Service (CAS) has unveiled a revolutionary new literature searching tool which will permit scientists to search and retrieve the world’s chemical literature—including patents and obscure technical reports—in their sleep.” --Author unknown
Overview of the Talk • Introduction to Web Services • Public Databases and Data Repositories for Chemistry • Projects Underway or Planned at Indiana University
Web Services Introduction • What are “Web Services”? • A distributed invocation system built on Grid computing • Independent of platform and programming language • Built on existing Web standards • A service oriented architecture with • Interfaces based on Internet protocols • Messages in XML (except for binary data attachments)
Service Oriented Architecture (SOA) • Goal is to achieve loose coupling among interacting software agents • Define service: a unit of work done by a service provider to achieve desired end results for a service consumer • Both provider and consumer are roles played by software agents on behalf of their owners.
Web Services Architectures • Individual services are registered globally • Broken down into individual services with inputs and outputs specified • Services are published • Services are requested • Open registry, publishing, and requesting
Service-Oriented Architecture • From Curcin et al. DDT, 2005, 10(12),867
Web Services for Science • Invisible Services, Semantic Web, and Grid • Easy-to-use tools for any scientist • High throughput, resource intensive computing done for low cost/resources • Shared community • Collaborations between labs and fields • Shared data • Shared tools
e-Science and the Grid 1 • e-Science:Major UK Program • global collaboration in key areas of science and the next generation of infrastructure that will enable it • reflects growing importance of international laboratories, satellites and sensors and their integrated analysis by distributed teams • total investment of some £200M over the five-year period from 2001 to 2006 • CyberInfrastructure: the analogous US initiative • Grid Technology: supports e-Science & Cyberinfrastructure
Basic Architectures:Servlets/CGI and Web Services Browser Browser GUI Client Web Server HTTP GET/POST WSDL SOAP Web Server WSDL Web Server WSDL WSDL SOAP JDBC JDBC DB or MPI Appl. DB or MPI Appl.
When To Use Web Services? • Applications do not have severe restrictions on reliabilityandspeed. • Two or more organizations need to cooperate. • One needs to write an application that uses another’s service. • Services can be upgraded independently of clients. • Services can be easily expressed with simple request/responsesemantics and simple state.
Web Services Benefits • Web services provide a clean separation between a capability and its user interface. • Increase in productivity • Increase in flexibility • Rapid return on investment • Integration across multiple applications
Web Services Advantages • Output in human- and computer-readable formats • I/O formats based on standard Internet protocols • Resources accessible server to server allow automated I/O • Integration based on specific services: you select services or data needed without downloading the entire data set
Web Services Advantages • Description protocols provide details of service provided and interface components • Semantic Web standards increase efficiency • Use a central registry and standardized description of services • Quality and status of the information is dynamically available
Web Services Drawbacks • Based on new technologies • Time and commitment required to learn • Standards still in a state of flux • Issues with quality of data, (and for chemistry, quantity of open data), security, and privacy
Components of Web Services • Protocols • WSDL • SOAP • UDDI • XML as a basis for the protocols • Ontologies • OWL: Ontology Web Language • Semantic Web
WSDL: Web Service Definition Language • Describes a service’s interface to clients • Services register themselves with Web Services • WSDL describes how to contact and interact with services • I/O, operations and messages to aid interaction with client
SOAP: Simple Object Access Protocol • Maps abstract WSDL service descriptions to concrete implementations • Flexible protocol to communicate information between server and server or client and server using XML • Supports Remote Procedure Calls • Allows layers (security, authentication, transactions) over the basic SOAP elements
UDDI: Universal Description, Discovery, and Integration • Provides ways for clients and services to interact with other services • Uses XML • Defines the means of access, e.g., • URL • E-Mail • Defines services hosted by an entity • Business-oriented tags • Uses SOAP for communicating
XML and Web services • XML lends itself to distributed computing: • It’s just a data description. • Platform, programming language independent • Web Services Description Language (WSDL) • Describes how to invoke a service • Can bind to SOAP, other protocols for actual invocation • Simple Object Access Protocol (SOAP) • Wire protocol extension for conveying RPC calls • Can be carried over HTTP, SMTP
RDF: Resource Description Framework • A standard for statements describing resources • RDF statements consist of a • Subject: the resource being described • Predicate: the property ascribed to the resource • Object: the entity to which the resource bears that property • An alternative to UDDI: said to more flexible and extensible than UDDI
RDFS: Resource Description Framework Schema • RDF statements ascribe properties to resources, but RDF provides no means for describing those properties. • RDFS defines classes and properties that are used to describe classes, properties, and other resources. • RDFS vocabulary descriptions are written in RDF. • RDFS provides a means of specifying a vocabulary for a particular domain.
OWL: Web Ontology Language • Builds on RDF and RDFS and adds a means for richer descriptions of properties and classes • Disjoint classes • Cardinality of classes • Characteristics of relations, like symmetry
Standards for Web Services • Business Process Execution Language for Web Services (BPEL4WS) • Ontology Web Language Semantics (OWL-S) • Web Service Modeling Ontology (WSMO)
Standards Setting Bodies: OASIS • OASIS: Organization for Advancement of Structured Information Standards • ebXML: e-business XML • UDDI: Universal Description, Discovery and Integration • Global Grid Forum • community of users, developers, and vendors leading the global standardization effort for grid computing
Standards Setting Bodies: W3C • W3C: World Wide Web Consortium • OWL: Ontology Web Language • RDF/RDFS: Resource Description Framework/Schema • SOAP: Simple Object Access Protocol • URI/URL/URN: Universal Resource Identifier/Locator/Name • WSDL: Web Service Definition Language • XML: eXtensible Markup Language
Web Services Integration Projects: Biosciences • myGrid • http://www.mygrid.org.uk/ • BIOPIPE • http://biopipe.org/ • BioMOBY • http://biomoby.org/
Web Services for Chemistry: Problems • Performance and scalability • Proprietary data • Competition from high-performance desktop applications -- Geoff Hutchison, it’s a puzzle blog, 2005-01-05 • ALSO: • Lack of a substantial body of trustworthy Open Access databases • Non-standard chemical data formats (over 40 in regular use and requiring normalization to one another)
Necessary Ingredients in Chemistry • Chemical communities to assemble Open Access databases • Well-defined quality assurance procedures performed by distributed peer-review systems • Software underlying the databases needs to be open source.
BlueObelisk.org • A group of chemists, programmers, and informaticians working collaboratively on projects such as: • Chemistry Development Kit (CDK) • JChemPaint • Jmol • JUMBO • NMRShiftDB • Octet • Open Babel • QSAR • World Wide Molecular Matrix (WWMM)
Components of the Semantic Web for Chemistry • XML – eXtensible Markup Language • RDF – Resource Description Framework • RSS – Rich Site Summary • Dublin Core – allows metadata-based newsfeeds • OWL – for ontologies • BPEL4WS – for workflow and web services • Murray-Rust et al. Org. Biomol. Chem. 2004, 2, 3192-3203.
Chemistry Databases on the Web • Marc Nicklaus lists 37 databases as of October 2001 • Must have structure searching and at least 100 molecules • http://cactus.nci.nih.gov/ncidb2/chem_www.html • SoaringBear’s List has 15 databases • http://geocities.com/soaringbear/biomed/chem.html
Institutional Repositories • NARSTO Quality Systems Science Center • http://cdiac.esd.ornl.gov/programs/NARSTO/ • Pollutant species in the troposphere over North America • Part of the Carbon Dioxide Information Analysis Center at ORNL • NARSTO Data and Information Sharing Tool • http://mercury.ornl.gov/narsto/
Public Databases • Developmental Therapeutics Program/NCI • Some assay data for download • Structures for over 200,000 compounds • http://dtp.nci.nih.gov/docs/dtp_search.html • Zinc and other screening databases • NIST computational chemistry database • Environmental fate and exposure databases
Other Public Databases 1 • ChemExper Chemical Directory • > 200,000 substances; > 10,000 IR spectra • http://chemexper.com/ • HIC-Up; Hetero-Compound Identification Centre – Uppsala • 5384 substances as of 1/15/05 • http://xray.bmc.uu.se/hicup/ • Chemicals with Pharmaceutical Activity; a 3D Structural Database • 400 3D structures • http://www.chem.ox.ac.uk/mom/chemical-database/
Other Public Databases 2 • Cheminformatics.org • 41 data sets in 9 categories as of 8/18/05 • http://www.cheminformatics.org/ • WebReactions • http://webreactions.net/
Other Public Databases 3 • MolTable • http://www.moltable.org/ • MatWeb Materials Property Data • http://www.matweb.com/index.asp?ckck=1 • Spectral Database for Organic Compounds (SDBS) • Over 32,000 compounds • Has EI-MS, FT-IR, 1H NMR, 13C NMR, Raman, ESR • http://www.aist.go.jp/RIODB/SDBS/cgi-bin/cre_index.cgi • NMRShiftDB (Christoph Steinbeck) • 14,753 structures as of 8/19/05 • Features peer-reviewed submission of data sets • http://www.nmrshiftdb.org/
Comment on Link to PubChem • “It is great to know that - with PubChem - there will be something like a single point of entry for people interested in structures and their properties. We are looking forward to link our datasets to PubChem structures.” --Christoph Steinbeck, commenting on the NMRShiftDB, CHMINF-L, 19 April 2005
Other Public Databases:Commercial Teasers • FTIRsearch.com (Thermo Electron) • Demo file of 575 spectra from 87,000 in the full database • https://ftirsearch.com/default3.htm • ChemACX • 30 of >350 suppliers catalog data • http://chemacx.cambridgesoft.com/chemacx/index.asp • Sunset Molecular Discovery, LLC • Wombat (World of Molecular BioAcTivity) • 117,007 entries with over 230,000 biological activities • Wombat PK • Database for Clinical Pharmacokinetics: 643 substances with 4668 measurements • Three sample files from Wombat containing 341 Histamine-1 receptor antagonists • http://www.sunsetmolecular.com/
Indiana University Existing Projects • System for the Integration of Bioinformatics Services (SIBIOS) • http://sibios.engr.iupui.edu • PlatCom: A Platform for Computational Comparative Genomics • http://bio.informatics.indiana.edu/sunkim/Platcom/ • Reciprocal Net • http://www.reciprocalnet.org/index.html
Indiana University Planned Projects • Design of a Grid-based distributed data architecture • Development of tools for HTS data analysis and virtual screening • Database for quantum mechanical simulation data • Chemical prototype projects • Novel routes to enzymatic reaction mechanisms • Mechanism-based drug design • Data-inquiry-based development of new methods in natural product synthesis
Web Services Future • Depends on • Adoption of standards • Incorporation of WS in current and newly developed applications • Security, privacy, quality of data issues • Development of WS tools and resources for e-Science
“Tripos Embraces Web Services” • Goal: to make it easier to incorporate Tripos’ high-throughput workflows • Tripos’ Service-Oriented Informatics (SOI) • In partnership with SciTegic: Web services ensure software compatibility between the companies’ products.