1 / 24

DAS/2: Next Generation Distributed Annotation System

DAS/2: Next Generation Distributed Annotation System. Gregg Helt 1 , Steve Chervitz 1 , Tony Cox 2 , Andrew Dalke 3 , Allen Day 4 , Ed Erwin 1 , Ed Griffiths 2 , and Lincoln Stein 4. (1) Affymetrix, Inc. (2) Sanger Institute (3) Dalke Scientific; (4) Cold Spring Harbor Laboratory

mab
Download Presentation

DAS/2: Next Generation Distributed Annotation System

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. DAS/2: Next Generation Distributed Annotation System Gregg Helt1, Steve Chervitz1, Tony Cox2, Andrew Dalke3, Allen Day4, Ed Erwin1, Ed Griffiths2, and Lincoln Stein4 (1) Affymetrix, Inc. (2) Sanger Institute (3) Dalke Scientific; (4) Cold Spring Harbor Laboratory (5) University of Alabama

  2. Distributed Annotation System (DAS) Overview • A specification designed for sharing genome annotations • Defines client requests and server responses • Simplified Web Services approach: HTTP GET, URLs, XML • Intended to be simple to implement • No central annotation authority • Intended to support client-side integration of annotations from different servers • First draft specification Spring 2000 • Last major change to DAS1 was Spring 2002 • Grant from NIH awarded June 2004 for development of next-generation DAS/2

  3. Annotation Server Reference Server Annotation Server Annotation Server AC003027 M10154 AC005122 WI1029 AFM820 AFM1126 WI443 DAS: Multiple Servers, Multiple Clients AC003027 M10154 AC005122

  4. Widespread Adoption of DAS/1 • Server Implementations • Dazzle, ProServer, LDAS • Server sites • Ensembl, UCSC, TIGR, KEGG, WormBase, Affymetrix, etc. • Clients • GBrowse, Ensembl, Dasty, IGB, • Libraries: • BioPerl, BioJava, JDAS • DAS Extensions • GeneDAS (non-positional annotations) • DAS web services registry • SPICE (protein structures) • DALEC (asynchronous analysis)

  5. Ensembl is an ensemble of DAS servers

  6. GBrowse on Ensembl

  7. Distributed GBrowse MODs GBrowse 1 GBrowse 2 DAS DAS DAS My GBrowse Ensembl UCSC

  8. DAS Limitations • No ontology (controlled vocabulary) of feature types. • Is a “gene” from DAS server 1 the same as a “gene” from DAS server 2? • Not particularly extensible. • Ambiguous semantics for retrieving features that overlap a range on the genome.

  9. Development of DAS/2 Specification • Enhancements have largely been motivated by initial discussions on the DAS mailing list. • Series of RFCs collected • Though informal, still a long process! • Most recent DAS/2 draft specification is available at http://biodas.org/documents/das2/das2_protocol.html (tied to CVS repository), so anyone can review and comment • Feedback from the DAS developer and user communities will continue to guide future iterations of the DAS/2 specification

  10. Preserving DAS1 Strengths in DAS/2 • Specification is independent of implementation • Many server implementations • Many client implementations • Simple, simple, simple • HTTP for transport • URLs for queries • XML for responses • REST-like style • Ontologies are integral • Focus on location-based annotations of biological sequences

  11. Basic DAS/2 Queries • Sources query: what genomes and versions of those genomes are available? • http://server/das/genome • Regions query: what annotated sequences are available for a given version of a genome? • http://server/das/genome/[genome]/[version]/region • Types query: what annotation types are availabe for a given genome version? • http://server/das/genome/[genome]/[version]/type • Range query: return all annotations of a given type that overlap a genomic region • http://server/das/genome/[genome]/[version]/feature? overlaps=[seq/min:max];type=[type]

  12. DAS/2 Enhancements: Ontologies • All features are required to be described by an ontology • What is the feature? • Gene, mRNA, transposable_element… • What are attributes of the feature? • Polycistronic_mRNA, programmed_frameshift… • Sequence ontology (SO) is the default (song.sourceforge.net) • Can be changed & extended • ~500 terms in all • Standard OBO format • Feature hierarchy allows features to be contained within others: e.g. gene->mRNA->CDS

  13. DAS/2 Enhancements: Performance • One of the biggest complaints about DAS1 • Very verbose annotation XML • DAS/2 Solution #1: Refactoring annotation XML • Much smaller minimum footprint • DAS/2 Solution #2: Alternative return formats • All servers can return defined das2xml annotation format • Servers can also specify additional return formats per annotation type • Clients can choose from alternative formats if they desire • Not restricted to XML, or even text • Examples: GFF3, BED, PSL, GAME • Extreme performance improvements possible

  14. DAS/2 Enhancements: Resolving Ambiguities Example: Ambiguous Range Queries Overlap or containment? Parent based or separate? query range = x:y x y Server 1 Response: Server 2 Response: Server 3 Response: Server 4 Response:

  15. DAS/2 Solution #1 – remove spec ambiguity • Specify that if parent meets region filter, also return all children • Specify whether overlap, containment, etc. • Add different region filters for different possibilities • Overlaps • Contains • Within • Identical • Allow boolean combinations of these and other filters in the query URL

  16. DAS/2 filter spec allows client query optimization QueryL QueryR QueryC x y L R Keep track of overlap bounds of all previous queries Instead of filter = “overlaps:S/x:y”, use filter = “overlaps:S/x:y; within:S/L:R” If annotation A not contained within L:R, then either: i) bounds crosses L, in which case must overlap QueryL ii) bounds crosses R, in which case must overlap QueryR iii) both Therefore if client has used this approach for all previous queries (and restricts other filtering to single “type” filter), then for QueryC no annotations will be returned that were already returned in a previous query

  17. Solution #2: DAS/2 Validation Suite • Verify whether a DAS/2 server is compliant with the specification. • Critical for improving interoperability between clients and servers developed by different groups. • Standalone tool and web application, written in Python • Enter a URL for a DAS/2 server • Get an HTML report about DAS/2 compliance • Reference dataset • Sequences and annotations that can be loaded into a DAS/2 server for additional validation of server implementation/configuration • Source code available at: http://sourceforge.net/projects/dasypus/

  18. More DAS/2 Spec Enhancements • “Writeback” spec to allow DAS/2 clients to create and edit annotations on DAS/2 servers • Still undergoing development • IDs are URIs • Could be LSIDs or URLs • Allows for integration with many other web technologies • xml:base • Feature hierarchies • And more…

  19. DAS/2 UML Modeling

  20. DAS/2 Reference Server • Implemented as an Apache/mod_perl 2.0 content handler • Annotations are converted to Bioperl objects and subsequently text-transformed using Template Toolkit. • Datasources are accessible using an adaptor pattern • Current adapter is for CHADO (GMOD schema) • Soon any datasource accessible to the Generic Genome Browser (Gbrowse) will be be accessible from the DAS/2 server. • Flatfile formats: GenBank, GFF • Databases: Ensembl, GMOD/Chado, Bio::DB::GFF • DAS1 web service • Source code released under Artistic License • Available via anonymous CVS as part of GMOD • See http://www.gmod.orgfor access details.

  21. DAS/2 Reference Client • Implemented in Java in the Integrated Genome Browser • IGB (“ig-bee”) - A visualization app developed at Affymetrix • Supports data loading via a variety of formats and mechanisms • Full implementation of DAS/2 read client, partial implementation of DAS/2 writeback. • Handles large amounts of genome-scale data • Loads hundreds of thousands of sequence annotations at once • Loads dense quantitative graphs with millions of data points • Maintains real-time responsiveness to user interactions • Includes features to support exploratory data analysis • Plugin architecture for customized extensions • Source code released under Common Public License • http://genoviz.sourceforge.net

  22. Upcoming DAS/2 Developments • Writeback protocol • Ready for implementation • Registry and discovery protocol • Various alternatives have been discussed • A “playpen server” available at EBI

  23. DAS/2 & caBIG • Project 1: Add DAS/2 support to caCORE • Will enable caCORE to read genome annotations from DAS/2 servers and re-export as caCORE objects. • Uses a flexible plug-in architecture that will be generally useful. • Project 2: Export HapMap database as DAS/2 • Will make HapMap human variation data available to caBIG grid via caCORE. • Project 3: Export Vertebrate Promoter Database as DAS/2 • Will make curated information on vertebrate transcription factors and their binding sites available to caB IG grid via caCORE.

  24. Acknowledgements • DAS & DAS2 mailing list participants! • Lincoln Stein (CSHL) • Ed Erwin, Steve Chervitz, Eric Blossom, Hari Tammara (Affymetrix) • Tony Cox, Ed Griffiths (Sanger Institute) • Allen Day, Brian O’Connor (UCLA) • Andrew Dalke (Dalke Consulting) • Suzanna Lewis (LBL) • Ann Loraine (U. of Alabama)

More Related