1 / 24

Robert Kincaid Daniel Kluesing Aditya Vailaya

BNS: An LDAP-based Biomolecule Naming Service. Robert Kincaid Daniel Kluesing Aditya Vailaya. Outline. Problem statement and design goals BNS architecture BNS use cases LDAP Final thoughts. Problem. There is an increasing need to connect related genomic and proteomic measurements

shelby
Download Presentation

Robert Kincaid Daniel Kluesing Aditya Vailaya

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. BNS: An LDAP-based Biomolecule Naming Service Robert KincaidDaniel KluesingAditya Vailaya

  2. Outline • Problem statement and design goals • BNS architecture • BNS use cases • LDAP • Final thoughts

  3. Problem • There is an increasing need to connect related genomic and proteomic measurements • However, no universally accepted/used identifiers exist for biomolecules (GenBank, RefSeq, Unigene, PIR, Swiss-Prot … ) • High-throughput measurements make manual association of related measurements impractical • We need a practical solution that uses today’s data

  4. Initial Motivating Use Cases • Generate a “view” of data that is formed by the “join” of: • A microarray and a protein array • A microarray and mass spec proteomics data • An Agilent and a brand X microarray • A commercial oligo array and a home-brew cDNA array

  5. Solution • A high-speed biomolecule Name/IDresolver • Converts between different identifier schemes based on gene locus or transcript • Converts between different states of transcriptiongene->transcript->protein • Converts between gene symbols and aliases • Easy to deploy and code applications • Platform and language neutral • Explores the research questions of feasibility and usefulness of- Name/ID resolver - LDAP

  6. System Is Not • A sequence database • Primarily an annotation system • Intended to be updated by users • Not an object/interface naming service • A complete, definitive system

  7. BNS – Biomolecule Naming Service • Research Prototype: • Based on LDAP for easy deployment and wide platform support • Derived from LocusLink data NCBI CLIENT APPLICATION BNS API BNS NAME/ID RESOLVER LDAP PROTOCOL LDAP API LDAP-BASED NAME SERVER FTP (via HTTP proxy) LOCUSLINK DOWNLOAD ANDCONVERSIONSCRIPTS

  8. DirectoryStructureLDAPSchema

  9. Example Entry (LDIF) dn: locus=1,org=Homo sapiens,dc=BNSobjectClass: bnsobjectlocus: 1sym: A1BGname: alpha-1-B glycoproteinug: Hs.373554summary: The protein encoded by this gene is a plasma glycopro . . .org: Homo sapienschr: 19q13.4altsym: A1Baltsym: ABGaltsym: GABgbaccn: AC010642. . .gbaccn: W25099 dn: transcript=NM_130786,locus=1,org=Homo sapiens,dc=BNSobjectClass: bnstranscriptlocus: 1transcript: NM_130786nm: NM_130786np: NP_570602prod: alpha 1B-glycoprotein

  10. Object Model • BNSConnection • Connect/Disconnect to LDAP server (local or remote)connect(String url, String org) • Query, Lookup functionsBNSObject lookupID(String id)String resolveTranscriptPair(String refseqID)List lookupSymbolList(String symbol) • BNSObject • Returned by query/lookup methods • Get/Set methods for attributes • Various text output functions provided for conveniencetoString(), toText(), toTabbedText(), toHTML()

  11. Example • try { • // STEP 1: Connect to the ldap server • conn.connect("ldap://localhost"); • // STEP 2: Do some BNS calls • System.out.println(conn.lookupSymbol("ABL1").toText()); • // STEP 3: Disconnect - That's all there is to it! • conn.disconnect(); • } • catch (BNSException e) { • e.printStackTrace(); • } JavaCode LOCUS 25SYMBOL ABL1ALIAS ABL, JTK7, p150, c-ABLDESCRIPTION v-abl Abelson murine leukemia viral oncogene homolog 1UNIGENE ID Hs.14635GENBANK K00009, AAA51895, M13099, AAA51896, U07563, AAB60393, AAB60394, . . .TRANSCRIPTS NM_005157, NP_005148, , v-abl Abelson murine leukemia viral oncogene homolog 1 isoform a NM_007313, NP_009297, , v-abl Abelson murine leukemia viral oncogene homolog 1 isoform bGENE ONTOLOGY cellular component : 0005634 : nucleus biological process : 0007048 : oncogenesis. . . Output

  12. Example • try { • // STEP 1: Connect to the ldap server • conn.connect( "ldap://localhost“ ); • // STEP 2: Do some BNS calls • System.out.println( conn.resolveTranscriptionPair("NM_000018") ); • System.out.println( conn.resolveTranscriptionPair("NP_000009") ); • System.out.println( conn.resolveSymbol("PSCP") ); • System.out.println( conn.lookupSymbol("A1BG").get_description() ); • // STEP 3: Disconnect - That's all there is to it! • conn.disconnect(); • } • catch (BNSException e) { • e.printStackTrace(); • } JavaCode NP_000009 NM_000018 BRCA1 alpha-1-B glycoprotein Output

  13. A real case – joining Microarray and MS data* Microarray 12626 Genes Mass Spec Proteomics741 Protein IDs GenBank/UniGene RefSeq/GenBank BNS 9419 (75%) 441 (60%) Locus 359 MS ID’s Matched to Microarray Features (48%) * Data provided by Joel Sevinsky and Natalie Ahn, Dept. of Chemistry and Biochemistry, University of Colorado, Boulder

  14. High Throughput Use Cases • Annotation of biomolecule listsexample: microarray annotation, analysis bnsConnection.lookupID(“NM_00018”).toTabbedText(); • Ad-hoc creation of biomolecule lists via query example: create a theme-based microarrayList bnsObjects = bnsConnection.query(“godesc=*onco*”); • Merging biomolecule data with varied identifiersexample: joining high throughput measurements bnsConnection.resolveTranscript(“NM_00018”); bnsConnection.lookupID(“NM_00018”).getUnigene();

  15. High Throughput Use Cases • Normalizing biomolecule ID’s to a common schemeexample: microarray annotationbnsConnection.lookupID(“NM_00018”).get_unigene(); • Validating gene symbolsexample: text mining if (bnsConnection.lookupSymbol(“PSCP”) != null) • Normalizing symbols to the official/preferred symbolexample: text mining, microarray annotation officialSym = bnsConnection.lookupSymbol(“PSCP”).get_sym();

  16. Low Throughput Use Cases • Lookup single IDBNSObject bnsObj= bnsConnection.lookupID(“NM_00018”); • Display data for single IDexample: popup information dialog bnsObj.toHTML();

  17. BNS Findings • A system like BNS is extremely useful and efficient: • New novel uses of genomic/proteomic data emerged beyond simple joins – text mining, annotation operations, chromosome mapping, etc. • Flexible range of associations possible - exact ID matches, transcript/product matches or looser locus matches • Simpler programming model than typical database access methods • Standardized object models and interfaces for performing “routine” name/id operations would enable rapid development of applications

  18. Why LDAP • Sequence data easily conforms to a hierarchical directory structure • Sequence databases are often lookup only and are not updated by users (cf. SRS and flat file databases) • LDAP is scalable from very low end systems (slow laptops) to shared high-end servers • Cross-platform, variety of language support, flexible back-ends, open standard • Access control and security • Good performance for minimal cost

  19. LDAP Issues • Approaches problem in a unique way • Can be confusing to newcomers • Easily overcome with modest experience • Potential rate and quantity of individual BNS queries is far beyond the expectations of email address book applications • Seems to work in practice • Assumed solvable by scalability More difficult to proxy through firewalls than HTML-based solutions • Socksification possible (trivial with Java)

  20. LDAP Supports Distributed Architecture Query referral enables transparent federated searching across widely distributed data servers LDAP-BASED NAME SERVER OUTASIGHT PRIVATEDATA CLIENT APPLICATION BNS API Data is replicated from central curation server BNS NAME/ID RESOLVER LDAP-BASED NAME SERVER LDAP-BASED NAME SERVER LDAP API LDAP PROTOCOL LOCUSLINK LOCUSLINK

  21. LDAP Findings • LDAP appears quite suitable for deploying this kind of system: • Performance appears to be good~20-200+ lookups/sec – usually bandwidth limitedqueries can be roundtrip optimizedserver-side in-memory caching possiblelow footprint allows client-side instance for special high-throughput needssubstantially faster than web-services equivalent* • Minimal infrastructure is requiredscalable from laptop to high-end multi-processor serveraccessible from many environments (Java, Perl, C/C++, Matlab, etc.) • Replication/Referral show promise for building distributed systems of biomolecule data*Based on data from Don Gilbert, Indiana Univ. (http://iubio.bio.indiana.edu/grid/directories)

  22. Conclusion • Some form of consistent ubiquitous interface for performing BNS-like operations is useful and desirable • Efforts to create unified identifier schemes should consider a LocusLink-like organizing principle as these transcript/product relationships are important to emerging analyses • Properly overloaded ID conventions could eliminate the need for ID conversions (e.g. Hs12345M6789 vs. Hs12345P6789, Hs12346M*, etc) • LDAP shows promise as a useful lightweight high-performance delivery mechanism for biomolecule information

  23. Availablility • http://openbns.sourceforge.net

  24. Acknowledgements University of ColoradoNatalie AhnJoel Sevinski • AgilentPaul WolberKaren Shannon • Dean ThompsonAnnette AdlerAmir Ben-Dor Indiana UniversityDon Gilbert • Daniel Kleusing • Aditya Vailaya

More Related