1 / 48

Core 2: Bioinformatics

Core 2: Bioinformatics. NCBO-Berkeley. Berkeley Drosophila Genome Project. Finish the sequence of the euchromatic genome of Drosophila melanogaster Annotated biological important features of this sequence Produced gene disruptions using P element-mediated mutagenesis

chaela
Download Presentation

Core 2: Bioinformatics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Core 2: Bioinformatics NCBO-Berkeley

  2. Berkeley Drosophila Genome Project • Finish the sequence of the euchromatic genome of Drosophilamelanogaster • Annotated biological important features of this sequence • Produced gene disruptions using P element-mediated mutagenesis • Full length sequencing and expression characterization of a cDNA for every gene • Developing informatics tools

  3. Chris Shu Mark Sima Who is here from NCBO-Berkeley

  4. Chris • GadFly database schema • GO database schema • Chado database schema • Perl libraries for all • OBD data architect

  5. Shu • OBD dev & Data flow • AmiGO,ImaGO & database • Compute Pipeline

  6. Mark • Apollo Genome Annotation Editor • Phenote and other OBD interfaces

  7. Sima • Adh region annotation • Annotation of entire Drosophila Genome • Project manager and coordinator nonpareil • Associate Director

  8. OBD Outline • Core 2 aims, refresher • Data models for OBD • phenotypes • clinical trials • others • Modeling frameworks • exchange formats • database system • SQL based vs ‘SemWeb’ dbs • Progress • Demo

  9. Core 2 Specific Aims • Apply ontologies • Software toolkit for describing and classifying data • Capture, manage, and view data annotations • Database (OBD) and interfaces to store and view annotations • Investigate and compare implications • Linking human diseases to model systems • Maintain • Ongoing reconciliation of ontologies with annotations

  10. Core 3 Driving Biological Projects • DBPs • phenotypes: Fly and Zebrafish to human • clinical trials • Core 2 Aims • Apply ontologies to describe data • Capture, manage, and view data annotations • Link disease genes to model systems • Reconcile annotation and ontology changes

  11. Apply ontologies to describe data • Requirements • Data capture tools • phenote • demo tomorrow • no tool requirements from UCSF • Data model • Database (OBD) • --aim 2

  12. data flow

  13. user’s view

  14. Data models • Common/shared domain specific models • Aim 3 • linking disease genes • model must support this • granularity • comparability

  15. Domain specific data models • FB, ZFIN • genotype to phenotype • ‘EAV’ • qualities inhere in entities • orthologs • phenotype to disease • core 2 will help define common model • UCSF • clinical trials • existing ontology-friendly schema - trialbank

  16. Phenotype data model • Qualities inhere in entities • Entity term; PATO term • brainFBbt:00005095;fusedPATO:0000642 • gutMA:0000917;dysplasticPATO:0000640 • tail finZDB:020702-16;ventralizedPATO:0000636 • kidneyZDB:020702-16;hypertrophiedPATO:0000636 • midfaceZDB:020702-16;hypoplasticPATO:0000636 • Pre-composed phenotype terms • Mammalian Phenotype Ontology • “increased activated B-cell number” MPO:0000319 • “pink fur hue” MPO:0000374

  17. Extensions to simple model • What about • Relational attributes • Quantative vs qualitative • Post-composing entity and attribute terms • Relative states/values • Variation in place, space and time • A better treatment of absence • See CSHL Pheno meeting talk • also, more detailed formal presentation (available) • Not to mention genotypes, environments, provenance, etc

  18. Modeling clinical trials • Model already described using frame-based schema • Further modeling required? • abstraction • to integrate more with other OBD datatypes • views • to only show parts relevant to OBD/BioPortal

  19. Future DBPs and use cases • OBD will contain a variety of general types of data • Modeling is expensive • use existing models where appropriate • but whole must be cohesive and integrated • Most of this talk focuses on the pheno DBPs for illustrative purposes

  20. Modeling frameworks • language • technology

  21. Modeling data: underlying formalism • Model is expressed with modeling language • Options • Relational/SQL • Semi-structured, XML • Object-centric (UML, frame-based?) • Logic based • description logic: e.g. OWL • first-order logic: e.g. CL • Natural language descriptions • Model should be independent of language it is expressed in

  22. Data exchange language: XML • Simple • XML is suited for data exchange • XML can drive software spec • constrains programmatic data model • XSD can generate UML • closed world assumption is useful • cf Ruttenberg et al • Mature technology • well understood by developers, MODs • standards

  23. How OBD uses XML • obd-geno-pheno-xml (aka pheno-xml) • actually multiple modular components • genotype schema • phenotype schema: ‘EAV’ • environment schema • provenance schema • used as • exchange format • cf: gene ontology association files • no need for ClinicalTrials-XML

  24. Example pheno-xml <genotype id="ZFIN:tm84"> <name>ZFIN:tm84</name> <genotype_phenotype_association> <phenotype> <entity type="ZDB-ANAT-010921-528"> <quality type=“PATO:……” > <state type="PATO:0000636"> <time_range type="ZDB-STAGE-010723-12"/> </state> </quality> </entity> </phenotype> </genotype_phenotype_association>

  25. SQL Databases • Data storage, management and querying • all MODs use SQL dbs • Lots of advantages • scalable, standard QL, mature, APIs, etc • pure relational model is reasonably formal • XML/SQL more or less compatible • low impedance mismatch

  26. Schemas for geno-pheno data • We already have schema: Chado • Used by many MODs (eg FB) • others are ‘chado compliant’ (eg ZFIN) • Modular • ontologies • genomic • genotype • phenotype • phylogenies • …etc • Phenotype module needs updating • will be driven by pheno-xml

  27. Problem solved? • We have two mature, complementary technologies, and can define schemas for our model in an appropriate formalism for each • Is this enough to work with?

  28. Issues • OBD will be much more than geno-pheno • clinical trials • future DBPs, other NCBCs • any data expressed in an ontology language • Software and schema development expensive • fragility in face of schema evolution • development gets bogged down in data exchange issues

  29. Major issue • SQL and XPath work great for ‘traditional’ data… • …but are too low level for ontology-centric data • lack of inference • no way to directly express ontology constraints

  30. Use cases from previous experience: AmiGO • GO • “find all TF genes” (is_a closure) • “find all gene products localised to endoplasmic reticulum” (part_of closure, over is_a) • Our solution (AmiGO & go-sqldb) • pre-compute transitive closure over all relations in db • (sort of) works for GO (for now) • refresh problem • explosive for tangled DAGs

  31. OBD requires more ontological awareness • Other relations • ontogenic (egderives_from) • transitive_over • Other types of data • Pre- versus post- composed terms • E.g. MPO versus AO+PATO • E.g. Entity+Spatial qualifier • queries over either should be interchangeable

  32. Solution: more expressive formalisms • QLs and APIs should provide and abstract away common ontology operations • ease of programming, optimisation • Choices • ‘Semweb’ databases • RDF + RDFS + Owl [ lite + DL ] + extra • lots to choose from, emerging standards • compatible with Obo v1.2 spec • Deductive databases • superset of relational databases • from Prolog to full CL

  33. Modeling phenotypes as RDF/OWL or Obo instances classes/ terms instances entity quality

  34. Example query in SeRQL find mutations affecting the shape of the wing vein: SELECT DISTINCT EI, ET, OrgI, QI, QT, QN FROM {EI} rdf:type {ET} rdfs:label {EN}, {EI} OBO_REL_part_of {OrgI} rdf:type {Tax} rdfs:label {TaxN}, {EI} OBO_REL_has_quality {QI} rdf:type {QT} rdfs:label {QN} WHERE label(EN) = "wing vein" AND label(TaxN) = ”Arthropoda" AND label(QN) = "ShapeValue" results of query on OBD-sesame: one annotation to “wing vein L2”, “branched”

  35. Advantages of ‘SemWeb’ dbs • Advantages over pure SQL • The ontology is the model • constraints encoded in ontology • e.g. certain quality types only applicable to certain entity types • agile development - fast database integration • Rich modeling constructs • transitivity, subsumption, intersection, etc • powerful QLs and APIs • More (technical) interoperation ‘for free’ • URIs • proven? • Open World Assumption (maybe a hindrance?)

  36. Disadvantages of ‘SemWeb’ dbs • Disadvantages • speed • may be slower than SQL • ..but in-memory execution is fast • lack of maturity • new technology.. but has a LOT of momentum • foundations • are RDF triples appropriate? • inherent difficulties modeling time • SQL allows n-ary relations/predicates

  37. Hybrid model • SemWeb dbs are commonly layered over SQL DBs • We can have the best of both worlds • Data View layers • mapping between Obo/OWL model and domain-specific relational schema • (optionally) materialized for speed • different applications use appropriate layer

  38. Current progress: OBD-Sesame • Sesame • open source ‘triple store’ • based on Jena • also used in Protégé-OWL • storage layer options • mysql/postgresql generic schema • in-memory • disk-based

  39. OBD in Sesame: current datasets • Pheno • ZFIN & FB : EAV trial 2003 data • Test ortholog set • FB ‘simple phenotype’ alleles • ZFIN legacy phenotype data, automatically parsed to EAV • Ontologies: AOs, PATO, Cell, GO • Method • excel & flatfiles->pheno-xml->owl • OWL from http://www.fruitfly.org/~cjm/obo-download • Trialbank • Method: ocelot->obo-xml->owl • Soon • human orthologs and omim

  40. Technology Evaluation: Sesame • Use case query set • Benchmarks • preliminary conclusions • SQL layering is terrible • in-memory is fast • optimisations? • other triple stores? • up to date results on wiki • http://smi.stanford.edu/projects/cbio/mwiki-internal/index.php/RDF_Sesame_Demo_Benchmark • Need to test OWL-DL entailment • Bigger dataset required for full evaluations • Community effort: pub-semweb-lifesci list

  41. Parallel development: an OBD Prototype • Initiated prior to OBD-Sesame • Simple deductive database • prolog-based • chado-like schema • can be views on Obo/OWL predicates • amigo-clone user interface • Rapid prototyping • Current dataset • as obd-sesame, plus CT • trivial to drop in more

  42. Example logic query find mutations affecting the shape of some part of the head capsule inheres(QI,EI) & inst(QI,QT) & label(QT,shape) & inst(EI,ETP) & part_of*(ETP,ET) & label(ET,’head capsule’) results of query on OBD-prolog: one annotation to “arista lateral”, “irregular shape”

  43. OBD TODO • Pheno-xml • finalise release version • finalise Obo/OWL mapping • logic specification • Data • orthologies • OBD - BioPortal integration • how will it work? • Versioning and reconciling changes • decide on ontology versioning first

  44. OBD dependencies • PATO development • UMLS into OBO-site • Ontologies • FMA accessibility? • species-centric AO alignments (XSPAN?) • Sept meeting on AO development • Nov meeting on disease ontologies • Data • MOD pheno annotation • OMIM annotation • Bioportal

  45. Misc • NLP for phenote • Obol • trial on evolutionary phenotype characters • cambridge NLP project • can be used to ‘prime’ phenote • Decomposing MPO • pink furdef=fur, has_quality: pink

  46. Discussion • Will SemWeb dbs work? • experiment • Ontology-based modeling • the ontology is the model • importance of • relations ontology • upper ontology

  47. Demos • http://yuri.lbl.gov/amigo/ct • http://yuri.lbl.gov/amigo/obd • http://spade.lbl.gov:8080/sesame/actionFrameset.jsp?repository=mem-rdfs-db

More Related