1 / 48

BioMart

BioMart. Federated Database Architecture. Arek Kasprzyk EBI 9 June 2005. BioMart. A join project European Bioinformatics Institute (EBI) Cold Spring Harbor Laboratory (CSHL) Aim To develop a simple and scalable data management system capable of integrating distributed data sources.

gypsy
Download Presentation

BioMart

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. BioMart Federated Database Architecture Arek Kasprzyk EBI 9 June 2005

  2. BioMart • A join project • European Bioinformatics Institute (EBI) • Cold Spring Harbor Laboratory (CSHL) • Aim • To develop a simple and scalable data management system capable of integrating distributed data sources.

  3. Challenges • Data sources • Large • Distributed • Different data

  4. Requirements • User • All data accessible through a single set of interaces • Suitable for power biologists and bioinformaticians • Deployer • ‘Out of the box’ installation • Built in query optimization • Easy data federation • Architecture • Distributed • Domain agnostic • Platform independent

  5. Federated architecture Query Engine

  6. BioMart User interfaces Data mart Data sources

  7. Dataset Data mart and dataset

  8. Schema Data mart, dataset and schema

  9. XML XML XML Dataset Configuration

  10. BioMart abstractions • Dataset • A subset of data organized into 1 or more tables • Attribute • A single data point • e. g. gene name • Filter • An operation on an attribute • e. g. ‘Chromosome =1’

  11. Mart Dataset Attribute Filter GENE gene_id(PK) gene_stable_id gene_start gene_chrom_end chromosome gene_display_id description Datasets, Attributes and Filters

  12. Examples Upstream sequences for all kinases up-regulated in brain and associated with a QTL for a neurological disorder Name, chromosome position, description of all genes located on chromosome 1, expressed in lung, associated with human homologues and non-synonymous snp changes

  13. PK PK Data model FK FK FK FK

  14. PK PK FK FK FK FK FK FK PK PK PK FK FK Data model

  15. PK Data model FK FK FK FK PK FK FK FK FK

  16. PK1 Data model - ‘reversed star’ FK1 FK1 main1 dm dm FK1 FK2 PK1 FK1 FK2 FK2 PK2 FK1 FK2 dm 2 FK2 PK2 PK1 FK2

  17. A C TA TB B DatasetFixed schema transformation

  18. BioMart abstractions • Link • ‘common currency’ between two datasets • e. g. accession • Exportable • Potential links to export • Importable • Potential links to import

  19. Dataset 1 Links Dataset 2 Exportables, Importables and Links

  20. Links Importable Exportable name = uniprot_id filters = uniprot_ac name = uniprot_id attributes = uniprot_ac Dataset 2 Dataset 1 Exportables, Importables and Links

  21. Links Importable Exportable name=genomic_region filters=chr_name (=), chr_start (>=), chr_end (<=) name=genomic_region attributes=chr_name, chr_start, chr_end Dataset 1 Dataset 2 Exportables, Importables and Links

  22. Configuration Transformation Source databases Mart XML Building BioMart databases MartBuilder MartEditor

  23. MartEditor

  24. Table naming conventionNaïve configuration • Tables • Meta tables meta_content • Data tables dataset__content__type • Data tables • Main __main • Dimension __dm • Columns • Key _key

  25. Retrieval MartExplorer MartShell MartView JAVA Perl BioMart API Databases Public data (local or remote) MartBuilder MartEditor myDatabase Vega SNP myMart MSD UniProt Ensembl Schema transformation Configuration XML BioMart architecture

  26. MartView

  27. MartExplorer

  28. Using = dataset Get = attribute Where = filter MartShell

  29. Mart Query Language (MQL) • Mart Query Language (MQL) syntax: using <dataset> get <attributes> where <filters> • Can join datasets together: • using Dataset1 get Attribute1 where Filter1=var1 as q; • using Dataset2 get Attribute2 where Filter2=var2 and filter3 in q • Can script and pipe: martshell.sh -E MQLscript.mql > results.txt martshell.sh -E MQLscript.mql | wc

  30. Third party software • Bioconductor (biomaRt) • BioMart schema • Taverna • BioMart java library • DAS ProServer • BioMart perl library

  31. biomaRt

  32. Taverna

  33. ProServer • No programming • DAS request and responses defined by Exportables and Importables and configured by MartEditor • DAS1

  34. BioMart deployers • Large scale data federation (EBI) • Optimising access to a large database (Ensembl, WormBase) • Connecting priopriatery datasets to public data (Pasteur, Unilever, Serono, Sanofi-Aventis, DevGen etc …)

  35. SANGER EBI Ensembl Uniprot SNP MSD Vega Sequence Hinxton example WWW

  36. BioMart deployers • Large scale data federation (Hinxton) • Optimising access to a large database (Ensembl, WormBase, ArrayExpress) • Connecting priopriatery datasets to public data (Pasteur, Unilever, Serono, Sanofi-Aventis, DevGen etc …)

  37. WormBase

  38. Ensembl

  39. ArrayExpress

  40. BioMart deployers • Large scale data federation (Hinxton) • Optimising access to a large database (Ensembl, WormBase) • Federating user data with public data (Pasteur, INRA, Bayer,Unilever, Serono, Sanofi-Aventis, DevGen, Solexa etc …)

  41. GMIA_SNP_mart_database SNP1 T/A AL13929 963253 1 SNP2 C/T AL13929 963255 -1 SNP3 C/G AL13929 963258 1 . ………………………………. . ………………………………. dbsnp HapMap Ensembl AceView Vega RefSeq Give me genoype and frequency data from HapMap Give me SNPs location on gene/transcript Give me frequency data from dbsnp Give me frequency, genotype, location on gene/transcriptfrom dbsnp,HapMap,Ensembl,RefSeq, AceViewandVegas Java graphical user interface WWW web browser Genetics of Infectious and Autoimmune Diseases, Pasteur Institute, INSERM U730, Paris, France.

  42. … what next ?

  43. BioMart model • Already applied • Ensembl • Vega • SNP • Uniprot • MSD • ArrayExpress • WormBase • Variety of ‘in house’ projects • In development • HapMap

  44. Summary • BioMart interface • Batch queries • ‘Data mining’ • Large annotation • BioMart software • Set up your own database • Make your database scalable and responsive • Federate with other data

  45. Where are we? • 0.2 released in february • 0.3 to be released in june • Platforms • Mysql • Oracle • Postgres

  46. Acknowledgments • BioMart • Damian Smedley (EBI) • Darin London (EBI) • Will Spooner (CSHL) • Contributors • Arne Stabenau (Ensembl) • Andreas Kahari (Ensembl) • Craig Melsopp (Ensembl) • Katerina Tzouvara (Uniprot) • Paul Donlon (Unilever)

More Related