1 / 22

Data Mining with BioMart

Data Mining with BioMart. Simple and … Complex Queries. Genes within a candidate region Gene products with a particular protein domain …

Download Presentation

Data Mining with BioMart

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Mining with BioMart

  2. Simple and … Complex Queries • Genes within a candidate region • Gene products with a particular protein domain … • Genomic location and description of all mouse and rat homologues of all human genes, that have transmembrane domains, are expressed in the cardiovascular system and are associated with non-synonymous SNPs

  3. Ensembl Core Database Relational database • Normalised • Each data point stored only once Therefore: • Quick updates • Minimal storage requirements But: • Many tables • Many joins for complicated queries • Slow for data mining applications

  4. Normalised Schema

  5. BioMart Database Data warehouse • De-normalised • Query-optimised • Tables with apparent “redundancy” Therefore: • Fast and flexible • Ideal for data mining Produced from normalised core databases at every new release

  6. De-Normalised Schema

  7. BioMart • Developed jointly by the European Bioinformatics Institute (EBI) and Cold Spring Harbor Laboratory (CSHL) • http://www.biomart.org/ Publicly available implementations at: • Ensembl • Central Server • Dictybase • Wormbase (WormMart) • Gramene (GrameneMart) • euGenes • HapMap (HapMart) • ZF-Models

  8. BioMart

  9. Data Sets Primary • Ensembl Genes • Vega Genes • SNPs Secondary • Markers • “Diseases” • Gene ontology • Gene expression information • Homology predictions • Protein annotation

  10. START FILTER OUTPUT REGION REGION GENE GENE EXPRESSION EXPRESSION HOMOLOGY HOMOLOGY PROTEIN PROTEIN SNP SNP DATABASE SPECIES Swiss-Prot FASTA EMBL GTF RefSeq HTML GO TEXT InterPro EXCEL Affymetrix FILE Information Flow

  11. BioMart Example Find all Ensembl genes on the short arm of human chromosome 1 which are known to be associated with a disease Export the 100 bp upstream of the transcripts of the above genes

  12. 1. Select “Ensembl 38” 3. Click “next” 2. Select “Homo sapiens genes (NCBI36)”

  13. 4. Select “Chromosome 1” 7. Click “next” 5. Select “Band Start p36.33 – End p11.1” 6. Select “with Disease Association Only”

  14. 8. Select Attribute Page “Features” Summary of actions 9. Select “Ensembl Gene ID” and “Ensembl Transcript ID”

  15. 10. Select “Disease OMIM ID” and “Disease description” 11. Select Output format “MS Excel” 12. Click “export”

  16. 13. Select Attribute Page “Sequences” 17. Click “export” 14. Select “Flank (Transcript)” 15. Enter “Upstream flank 100” 16. Select Header information

  17. There are other ways… • MartShell • Command line interface to Mart written in Java • Mart Query Language

  18. What about queries not possible to do in BioMart? • MySQL queries on ensembldb.ensembl.org • MySQL client • Perl API • BioPerl and Ensembl modules • Java API

  19. Q & A Q U E S T I O N S A N S W E R S

  20. Exercises «The range and complexity of the questions you can address through the Ensembl MartView resource is truly impressive. We really encourage you to spend some time playing with it …»

More Related