biomart/

http://www.biomart.org/ “BioMart is a query-oriented data management system developed jointly by the Ontario Institute for Cancer Research (OICR) and the European Bioinformatics Institute (EBI).” Open Source – LGPL * Perl API → Web Interface, Web Services Interface, REST API * Java API → Mart Explorer GUI, MartShell * 3rd Party Software → Bioclipse, biomaRt-BioConductor, Cytoscape, Galaxy, Taverna, WebLab Databases in Biomart format: Ensembl HapMap HTGT HGNC Dictybase Wormbase Gramene Europhenome UniPro Rat Genome Database DroSpeGe ArrayExpress DW Eurexpress GermOnLine PRIDE PepSeeker VectorBase Pancreatic Expression Database Reactome EU Rat Mart Paramecium DB

A Mart is a collection of datasets (~=Database). Marts are optimised for querying. A Dataset has a main table, with an entry (and Primary Key) for each of the items of interest in that dataset (eg Mouse Transcripts). Related bits of information about these items are hung off the table in dimension tables (eg. Affy Ids corresponding to this gene)‏ More Info: http://www.biomart.org/user-docs.pdf

Ensembl annotates everything at the transcript level: AffyID Ensembl_transcript_1 HUGO Symbol Ensembl_transcript_2 Ensembl_transcript_3 1939_at ENST000003789 1939_at ENST000003790 1939_at ENST000003791 TP53 Affy Ids are mapped by Ensembl. If there is no clear match then that probe is not assigned to a gene.

Web Interface: http://www.biomart.org/biomart/martview/ Choose a Database (mart) to query (eg Ensembl)‏ Choose a Dataset from that mart to query (eg Mus Musculus Genes)‏

Filters Use filters to select the members of the dataset in which you're interested eg. Limit to miRNA genes from Chr1 →

Attributes Use attributes to define what bits of information you want to retrieve about the members of the dataset eg. Gene ID, Transcript ID, Start, End and Status:

Results:

http://www.biomart.org/biomart/martview

www.bioconductor.org “Bioconductor is an open source and open development software project for the analysis and comprehension of genomic data.” source("http://bioconductor.org/biocLite.R")‏ #Default package set biocLite()‏ #OR biocLite(“someBiocPkg”)‏ #OR biocLite(groupName=”pkgGroupName”)‏

Core Packages: affy, affydata, affyPLM, annaffy, annotate, Biobase, Biostrings, DynDoc, gcrma, genefilter, geneplotter, hgu95av2.db, limma, marray, matchprobes, multtest, ROC, vsn, xtable, affyQCReport. Alternative Package Groups lite, affy, graph, all Full Package Listing (software)‏ http://www.bioconductor.org/packages/release/BiocViews.html Full Package Listing (annotation)‏ http://www.bioconductor.org/packages/release/data/annotation/

Querying biomart from R: # Install library source(“http://www.bioconductor.org/biocLite.R”)‏ biocLite(“biomaRt”)‏ # Load library library(biomaRt)‏ listMarts()‏ # result is just a data.frame, so you can subset it: listMarts()[1:5,] # or search it: grep('ensembl', listMarts()[,1], value=TRUE)‏

# Select a mart mart <- useMart('ensembl')‏ # List the available datasets (returns data.frame)‏ listDatasets(mart)‏ # Select a dataset mart <- useDataset('mmusculus_gene_ensembl', mart=mart)‏ # Both in one: mart <- useMart('ensembl', dataset='mmusculus_gene_ensembl')‏

# Available Filters (returns data.frame)‏ listFilters(mart)‏ # Available Attributes (returns data.frame)‏ listAttributes(mart)‏ # A Simple Query getBM(filters=c('ensembl_gene_id'), values=c('ENSMUSG00000029249','ENSMUSG00000048482'), attributes=c('ensembl_gene_id', 'ensembl_transcript_id', 'transcript_start', 'transcript_end'), mart=mart)‏ ensembl_gene_id ensembl_transcript_id transcript_start transcript_end 1 ENSMUSG00000029249 ENSMUST00000113448 77694516 77708955 2 ENSMUSG00000029249 ENSMUST00000113449 77695221 77715457 3 ENSMUSG00000029249 ENSMUST00000080359 77694516 77712009 4 ENSMUSG00000048482 ENSMUST00000053317 109514857 109567200 5 ENSMUSG00000048482 ENSMUST00000111052 109533720 109567200 6 ENSMUSG00000048482 ENSMUST00000111051 109516054 109567200 7 ENSMUSG00000048482 ENSMUST00000111050 109532593 109567200 8 ENSMUSG00000048482 ENSMUST00000111047 109516054 109567163 9 ENSMUSG00000048482 ENSMUST00000111049 109516054 109567163 10 ENSMUSG00000048482 ENSMUST00000111046 109517251 109567163 11 ENSMUSG00000048482 ENSMUST00000111045 109533720 109567163 12 ENSMUSG00000048482 ENSMUST00000111044 109534626 109567163 13 ENSMUSG00000048482 ENSMUST00000111043 109534626 109567163 14 ENSMUSG00000048482 ENSMUST00000111042 109534628 109567204

# If using multiple filters, values should be a list # If chromosome_name, start and end filters used they are auto # interpreted as 'search within this region' getBM(filters=c('chromosome_name', 'start', 'end' ), values=list(10, 80000000,80050000), attributes= c('ensembl_gene_id', 'start_position','end_position'), mart=mart)‏ ensembl_gene_id start_position end_position 1 ENSMUSG00000003346 80046400 80053049 2 ENSMUSG00000035397 80029874 80040066 3 ENSMUSG00000047417 80005138 80024286 4 ENSMUSG00000003341 79982330 80001869

# Attributes and filters are organised into categories # To get a list of the categories: attributeSummary(mart)‏ filterSummary(mart)‏ # You can then list attributes and filters limited to a # specified category: listAttributes(mart, category='Variations')‏ # Filters can be either numeric, string or boolean. # Boolean filters need a TRUE or FALSE value # Determine type of filter with: filterType('with_unigene', mart)‏

# Older versions of ensembl are archived, useful if you've # got genome positions to a previous build old.mart <- useMart('ensembl_mart_46', dataset='mmusculus_gene_ensembl', archive=TRUE)‏

Retrieving Sequences: # can get complicated with getBM. Use the getSequence wrapper # Genome Sequences always 5'-3' but... # Web-Services mode (default): Strand is context dependant # MySQL mode: Always top strand #eg... # BRCA1 peptide sequence from gene symbol getSequence(id="BRCA1", type="mgi_symbol", seqType="peptide", mart = mart)‏ # REST transcript 20 bases upstream getSequence(id='ENSMUST00000113448', type='ensembl_transcript_id', seqType='transcript_flank', upstream=20, mart=mart)‏ # Chromosome 4 100,000,000-100,000,010 getSequence(chromosome=4, start=10000000, end=11000000, mart=mart, seqType="gene_exon", type="ensembl_gene_id")‏

seqTypes: Note that any of the _flank types need an 'upstream' or 'downstream' argument to determine the size of the flanking region. At the moment, you can't specify both.

Exporting Sequences: # The exportFASTA function provides a quick way of saving # sequences in FASTA format: res <- getSequence(id="BRCA1", type="mgi_symbol", seqType="peptide", mart = mart)‏ exportFASTA(res, file='sequence.fa')‏

Linking Datasets... # Make mart connections for each of the datasets: mouse.mart<-useMart('ensembl', dataset="mmusculus_gene_ensembl")‏ people.mart<-useMart('ensembl', dataset='hsapiens_gene_ensembl')‏ # In Ensembl, datasets are made of transcripts # from a single species. # Linking datasets amounts to homology #eg. Get pos of mouse homolog to human 'TP53' gene getLDS(attributes = c("hgnc_symbol","chromosome_name", "start_position"), filters = "hgnc_symbol", values = "TP53", mart = people.mart, attributesL = c("chromosome_name","start_position"), martL = mouse.mart)‏ } V1 V2 V3 V4 V5 1 TP53 17 7512445 11 69393861

Pretty HTML Output: library(annotate)‏ #Provides the htmlpage function. Salient args are: # genelist – a list or dataframe of IDs to be made into links # filename # title – for the table # othernames – a list of other things to add to the table as is # table.head – a character vector of col headers for the table. # repository – a list of repositories to use for creating links ids <- c('ENSMUSG00000029249','ENSMUSG00000048482')‏ genelist <- getBM(attributes=c('uniprot_swissprot_accession', 'entrezgene'), filters='ensembl_gene_id', values=ids, output='list', na.value=' ', mart=mart)‏ othernames <- getBM(attributes=c('ensembl_gene_id','mgi_symbol', 'description'), filters='ensembl_gene_id', values=ids, output='list', na.value='&nsbp;',mart=mart)‏ htmlpage(genelist=genelist, othernames=othernames, title='Some Genes', table.head=c('Uniprot', 'Entrezgene', 'Ensembl','Name', 'Description'), repository=list('sp', 'en'), filename='genes.html')‏ # Note that all the lists are expected to be in the right order

More Info... Bioconductor Mailing List: http://www.bioconductor.org/docs/mailList.html biomaRt Users' Guide: vignette('biomaRt')‏ Biomart Website http://www.biomart.org Slides & examples: http://www.cassj.co.uk/biomart_slides.ppt http://www.cassj.co.uk/worksheet.txt http://www.cassj.co.uk/worksheet_code.R

biomart/

biomart/

Presentation Transcript

Database mining with biomaRt

Data Mining in Ensembl with BioMart

Introduction to the BioMart API

Working with gene lists: Finding data using GEO & BioMart

Data Mining in Ensembl with BioMart

BioMart

Data Mining in Ensembl with BioMart

BIOMART IMPLEMENTATION OF SP_BASE

BioMart and CHADO

Data Mining with BioMart

BioMart Query Network

BioMart

Data Mining in Ensembl with BioMart

Data Mining with BioMart

Creative BioMart protein expression service

Enzyme Activity Assay in Creative BioMart

Advanced Biomart-A professional recombinant protein supplier

Yeast Expression in Creative BioMart

Thanksgiving Day Sale —Creative BioMart

Creative BioMart Has Newly Launched Stable Cell Line Services for Protein

Database mining with biomaRt

Data Mining in Ensembl with BioMart

biomart/

biomart/

Presentation Transcript

Database mining with biomaRt

Data Mining in Ensembl with BioMart

Introduction to the BioMart API

Working with gene lists: Finding data using GEO &amp; BioMart

Data Mining in Ensembl with BioMart

BioMart

Data Mining in Ensembl with BioMart

BIOMART IMPLEMENTATION OF SP_BASE

BioMart and CHADO

Data Mining with BioMart

BioMart Query Network

BioMart

Data Mining in Ensembl with BioMart

Data Mining with BioMart

Creative BioMart protein expression service

Enzyme Activity Assay in Creative BioMart

Advanced Biomart-A professional recombinant protein supplier

Yeast Expression in Creative BioMart

Thanksgiving Day Sale —Creative BioMart

Creative BioMart Has Newly Launched Stable Cell Line Services for Protein

Database mining with biomaRt

Data Mining in Ensembl with BioMart

Working with gene lists: Finding data using GEO & BioMart