Metadata and Annotation with Bioconductor

Metadata and Annotation with Bioconductor

Static vs. Dynamic Annotation Static Annotation: • Bioconductor packages containing annotation information that are installed locally on a computer • well-defined structure • reproducible analyses • no need for network connection Dynamic Annotation: • stored in a remote database • more frequent updates  possibly different result when repeating analyses • more information • one needs to know about the structure of the database, the API of the webservice etc.

Available Metadata • EntrezGene is a catalog of genetic loci that connects curated sequence information to official nomenclature. It replaced LocusLink. • UniGene defines sequence clusters. UniGene focuses on protein-coding genes of the nuclear genome (excluding rRNA and mitochondrial sequences). • RefSeq is a non-redundant set of transcripts and proteins of known genes for many species, including human, mouse and rat. • Enzyme Commission (EC) numbers are assigned to different enzymes and linked to genes through EntrezGene.

Available Metadata • Gene Ontology (GO) is a structured vocabulary of terms describing gene products according to molecular function, biological process, or cellular component • PubMed is a service of the U.S. National Library of Medicine. PubMed provides a rich resource of data and tools for papers in journals related to medicine and health. While large, the data source is not comprehensive, and not all papers have been abstracted

Available Metadata • OMIM Online Mendelian Inheritance in Man is a catalog of human genes and genetic disorders. • NetAffx Affymetrix’ NetAffx Analysis Center provides annotation resources for Affymetrix GeneChip technology. • KEGG Kyoto Encyclopedia of Genes and Genomes; a collection of data resources including a rich collection of pathway data. • IntAct Protein Interaction data, mainly derived from experiments. • Pfam Pfam is a large collection of multiple sequence alignments and hidden Markov models covering manycommon protein domains and families.

Available Metadata • Chromosomal Location Genes are identified with chromosomes, and where appropriate with strand. • Data Archives The NCBI coordinates the Gene Expression Omnibus (GEO); TIGR provides the Resourcerer database, and the EBI runs ArrayExpress.

Annotation Packages • An early design decision was to provide metadata on a per chip-type basis (e.g. hgu133a, hgu95av2) • Each annotation package contains objects that provide mappings between identifiers (genes, probes, …) and different types of annotation data • One can list the content of a package: > library("hgu133a") > ls("package:hgu133a") [1] "hgu133a" "hgu133aACCNUM" [3] "hgu133aCHR" "hgu133aCHRLENGTHS" [5] "hgu133aCHRLOC" "hgu133aENTREZID" [7] "hgu133aENZYME" "hgu133aENZYME2PROBE" [9] "hgu133aGENENAME" "hgu133aGO" [11] "hgu133aGO2ALLPROBES" "hgu133aGO2PROBE" [13] "hgu133aLOCUSID" "hgu133aMAP" [15] "hgu133aMAPCOUNTS" "hgu133aOMIM" [17] "hgu133aORGANISM" "hgu133aPATH" [19] "hgu133aPATH2PROBE" "hgu133aPFAM" [21] "hgu133aPMID" "hgu133aPMID2PROBE" [23] "hgu133aPROSITE" "hgu133aQC" [25] "hgu133aREFSEQ" "hgu133aSUMFUNC_DEPRECATED" [27] "hgu133aSYMBOL" "hgu133aUNIGENE"

A little bit of history... (the pre-SQL era) before: hgu95av2 now: hgu95av2.db

Annotation Packages • Objects in annotation packages used to be environments, hash tables for mapping  now things are stored in SQLite DB • Mapping only from one identifier to another, hard to reverse • quite unflexible • The user interface still supports many of the old environment-specific interactions: You can access the data directly using any of the standard subsetting or extraction tools for environments: get, mget, $ and [[. > get("201473_at", hgu133aSYMBOL) [1] "JUNB" > mget(c("201473_at","201476_s_at"), hgu133aSYMBOL) $`201473_at` [1] "JUNB" $`201476_s_at` [1] "RRM1" > hgu133aSYMBOL$"201473_at" [1] "JUNB" > hgu133aSYMBOL[["201473_at"]] [1] "JUNB"

Working with Metadata Suppose we are interested in the gene BAD. > gsyms <- unlist(as.list(hgu133aSYMBOL)) > whBAD <- grep("^BAD$", gsyms) > gsyms[whBAD] 1861_at 209364_at "BAD" "BAD" > hgu133aGENENAME$"1861_at" [1] "BCL2-antagonist of cell death"

Working with Metadata Find the pathways that BAD is associated with. > BADpath <- hgu133aPATH$"1861_at" > kegg <- mget(BADpath, KEGGPATHID2NAME) > unlist(kegg) 01510 "Neurodegenerative Disorders" 04012 "ErbB signaling pathway" 04210 "Apoptosis" 04370 … "Colorectal cancer" 05212 "Pancreatic cancer" 05213 "Endometrial cancer" 05215

Working with Metadata We can get the GeneChip probes and the unique EntrezGene loci in each of these pathways. First, we obtain the Affymetrix IDs > allProbes <- mget(BADpath, hgu133aPATH2PROBE) > length(allProbes) [1] 15 > allProbes[[1]][1:10] [1] "206679_at" "209462_at" "203381_s_at" "203382_s_at" [5] "212874_at" "212883_at" "212884_x_at" "200602_at" [9] "211277_x_at" "214953_s_at" > sapply(allProbes, length) 01510 04012 04210 04370 04510 04910 05030 05210 05212 05213 85 169 162 137 413 243 39 167 156 111 05215 05218 05220 05221 05223 194 137 160 117 110

Working with Metadata And then we can map these to their Entrez Gene values. > getEG = function(x) unique(unlist(mget(x, hgu133aENTREZID))) > allEG = sapply(allProbes, getEG) > sapply(allEG, length) 01510 04012 04210 04370 04510 04910 05030 05210 05212 05213 37 84 81 67 187 130 18 82 72 51 05215 05218 05220 05221 05223 85 68 74 53 53

.db Packages • Data in the new .db annotation packages is stored in SQLite databases  much more efficient and flexible • old environment-style access provided by objects of class Bimap (package AnnotationDbi) left object right object left object right object left object right object

.db Packages • Data in the new .db annotation packages is stored in SQLite databases  much more efficient and flexible • old environment-style access provided by objects of class Bimap (package AnnotationDbi) left object right object attr1 = value1 attr2 0 value2 left object right object name left object right object  bipartite graph

DBI • collection of classes and methods for database interaction • they abstract the particular implementations of common standard operations on different types of databases • resultSet: operations are performed on the database, the user controls how much information is returned dbSendQuery create result set dbGetQuery get all results dbGetQuery(connection, sql query)

.db Packages Notice that there are a few more entries here. They give you access to a connection to the database. > library("hgu133a.db") > ls("package:hgu133a.db") [1] "hgu133aACCNUM" "hgu133aALIAS2PROBE" [3] "hgu133aCHR" "hgu133aCHRLENGTHS" [5] "hgu133aCHRLOC" "hgu133aENTREZID" [7] "hgu133aENZYME" "hgu133aENZYME2PROBE" [9] "hgu133aGENENAME" "hgu133aGO" [11] "hgu133aGO2ALLPROBES" "hgu133aGO2PROBE" [13] "hgu133aMAP" "hgu133aMAPCOUNTS" [15] "hgu133aOMIM" "hgu133aORGANISM" [17] "hgu133aPATH" "hgu133aPATH2PROBE" [19] "hgu133aPFAM" "hgu133aPMID" [21] "hgu133aPMID2PROBE" "hgu133aPROSITE" [23] "hgu133aREFSEQ" "hgu133aSYMBOL" [25] "hgu133aUNIGENE" "hgu133a_dbInfo" [27] "hgu133a_dbconn" "hgu133a_dbfile" [29] "hgu133a_dbschema"

> con <- hgu133a_dbconn() > q1 <- "select symbol from gene_info“ > head(dbGetQuery(con ,q1)) symbol 1 A2M 2 NAT1 3 NAT2 4 SERPINA3 extract information from a database table as data.frame > toTable(hgu133aSYMBOL)[1:3,] probe_id symbol 1 217757_at A2M 2 214440_at NAT1 3 206797_at NAT2 reverse mapping > revmap(hgu133aSYMBOL)$BAD [1] "1861_at" "209364_at"

Lkeys, Rkeys: Get left and right keys of a Bimap object > head(Lkeys(hgu133aSYMBOL)) [1] "1007_s_at" "1053_at" "117_at" "121_at" "1255_g_at" "1294_at" > head(Rkeys(hgu133aSYMBOL)) [1] "A2M" "NAT1" "NAT2" "SERPINA3" "AADAC" "AAMP" nhit: number of hits for every left key in a Bimap object > table(nhit(revmap(hgu133aSYMBOL))) 1 2 3 4 5 6 7 8 9 10 11 12 13 18 19 8101 2814 1273 475 205 77 19 15 5 3 4 1 2 1 1

Metadata about Metadata <package>_dbschema() database schemata of the package e.g. hgu133a_dbschema() <package>() summary of tables, number of mapped elements, etc. e.g. hgu133a() <package>_dbInfo() meta information about origin of the data, chip type, etc e.g. hgu133a_dbInfo()

> hgu133a() • Quality control information for hgu133a: • This package has the following mappings: • hgu133aACCNUM has 22283 mapped keys (of 22283 keys) • hgu133aALIAS2PROBE has 51017 mapped keys (of 51017 keys) • … • hgu133aSYMBOL has 21382 mapped keys (of 22283 keys) • hgu133aUNIGENE has 21291 mapped keys (of 22283 keys) • Additional Information about this package: • DB schema: HUMANCHIP_DB • DB schema version: 1.0 • Organism: Homo sapiens • Date for NCBI data: 2008-Apr2 • Date for GO data: 200803 • Date for KEGG data: 2008-Apr1 • Date for Golden Path data: 2006-Apr14 • Date for IPI data: 2008-Mar19 • Date for Ensembl data: 2007-Oct24

Annotating a Genome Bioconductor also provides some comprehensive annotations for whole genomes (e.g. S. cerevisae). They follow a naming convention like: org.Hs.eg.db. Currently we are trying to support all widely used model organisms. These packages are like the chip annotation packages, except a different set of primary keys is used (e.g. for yeast we use the systematic names such as YBL088C) > library("YEAST.db") > ls("package:YEAST.db")[1:12] [1] "YEAST" "YEASTALIAS" [3] "YEASTCHR" "YEASTCHRLENGTHS" [5] "YEASTCHRLOC" "YEASTCOMMON2SYSTEMATIC" [7] "YEASTDESCRIPTION" "YEASTENZYME" [9] "YEASTENZYME2PROBE" "YEASTGENENAME" [11] "YEASTGO" "YEASTGO2ALLPROBES"

BP CC MF 14598 2065 8268 „old-style“ vs SQL example from GO: number of terms in the three different ontologies old style: > system.time(goCats <- unlist(eapply(GOTERM, Ontology))) User System Ellapsed 70.75 0.12 88.48 > gCnums <- table(goCats)[c("BP","CC", "MF")] SQL: > system.time(goCats <- dbGetQuery(GO_dbconn(), "select ontology from go_term")) User System Ellapsed 0.07 0.00 0.07

KEGG • KEGG provides mappings from genes to pathways • We provide these in the package KEGG.db, you can also query the site directly using KEGGSOAP or other software. • One problem with the KEGG is that the data is not in a form that is amenable to computation.

KEGG Data in KEGG.db package KEGGEXTID2PATHID provides mapping from either EntrezGene (for human, mouse and rat) or Open Reading Frame (yeast) to KEGG pathway ID. KEGGPATHID2EXTID contains the mapping in the other direction. KEGGPATHID2NAME provides mapping from KEGG pathway ID to a textual description of the pathway. Only the numeric part of the KEGG pathway identifiers is used (not the three letter species codes)

Exploring KEGG Consider pathway 00362. > KEGGPATHID2NAME$"00362" [1] "Benzoate degradation via hydroxylation„ Species specific mapping from pathway to genes is indicated by glueing together three letter species code, e. g. texttthsa, and numeric pathway code. > KEGGPATHID2EXTID$hsa00362 [1] "10449" "30" "3032" "59344" "83875" > KEGGPATHID2EXTID$sce00362 [1] "YIL160C" "YKR009C"

PAK1 has EntrezGene ID 5058 in humans > KEGGEXTID2PATHID$"5058" [1] "hsa04010" "hsa04012" "hsa04360" "hsa04510" "hsa04650" [6] "hsa04660" "hsa04810" "hsa05120" "hsa05211" > KEGGPATHID2NAME$"04010" [1] "MAPK signaling pathway„ We find that it is involved in 9 pathways. For mice, the MAPK signaling pathway contains > mm <- KEGGPATHID2EXTID$mmu04010 > length(mm) [1] 253 > mm[1:10] [1] "102626" "109689" "109880" "109905" "110157" "110651" [7] "114713" "11479" "11651" "11652" Exploring KEGG

Dynamic Annotation • The annotate package • functions for harvesting of curated persistent data sources • functions for simple HTTP queries to web service providers • interface code that provides common calling sequences for the assay based metadata packages such as getSEQ • perform webqueries to NCBI to extract the nucleotide sequence corresponding to a GenBank accession number. • > gsq <- getSEQ("M22490") • > substring(gsq,1,40) • [1] "GGCAGAGGAGGAGGGAGGGAGGGAAGGAGCGCGGAGCCCG" • M22490: mapped to locus HUMBMP2B; Human bone morphogenetic • protein-2B (BMP-2B) mRNA.

The annotate Package • other interface functions include getGO, getSYMBOL, getPMID, and getLL • functions whose names start with pm work with lists of PubMed identifiers for journal articles. • > hgu133aSYMBOL$"209905_at" • [1] "HOXA9" • > pm.getabst("209905_at", "hgu133a") • $`209905_at` • $`209905_at`[[1]] • An object of class 'pubMedAbst': • Title: Vertebrate homeobox gene nomenclature. • PMID: 1358459 • Authors: MP Scott • Journal: Cell • Date: Nov 1992

BioMart • Generic data management system, collaboration between EBI and CSHL • Several query interfaces and administration tools • Conduct fast and powerful queries using: • website • webservice • graphical or text-oriented applications • software libraries written in Perl and Java. • http://www.ebi.ac.uk/biomart/

Ensembl Joint project between EMBL-EBI and the Sanger Institute Produces and maintains automatic annotation on selected eukaryotic genomes. http://www.ensembl.org

Ensembl martview

VEGA The Vertebrate Genome Annotation (VEGA) database is a central repository for high quality, frequently updated, manual annotation of vertebrate finished genome sequence. Current release: • Human • Mouse • Zebrafish • Dog http://vega.sanger.ac.uk

WormBase WormBase is the repository of mapping, sequencing and phenotypic information for C. elegans (and some other nematodes). http://www.wormbase.org

WormMart

GrameneMart Gramene: A Comparative Mapping Resource for Grains Gramene is a curated, open-source, Web-accessible data resource for comparative genome analysis in the grasses. http://www.gramene.org

Other databases with BioMart interfaces • dbSNP (via Ensembl)‏ • HapMap • Sequence Mart: Ensembl genome sequences

BioMart user interfaces

MartShell MartShell is a command line BioMart user interface based on a structured query language: Mart Query Language (MQL)‏

BioMart user interfaces Martview Web based user interface for BioMart, provides functionality for remote users to query all databases hosted by the EBI's public BioMart server. MartExplorer Perl and Java libraries biomaRt interface to R/Bioconductor

The biomaRt package Developed by Steffen Durinck (started Feb 2005) Two main sets of functions: 1. Tailored towards Ensembl, shortcuts for FAQs (frequently asked queries): getGene, getGO, getOMIM... 2. Generic queries, modeled after MQL (Mart query language), can be used with any BioMart dataset Two communication protocols 1. Direct MySQL queries to BioMart database servers 2. HTTP queries to BioMart webservices more stable (across database releases); self-reflective; less firewall problems

Getting started > library(biomaRt)‏ > listMarts()‏ $biomart [1] "dicty" "ensembl" "snp" "vega" "uniprot" "msd" "wormbase" $version [1] "DICTYBASE (NORTHWESTERN)" "ENSEMBL 38 (SANGER)" [3] "SNP 38 (SANGER)" "VEGA 38 (SANGER)" [5] "UNIPROT 4-5 (EBI)" "MSD 4 (EBI)" [7] "WORMBASE CURRENT (CSHL)" $host [1] "www.dictybase.org" "www.biomart.org" "www.biomart.org" [4] "www.biomart.org" "www.biomart.org" "www.biomart.org" [7] "www.biomart.org" $path [1] "" "/biomart/martservice" "/biomart/martservice" [4] "/biomart/martservice" "/biomart/martservice" "/biomart/martservice" [7] "/biomart/martservice"

Gene annotation The function getGene allows you to get gene annotation for many types of identifiers Supported identifiers are: • Affymetrix Genechip Probeset ID • RefSeq • Entrez-Gene • EMBL • HUGO • Ensembl • soon Agilent identifiers will also be available

getGene > mart <- useMart("ensembl", dataset = "hsapiens_gene_ensembl")‏ > myProbes <- c("210708_x_at", "202763_at", "211464_x_at")‏ > z <- getGene(id = myProbes, array = "affy_hg_u133_plus_2", mart = mart)‏ ID symbol 1 202763_at CASP3 2 210708_x_at CASP10 7 211464_x_at CASP6 description 1 Caspase-3 precursor (EC 3.4.22.-) (CASP-3) (Apopain) ... 2 Caspase-10 precursor (EC 3.4.22.-) (CASP-10) (ICE-like apoptotic pro.. 7 Caspase-6 precursor (EC 3.4.22.-) (CASP-6) (Apoptotic protease Mch-2)... chromosome band strand chromosome_start chromosome_end ensembl_gene_id 1 4 q35.1 -1 185785845 185807623 ENSG00000164305 2 2 q33.1 1 201756100 201802372 ENSG00000003400 7 4 q25 -1 110829234 110844078 ENSG00000138794 ensembl_transcript_id 1 ENST00000308394 2 ENST00000272879 7 ENST00000265164

Gene annotation • Note: Ensembl does an independent mapping of affy probe sequences to genomes. If there is no clear match then that probe is not assigned to a gene.

Gene annotation • getGene returns a dataframe • Gene symbol • Description • Chromosome name • Band • Start position • End position • BioMartID

getGene > getGene(id = 100, type = "entrezgene", mart = mart)‏ ID symbol 1 100 ADA description 1 Adenosine deaminase (EC 3.5.4.4) (Adenosine aminohydrolase). [Source:Uniprot/SWISSPROT;Acc:P00813] chromosome band strand chromosome_start chromosome_end ensembl_gene_id 1 20 q13.12 -1 42681577 42713797 ENSG00000196839 ensembl_transcript_id 1 ENST00000372874

Metadata and Annotation with Bioconductor