Agenda • Biological databases related to microarray • Gene Ontology • KEGG • Biocarta • Reactome • MSigDB • Pathway enrichment analysis • GSEA • GSA • Ingenuity Pathway Analysis (IPA) • Motif finding
1. Databases Biological pathways and knowledge are very complex: • Is it possible to establish a database? • To systematically structuring and managing the knowledge? • To validate analysis result or be incorporated into analysis?
1.1 Gene Ontology • Ontologies: Controlled vocabularies to describe fuctions of genes. • The database is structured as directed acyclic graphs (DAGs), which differ from hierarchical trees in that a 'child' (more specialized term) can have many 'parents' (less specialized terms).
1.1 Gene Ontology Three major categories in Gene Ontology: Current term counts: as of April 2, 2005 at 18:00 Pacific time:17708 terms, 93.8% with definitions. 9263 biological_process1496 cellular_component6949 molecular_function
1.1 Gene Ontology Evidence code: How is the information collected? • IC inferred by curator • IDA inferred from direct assay • IEA inferred from electronic annotation • IEP inferred from expression pattern • IGI inferred from genetic interaction • IMP inferred from mutant phenotype • IPI inferred from physical interaction • ISS inferred from sequence or structural similarity • NAS non-traceable author statement • ND no biological data available • RCA inferred from reviewed computational analysis • TAS traceable author statement • NR not recorded • There may be (a lot of) errors in the database!!
1.1 Gene Ontology • Demo: • Go to GO: http://www.geneontology.org • Go to “Tools" and click on "AmiGO". • Click “Browse”. Click on the boxes with "+" to expand any category to look at its subcategories. Click on "-" to collapse again. • Type the term “cell cycle" in the "Search GO"field. Press "Submit". You will then see all GO categories containig this word. • Click on a GO term, say “cell cycle arrest”. Genes belonging to this GO term can be shown. Further filter genes by “Data source” or “Species”. • Type the name “cyclin" in Amigo. Change to the “genes or proteins" selection button and press "Submit". You will then see a number of genes containing this name. Press some of the "Tree view" links. • Note that in some cases, the same term category can exist in different places in the tree. This ontology is thus not strictly hierarchical, but shows complex "many-to-many" relationships between gene products, ontology terms and branches in the ontology tree.
1.2 KEGG http://www.genome.jp/kegg/pathway.html
1.2 KEGG Kyoto Encyclopedia of Genes and Genomes KEGG is a suite of databases and associated software, integrating our current knowledge on molecular interaction networks in biological processes (PATHWAY database), the information about the universe of genes and proteins (GENES/SSDB/KO databases), and the information about the universe of chemical compounds and reactions (COMPOUND/GLYCAN/REACTION databases). The current statistics of KEGG databases is as follows: Number of pathways 23,574(PATHWAY database) Number of reference pathways 265(PATHWAY database) Number of ortholog tables 87(PATHWAY database) Number of organisms 272(GENOME database) Number of genes 911,584(GENES database) Number of ortholog clusters 35,456(SSDB database) Number of KO assignments 6,221(KO database) Number of chemical compounds 12,737(COMPOUND database) Number of glycans 11,017(GLYCAN database) Number of chemical reactions 6,399(REACTION database) Number of reactant pairs 5,953(RPAIR database)
1.2 KEGG RNA polymerase:
1.2 KEGG Cell cycle:
1.2 KEGG Parkinson’s disease: Alzheimer’s disease, Huntington’s disease, Prion disease….
1.4 Reactome • A manually curated and peer-reviewed (authors, reviewers and editors) pathway database. • Now annotates 5849 proteins, 4555 complexes, 4827 reactions and 1192 pathways in Homo Sapien (Version 39, 2/21/2012)
1.5 MSigDB A comprehensive pathway database (mainly gene sets without graphical interaction model). Useful for conventional pathway (gene set) enrichment analysis. C1: Positional gene sets (326) C2: Curated gene sets (3272) Canonical pathways (880) Biocarta (217) KEGG (186) Reactome (430) C3: Motif gene sets (836) miRNA targets gene sets (221) TF targets gene sets(615) C4: Computational gene sets (881) C5: GO gene sets (1454)
2. Enrichment analysis • After • Selecting DE genes, or • Classification, or • Clustering • We are usually given a gene list for further investigation. How do we validate information contained in the gene list by available biological knowledge?
2. Enrichment analysis Cell cycle data: Cells are synchronized and samples taken at various time points (covering 2 cell cycles). 6162 genes are included. From Fourier analysis, 800 genes with cyclic gene expression pattern are selected for further investigation. Are these 800 genes really involved in cell cycle?
2. Enrichment analysis http://db.yeastgenome.org/cgi-bin/GO/goTermMapper
2. Enrichment analysis Is the selected set of genes enriched in the GO term of “cell cycle”?
2. Enrichment analysis R code for chi-square test without continuity correction > chisq.test(matrix(c(285, 5012, 100, 691), 2, 2), correct=F) Pearson's Chi-squared test data: matrix(c(285, 5012, 100, 691), 2, 2) X-squared = 61.2644, df = 1, p-value = 4.99e-15
2. Enrichment analysis Chi-squared test is an approximate test and may not perform well when sample size small. Fisher’s exact test is a better alternative. Fisher’s exact test: G genes in the genome (G=1663) are analyzed; Functional category “F”. In a cluster of size C, h genes are found to be in a functional category “F” with m genes, then p-value (i.e. the probability of observing h or more annotated genes in the cluster is calculated as (Tavazoie et al. 1999):
2. Enrichment analysis Fisher’s exact test If genes are randomly assigned, the probability of having h intersection genes is The p-value is the probability to observe h or more intersection genes by chance:
2. Enrichment analysis Fisher’s exact test Observation: • There are only two possibilities to observe more extremely than observation:
2. Enrichment analysis Kolmogorov-Smirnov test (KS test) -- A major issue of Fisher’s exact test is that it requires an ad hoc threshold to generate DE gene list. -- KS test is a better way to associate any gene order with a pathway information. Example: S1=(1,2,3,5), S2=(4,6,8,9,10) D=maxx |F1(x)-F2(x)|
2. Enrichment analysis • In practice, we need to search through thousands of GO terms to determine which GO term is enriched in the selected gene set . • Multiple comparison problem!! Difficulties: Tests are highly dependent. • Hierarchical structure of the GO e.g. “Cell Proliferation” is a parent GO term of “Cell Cycle”. • Each gene can belong to multiple GO terms. e.g. human HoxA7 gene belongs to four GO terms: “Development”, “Nucleus”, “DNA dependent regulation and transcription”, “Transcription factor activity”.
2. Enrichment analysis • Simple and Naïve way: • Get p-values from Fisher’s exact test for all pathways. • Correct by Benjamini-Hochberg procedure to control FDR. • Problem: • Fisher’s test simplify DE statistics into a biomarker list (0-1). • Does not consider gene dependence structure and pathway hierarchical dependence structure. • Improved methods: • Use averaged t-statistics or Kolmogorov-Smirnov (KS) statistics as the pathway-specific enrichment score. • Apply permutation test (either gene permutation or sample permutation) to perform FDR control. • Read the following papers if interested. • Goeman, J.J. and Buhlmann, P. (2007) Analyzing gene expression data in terms of gene sets: methodological issues, Bioinformatics, 23, 980-987. • Tian, L., Greenberg, S.A., Kong, S.W., Altschuler, J., Kohane, I.S. and Park, P.J. (2005) Discovering statistically significant pathways in expression profiling studies, Proceedings of the National Academy of Sciences of the United States of America, 102, 13544-13549. • Efron, B. and Tibshirani, R. (2007) On testing the significance of sets of genes, Annals of Applied Statistics, 1, 107-129. • Subramanian, A., Tamayo, P., Mootha, V.K., Mukherjee, S., Ebert, B.L., Gillette, M.A., Paulovich, A., Pomeroy, S.L., Golub, T.R., Lander, E.S. and Mesirov, J.P. (2005) Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles, Proceedings of the National Academy of Sciences of the United States of America, 102, 15545-15550.
2. Enrichment analysis • Simple Fisher’s exact test: • Ingenuity Pathway • A commercial package with good interface and human curated annotation. Can generate network figures. • NIH DAVID • Free and web-based. Perform enrichment analysis (Fisher’s exact test), adjust for multiple comparison and generate a table of results. Use multiple databases. • “Gostats” package in Bioconductor • Free and web-based. Perform enrichment analysis (Fisher’s exact test) and generate a table of results. Use only GO database. • More sophisticated and systematic methods: • Gene set enrichment analysis (GSEA; MIT Mesirov’s group) • http://www.broad.mit.edu/gsea/ (free) • Gene set analysis (GSA; Stanford Tibshirani’s group) • http://www-stat.stanford.edu/~tibs/GSA/ (free) • Ingenuity Pathway Analysis (IPA) • http://www.ingenuity.com/ (commercial; Pitt has purchases licenses)
2. Enrichment analysis • Things to note when using biological database: • Biological pathways and gene functions are complex and difficult to quantify. • Data may not be accurate. The analysis should take into account of strength of evidence. • May need to go to specific database for particular organism. (e.g. SGD for yeast; FlyBase and BDGP for fly) • To systematically collect and manage massive biological knowledge from publications and experiments is an important and active research topic in bioinformatics.
3. Motif Finding http://web.indstate.edu/thcme/mwking/gene-regulation.html
3. Motif Finding http://web.indstate.edu/thcme/mwking/gene-regulation.html
3. Motif Finding • Genes in a cluster have similar expression patterns. • They might share common regulatory motifs so they are expressed simultaneously. • It is of interest to find motifs from the gene clusters.
3. Motif Finding The following materials are obtained from Shirley Liu at Harvard.