Large-scale mining of gene expression patterns

Large-scale mining of gene expression patterns Paul Pavlidis paul@bioinformatics.ubc.ca VanBUG September 2007

Students Leon French Meeta Mistry Vaneet Lotay Postdoc Jesse Gillis Undergraduates Raymond Lim Suzanne Lane Programmers Kelsey Hamer Luke McCarthy

Genome Synapse Injury Stress Disease Aging Development Signal transduction Synaptic modulation

Topics • Connectivity database and analysis • Gene expression data re-use system • Scaling up gene coexpression analysis • Applications and ongoing work

Another ‘ome

Leon French, Suzanne Lane

Age Genes Samples With JJ Mann, V Arango, E Sibille et al.

Age Genes Samples Data from http://national_databank.mclean.harvard.edu/

GEO

Goals for a system • Researchers should be able to put their new expression data in a wider context of previous studies without extraordinary effort. • Move analyzing multiple microarray data sets from a niche activity to the mainstream • Integration of other data types, domain specific information.

Public data sources Coexpression Differential expression

Challenges to comparing data sets • Need to match genes/transcripts across platforms • Data from third parties not always easy to handle • Varying scales, normalization, etc. • Varying data quality • Varying levels of “raw data” available • Selecting appropriate data to compare

With Cincinnati Children’s Hospital (D.Glass, M. Barnes et al.)

Probe specificity (or lack thereof)

Which data sets are reasonable to compare? Too general, but lots of power All mouse data sets Mouse brain data sets Mouse neocortex data sets Mouse neocortex data sets examining stress Mouse neocortex data sets examining hypoxic stress Mouse neocortex data sets examining hypoxic stress after 3 hours of hypoxia Very specific, low power

Array Designs: 178 Assays (i.e., chips): 20837 Coexpression links (probe-level): >100 million

Scaling up analysis of gene coexpression • Genes that are coexpressed tend to have related function • Needed at the same place at the same time • “Guilt by association” • Reasonable to compare across studies Eisen et al., 1998 PNAS Two ribosomal protein genes. Expression Samples

Biological noise • Induced gene expression effects are often small. • Gene expression varies between “replicates” in biologically-meaningful ways. • Allows us to repurpose data Sample type

Functional coexpression should be (somewhat) generalized • If two genes are coexpressed under one condition, they will probably be coexpressed under at least some other conditions (or data sets). • Coexpression seen “only once” needs special care in interpretation. • We shouldn’t expect coexpression to be perfectly reproducible (for biological and technical reasons) Correlation Correlation

Genome Research, June 2004 A simple approach: Count Recurring patterns

Pipeline for one dataset

Proof of concept analysis • 60 human data sets, 15700 RefSeq genes. • 70% cancer data • 11 million “links” • About 9.7 million different links

Many links are replicated across studies

Evaluation on biological grounds

Cluster involving NMDAR1 (GRIN1)

GRIN1 ATP6V0A1 PLD3 Allen Brain Institute

Application: analysis of imprinted genes Laurent Journot, INSERM – Universités Montpellier

LYAR interacting proteins Correlation p-value LYAR-interactors Ewing et al, 2007 Molecular Systems Biology

Vote counting limitations • Weak evidence distributed across data sets will not be picked up. • This example meets strict “vote counting” criteria in only 2/23 data sets Correlation

Correlation (Global) Support (# of datasets)

Datasets Genes pairs Related work: Zhou XJ et al., Nat.Biotech 2005

Summary • Reuse of public data: ‘adding value’ • Meta-analysis of coexpression • Some applications • Functional prediction • Candidate identification • Platform evaluation

Ongoing and future work • Applications and analyses • Protein interactions and hubs • Prediction of gene function at the synapse • Differential expression analysis • Regionalization • Mouse models of brain injury • Mouse models of psychosis • Expanding our public database and software http://www.bioinformatics.ubc.ca/Gemma Web-based tools for biologists; web services coming soon • Integration with other information sources

Thanks • And to: • NCBI GEO team • Groups who made data available • Collaborators who provided data prior to publication • Conrad Gilliam • Abraham Palmer • Andreas Kottmann • Etienne Sibille Gemma Xiang Wan Kelsey Hamer Luke McCarthy Kiran Keshav Suzanne Lane Meeta Mistra Jesse Gillis Joseph Santos Gozde Cozen David Quigley Anshu Sinha Spiro Pantazatos Wei-Keat Lim Tmm Homin Lee Amy Hsu Jon Sajdak Jie Qin Tzu-Lin Hsaio Collaborators Barclay Morrison Joseph Gogos Michael Hayden Blair Leavitt Tony Blau Panos Papapanou

Answers to FAQs • No, they don’t have to be time course experiments. • Yes, we’re using cDNA as well as Affymetrix etc. • Yes, we see reproducible negative correlations. • Yes, we’re interested in finding differences as well as similarities between data sets. • No, we aren’t necessarily inferring regulatory relationships • Yes, we know that RNA is just one way of measuring cell state. • No, we don’t have {worm,fly,yeast…} data, but we’d like to.

Large-scale mining of gene expression patterns

Large-scale mining of gene expression patterns

Presentation Transcript

Gene Ontology as a tool for the systematic analysis of large-scale gene-expression data

MINING THE GENE EXPRESSION MATRIX: INFERRING GENE RELATIONSHIPS FROM L ARGE SCALE GENE EXPRESSION DATA

Large scale genomic data mining

Unlocking the potential of public available gene expression data for large-scale analysis

Patterns of Large-Scale Flux Emegence

Mining Shifting-and-Scaling Co-Regulation Patterns on Gene Expression Profiles

Patterns of Expression

Large Scale Patterns of Climatic Variations

Large scale genomic data mining

An Evolutionary Approach for Gene Expression Patterns

Multiple testing in large-scale gene expression experiments

Gene Expression, Inheritance Patterns, and DNA Technology

Multiple testing in large-scale gene expression experiments

Large Scale Gene Expression with DNA Microarrays

Large Scale Gene Expression with DNA Microarrays

Extraction of functional information from large-scale gene expression data

Modified Multi-Dimensional Scaling (MDS) Algorithm for Mining Gene Expression Patterns

Gene Expression

INTERPRETATION OF LARGE SCALE CIRRUS PATTERNS

Gene Expression