machine learning for functional genomics i
Download
Skip this Video
Download Presentation
Machine Learning for Functional Genomics I

Loading in 2 Seconds...

play fullscreen
1 / 61

Machine Learning for Functional Genomics I - PowerPoint PPT Presentation


  • 144 Views
  • Uploaded on

Machine Learning for Functional Genomics I. Matt Hibbs http:// cbfg.jax.org. Central Dogma. Gene Expression. Proteins. DNA. Phenotypes. Functional Genomics. Identify the roles played by genes/proteins. Sealfon et al. , 2006. Gene Expression Microarrays.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Machine Learning for Functional Genomics I' - jerrod


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
machine learning for functional genomics i

Machine Learning forFunctional Genomics I

Matt Hibbs

http://cbfg.jax.org

central dogma
Central Dogma

GeneExpression

Proteins

DNA

Phenotypes

functional genomics
Functional Genomics

Identify the roles played by genes/proteins

Sealfon et al., 2006.

gene expression microarrays
Gene Expression Microarrays

Simultaneous measurements of mRNA abundance levels for every gene in a genome

Conditions

Genes

gene expression microarrays1
Gene Expression Microarrays

Simultaneous measurements of mRNA abundance levels for every gene in a genome – in thousands of conditions

Rich functional information in these data, but how can we utilize the entire compendia?

biological data explosion
Biological Data Explosion

Huge repositories of biological data…

Publically available microarrays in GEO

Mouse genes with known process association

# of measurements

# of genes

Year

Year

…are not directly translating into knowledge

why is there a data knowledge gap
Why is there a Data-Knowledge Gap?
  • Many datasets are analyzed only once
    • Initial publication looks for hypothesis
    • Need standards for naming, formats, collection
  • Data should be aggregated and integrated
    • Modestly significant clues seen repeatedly can become convincing
    • “a preponderance of circumstantial evidence”
  • Scale of this problem overwhelms traditional biology
scalable artificial intelligence
Scalable Artificial Intelligence

Computer science is really a study in scalability

Use machine learning and data mining techniques to quickly identify important patterns

amazon recommendations1
Amazon Recommendations

Purchase History

Item Rankings

  • Compare your purchase history to all other customers
  • Find commonalities between profiles
  • Predict potential purchases

Machine Learning

(Bayesian networks)

Observe Browsing Patterns and Account Activity

Recommendations

gene function prediction
Gene Function Prediction

Purchase History

Item Rankings

Genome Scale Data

MGI Annotations

Observe Browsing Patterns and Account Activity

Machine Learning

(Bayesian networks)

Laboratory Experiments

Machine Learning

(Bayesian networks)

Recommendations

Predictions

challenges for ai from biology
Challenges for AI from Biology

Input data is noisy, heterogeneous, constantly evolving

Current knowledge is incomplete and biased

Can be difficult to determine accuracy

promise of computational functional genomics
Promise of Computational Functional Genomics

Data & Existing Knowledge

Laboratory Experiments

Computational Approaches

Predictions

reality of computational functional genomics
Reality of Computational Functional Genomics

Data & Existing Knowledge

Laboratory Experiments

Computational Approaches

Predictions

computational solutions
Computational Solutions
  • Machine learning & data mining
    • Use existing data to make new predictions
      • Similarity search algorithms
      • Bayesian networks
      • Support vector machines
      • etc.
    • Validate predictions with follow-up lab work
  • Visualization & exploratory analysis
    • Seeing and interacting with data important
    • Show data so that questions can be answered
      • Scalability, incorporate statistics, etc.
computational solutions1
Computational Solutions
  • Machine learning & data mining
    • Use existing data to make new predictions
      • Similarity search algorithms
      • Bayesian networks
      • Support vector machines
      • etc.
    • Validate predictions with follow-up lab work
  • Visualization & exploratory analysis
    • Seeing and interacting with data important
    • Show data so that questions can be answered
      • Scalability, incorporate statistics, etc.
similarity search approach
Similarity Search Approach

Relevant Datasets

Search

Algorithm

(SPELL)

Data Collection

Query Genes

Related Genes

  • Re-frame analysis as exploratory search
key insights
Key Insights

X

U

Vt

=

  • Context-Sensitive Search Process
  • Signal Balancing
  • Correlation Comparability
key insights1
Key Insights

X

U

Vt

=

  • Context-Sensitive Search Process
  • Signal Balancing
  • Correlation Comparability
dataset relevance weighting
Dataset relevance weighting

0.15

0.82

0.05

0.55

Query Genes:

Q= {YQG1, YQG2, YQG3}

YQG1

YQG2

YQG3

Calculate correlation measure among query for each dataset

-- This is each datasets’ weight

Datasets

identify novel partners
Identify Novel Partners

0.15

0.82

0.05

0.55

Query Genes:

Q= {YQG1, YQG2, YQG3}

YQG1

YQG2

YQG3

geneA

geneB

geneC

Datasets

Calculate weighted distance score for all other genes to the query set

identify novel partners1
Identify Novel Partners

0.15

0.82

0.05

0.55

Query Genes:

Q= {YQG1, YQG2, YQG3}

YQG1

YQG2

YQG3

Best score

Worst score

geneB

geneC

geneA

+ Takes advantage of functional diversity

+ Addresses statistical concerns

+ Fast running times [O(GDQ2)] (ms per query)

+ Top results are candidates for investigation

+ Search process is iterative to refine results

Datasets

Calculate weighted distance score for all other genes to the query set

key insights2
Key Insights

X

U

Vt

=

  • Context-Sensitive Search Process
  • Signal Balancing
  • Correlation Comparability
signal balancing data svd
Signal Balancing Data - SVD
  • Singular Value Decomposition (SVD)
  • Projects data into another orthonormal basis
  • Correlations in U (rather than X) often contain better biological signals
signal balancing
Signal Balancing

SVD

Signal

Balancing

signal balancing1
Signal Balancing
  • Use correlations among left singular vectors
    • Downweights dominant patterns, amplifies subtle patterns
  • Top eigengenes dominate data
    • Sometimes correspond to systematic bias
    • Often correspond to common biological processes
      • eg. ribosome biogenesis, etc.
  • Accuracy of signal balancing improved over re-projection
key insights3
Key Insights

X

U

Vt

=

  • Context-Sensitive Search Process
  • Signal Balancing
  • Correlation Comparability
between dataset normalization
Between-dataset normalization
  • Commonly used Pearson correlation yields greatly different distributions of correlation
  • These differences complicate comparisons

Histograms of Pearson correlations between all pairs of genes

DeRisi et al., 97

Primig et al., 00

between dataset normalization1
Between-dataset normalization
  • Fisher Z-transform, Z-score equalizes distributions
  • Increases comparability between datasets

Histograms of Z-scores between all pairs of genes

spell algorithm overview
SPELL Algorithm Overview

Hibbs MA, Hess DC, Myers CL, Huttenhower C, Li K, Troyanskaya OG. Exploring the functional landscape of gene expression: directed search of large microarray compendia. Bioinformatics, 2007.

web interface
Web Interface

http://spell.princeton.edu

evaluation of performance
Evaluation of Performance
  • Leave-k-in cross validation / bootstrapping
  • Results averaged across 125 diverse GO biological process terms (defined in the GRIFn system, Myers et al., 2006)
  • Many predictions also verified through experimental validations in other studies
    • Hibbs et al., Bioinf, 2007
    • Hess et al., PLoS Gen, 2009
    • Hibbs*, Myers*, Huttenhower*, et al., PLoS Comp Biol, 2009
search accuracy
Search Accuracy
  • Perform “leave-k-in” cross-validation

Order Genome

Master List

Rank Average

Genes with common function

For all pairs

search accuracy1
Search Accuracy
  • Precision-Recall Curve

Master List

1

Precision

TP

TP

TP + FP

TP + FN

0

1

0

Recall

sample query size effects
Sample & Query Size Effects

Even relatively small sample sizes produce similar results

(1000 samples used for all other tests)

Significant performance gain between 2 and 3 query genes, little change beyond

(5 query genes used for all other tests)

effect of signal balancing
Effect of Signal Balancing

Improvement is robust to missing value imputation method

Signal balancing further improves context-specific search performance

effects of signal balancing
Effects of Signal Balancing

signal balanced

n% re-projection

n% balanced

effects of signal balancing1
Effects of Signal Balancing

n% re-projection

n% balanced

computational solutions2
Computational Solutions
  • Machine learning & data mining
    • Use existing data to make new predictions
      • Similarity search algorithms
      • Bayesian networks
      • Support vector machines
      • etc.
    • Validate predictions with follow-up lab work
  • Visualization & exploratory analysis
    • Seeing and interacting with data important
    • Show data so that questions can be answered
      • Scalability, incorporate statistics, etc.
function prediction evaluation
Function Prediction Evaluation

Huttenhower C*, Hibbs MA*, Myers CL* et al. The impact of incomplete knowledge on gene function prediction. Bioinformatics, 2009.

  • Cross-validation based on known biology
    • Most often used method in literature
    • Results are useful, but can be biased
  • Laboratory evaluation
    • More accurate, more difficult
    • Ultimate goal of functional genomics
    • Identify novel biology
    • Publish biological corpus
promise of computational functional genomics1
Promise of Computational Functional Genomics

Data & Existing Knowledge

Laboratory Experiments

Computational Approaches

Predictions

biological benefits of computational direction
Biological Benefits of Computational Direction
  • Effective Candidate prioritization
    • 6 months of work vs. 8 years for whole genome screen
  • “Unbiased” (actually, just less biased)
    • Both uncharacterized genes and genes with known function predicted and verified
      • 40 of 75 (53%) for genes with known function
      • 60 of 118 (51%) for uncharacterized genes
    • Testing only mitochondrial localized proteins would miss 43% of our discoveries
      • 59% accuracy among mitochondria localized
      • 44% accuracy among non-mitochondria localized
computational expectations
Computational Expectations

Original Gold Standard

Experimental Results

computational reality
Computational Reality

Original Gold Standard

Experimental Results

computational lessons
Computational Lessons
  • Underlying data, Choice of algorithm important
    • Data affects which biological areas can be studied
    • Algorithm affects biological context, nature of results
    • Possible for many combinations to be accurate
  • Utilizing an ensemble of methods broadens scope and reliability
    • Iteration in an ensemble can lead to converging predictions
  • Evaluating the results of computational prediction methods is not as simple as recapitulating GO
conclusions
Conclusions

Microarray search system (& Bayesian data integration) produce good predictions of gene function

Experimental verification of predictions is important

109 novel gene functions discovered

Subtle phenotypes important to consider

Big challenge: Make this work in mammals

acknowledgements
Acknowledgements
  • Hibbs Lab
    • Karen Dowell
    • Tongjun Gu
    • Al Simons
  • Olga Troyanskaya Lab
    • Patrick Bradley
    • Maria Chikina
    • Yuanfang Guan
  • Chad Myers
  • David Hess
  • Florian Markowetz
  • Edo Airoldi
  • Curtis Huttenhower
  • Kai Li Lab
    • Grant Wallace
  • Amy Caudy
  • Maitreya Dunham
  • Botstein, Kruglyak, Broach, Rose labs
  • Kyuson Yun
  • Carol Bult
ad