Machine learning for functional genomics i
This presentation is the property of its rightful owner.
Sponsored Links
1 / 61

Machine Learning for Functional Genomics I PowerPoint PPT Presentation


  • 101 Views
  • Uploaded on
  • Presentation posted in: General

Machine Learning for Functional Genomics I. Matt Hibbs http:// cbfg.jax.org. Central Dogma. Gene Expression. Proteins. DNA. Phenotypes. Functional Genomics. Identify the roles played by genes/proteins. Sealfon et al. , 2006. Gene Expression Microarrays.

Download Presentation

Machine Learning for Functional Genomics I

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Machine learning for functional genomics i

Machine Learning forFunctional Genomics I

Matt Hibbs

http://cbfg.jax.org


Central dogma

Central Dogma

GeneExpression

Proteins

DNA

Phenotypes


Functional genomics

Functional Genomics

Identify the roles played by genes/proteins

Sealfon et al., 2006.


Gene expression microarrays

Gene Expression Microarrays

Simultaneous measurements of mRNA abundance levels for every gene in a genome

Conditions

Genes


Gene expression microarrays1

Gene Expression Microarrays

Simultaneous measurements of mRNA abundance levels for every gene in a genome – in thousands of conditions

Rich functional information in these data, but how can we utilize the entire compendia?


Biological data explosion

Biological Data Explosion

Huge repositories of biological data…

Publically available microarrays in GEO

Mouse genes with known process association

# of measurements

# of genes

Year

Year

…are not directly translating into knowledge


Why is there a data knowledge gap

Why is there a Data-Knowledge Gap?

  • Many datasets are analyzed only once

    • Initial publication looks for hypothesis

    • Need standards for naming, formats, collection

  • Data should be aggregated and integrated

    • Modestly significant clues seen repeatedly can become convincing

    • “a preponderance of circumstantial evidence”

  • Scale of this problem overwhelms traditional biology


Scalable artificial intelligence

Scalable Artificial Intelligence

Computer science is really a study in scalability

Use machine learning and data mining techniques to quickly identify important patterns


Amazon recommendations

Amazon Recommendations


Amazon recommendations1

Amazon Recommendations

Purchase History

Item Rankings

  • Compare your purchase history to all other customers

  • Find commonalities between profiles

  • Predict potential purchases

Machine Learning

(Bayesian networks)

Observe Browsing Patterns and Account Activity

Recommendations


Gene function prediction

Gene Function Prediction

Purchase History

Item Rankings

Genome Scale Data

MGI Annotations

Observe Browsing Patterns and Account Activity

Machine Learning

(Bayesian networks)

Laboratory Experiments

Machine Learning

(Bayesian networks)

Recommendations

Predictions


Challenges for ai from biology

Challenges for AI from Biology

Input data is noisy, heterogeneous, constantly evolving

Current knowledge is incomplete and biased

Can be difficult to determine accuracy


Promise of computational functional genomics

Promise of Computational Functional Genomics

Data & Existing Knowledge

Laboratory Experiments

Computational Approaches

Predictions


Reality of computational functional genomics

Reality of Computational Functional Genomics

Data & Existing Knowledge

Laboratory Experiments

Computational Approaches

Predictions


Computational solutions

Computational Solutions

  • Machine learning & data mining

    • Use existing data to make new predictions

      • Similarity search algorithms

      • Bayesian networks

      • Support vector machines

      • etc.

    • Validate predictions with follow-up lab work

  • Visualization & exploratory analysis

    • Seeing and interacting with data important

    • Show data so that questions can be answered

      • Scalability, incorporate statistics, etc.


Computational solutions1

Computational Solutions

  • Machine learning & data mining

    • Use existing data to make new predictions

      • Similarity search algorithms

      • Bayesian networks

      • Support vector machines

      • etc.

    • Validate predictions with follow-up lab work

  • Visualization & exploratory analysis

    • Seeing and interacting with data important

    • Show data so that questions can be answered

      • Scalability, incorporate statistics, etc.


Similarity search approach

Similarity Search Approach

Relevant Datasets

Search

Algorithm

(SPELL)

Data Collection

Query Genes

Related Genes

  • Re-frame analysis as exploratory search


Key insights

Key Insights

X

U

Vt

=

  • Context-Sensitive Search Process

  • Signal Balancing

  • Correlation Comparability


Key insights1

Key Insights

X

U

Vt

=

  • Context-Sensitive Search Process

  • Signal Balancing

  • Correlation Comparability


Dataset relevance weighting

Dataset relevance weighting

0.15

0.82

0.05

0.55

Query Genes:

Q= {YQG1, YQG2, YQG3}

YQG1

YQG2

YQG3

Calculate correlation measure among query for each dataset

-- This is each datasets’ weight

Datasets


Identify novel partners

Identify Novel Partners

0.15

0.82

0.05

0.55

Query Genes:

Q= {YQG1, YQG2, YQG3}

YQG1

YQG2

YQG3

geneA

geneB

geneC

Datasets

Calculate weighted distance score for all other genes to the query set


Identify novel partners1

Identify Novel Partners

0.15

0.82

0.05

0.55

Query Genes:

Q= {YQG1, YQG2, YQG3}

YQG1

YQG2

YQG3

Best score

Worst score

geneB

geneC

geneA

+ Takes advantage of functional diversity

+ Addresses statistical concerns

+ Fast running times [O(GDQ2)] (ms per query)

+ Top results are candidates for investigation

+ Search process is iterative to refine results

Datasets

Calculate weighted distance score for all other genes to the query set


Key insights2

Key Insights

X

U

Vt

=

  • Context-Sensitive Search Process

  • Signal Balancing

  • Correlation Comparability


Signal balancing data svd

Signal Balancing Data - SVD

  • Singular Value Decomposition (SVD)

  • Projects data into another orthonormal basis

  • Correlations in U (rather than X) often contain better biological signals


Signal balancing

Signal Balancing

SVD

Signal

Balancing


Signal balancing1

Signal Balancing

  • Use correlations among left singular vectors

    • Downweights dominant patterns, amplifies subtle patterns

  • Top eigengenes dominate data

    • Sometimes correspond to systematic bias

    • Often correspond to common biological processes

      • eg. ribosome biogenesis, etc.

  • Accuracy of signal balancing improved over re-projection


Key insights3

Key Insights

X

U

Vt

=

  • Context-Sensitive Search Process

  • Signal Balancing

  • Correlation Comparability


Between dataset normalization

Between-dataset normalization

  • Commonly used Pearson correlation yields greatly different distributions of correlation

  • These differences complicate comparisons

Histograms of Pearson correlations between all pairs of genes

DeRisi et al., 97

Primig et al., 00


Between dataset normalization1

Between-dataset normalization

  • Fisher Z-transform, Z-score equalizes distributions

  • Increases comparability between datasets

Histograms of Z-scores between all pairs of genes


Spell algorithm overview

SPELL Algorithm Overview

Hibbs MA, Hess DC, Myers CL, Huttenhower C, Li K, Troyanskaya OG. Exploring the functional landscape of gene expression: directed search of large microarray compendia. Bioinformatics, 2007.


Web interface

Web Interface

http://spell.princeton.edu


Evaluation of performance

Evaluation of Performance

  • Leave-k-in cross validation / bootstrapping

  • Results averaged across 125 diverse GO biological process terms (defined in the GRIFn system, Myers et al., 2006)

  • Many predictions also verified through experimental validations in other studies

    • Hibbs et al., Bioinf, 2007

    • Hess et al., PLoS Gen, 2009

    • Hibbs*, Myers*, Huttenhower*, et al., PLoS Comp Biol, 2009


Search accuracy

Search Accuracy

  • Perform “leave-k-in” cross-validation

Order Genome

Master List

Rank Average

Genes with common function

For all pairs


Search accuracy1

Search Accuracy

  • Precision-Recall Curve

Master List

1

Precision

TP

TP

TP + FP

TP + FN

0

1

0

Recall


Accuracy of context sensitive search

Accuracy of Context-Sensitive Search


Sample query size effects

Sample & Query Size Effects

Even relatively small sample sizes produce similar results

(1000 samples used for all other tests)

Significant performance gain between 2 and 3 query genes, little change beyond

(5 query genes used for all other tests)


Effect of signal balancing

Effect of Signal Balancing

Improvement is robust to missing value imputation method

Signal balancing further improves context-specific search performance


Effects of signal balancing

Effects of Signal Balancing

signal balanced

n% re-projection

n% balanced


Effects of signal balancing1

Effects of Signal Balancing

n% re-projection

n% balanced


Specific performance

Specific Performance


Computational solutions2

Computational Solutions

  • Machine learning & data mining

    • Use existing data to make new predictions

      • Similarity search algorithms

      • Bayesian networks

      • Support vector machines

      • etc.

    • Validate predictions with follow-up lab work

  • Visualization & exploratory analysis

    • Seeing and interacting with data important

    • Show data so that questions can be answered

      • Scalability, incorporate statistics, etc.


Function prediction evaluation

Function Prediction Evaluation

Huttenhower C*, Hibbs MA*, Myers CL* et al. The impact of incomplete knowledge on gene function prediction. Bioinformatics, 2009.

  • Cross-validation based on known biology

    • Most often used method in literature

    • Results are useful, but can be biased

  • Laboratory evaluation

    • More accurate, more difficult

    • Ultimate goal of functional genomics

    • Identify novel biology

    • Publish biological corpus


Promise of computational functional genomics1

Promise of Computational Functional Genomics

Data & Existing Knowledge

Laboratory Experiments

Computational Approaches

Predictions


Petite frequency assay

Petite Frequency Assay


Petite frequency phenotypes for predictions

Petite Frequency Phenotypes for Predictions


Overall result summary

Overall Result Summary


Double mutant petite freq

Double mutant petite freq.


Mitochondrial motility

Mitochondrial Motility


Respiratory growth rate

Respiratory Growth Rate


Biological benefits of computational direction

Biological Benefits of Computational Direction

  • Effective Candidate prioritization

    • 6 months of work vs. 8 years for whole genome screen

  • “Unbiased” (actually, just less biased)

    • Both uncharacterized genes and genes with known function predicted and verified

      • 40 of 75 (53%) for genes with known function

      • 60 of 118 (51%) for uncharacterized genes

    • Testing only mitochondrial localized proteins would miss 43% of our discoveries

      • 59% accuracy among mitochondria localized

      • 44% accuracy among non-mitochondria localized


Computational expectations

Computational Expectations

Original Gold Standard

Experimental Results


Complementary computational approaches

Complementary Computational Approaches


Computational reality

Computational Reality

Original Gold Standard

Experimental Results


Method comparison

Method Comparison


Method accuracy is biologically diverse

Method Accuracy is Biologically Diverse


Underlying data changes predictions

Underlying Data Changes Predictions


Methods converge during iteration

Methods Converge During Iteration


Computational lessons

Computational Lessons

  • Underlying data, Choice of algorithm important

    • Data affects which biological areas can be studied

    • Algorithm affects biological context, nature of results

    • Possible for many combinations to be accurate

  • Utilizing an ensemble of methods broadens scope and reliability

    • Iteration in an ensemble can lead to converging predictions

  • Evaluating the results of computational prediction methods is not as simple as recapitulating GO


Conclusions

Conclusions

Microarray search system (& Bayesian data integration) produce good predictions of gene function

Experimental verification of predictions is important

109 novel gene functions discovered

Subtle phenotypes important to consider

Big challenge: Make this work in mammals


Acknowledgements

Acknowledgements

  • Hibbs Lab

    • Karen Dowell

    • Tongjun Gu

    • Al Simons

  • Olga Troyanskaya Lab

    • Patrick Bradley

    • Maria Chikina

    • Yuanfang Guan

  • Chad Myers

  • David Hess

  • Florian Markowetz

  • Edo Airoldi

  • Curtis Huttenhower

  • Kai Li Lab

    • Grant Wallace

  • Amy Caudy

  • Maitreya Dunham

  • Botstein, Kruglyak, Broach, Rose labs

  • Kyuson Yun

  • Carol Bult


  • Login