nan song advisors john lafferty dannie durand n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Gene family classification using a semi-supervised learning method PowerPoint Presentation
Download Presentation
Gene family classification using a semi-supervised learning method

Loading in 2 Seconds...

play fullscreen
1 / 87

Gene family classification using a semi-supervised learning method - PowerPoint PPT Presentation


  • 153 Views
  • Uploaded on

Nan Song Advisors: John Lafferty, Dannie Durand. Gene family classification using a semi-supervised learning method. Outline. Introduction A motivating application: genome annotation A graphical model of sequence relatedness Gene classification using machine learning

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Gene family classification using a semi-supervised learning method' - kaida


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
nan song advisors john lafferty dannie durand
Nan Song

Advisors: John Lafferty, Dannie Durand

Gene family classification using a semi-supervised learning method
outline
Outline
  • Introduction
    • A motivating application: genome annotation
  • A graphical model of sequence relatedness
  • Gene classification using machine learning
  • Empirical evaluation
  • Conclusion
key genomic component genes
Key genomic component: genes

A gene is a DNA subsequence

ACCCTTAGCTAGACCTTTAGGAGG...

key genomic component genes1

A gene is a DNA subsequence

ACCCTTAGCTAGACCTTTAGGAGG...

A protein is an amino acid sequence

A protein is an amino acid sequence

VHLT P E...

Genes encode proteins,

the building blocks of the cell

Key genomic component: genes

A gene is a DNA subsequence

Genes encode proteins,

the building blocks of the cell

ACCCTTAGCTAGACCTTTAGGAGG...

VHLT P E...

whole genome sequencing
Whole Genome Sequencing

413 whole genome sequences: 41 eukarya, 28 archaea, 344 bacteria

In progress: 1034 prokaryotic genomes, 629 eukaryotic genomes

www.genomesonline.org

gene prediction and annotation

14,882

Known genes

16,896

Predicted genes

31,778

Total

Gene prediction and annotation

International Human Genome Consortium, Nature 2001

gene annotation
Gene annotation
  • We are given a new genome sequence with predicted genes.
  • A few genes are well studied.
  • Identify other genes in the same family to predict function.
  • Verify predictions experimentally

Two contexts:

    • Individual scientist
    • High throughput
outline1
Outline
  • Introduction
    • Molecular biology
    • A motivating application: genome annotation
  • A graphical model of sequence relatedness
  • Gene classification using machine learning
  • Empirical evaluation
  • Conclusion
evolutionarily related genes have related functions

atgcgccgtctggcatgt…

atgcgaggtctcccatgt…

atgcaaggagtcccagagc…

γ-globin

β-globin

ε-globin

Evolutionarily related genes have related functions

Ancestral gene

atgccaggactcccagtga…

Duplication

Duplication

Adult Fetal Embryonic

evolutionarily related genes have related functions1
Evolutionarily related genes have related functions

Ancestral gene

Gene family classification is a powerful source of information for inferring evolutionary, functional and structural properties of genes

atgccaggactcccagtga…

Duplication

Duplication

atgcgccgtctggcatgt…

atgcaaggagtcccagagc…

atgcgaggtctcccatgt…

β-globin

γ-globin

ε-globin

outline2
Outline
  • Introduction
  • A graphical model of sequence relatedness
  • Gene classification using machine learning
  • Empirical evaluation
  • Conclusion
a graphical model of sequence relatedness
A graphical model of sequence relatedness
  • G = (V,E)
  • V: represent sequences
  • E: weight of the edge is proportional to the similarity between sequences.

…atgcaaggagtcccagagcc…

…atgcgaggtctcccagtgtc…

xi

xj

a graphical model of sequence relatedness1
A graphical model of sequence relatedness
  • G = (V,E)
  • V: represent sequences
  • E: weight of the edge is proportional to the similarity between sequences.

xi

xj

gene family classification
Gene family classification
  • Biological scenario:
  • small number of known genes
  • large number of unknown genes

Goal:

Given known genes, identify genes in the same family.

xi

xj

outline3
Outline
  • Introduction
  • A graphical model of sequence relatedness
  • Gene classification using machine learning
  • Empirical evaluation
  • Conclusion
framework binary classification
Framework: binary classification
  • Machine learning scenario:
  • small number of labeled data
    • genes known to be in family
    • genes clearly not in family
  • large number of unlabeled data

Determine which unlabeled genes belong to the family.

several challenging problems of gene family classification

Mutations

DNA shuffling

atgcgccccccggcatgt…

atgcgccgtctggcatgt…ggctcgta

Several challenging problems of gene family classification

Ancestral gene

Duplication

Duplication

atgcgccgtctggcatgt…

atgcgaggtctcccatgt…

atgcaaggagtcccagagc…

Traditionally, similarity is represented by sequence comparison

several challenging problems of gene family classification1

Mutations

DNA shuffling

atgcgccccccggcatgt…

atgcgccgtctggcatgt…ggctcgta

Several challenging problems of gene family classification

Ancestral gene

Duplication

Duplication

atgcgccgtctggcatgt…

atgcgaggtctcccatgt…

atgcaaggagtcccagagc…

Traditionally, similarity is represented by sequence comparison

several challenging problems of gene family classification2
Several challenging problems of gene family classification

Families

  • do not form a clique
  • do not form a connected component
  • have edges to sequences outside the family.
outline4
Outline
  • Introduction
  • A graphical model of sequence relatedness
  • Gene classification using machine learning
    • Semi-supervised learning algorithm
    • Supervised learning algorithm
  • Empirical evaluation
  • Conclusion
gene family classification1
Gene family classification
  • Machine learning scenario:
  • large number of unlabeled data
  • small number of labeled data

Goal:

Binary classification

  • Semi supervised learning:
    • Exploit information from both labeled and unlabeled data
    • Performed well in many applications
graphical semi supervised learning binary classification
Graphical semi-supervised learning (Binary classification)

(xj,yj = 0)

(xk,f(k))

  • Notation:
  • V: The whole data set
  • L: Labeled data set
  • U: unlabeled data set
  • Each vertex: (xi,yi) or (xk, f(k))

(xi,yi = 1)

Xiaojin Zhu, Zoubin Ghahramani, John Lafferty The Twentieth International Conference on Machine Learning (ICML-2003)

graphical semi supervised learning binary classification1
Graphical semi-supervised learning (Binary classification)

(xj,yj = 0)

  • Input:
    • family members (xi,yi = 1)
    • nonfamily members: (xj, yj = 0)

(xk,f(k))

  • Output:
    • Assign a real value to every vertex in the graph
    • Find a cutoff to separate the two classes

(xi,yi = 1)

Xiaojin Zhu, Zoubin Ghahramani, John Lafferty The Twentieth International Conference on Machine Learning (ICML-2003)

graphical semi supervised learning binary classification2
Graphical semi-supervised learning (Binary classification)

Assign real values to all vertices in the graph, to minimize E(f):

(xn,yp = 1)

(xk,f(k))

Sij

(xi,yi = 0)

G = (V,E)

L: Labeled data set

U: unlabeled data set

graph based semi supervised learning
Graph-based semi-supervised learning

f(xk)

Works well

http://www.cs.wisc.edu/~jerryzhu/research/ssl/animation.html

graph based semi supervised learning1
Graph-based semi-supervised learning

f(xk)

Works well

Works well ?

http://www.cs.wisc.edu/~jerryzhu/research/ssl/animation.html

outline5
Outline
  • Introduction
  • A graphical model of sequence relatedness
  • Gene classification using machine learning
    • Semi-supervised learning
    • Supervised learning
  • Empirical evaluation
  • Conclusion
semi supervised vs kernel based supervised learning
Semi-supervised vs kernel-based supervised learning
  • Semi-supervised learning:
  • Supervised learning:

where L is the labeled data set and U is the unlabeled data set

outline6
Outline
  • Introduction
  • A graphical model of sequence relatedness
  • Gene classification using machine learning
  • Empirical evaluation
    • Methodology
    • Results
  • Conclusion
graph construction
Graph construction
  • G = (V,E)
  • V: All mouse sequences from SwissProt (n = 7439)
  • E: based on newly designed sequence similarity measurement.
  • 0 < S(i, j) < 1
methodology
Methodology
  • Graph construction
  • Test set construction
  • Experiments performed
  • Basis for evaluation
test set construction

ACSL

FOX

Laminin

SEMA

USP

ADAM

GATA

Myosin

T-box

WNT

DVL

Kinase

Notch

TNFR

FGF

Kinesin

PDE

TRAF

Test set construction

18 well studied protein families

  • Receptors, enzymes, transcription factors, motor proteins, structural proteins, and extracellular matrix proteins.
test set construction1
Test set construction
  • Retrieved all complete mouse sequences from SwissProtdatabase (7,439)
  • Identified sequences for each test family based on
    • Nomenclature committee reports
    • Structural properties
    • Literature surveys
methodology1
Methodology
  • Graph construction
  • Test set construction
  • Experiments performed
  • Basis for evaluation
experiments performed
Experiments performed
  • Compare semi-supervised with supervised learning algorithm
  • Tested parameters:
    • Scaling parameter,σ, in the kernel function
    • Number of Labeled Family members (LF)
    • Number of Labeled Nonfamily members(LN)
tested parameters

σ

number of Labeled

Family members

number of Non-labeled

Family members

Tested parameters

For each set of parameters, 20 tests were performed

tested parameters 1

σ=100

1

σ=10

0.8

W

0.6

σ=1

0.4

σ=0.5

0.2

0.08

σ=0.2

0.05

σ=0.1

0.02

0

0

0.2

0.4

0.6

0.8

1

S

Tested parameters (1)

Tested σ values: 0.05, 0.1, 0.5, 1, 2, 10, 100

tested parameters 2
Tested parameters (2)
  • Labeled Family members (LF):

10-70% of family size

  • Labeled Nonfamily members (LN) :

100, 500, 1000

about 1 - 10% of nonfamily size

Database size: 7439

methodology2
Methodology
  • Graph construction
  • Test set construction
  • Experiments performed
  • Basis for evaluation
semi supervised learning
Semi-supervised learning

Goal:

f(i) > f(j) when xi is a family member and xj is not.

Evaluation criteria:

  • Visualization
  • AUC score
  • False negatives
visualization
Sort all unlabeled data by f(x)

Family members

f(x)

Nonfamily members

Rank

Visualization
slide44

Family members

f(x)

Nonfamily members

sensitivity

Rank

1 - specificity

Rank plot

AUC (Area Under ROC Curve)

auc scores do not reflect all information we need
AUC scores do not reflect all information we need
  • False negatives after the first false positive
  • The number of missed data after the first false positive
outline7
Outline
  • Introduction
  • A graphical model of sequence relatedness
  • Gene classification using machine learning
  • Empirical evaluation
    • Methodology
    • Results
  • Conclusion
several challenging problems of gene family classification3
Several challenging problems of gene family classification

Families

  • do not form a clique
  • do not form a connected component
  • have edges to sequences outside the family.

Edges to sequences outside the family are mainly a problem if they have strong edge weights

test families have different graph properties
Test families have different graph properties

W: Edges to sequences outside the family have weak edge weights

S: Edges to sequences outside the family have strong edge weights

results
Results
  • Compare semi-supervised with supervised learning algorithm
  • Tested parameters:
    • Scaling parameter,σ, in the kernel function
    • Number of Labeled Family members (LF)
    • Number of Labeled Nonfamily members(LN)
tested parameters1

AUC (ave)

Notch, Lf = 1, Ln =1000

0.1

0.2

0.5

1

10

Tested parameters

σ

number of Labeled

Family members

number of Non-labeled

Family members

the effect of

σ=100

1

σ=10

0.8

W

0.6

σ=1

0.4

σ=0.5

0.2

0.08

σ=0.2

0.05

σ=0.1

0.02

0

0

0.2

0.4

0.6

0.8

1

Raw similarity score (s)

The effect of σ
test families have different graph properties1
Test families have different graph properties

W: Edges to sequences outside the family have weak edge weights

S: Edges to sequences outside the family have strong edge weights

edges to sequences outside the family are mainly a problem if they have strong edge weights1
Edges to sequences outside the family are mainly a problem if they have strong edge weights

FOX

Notch

Number of edges

Raw edge weight

Raw edge weight

case study rank plots for semi supervised learning in fox

σ = 10

σ =1

σ = 0.1

σ=100

Case study: Rank plots for semi-supervised learning in FOX

LF = 3, LN = 100, family size: 30

slide57

σ= 10

σ = 0.1

σ = 1

σ = 10

Case study: rank plots for semi-supervised learning in Notch

labeled family seqs: 1 (out of 4)

labeled nonfamily seqs: 100(out of 7435)

slide58

AUC (ave)

AUC (ave)

Notch, Lf = 1, Ln =1000

0.1

0.1

0.2

0.2

0.5

0.5

1

1

10

10

FOX, Lf = 3, Ln =1000

σ

summary on
Summary on σ
  • For most families, the performance is not very sensitive to σ
  • For almost all families that form a clique, there is at least one value of sigma (usually many)
    • such that both semi-supervised and supervised learning algorithms have perfect classfication performance
results1
Results
  • Compare semi-supervised with supervised learning algorithm
  • Tested parameters:
    • Scaling parameter,σ, in the kernel function
    • Number of Labeled Family members (LF)
    • Number of Labeled Nonfamily members(LN)
test families have different graph properties2
Test families have different graph properties

W: Edges to sequences outside the family have weak edge weights

S: Edges to sequences outside the family have strong edge weights

the connection among sequences in adam family
The connection among sequences in ADAM family

9

24

25

26

# of connected ADAM sequences

tested parameters2

achieve the best average AUC score

number of Labeled

Family members

number of Non-labeled

Family members

Tested parameters

σ

number of Labeled

Family members

number of Non-labeled

Family members

By taking the maximum

the impact of number of labeled family and nonfamily members on the performance

3

5

9

15

Supervised, LN =100

7

The impact of number of labeled family and nonfamily members on the performance

ADAM

AUC

# labeled family seqs, LF

the impact of number of labeled family and nonfamily members on the performance1

3

5

9

15

Supervised, LN =100

Semi-supervised, LN = 100

7

The impact of number of labeled family and nonfamily members on the performance

ADAM

AUC

# labeled family seqs, LF

Performed paired t-test to detect the difference between semi-supervised and supervised method for a set of parameters

the impact of number of labeled family and nonfamily members on the performance2

Supervised, ln =1000

3

5

9

15

Supervised, ln =100

Semi-supervised, ln = 100

7

The impact of number of labeled family and nonfamily members on the performance

ADAM

AUC

# labeled family seqs

the impact of number of labeled family and nonfamily members on the performance3

Supervised, ln =1000

3

5

9

15

Semi-supervised, ln = 1000

Supervised, ln =100

Semi-supervised, ln = 100

7

The impact of number of labeled family and nonfamily members on the performance

ADAM

AUC

# labeled family seqs

graph structure of adam
Graph structure of ADAM
  • Troublemaker: ADAMTS10 matches with only 8 out of 26 sequences in ADAM family.
  • ADAMTS10 is often misclassified
  • ADAMTS10 is implicated in a genetic disease that causes impaired vision and heat defects.
slide70

Semi-supervised method

Supervised method

several challenging problems of gene family classification4
Several challenging problems of gene family classification
  • Sequences in the same family
    • do not form a clique
    • do not exist in the same connected component
  • Sequences in different families
    • have significant matches
test families have different graph properties3
Test families have different graph properties

W: Edges to sequences outside the family have weak edge weights

S: Edges to sequences outside the family have strong edge weights

the connection among sequences in tnfr family1

10

11

12

13

14

15

16

17

18

19

20

The connection among sequences in TNFR family

20 TNFR in this connected component

6

4

2

# of connected TNFR sequences

tnfr family size 24

Semi, ln = 1000

Semi, , ln = 100

2

4

8

18

12

The impact of number of labeled family and nonfamily members on the performance

TNFR (family size 24)

Supervised, ln =100

Supervised, ln=1000

AUC

summary for number of labeled family members
Summary for Number of labeled family members
  • The performance of both semi-supervised and supervised learning improves as LF increases for all families.
  • In non-clique families, semi-supervised learning performs better than supervised when LF is small.
rank plots for semi supervised learning in tnfr

σ= 0.1

Rank plots for semi-supervised learning in TNFR

Lf = 2, ln = 100

AUC values do not reflect all information that we need

tnfr family size 241

Supervised, ln =100

Supervised, ln=1000

Semi, ln = 1000

Semi, , ln = 100

2

4

8

18

12

The impact of number of labeled family and nonfamily members on the performance

TNFR (family size 24)

Number of missed TNFR

summary for number of labeled family members1
Summary for Number of labeled family members
  • The performance of both semi-supervised and supervised learning improves as LF increases for all families.
  • In non-clique families, semi-supervised learning performs better than supervised when LF is small.
summary for number of labeled non family members ln
Summary for Number of labeled non-family members (LN)
  • The performance supervised learning improves as LN increases for all families.
  • For semi-supervised learning, sometimes LN is sometimes helpful and sometimes not.
summary of results
Summary of results

Clique

Connected

insights 1
Insights - 1
  • SSL is most effective for families that are not cliques but are connected.
  • In test set, 12/18 cliques, 3/18 not connected.
  • What fraction of protein families are cliques? Is the large number of cliques in the test set due to sample bias?
insights 2
Insights - 2
  • Performance evaluation measures should match the needs of the user.
  • AUC scores penalize all FNs and FPs.
  • For experimental biologists, top ranked predictions are of interest
  • The number of FNs after the first false positive can reveal some information
insights 3
Insights - 3
  • Semi-supervised learning algorithm provides an appealing visualization tool for identifying family members especially when the number of known family members are small
acknowledgements
Acknowledgements

Durand Lab

  • Robbie Sedgewick
  • Rose Hoberman
  • Ben Vernot
  • Narayanan Raghupathy
  • Aiton Goldman
  • Jacob Joseph
  • Annette McLeod
  • Maureen Stolzer
  • John Lafferty
  • Dannie Durand
  • Jerry Zhu