Using the fisher kernel method to detect remote protein homologies
Sponsored Links
This presentation is the property of its rightful owner.
1 / 24

Using the Fisher kernel method to detect remote protein homologies PowerPoint PPT Presentation


  • 50 Views
  • Uploaded on
  • Presentation posted in: General

Using the Fisher kernel method to detect remote protein homologies. Tommi Jaakkola, Mark Diekhams, David Haussler ISMB’ 99 Talk by O, Jangmin (2001/01/16). Abstract. Detecting remote protein homologies Fisher kernel method Variant of Support Vector Machines using new kernel function

Download Presentation

Using the Fisher kernel method to detect remote protein homologies

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Using the Fisher kernel method to detect remote protein homologies

Tommi Jaakkola, Mark Diekhams, David Haussler

ISMB’ 99

Talk by O, Jangmin (2001/01/16)


Abstract

  • Detecting remote protein homologies

  • Fisher kernel method

  • Variant of Support Vector Machines using new kernel function

    • Derived from Hidden Markov Models


Introduction (1)

  • Detecting protein homologies (sequence-based algorithm)

    • BLAST, Fasta, PROBE, templates, profiles, position-specific weight matrices, HMM

  • Comparison by (Brenner 1996; Park et al. 1998)

    • SCOP classification of protein structures

    • Remote protein homologies existing between protein domain in the same structural superfamily.

    • Statistical models like PSI-BLAST and HMMs are better than simple pairwise comparison methods.


Introduction (2)

  • Generative statistical models (HMMs)

    • Extracting features from protein sequences

    • Mapping all protein sequences to points in a Euclidean feature space of fixed dimension.

  • General discriminative statistical method to classify the points.

  • Improvements acquired

    • Over HMMs alone.


Methods

  • How generative models work. (HMMs)

    • Training examples ( sequences known to be members of protein family ) : positive

    • Tuning parameters with a priori knowledge

    • Model assigns a probability to any given protein sequence.

    • The sequence from that family yield a higher probability than that of outside family.

  • Log-likelihood ratio as score


Discriminative approaches

  • Using both positive and negative examples

  • Parameter is tuned so that the model can optimally discriminate members of the family from nonmembers.

  • When training examples are few

    • Likelihood ratio is optimal if generative models perfectly fit to data but…

    • Discriminative methods often performs better.


Kernel methods

  • Discriminant function L(X)

    • Where { Xi, i = 1,…,n} and hypothesis class H1, H2

    • + : the sequence of the family, - : outside of the family

  • Contribution of Kernel

    • i : overall importance of the example Xi.

    • Measure of pairwise similarity : K(Xi, X)

  • User supplies the type of kernel for the application area!!


The Fischer kernel (1)

  • Deriving kernel function from generative models

    • Advantage 1 : handle variable length protein sequences!!

    • Advantage 2 : encoding of prior knowledge about protein sequences

  • HMMs (difference)

    • Kernel function specifies a similarity score for any pair of sequences.

    • Likelihood score from an HMM only measures the closeness of the sequence to the model itself.


The Fischer kernel (2)

  • Sufficient statistics

    • Each parameter in HMM : Posterior frequencies

      • Of particular transition.

      • Of generating one of the residues of the query sequence.

    • Reflects the process of generating the query sequence from HMM.

  • Alterative of sufficient statistics : Fischer score

    • Magnitude of the components : how each contributes to generating the query sequence.


The Fischer kernel (3)

  • Kernel function used in this paper.

    note that its fixed vector.

  • Summary

    • Train HMM with positive examples.

    • Map each new protein sequence X into a fixed vector, Fisher score.

    • Calculate the kernel function

    • Get resulting discriminant function (SVM-Fisher)


The Fischer kernel (4)

  • Combination of scores

    • There might be more than one HMM model for the family or superfamily of interest.

  • Average score

  • Maximum score


Experimental Methods

  • Methods

    • SVM-Fisher (this paper)

    • BLAST (Altshul et al. 1990; Gish & States 1993)

    • HMMs using SAM-T98 methodology (Park et al. 1998; Karplus, Barrett, & Hughey 1998; Hughey & Krogh 1995l 1996)

  • Measurement of recognition rate for members of superfamilies of the SCOP protein structure classification (Hubbard et al. 1997)

    • Withholding all members of SCOP family

    • Train with the remaining members of SCOP superfamily

    • Test with withheld data

    • Question: “Could the method discover a new family of a known superfamily?”


Overview of experiments

  • Database

    • SCOP version 1.37 PDB90 : consisting of protein domains, no two of which have 90% of more residue identity

    • PDB90 eliminates redundant sequences.

  • Generative models

    • SAM-T98 HMMs

  • Data selection

    • Get 33 test families from 16 superfamilies.

  • Evaluation strategy

    • Assessing to what extent it gave better scores to the positive test examples thant it gave to the negative test examples.


SCOP: a Structural Classification of Proteins database

  • Hierachical levels

    • Family: clustered proteins by common evolutionary origin: residue identities of above 30%, lower sequence identities but very similar functions and structures

    • Superfamily: low sequence identities but probably common evolutionary origin

    • Fold: same major secondary structure in the same arrangement and with the same topological connections


Figure 1: Separation of the SCOP PDB90 database into training and test sequences, shown for the G proteins test family


Multiple models used

  • Modeling superfamily

    • SAM-T98 : starts with a single sequence (the guide sequence for the domain) and build a model

    • Too many sequences!

    • Using a subset of PDB90.

    • Train SVM-Fisher method using each of models in turn


Details on the training and test sets

  • All PDB90 sequence outside the fold of the test family were used as either negative training or negative test examples.

    • Reverse test/training allocation of negative examples, and repeat experiments.

    • Fold-by-fold basis split of negative examples.

  • For positive examples

    • PDB90 sequences in the superfamily of the test family are used.

    • Homologs found by each individual SAM-T98 model are used.


BLAST methods

  • WU-BLAST version 2.0a16 (Althcshul & Gish 1996)

    • PDB90 database was queried with each positive training examples, and E-values were recorded.

    • BLAST:SCOP-only

    • BLAST:SCOP+SAM-T98-homologs

    • Scores were combined by the maximum method


Generative HMM models

  • SAM-T98 method

    • Null model: reverse sequence model

    • Same data and same set of models as in the SVM-Fisher

    • Combined with maximum methods


Results

  • Metric : the rate of false positives (RFP)

  • RFP for a positive test sequence : the fraction of negative test sequences that score as good of better than positive sequence.


G-proteins

  • The result of the family of the nucleotide triphosphate hydrolases SCOP superfamily

    • Test the ability to distinguish 8 PDB90 G proteins from 2439 sequences in other SCOP folds.

  • Table 1

    • In SVM-Fisher

      • 5 of the 8 G proteins are better than all 2439 negative test sequences.

      • Maximum RFP

      • Median RFP

  • Figure 2

    • RFP curve


Table 1. Rate of false positives for G proteins family. BLAST = BLAST:SCOP-only, B-Hom = BLAST:SCOP+SAMT-98-homologs, S-T98 = SAMT-98, and SVM-F = SVM-Fisher method


Figure 2: 4 methods on the 33 test families. Curve of median RFP


Discussion

  • New approach

    • to recognition of remote protein homologies make a discriminative method built on top of a generative model (HMMs)

    • Discriminative method on top of HMM methods

    • Significant improvement

  • Combining multiple score would be improved.

  • Allocation problem

    • Different training set for tuning HMM and different training set for discriminative model

  • Extend the method to identify multiple domains within large protein sequences


  • Login