- 44 Views
- Uploaded on
- Presentation posted in: General

Using the Fisher kernel method to detect remote protein homologies

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Using the Fisher kernel method to detect remote protein homologies

Tommi Jaakkola, Mark Diekhams, David Haussler

ISMB’ 99

Talk by O, Jangmin (2001/01/16)

- Detecting remote protein homologies
- Fisher kernel method
- Variant of Support Vector Machines using new kernel function
- Derived from Hidden Markov Models

- Detecting protein homologies (sequence-based algorithm)
- BLAST, Fasta, PROBE, templates, profiles, position-specific weight matrices, HMM

- Comparison by (Brenner 1996; Park et al. 1998)
- SCOP classification of protein structures
- Remote protein homologies existing between protein domain in the same structural superfamily.
- Statistical models like PSI-BLAST and HMMs are better than simple pairwise comparison methods.

- Generative statistical models (HMMs)
- Extracting features from protein sequences
- Mapping all protein sequences to points in a Euclidean feature space of fixed dimension.

- General discriminative statistical method to classify the points.
- Improvements acquired
- Over HMMs alone.

- How generative models work. (HMMs)
- Training examples ( sequences known to be members of protein family ) : positive
- Tuning parameters with a priori knowledge
- Model assigns a probability to any given protein sequence.
- The sequence from that family yield a higher probability than that of outside family.

- Log-likelihood ratio as score

- Using both positive and negative examples
- Parameter is tuned so that the model can optimally discriminate members of the family from nonmembers.
- When training examples are few
- Likelihood ratio is optimal if generative models perfectly fit to data but…
- Discriminative methods often performs better.

- Discriminant function L(X)
- Where { Xi, i = 1,…,n} and hypothesis class H1, H2
- + : the sequence of the family, - : outside of the family

- Contribution of Kernel
- i : overall importance of the example Xi.
- Measure of pairwise similarity : K(Xi, X)

- User supplies the type of kernel for the application area!!

- Deriving kernel function from generative models
- Advantage 1 : handle variable length protein sequences!!
- Advantage 2 : encoding of prior knowledge about protein sequences

- HMMs (difference)
- Kernel function specifies a similarity score for any pair of sequences.
- Likelihood score from an HMM only measures the closeness of the sequence to the model itself.

- Sufficient statistics
- Each parameter in HMM : Posterior frequencies
- Of particular transition.
- Of generating one of the residues of the query sequence.

- Reflects the process of generating the query sequence from HMM.

- Each parameter in HMM : Posterior frequencies
- Alterative of sufficient statistics : Fischer score
- Magnitude of the components : how each contributes to generating the query sequence.

- Kernel function used in this paper.
note that its fixed vector.

- Summary
- Train HMM with positive examples.
- Map each new protein sequence X into a fixed vector, Fisher score.
- Calculate the kernel function
- Get resulting discriminant function (SVM-Fisher)

- Combination of scores
- There might be more than one HMM model for the family or superfamily of interest.

- Average score
- Maximum score

- Methods
- SVM-Fisher (this paper)
- BLAST (Altshul et al. 1990; Gish & States 1993)
- HMMs using SAM-T98 methodology (Park et al. 1998; Karplus, Barrett, & Hughey 1998; Hughey & Krogh 1995l 1996)

- Measurement of recognition rate for members of superfamilies of the SCOP protein structure classification (Hubbard et al. 1997)
- Withholding all members of SCOP family
- Train with the remaining members of SCOP superfamily
- Test with withheld data
- Question: “Could the method discover a new family of a known superfamily?”

- Database
- SCOP version 1.37 PDB90 : consisting of protein domains, no two of which have 90% of more residue identity
- PDB90 eliminates redundant sequences.

- Generative models
- SAM-T98 HMMs

- Data selection
- Get 33 test families from 16 superfamilies.

- Evaluation strategy
- Assessing to what extent it gave better scores to the positive test examples thant it gave to the negative test examples.

- Hierachical levels
- Family: clustered proteins by common evolutionary origin: residue identities of above 30%, lower sequence identities but very similar functions and structures
- Superfamily: low sequence identities but probably common evolutionary origin
- Fold: same major secondary structure in the same arrangement and with the same topological connections

Figure 1: Separation of the SCOP PDB90 database into training and test sequences, shown for the G proteins test family

- Modeling superfamily
- SAM-T98 : starts with a single sequence (the guide sequence for the domain) and build a model
- Too many sequences!
- Using a subset of PDB90.
- Train SVM-Fisher method using each of models in turn

- All PDB90 sequence outside the fold of the test family were used as either negative training or negative test examples.
- Reverse test/training allocation of negative examples, and repeat experiments.
- Fold-by-fold basis split of negative examples.

- For positive examples
- PDB90 sequences in the superfamily of the test family are used.
- Homologs found by each individual SAM-T98 model are used.

- WU-BLAST version 2.0a16 (Althcshul & Gish 1996)
- PDB90 database was queried with each positive training examples, and E-values were recorded.
- BLAST:SCOP-only
- BLAST:SCOP+SAM-T98-homologs
- Scores were combined by the maximum method

- SAM-T98 method
- Null model: reverse sequence model
- Same data and same set of models as in the SVM-Fisher
- Combined with maximum methods

- Metric : the rate of false positives (RFP)
- RFP for a positive test sequence : the fraction of negative test sequences that score as good of better than positive sequence.

- The result of the family of the nucleotide triphosphate hydrolases SCOP superfamily
- Test the ability to distinguish 8 PDB90 G proteins from 2439 sequences in other SCOP folds.

- Table 1
- In SVM-Fisher
- 5 of the 8 G proteins are better than all 2439 negative test sequences.
- Maximum RFP
- Median RFP

- In SVM-Fisher
- Figure 2
- RFP curve

Table 1. Rate of false positives for G proteins family. BLAST = BLAST:SCOP-only, B-Hom = BLAST:SCOP+SAMT-98-homologs, S-T98 = SAMT-98, and SVM-F = SVM-Fisher method

Figure 2: 4 methods on the 33 test families. Curve of median RFP

- New approach
- to recognition of remote protein homologies make a discriminative method built on top of a generative model (HMMs)
- Discriminative method on top of HMM methods
- Significant improvement

- Combining multiple score would be improved.
- Allocation problem
- Different training set for tuning HMM and different training set for discriminative model

- Extend the method to identify multiple domains within large protein sequences