Protein Homology Detection Using String Alignment Kernels - PowerPoint PPT Presentation

moe
jean phillippe vert tatsuya akutsu n.
Skip this Video
Loading SlideShow in 5 Seconds..
Protein Homology Detection Using String Alignment Kernels PowerPoint Presentation
Download Presentation
Protein Homology Detection Using String Alignment Kernels

play fullscreen
1 / 25
Download Presentation
Protein Homology Detection Using String Alignment Kernels
119 Views
Download Presentation

Protein Homology Detection Using String Alignment Kernels

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. Jean-Phillippe Vert, Tatsuya Akutsu Protein Homology Detection Using String Alignment Kernels

  2. Learning Sequence Based Protein Classification • Problem: classification of protein sequence data into families and superfamilies • Motivation: Many proteins have been sequenced, but often structure/function remains unknown • Motivation: infer structure/function from sequence-based classification

  3. Sequence Data Versus Structure and function Sequences for four chains of human hemoglobin Tertiary Structure >1A3N:A HEMOGLOBIN VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGK KVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPA VHASLDKFLASVSTVLTSKYR >1A3N:B HEMOGLOBIN VHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKV KAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGK EFTPPVQAAYQKVVAGVANALAHKYH >1A3N:C HEMOGLOBIN VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGK KVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPA VHASLDKFLASVSTVLTSKYR >1A3N:D HEMOGLOBIN VHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKV KAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGK EFTPPVQAAYQKVVAGVANALAHKYH Function: oxygen transport

  4. Structural Hierarchy • SCOP: Structural Classification of Proteins • Interested in superfamily-level homology – remoteevolutionary relationship Difficult !!

  5. Learning Problem • Reduce to binary classification problem: positive (+) if example belongs to a family (e.g. G proteins) or superfamily (e.g. nucleoside triphosphate hydrolases), negative (-) otherwise • Focus on remote homology detection • Use supervised learning approach to train a classifier Labeled Training Sequences Classification Rule Learning Algorithm

  6. Two supervised learning approaches to classification • Generative model approach • Build a generativemodel for a single protein family; classify each candidate sequence based on itsfitto the model • Only uses positive training sequences • Discriminative approach • Learning algorithm tries to learn decision boundary between positive and negative examples • Uses both positive and negative training sequences

  7. Class Secondary Structure Prediction Fold Threading Super Family HMM, PSI-BLAST, SVM Family SW, BLAST, FASTA Targets of the current methods

  8. Discriminative Learning Discriminative approach Train on both positive and negative examples to learn classifier Modern computational learning theory • Goal: learn a classifier that generalizes well to new examples • Do not use training data to estimate parameters of probability distribution – “curse of dimensionality”

  9. SVM for protein classification • Want to define feature map from space of protein sequences to vector space • Goals: • Computational efficiency • Competitive performance with known methods • No reliance on generative model – general method for sequence-based classification problems

  10. Summary of the current kernel methods • Feature vector from HMM • Fisher kernel(Jaakkola et al., 2000) • Marginalized kernel (Tsuda et al., 2002) • Feature vector from sequence • Spectrum kernel (Leslie et al., 2002) • Mismatch kernel (Leslie et al., 2003) • Feature vector from other score • SVM pairwise (Liao & Noble, 2002)

  11. String Alignment Kernels • Observation: SW alignment score provides measure of similarity with biological knowledge on protein evolution. • It can not be used as kernel because of lack of positive definiteness. • A family of local alignment (LA) kernels that mimic SW score are presented .

  12. LA Kernels Choose Feature Vector representation Get Kernel by inner product of vectors Other Kernels LA Kernel Measure similarity Get valid kernel

  13. LA Kernels • Pair score Kaβ(x,y) • Gap kernel Kgβ(x,y) for penalty gap model Β>=0, s is a symmetric similarity score. with d is gap opening and e is extension costs

  14. LA Kernels • Kernel convolution: • For n>=1, the string kernel can be expressed as K0=1 K0 is initial part, succession of n aligned residues Ka β with n-1 possible gap Kgβ and a terminal part K0.

  15. LA Kernels It is convergent for any x and y because of finite number of non-null terms. It is a point-wise limit of Mercer Kernels

  16. LA with SW score • π:local alignment • p(x,y,π): score of local alignment π over x,y. • Π:set of all possible local alignment over x,y.

  17. Why SW can not be kernel • 1. SW only keep the best alignment instead of sum of alignment of x,y. • 2. Logrithm can destroy the property of being postive definite.

  18. Example LA Kernel SW score

  19. SVM-pairwise LA kernel x y SW Score y x Pair HMM (0.2, 0.3, 0.1, 0.01) (0.9, 0.05, 0.3, 0.2) Inner Product 0.253 0.227

  20. Diagonal Dominant Issue • It is the fact that K(x,x) is easily orders of magnitude larger than K(x,y) of similar sequence which bias the performance of SVM.

  21. Diagonal Dominant Issue (1) The eigen kernel LA-eig : a. By subtracting from the diagonal the smallest negative eigenvalue of the training Gram matrix, if there are negative eigenvalues. b. LA-eig, is equal to except eventually on the diagonal. (2) The empirical kernel map LA-ekm

  22. Methods • Implementation • The computation of the kernel [and therefore of ] with a complexity in O(|x| · |y|), Using dynamic programming by a slight modification of the SW algorithm. • Normaliztion • Dataset • 4352 sequences extracted from the Astral database (www.cs.columbia.edu/compbio/svmpairwise), grouped into families and superfamilies.

  23. ROC Curve

  24. ROC Curve

  25. Summary for the kernels