1 / 30

Machine Learning for Protein Classification: Kernel Methods

Machine Learning for Protein Classification: Kernel Methods. CS 374 Rajesh Ranganath 4/10/2008. Outline. Biological Motivation and Background Algorithmic Concepts Mismatch Kernels Semi-supervised methods. Proteins. The P rotein P roblem. Primary Structure can be easily determined

osma
Download Presentation

Machine Learning for Protein Classification: Kernel Methods

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Machine Learning for Protein Classification: Kernel Methods CS 374 Rajesh Ranganath 4/10/2008

  2. Outline • Biological Motivation and Background • Algorithmic Concepts • Mismatch Kernels • Semi-supervised methods

  3. Proteins

  4. The Protein Problem • Primary Structure can be easily determined • 3D structure determines function • Grouping proteins into structural and evolutionary families is difficult • Use machine learning to group proteins

  5. How to look at amino acid chains • Smith-Waterman Idea • Mismatch Idea

  6. Families • Proteins whose evolutionarily relationship is readily recognizable from the sequence (>~25% sequence identity) • Families are further subdivided into Proteins • Proteins are divided into Species • The same protein may be found in several species Fold Superfamily Family Proteins Morten Nielsen,CBS, BioCentrum, DTU

  7. Superfamilies • Proteins which are (remote) evolutionarily related • Sequence similarity low • Share function • Share special structural features • Relationships between members of a superfamily may not be readily recognizable from the sequence alone Fold Superfamily Family Proteins Morten Nielsen,CBS, BioCentrum, DTU

  8. Folds • Proteins which have >~50% secondary structure elements arranged the in the same order in the protein chain and in three dimensions are classified as having the same fold • No evolutionary relation between proteins Fold Superfamily Family Proteins Morten Nielsen,CBS, BioCentrum, DTU

  9. Protein Classification • Given a new protein, can we place it in its “correct” position within an existing protein hierarchy? Methods • BLAST / PsiBLAST • Profile HMMs • Supervised Machine Learning methods Fold Superfamily new protein ? Family Proteins

  10. Machine Learning Concepts • Supervised Methods • Discriminative Vs. Generative Models • Transductive Learning • Support Vector Machines • Kernel Methods • Semi-supervised Methods

  11. Discriminative and Generative Models Discriminative Generative

  12. Transductive Learning • Most Learning is Inductive • Given (x1,y1) …. (xm,ym), for any test input x* predict the label y* • Transductive Learning • Given (x1,y1) …. (xm,ym) and all the test input {x1*,…, xp*} predict label {y1*,…, yp*}

  13. Support Vector Machines • Popular Discriminative Learning algorithm • Optimal geometric marginal classifier • Can be solved efficiently using the Sequential Minimal Optimization algorithm • If x1 … xn training examples, sign(iixiTx) “decides” where x falls • Train i to achieve best margin

  14. Support Vector Machines (2) • Kernalizable: The SVM solution can be completely written down in terms of dot products of the input. {sign(iiK(xi,x) determines class of x)}

  15. Kernel Methods • K(x, z) = f(x)Tf(z) • f is the feature mapping • x and z are input vectors • High dimensional features do not need to be explicitly calculated • Think of the kernel function similarity measure between x and z • Example:

  16. Mismatch Kernel • Regions of similar amino acid sequences yield a similar tertiary structure of proteins • Used as a kernel for an SVM to identify protein homologies

  17. X Y k-mer based SVMs • For given word size k, and mismatch tolerance l, define K(X, Y) = # distinct k-long word occurrences with ≤ l mismatches • Define normalized mismatch kernel K’(X, Y) = K(X, Y)/ sqrt(K(X,X)K(Y,Y)) • SVM can be learned by supplying this kernel function A B A C A R D I K(X, Y) = 4 K’(X, Y) = 4/sqrt(7*7) = 4/7 Let k = 3; l = 1 A B R A D A B I

  18. Disadvantages • 3D structure of proteins is practically impossible • Primary sequences are cheap to determine • How do we use all this unlabeled data? • Use semi-supervised learning based on the cluster assumption

  19. Semi-Supervised Methods • Some examples are labeled • Assume labels vary smoothly among all examples

  20. Semi-Supervised Methods • Some examples are labeled • Assume labels vary smoothly among all examples • SVMs and other discriminative methods may make significant mistakes due to lack of data

  21. Semi-Supervised Methods • Some examples are labeled • Assume labels vary smoothly among all examples

  22. Semi-Supervised Methods • Some examples are labeled • Assume labels vary smoothly among all examples

  23. Semi-Supervised Methods • Some examples are labeled • Assume labels vary smoothly among all examples

  24. Semi-Supervised Methods • Some examples are labeled • Assume labels vary smoothly among all examples Attempt to “contract” the distances within each cluster while keeping intracluster distances larger

  25. Semi-Supervised Methods • Some examples are labeled • Assume labels vary smoothly among all examples

  26. Cluster Kernels • Semi-supervised methods • Neighborhood • For each X, run PSI-BLAST to get similar seqs  Nbd(X) • Define Φnbd(X) = 1/|Nbd(X)| X’  Nbd(X)Φoriginal(X’) “Counts of all k-mers matching with at most 1 diff. all sequences that are similar to X” • Knbd(X, Y) = 1/(|Nbd(X)|*|Nbd(Y)) X’  Nbd(X)Y’  Nbd(Y) K(X’, Y’) • Next bagged mismatch

  27. Bagged Mismatched Kernel • Final method • Bagged mismatch • Run k-means clustering n times, giving p = 1,…,n assignments cp(X) • For every X and Y, count up the fraction of times they are bagged together Kbag(X, Y) = 1/n p 1(cp(X) = cp (Y)) • Combine the “bag fraction” with the original comparison K(.,.) Knew(X, Y) = Kbag(X, Y) K(X, Y)

  28. O. Jangmin

  29. What works best? Transductive Setting

  30. References • C. Leslie et al. Mismatch string kernels for discriminative protein classification. Bioinformatics Advance Access. January 22, 2004. • J. Weston et al. Semi-supervised protein classification using cluster kernels.2003. • Images pulled under wikiCommons

More Related