1 / 65

Ming-Yang Kao Department of Computer Science Northwestern University Evanston, Illinois U. S. A

A Combinatorial Toolbox for Protein Sequence Design and Landscape Analysis in the Grand Canonical Model. Ming-Yang Kao Department of Computer Science Northwestern University Evanston, Illinois U. S. A. Acknowledgments.

albin
Download Presentation

Ming-Yang Kao Department of Computer Science Northwestern University Evanston, Illinois U. S. A

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Combinatorial Toolbox for Protein Sequence Design and Landscape Analysis in the Grand Canonical Model Ming-Yang Kao Department of Computer Science Northwestern University Evanston, Illinois U. S. A

  2. Acknowledgments This talk is based on joint work with colleagues & students at Yale University: Computer Science: • Jim Aspnes • Gauri Shah Biology: • Julia Hartling • Junhyong Kim

  3. Dual Purposes of This Talk • Discuss protein folding problems. • Emphasize the point that as bioinformatics grows, advanced algorithmic techniques will become useful and crucial.

  4. Importance of Protein Folding The 3D structure significantly determines the function.

  5. Two Complementary Problems for Protein Folding • Protein Folding Prediction---Given a protein sequence, determine the 3D folding of the sequence. • Protein Sequence Design--- Given a 3D structure, determine the fittest protein sequence for the structure, i.e., one that has the smallest energy among all possible sequences when folded into the structure.

  6. Complexity for Protein Folding Problems • Protein Folding Prediction---Given a protein sequence, determine the 3D folding of the sequence. NP-hard under various models. • Protein Sequence Design--- Given a 3D structure, determine the fittest protein sequence for the structure, i.e., one that has the smallest energy among all possible sequences when folded into the structure. Solvable in polynomial time under the Grand Canonical model.

  7. History of Protein Sequence Design • Protein Sequence Design --- Given a 3D structure, determine the fittest protein sequence for the structure, i.e., one that has the smallest energy among all possible sequences when folded into the structure. • Sun et al, 1995: Heuristic search without optimality guarantee. • Hart, 1997: Open question on the computational tractability. • Kleinberg, 1999: Polynomial-time algorithms. • Aspnes, Hartling, Kao, Kim, Shah, 2001: Improved algorithms and generalized problems. this talk

  8. Outline of Technical Discussions • The Grand Canonical Model • Two Basic Computational Problems • Experimental Results • Combinatorial Tools (1a) Linear Programming (1b) Network Flow (1c) Compact Representation of All Min Cut (1d) others • Further Algorithmic & Computational Hardness Results • Conclusions

  9. Outline of Technical Discussions (1) • The Grand Canonical Model • Two Basic Computational Problems • Experimental Results • Combinatorial Tools (1a) Linear Programming (1b) Network Flow (1c) Compact Representation of All Min Cut (1d) others • Further Algorithmic & Computational Hardness Results • Conclusions

  10. Grand Canonical Model (Sun et al, 1995) • Each amino acid is classified as Hydrophobic (H) and Polar (P). • Each amino acid sequence is then considered as a binary sequence of H and P. (For mathematical convenience, set H = 1 and P = 0). • Hydrophobic (H): A, C, F, I, L, M, V, W, Y. • Polar (P): the other amino acids. • Sun, Brem, Chan, Dill. Designing amino acid sequences to fold with good hydrophobic cores. Protein Engineering, 1995.

  11. Representation of a 3D structure: (Sun et al, 1995) A 3D folding structure S of n amino acid sequence: • the coordinate of each atom in S. • the pairwise distances between the centers of amino acid residues in S. • the solvent-accessible areas of the amino acid residues in S.

  12. Goal of Protein Sequence Design: (Sun et al, 1995) Input: A 3D structure S and a sequence length n. Output: a sequence X of n amino acids that, when folded into S, has the following properties: • The H-residues in X are as close to each other as possible. • The solvent-accessible areas of the H-residues of X are as small as possible.

  13. Fitness of a Sequence (Sun et al, 1995)

  14. Fitness of a Sequence (Sun et al, 1995) closeness among H-residues small surface area

  15. Outline of Technical Discussions (2) • The Grand Canonical Model • Two Basic Computational Problems • Experimental Results • Combinatorial Tools (1a) Linear Programming (1b) Network Flow (1c) Compact Representation of All Min Cut (1d) others • Further Algorithmic & Computational Hardness Results • Conclusions

  16. Problem #1 Input: • the parameters alpha and beta, • a protein sequence Y, • Y’s 3D structure, • the sequence length n of Y. Output: a fittest sequence X for the 3D structure with respect to the given alpha and beta. Applications of this problem: Design the best sequences for novel structures because we don’t really need Y.

  17. Problem #2 Input: • the parameters alpha and beta, • a protein sequence Y, • Y’s 3D structure, • the sequence length n of Y. Output: a fittest sequence X for the 3D structure that is the most similar to Y over all possible alpha and beta. Applications of this problem:tune the alpha and beta of the Grand Canonical model.

  18. Basic Computational Scheme (1) a min cut 3D structure a fittest sequence network HPPPHHPHP

  19. Problem #1 Input: • the parameters alpha and beta, • a protein sequence Y, • Y’s 3D structure, • the sequence length n of Y. Output: a fittest sequence X for the 3D structure with respect to the given alpha and beta. Applications of this problem: Design the best sequences for novel structures because we don’t really need Y. Computational Complexity: 1 network flow.

  20. Problem #2 Input: • the parameters alpha and beta, • a protein sequence Y, • Y’s 3D structure, • the sequence length n of Y. Output: a fittest sequence X for the 3D structure that is the most similar to Y over all possible alpha and beta. Applications of this problem:tune the alpha and beta of the Grand Canonical model. Computational Complexity: O(n) network flows.

  21. Outline of Technical Discussions (3) • The Grand Canonical Model • Two Basic Computational Problems • Experimental Results • Combinatorial Tools (1a) Linear Programming (1b) Network Flow (1c) Compact Representation of All Min Cut (1d) others • Further Algorithmic & Computational Hardness Results • Conclusions

  22. Empirical Study: Predictive Ability • Computed Fittest Sequence versus Native Sequences (% similarity) • Our % Similarity versus Kleinberg’s • % Similarity versus Protein Family Size.

  23. % similarity --- computed versus native • % similarity = the percentage of the H/P’s in the computed fittest sequence that are identical to those in the native sequence. • The average percentage of the hydrophobic residues is 42% in the native sequences that were studied. • The best sequence picked without “domain knowledge” would have a 58% similarity on average.

  24. % similarity --- computed versus native (1)

  25. % similarity --- computed versus native (2) Our results versus Kleinberg’s

  26. % similarity --- computed versus native (3)

  27. % similarity versus PFAM family size (1) • % similarity = the percentage of the H/P’s in the computed fittest sequence that are identical to those in the native sequence. • PFAM family size of a protein = # of proteins in the PFAM database that are related to the given protein. • The relatedness is computed via HMM models. • pfam.wustl.edu • measure of success of a protein in Nature.

  28. % similarity versus PFAM family size (2) • % similarity = the percentage of the H/P’s in the computed fittest sequence that are identical to those in the native sequence. • PFAM family size of a protein = # of proteins in the PFAM database that are related to the given protein. Intuition/Conjecture: (3A) the more diverse a protein family is, (3B) the more its 3D structures vary, (3C) the smaller the % similarity will be.

  29. % similarity versus PFAM family size (3)

  30. % similarity versus PFAM family size (4)

  31. Outline of Technical Discussions (4) • The Grand Canonical Model • Two Basic Computational Problems • Experimental Results • Combinatorial Tools (1a) Linear Programming (1b) Network Flow (1c) Compact Representation of All Min Cut (1d) others • Further Algorithmic & Computational Hardness Results • Conclusions

  32. Tool #1: Linear Programming find x and y that Goal:find a fittest sequence X of n amino acids. find a binary sequence x that minimizes • Linear • Totally unimodular • Integer solution • Useful for proving theorems • Still too inefficient clueless! quadratic

  33. Tool #2: Network Flow (1) 14 • analogy: a network of oil pipes • sources (origin of oil) • sink t (destination of oil) • other nodes (midway stations) • arcs (pipes) • arc capacity (pipe capacity) • flow (amount of oil through a pipe) • goal: deliver max amount of oil from source to sink • computational goal: a max flow • computational complexity: VE log (V2/E) 1 8 4 9 14 s 5 4 t 20 5 5 10

  34. Tool #2: Network Flow (2) 14 (1) example of max flow • source (origin of oil) • sink (destination of oil) • other nodes (midway stations) • arcs (pipes) • arc capacity (pipe capacity) • flow (amount of oil through a pipe) • goal: deliver max amount of oil from source to sink • computational goal: a max flow • computational complexity: VE log (V2/E) 1 (1) 8 (5) 14 (14) 4 (4) 9 (9) s 5 (4) 4 (4) t 20 5 (5) 5 10 (5)

  35. Tool #2: Network Flow (3) 14 (1) max flow versus min cut • min cut  bottleneck • a partition (S,T) of nodes with s in S and t in T. • total capacity of arcs from S to T = max flow. 1 (1) 8 (5) 14 (14) 4 (4) 9 (9) s 5 (4) t 4 (4) 20 5 (5) 5 10 (5)

  36. Tool #2: Network Flow (4) 14 (1) max flow versus min cut • min cut  bottleneck • a partition (S,T) of nodes with s in S and t in T. • total capacity of arcs from S to T = max flow. • computational complexity: VE log (V2/E) 1 (1) 8 (5) 14 (14) 4 (4) 9 (9) s 5 (4) t 4 (4) 20 5 (5) 5 10 (5)

  37. Basic Computational Scheme (1) a min cut 3D structure a fittest sequence network HPPPHHPHP

  38. Tool #2: 3D  Network (1) 7 9 8 4 6 5 1 2 3 S1= 3 S2= 18 S3= 6 S4= 9 S5= 3 S6= 9 S7= 6 S8= 24 S9= 9 g(d16) = 0.5 g(d25) = 0.75 g(d58) = 0.9 g(d49) = 0.75 alpha = -8 beta = 1/3

  39. Tool #2: 3D  Network (2) beta*si 7 9 8 1 -alpha*g(dij) 2 1 4 6 1,6 5 3 6 4 2 4 2,5 1 2 3 3 6 5 S1= 3 S2= 18 S3= 6 S4= 9 S5= 3 S6= 9 S7= 6 S8= 24 S9= 9 g(d16) = 0.5 g(d25) = 0.75 g(d58) = 0.9 g(d49) = 0.75 alpha = -8 beta = 1/3 1 5,8 7.2 3 6 6 2 4,9 8 7 3 8 9

  40. Tool #2: 3D  Network (3) beta*si 7 9 8 1 -alpha*g(dij) 2 1 4 6 1,6 5 3 6 4 2 4 2,5 1 2 3 3 6 5 S1= 3 S2= 18 S3= 6 S4= 9 S5= 3 S6= 9 S7= 6 S8= 24 S9= 9 g(d16) = 0.5 g(d25) = 0.75 g(d58) = 0.9 g(d49) = 0.75 alpha = -8 beta = 1/3 1 5,8 7.2 3 6 6 2 4,9 8 7 3 8 9

  41. Tool #2: 3D  Network (4) beta*si 7 9 8 1 -alpha*g(dij) 2 1 4 6 1,6 5 3 6 4 2 4 2,5 1 2 3 3 6 5 1 5,8 7.2 3 6 6 2 4,9 8 7 Theorem (Kleinberg, 1999) The amino acids that are with the source in a min cut are H’s. 3 8 9

  42. Basic Computational Scheme (1) a min cut 3D structure a fittest sequence network HPPPHHPHP

  43. Problem #1 Input: • the parameters alpha and beta, • a protein sequence Y, • Y’s 3D structure, • the sequence length n of Y. Output: a fittest sequence X for the 3D structure with respect to the given alpha and beta. Applications of this problem: Design the best sequences for novel structures because we don’t really need Y.

  44. Tool #3: Linear Size Representation of All Min Cuts (1) v2 14 (1) Step 1: Compute a max flow of G. Step 2: Compute the residual network G’. Step 3: Contract every strongly connected component into a super node. Call the new graph G”. 1 (1) v1 8 (5) 14 (14) 4 (4) v3 9 (9) s 5 (4) v6 v5 t 4 (4) 20 5 (5) Def: A node subset U of G” is a closed set if for every node x in U, every descendant of x is also in U. 5 10 (5) v4 v7 Theorem: (Picard and Queyranne, 1980) Every closed set not including the sink forms a min cut, and vice versa.

  45. Tool #3: Linear Size Representation of All Min Cuts (2) v2 13 Residual Network 1 1 v1 3 14 4 v3 5 9 s 1 v6 v5 t 4 4 20 5 5 5 5 v4 v7

  46. Tool #3: Linear Size Representation of All Min Cuts (3) Picard-Queyranne Representation v2 v1 v3 s v6 v5 t 5 v4 v7

  47. Tool #3: Linear Size Representation of All Min Cuts (4) v2 Picard-Queyranne Representation v1 v3 s v6 Applications: • Obtain all fittest sequences. • Study the landscape of the fittest sequences. • Compute fittest sequences with additional optimization objectives. v5 t 5 v4 v7

  48. Basic Computational Scheme (2) a max flow/min cut 3D structure network Picard-Queyranne Representation the space of all fittest sequences HPPPHHPHP

  49. Outline of Technical Discussions (5) • The Grand Canonical Model • Two Basic Computational Problems • Experimental Results • Combinatorial Tools (1a) Linear Programming (1b) Network Flow (1c) Compact Representation of All Min Cut (1d) others • Further Algorithmic & Computational Hardness Results • Conclusions

  50. Problem #3 • Input:a 3D structure. • Output:all its fittest protein sequences. • Computational Complexity: (A) A linear size representation can be computed with 1 network flow. (B) Each individual fittest protein sequences can be generated from this representation in O(n) time.

More Related