1 / 31

Approximation of Protein Structure for Fast Similarity Measures

Approximation of Protein Structure for Fast Similarity Measures. Itay Lotan Fabian Schwarzer. Comparing Protein Structures. Same protein:. vs. Analysis of MDS and MCS trajectories. Graph-based Methods. Structure prediction applications. Evaluating decoy sets

Download Presentation

Approximation of Protein Structure for Fast Similarity Measures

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Approximation of Protein Structure for Fast Similarity Measures Itay Lotan Fabian Schwarzer

  2. Comparing Protein Structures Same protein: vs. Analysis of MDS and MCStrajectories Graph-based Methods Structure prediction applications • Evaluating decoy sets • Clustering predictions (Shortle et al, Biophysics ’98) Stochastic Roadmap Simulation(Apaydin et al, RECOMB ’02) http://folding.stanford.edu

  3. k Nearest-Neighbors Problem Given a set S of conformations of a protein and a query conformation c, find the k conformations in S most similar to c. Can be done in N– size of S L – time to compare two conformations

  4. k Nearest-Neighbors Problem What if needed for all cin S? -too much time • Can be improved by: • Reducing L • A more efficient algorithm

  5. Our Solution Reduce structure description Approximate but fast similarity measures Reduce description further Efficient nearest-neighbor algorithms can be used

  6. Description of a Protein’s Structure 3n coordinates of Cα atoms (n – Number of residues)

  7. m-Averaged Approximation • Cut chain into pieces of length m • Replace each sequence of m Cα atoms by its centroid 3n coordinates 3n/m coordinates

  8. Similarity Measures - cRMS The RMS of the distances between corresponding atoms after the two conformations are optimally aligned Computed in O(n) time

  9. Similarity Measures - dRMS The Euclidean distance between the intra-molecular distances matrices of the two conformations Computed in O(n2) time

  10. Evaluation: Test Sets • Decoy sets: conformations from the Park-Levitt set (Park et al, JMB ’97), N =10,000 • Random sets: conformations generated by the program FOLDTRAJ (Feldman & Hogue, Proteins ’00),N = 5000 8 structurally diverse proteins of size 54 -76 residues:

  11. dRMS m cRMS 3 4 6 9 12 Decoy Sets Correlation 0.99 0.96 – 0.98 0.98 – 0.99 0.94 – 0.97 0.92 – 0.99 0.78 – 0.93 0.81 – 0.98 0.65 – 0.96 0.54 – 0.92 0.52 – 0.69 Higher Correlation for random sets!

  12. Speed-up for Decoy Sets • 9x for cRMS (m = 9) • 36x for dRMS (m = 6) with very small error For random sets the speed-up for dRMS goes up to 81x (m = 9)

  13. Efficient Nearest-Neighbor Algorithms There are efficient nearest-neighbor algorithms, but they are not compatible with similarity measures: cRMS is not a Euclidean metric dRMS uses a space of dimensionality n(n-1)/2

  14. Further Dimensionality Reduction of dRMS kd-trees require dimension  20 m-averaging with dRMS is not enough Reduce further using SVD SVD: A tool for principal component analysis. Computes directions of greatest variance.

  15. Reduction Using SVD • Stack m-averaged distance matrices as vectors • Compute the SVD of entire set • Project onto most important singular vectors dRMS is thus reduced to 20 dimensions Without m-averaging SVD can be too costly

  16. Testing the Method • Use decoy sets (N = 10,000) and random sets (N = 5,000) • m-averaging with (m = 4) • Project onto 16 PCs for decoys, 12 PCs for random sets • Find k = 10, 25, 100 NNs for 250 conformations in each set

  17. Results • Decoy sets: • ~77% correct • Furthest approximate NN off by 10% - 15% • ~4k approximate NNs contain all true k NNs • Random sets: • 71%, 76%, 84% correct respectively • Furthest approximate NN off by 5% - 10% • ~3k approximate NNs contain all true k NNs

  18. More Results: N = 100,000 • 1CTF decoys: • ~70% correct • Furthest approximate NN off by ~20% • ~6k approximate NNs contain all true k NNs • 1CTF random: • 46%, 48%, 60% correct respectively • Furthest approximate NN off by ~16% • ~7k approximate NNs contain all true k NNs

  19. Running Time N = 100,000, m=4, PC = 16 Find k = 100 for each conformation Brute-force: ~84 hours Brute-force + m-averaging: ~4.8 hours Brute-force + m-averaging + SVD: 41 minutes Kd-tree + m-averaging + SVD: 19 minutes kd-trees will have more impact for larger sets

  20. Structural Classification Computing the similarity between structures of two different proteins is more involved: 2MM1 1IRD vs. The correspondence problem: Which parts of the two structures should be compared?

  21. STRUCTAL (Subbiah et al, ’93) • Compute optimal correspondence using dynamic programming • Optimally align the corresponding parts in space to minimize cRMS • Repeat until convergence O(n1n2) time

  22. STRUCTAL + m-averaging • 256 protein domains (180 – 420 res) • 3691 good matches (Sandelin’s PROTOFARM) • 6375 random pairs • Compute SAS scores (cRMS/length*100) m correlation speed-up 3 0.81 ~9x 0.77 ~16x 4 5 0.70 ~25x

  23. OK (P < 0.005) BAD (P > 0.005) Number of pairs SAS score

  24. Misclassification

  25. Random Chains c5 c7 • The dimensions are uncorrelated • Average behavior can be approximated by normal variables: c2 c6 c8 cn-1 c0 c4 c1 c3

  26. 1-D Haar Wavelet Transform Recursive averaging and differencing of the values Detail Coefficients Level Averages [ 9 7 2 6 5 1 4 6 ] 3 2 [ 8 4 3 5 ] [ 1 -2 2 -1 ] 1 [ 6 4 ] [ -2 -1 ] 0 [ 1 ] [ 5 ] [ 9 7 2 6 5 1 4 6 ] [ 5 1 -2 -11 -2 2 1]

  27. Transform of Random Chains • pdf of the detail coefficients is: • Coefficients expected to be ordered! • Discard coefficients starting at lowest level Discarding lowest levels of detail coeeficients  m-averaging

  28. Random Chains and Proteins • Protein backbones behave on average like random chains • Chain topology • Limited compactness

  29. Conclusion • Fast computation of similarity measures • Trade-off between speed and precision • Exploits chain topology of proteins and limited compactness • Allows use of efficient nearest-neighbor algorithms • Can be used as filter when precision is important

More Related