1 / 28

Design of score functions for recognition of protein folds

Design of score functions for recognition of protein folds. T. Galor Joint work with Ron Elber Thorsten Joachims. Structures are vital to understand protein function. Researches are interested to know: what are the active site residues for enzymatic reaction?

tamma
Download Presentation

Design of score functions for recognition of protein folds

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Design of score functions for recognition of protein folds T. Galor Joint work with Ron Elber Thorsten Joachims

  2. Structures are vital to understand protein function • Researches are interested to know: • what are the active site residues for enzymatic reaction? • Where are the active sites for transmission of signal? The HIV protease plays a major role in cell infection. There is a need for a tool for finding Protein structure It is important for drug design to know the binding sites.

  3. From Homologues to structure(evolutionary related protein) We present procedures for identifying Homologues structures from a given library of structures that span the Protein Data Bank. • An annotated homologues structure may give a clue for the function of the probe protein • The homologues structure may be used as scaffold for modeling the probe structure.

  4. S1: AWHFFAI S2: AHGI Sequence alignment Only sequence information is used ,both in the query and target

  5. Some Homologues do not share high sequence similarity Myoglobin 1mba and leghemoglobin 1bin:A share similar structure but their sequence identity is small 14%. When sequence similarity fails we can use different similarity measure- Threading

  6. Fold prediction by threading when sequence identity is low Evaluate how amino acid sequence fits into known three-dimensional (3D) protein structure. Test fitness of the probe sequence to structures from existing library. Choose the most significant fit from sequence/structure matches. MGFPIPDPYV … KGKI

  7. Definition of protein characteristics The number of contacts determines if a site is buried or exposed Characterize a protein is by its amino acid sequence S, S: AW….HI Or by its structure X, X: s1s0….s2s2 A structure site can be classified according to the number of contacts with its neighbors. Polar sites have law number of contacts. Hydrophobic sites have more. W H I A

  8. Focus on threading S1: AWGHKI Sequence information is used for the probe protein. Structural information for the target. G K I H X2: s1s0s2s3s0s3

  9. Many ways to align two proteins S1: AWHFFAI S2: AHGI There are many different alignments for two sequence:

  10. The need to score the alignments Counts the number of amino acid type ai placed at site type sj Counts the number of gaps placed at site sj Costs

  11. The cost matrix W A numerical value is assigned to every cell depending on the fitness of assigning an amino acid into structural site. These may be simple scores or more complicated. Related to chemical similarities or frequency observed

  12. How to find the optimal alignment S2: MPR X2: s1s1s3 S1: PVRC Dynamic programming algorithm is an O(nm) algorithm which find the optimal alignment given a cost matrix and gap cost from an exponential number (2^(n+m)) alignments.

  13. Present alignment scores are not accurate enough There is the need for • A better scoring cost. • A better gap cost. Or • Accompany the similarity measure (Seq2Seq/Seq2Struc) with a statistical measure which will enhance the signal.

  14. The Z score measure • Define a set of random sequences. Each random sequence is threaded to the same target structure. • We compare the alignment cost of the true probe sequence with the average cost of random sequences. • This enables us to estimate the significance of the score of the probe compared to a typical score of a random sequence.

  15. A novel method for designing energy cost function parameters Current methods try to recognize native versus decoys. DIFFICULTIES Not exact Exponential number of alignments But the goal is to identify Homologues The answer is Recognize Homologues versus decoys

  16. An energy function A training set An algorithm to train proteins to recognize their correct folds An algorithm to solve the “training” conditions. Evaluation of the new score parameters. The steps to design energy cost function parameters Use Mathematical programming to solve the set of equations Evaluate the performance on an independent set.

  17. Some notation alignment An alignment of protein SI into protein SII results with a path g. Recall that there are many possibilities of paths (2^(n+m)) The set of alignments

  18. The cost matrix definition The total energy, denoted Etotal, of the alignment is used as a measure to score the similarity between the two proteins ,

  19. The cost function Our cost function has the form: score alignment Alignment coefficient

  20. HIDE SLIDEOptimize the parameter W such that the native energy is the lowest Instead of solving for the unknown optimal path, we solve for all paths. Unfortunately the number of inequality is exponentially large.

  21. w w1 Use statistical machine learning algorithm w2 w2 constrain DX Find the middle point in the cone w s w1

  22. Converge in polynomial time Algorithm I=0 wi Compute the optimal alignment using dynamic programming. I=I+1 NO wi+1 || wi -wi+1||<0.001 Solve DE>0 yes THE END

  23. Toy Example-New methodsequence alignment -DE Number of sets in Training 1169 Number of error 225 Number of iteration 6 Native: 1chg, length 245. Chymotrypsinogen A Homologue: 2alp 80 20 Sets number

  24. Toy Example-Old method Number of sets in Training 1169 Number of error 725 Native: 1chg, length 245. Chymotrypsinogen A Homologue: 2alp -DE 700 100 Set label

  25. LOOPP http://ser-loopp.tc.cornell.edu/loopp.html Compute the cost of the alignment of the Query to proteins in LOOPP database. Query sequence Cost model Compute the Z score for a subset of proteins with highest scores List of homologue targets+alignment Select the best prediction using a number of scores

  26. DESIGN SCORE PARAMETERE Master node Node 1 NODE n LOOPP: LOOPP: SVM Analyze with PERL script

  27. SUMMARY • We introduce a consistent method for designing cost function parameters using threading which enable also to design gaps parameters. • We overcome the problem of solving exponentially number of inequalities using iterative SVM formulation.

  28. Tools to determine protein folds • The number of new protein sequences is growing exponentially relative to the number of protein structures being solved by experimental methods. • There is a need for proteins annotation tool using sequence, and structural information. • The method needs to be quick. • And to give reliable answers.

More Related