1 / 41

Lecture 7. Computing Protein Structures

CS882, Fall 2006. Lecture 7. Computing Protein Structures. Current attempts: Threading: RAPTOR Consensus: ACE Fragment assembly Can we compute the protein structures eventually? Your projects. Homologous proteins have similar structure and functions.

blithe
Download Presentation

Lecture 7. Computing Protein Structures

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS882, Fall 2006 Lecture 7. Computing Protein Structures • Current attempts: • Threading: RAPTOR • Consensus: ACE • Fragment assembly • Can we compute the protein structures eventually? Your projects.

  2. Homologous proteins have similar structure and functions • Being homologous means that they have evolved from a common ancestral gene. Hence at least in the past they had the same structure and function. • Caution: old genes can be recruited for new functions. Example: a structural protein in eye lens is homologous to an ancient glycolytic enzyme. • Homology search is done by BLAST, or PatternHunter for more sensitivity. BLAST will work with over 30% sequence identity.

  3. Conserving core regions • Homologous proteins usually have conserved core regions. • When we model one protein after a similar protein with known structure, the main problem becomes modeling loop regions. • Modeling loops can also depend on database to some degree. • Side chains: on a few side-chain conformations frequently occur – they are called rotamers, there is a such a database.

  4. Primary, secondary, and tertiary • There are many secondary structure prediction programs. However, without considering tertiary structure, we will never be correct solely predicting secondary structures. • Most tertiary structure prediction programs today depend on good secondary predictions. This is also not good: you cannot get right tertiary structure with wrong starting information. • They must be done together.

  5. There are not too many candidates! • There are only about 1000 topologically different domain structures. There is no reason whatsoever that we cannot compute their structures accurately. • Ab initio method – we have heard about it. • Another promising method is threading (separate lecture). • After threading, an important step is “refinement”, perhaps by fragment assembly. This will be a separate topic (Xin Gao). • Folding membrane proteins is a quite different topic (Richard Jang). • Now we go to threading.

  6. target sequence • MTYKLILNGKTKGETTTEAVDAATAEKVFQYANDNGVDGEWTYTEtemplate library Protein Threading • Make a structure prediction through finding an optimal placement (threading) of a protein sequence onto each known structure (structural template) • “placement” quality is measured by some statistics-based energy function • best overall “placement” among all templates may give a structure prediction

  7. Threading Example

  8. Introduction to Linear Program • Optimize (Maximize or Minimize) a linear objective function • e.g. 2x+3y+4z • The variables satisfy some linear constraints. e.g. • x+y-z >=1 • 2x+y+3z=3 • integer program (IP) =linear program (LP) + integral variables • LP can be solved within polynomial time --- Interior point method. Simplex method also runs fast. We used IBM package. • Polynomial time for IP not likely, NP-hard • IP can be relaxed to LP, solve the non-integral version • Branch-and-bound or branch-and-cut (may cost exponential time)

  9. Why Integer Programming? • Treat pairwise potentials rigorously • critical for fold-level targets • Existing Exact algorithms for pairwise potentials • High memory requirement, or • Expensive computational time • Exploit correlations between various kinds of item scores in the energy function • 99% real data generate integral solutions directly, no branch-and-bound needed.

  10. Different approaches • Approximation Algorithm • Interaction-Frozen Algorithm (A. Godzik et al.) • Monte Carlo Sampling (T. Madej et al.) • Double dynamic programming (D. Jones et al.) • Recursive dynamic programming (R. Thiele et al.) • Exact Algorithm • Branch-and-bound (R.H. Lathrop et al.) • Exploit the relationship among various scoring parameters, fast self-threading • Divide-and-conquer (Y. Xu et al.) • Exploit the topological structure of template contact graphs

  11. Formulating Protein Threading by LP • Protein Threading Needs: • Construction of Template Library • Design of Energy Function • Sequence-Structure Alignment • Template Selection and Model Construction

  12. Threading Energy Function how preferable to put two particular residues nearby: Ep (Pairwise potential) how well a residue fits a structural environment: Es (Fitness score) sequence similarity between query and template proteins: Em (Mutation score) alignment gap penalty: Eg (gap score) Consistency with the secondary structures: Ess E= Ep + Es + Em + Eg + Ess Minimize E to find a sequence-structure alignment

  13. Contact Graph • Each residue as a vertex • One edge between two residues if their spatial distance is within a given cutoff. • Cores are the most conserved segments in the template: alpha-helix, beta-sheet template

  14. Simplified Contact Graph

  15. Contact Graph and Alignment Diagram

  16. Contact Graph and Alignment Diagram

  17. Variables • x(i,l) denotes core i is aligned to sequence position l • y(i,l,j,k) denotes that core i is aligned to position l and core j is aligned to position k at the same time.

  18. Formulation 1 Eg , Ep Es , Ess , Em Encodes scoring system Encodes interaction structures: the first makes sure no crosses; the second is quadratic, but can be converted to linear: a=bc is eqivalent to: a≤b, a≤c, a≥b+c-1

  19. Formulation used in RAPTOR Eg, Ep Es, Ess, En Encodes scoring system Encodes interaction structures

  20. Solving the Problem Practically • More than 99% threading instances can be solved directly by linear programming, the rest can be solved by branch-and-bound with only several branch nodes • Less memory consumption • Less computational time • Easy to extend to incorporate other constraints

  21. CPU Time for CAFASP3 targets

  22. Fold Recognition • Support Vector Machines (SVM) Approach • Features are extracted from the alignments • A threading pair is treated as a positive pattern only if they are in at least fold-level similarity • 60,000 threading pairs are employed to train SVM model. • 5% more targets are recognized by SVM approach than the traditional z-Score

  23. Part II. Experiments

  24. Target Category Hard Easy Prediction Difficulty CM: Comparative Modelling, HM: Homology Modelling FR: Fold Recogniton, NF: New Fold

  25. Lindahl Benchmark Test 976*975 threading pairs are tested, the results of other servers are taken from Shi et al.’s paper.

  26. LiveBench Test LiveBench 6 LiveBench 7 (http://bioinfo.pl/LiveBench)

  27. CASP5/CAFASP3 • 62 targets • Time allowed for each target: • Individual Servers: 48 hours • Meta Servers: 48 hours • Predictors: computer program, no manual intervention (CAFASP3) • Evaluated by computer program • RAPTOR was voted by CASP5 attendees as the most novel approach, at http://forcasp.org CAFASP3: The Third Critical Assessment of Fully Automated Structure Prediction

  28. CAFASP3 Evaluation Criteria • Model • Only the first submission considered for each target, • each server can submit 10 models for each target, • MaxSub (evaluation program) • Superimpose the predicted structure with the experimental structure • Calculate the length of maximum superimposable subsegment within 5Å RMSD • one prediction is regarded as correct only if the length is above a given value.

  29. CAFASP3 Evaluation Criteria • Sensitivity (N-1 Rule) • One miss allowed for each server, i.e., the first models of N-1 out of N targets ranked • Specificity • Rank the first models of all targets according to their zScores • S(M): # Correct before the first M false positives • Average of S(1),S(2),…,S(5)

  30. Specificity Example S(1)=2 S(2)=3 First false positive Second false positive

  31. Sensitivity on FR targets (1) 30 FR targets 54 servers (http://ww.cs.bgu.ac.il/~dfischer/CAFASP3, released on Dec., 2002.)

  32. Sensitivity on FR targets (2) • RAPTOR is weak at recognizing FR(A) targets (need improvement ) • RAPTOR cannot deal with NF targets at all (normal)

  33. Sensitivity on Hard HM targets

  34. Specificity of Servers Out of 33 Targets

  35. CAFASP3 Example • Target ID: T0136_1 • Target Size:144 • Superimposable size within 5Å: 118 • RMSD:1.9Å Red: Experimental Structure Blue/green: RAPTOR model

  36. CASP6, T0199-2, ACE buffalo rank: 9thFrom RAPTOR rank 1 model. TM=0.4183 MaxSub=0.2857. Good parts: 116-134, 286-332 Left: predicted structure. Right: experimental structure

  37. CASP6, T0203 ACE buffalo rank: 1stFrom RAPTOR 2nd model. TM=0.6041, MaxSub=0.3485. Good parts: 19-57, 89-94, 139-178, 224-239, 312-372 RAPTOR first Model ranks 5th Predicted Experimental

  38. CASP6, T0262-2, ACE buffalo rank: 4thFrom Fugue3 6th model. TM=0.4306, MaxSub=0.3459. Good parts: 162-203 Fugue’s top model ranks low Predicted Experimental

  39. CASP6, T0242, NF, ACE buffalo rank: 1From RAPTOR rank 5 model.TM score=0.2784, MaxSub score=0.1645 However, RAPTOR top model ranks 44th ! Trivial error? Predicted Experimental

  40. CASP6, T0238, NF ACE buffalo rank 1stFrom RAPTOR 8th model TM=0.2748, MaxSub=0.1633Good part: 188-237. High TM score, low MaxSub Raptor top model ranks 4th Predicted Experimental

  41. About RAPTOR • Jinbo Xu’s Ph.D. thesis work. • The RAPTOR system has benefited significantly from PROSPECT (Ying Xu, Dong Xu, et al). • Currently distributed by BSI. • References:J. Xu, M. Li, D. Kim, Y. Xu, Journal of Bioinformatics and Computational Biology, 1:1(2003), 95-118. J. Xu, M. Li, PROTEINS: Structure, Function, and Genetics, CASP5 special issue.

More Related