1 / 58

Advancement to Candidacy Computer Science Department by Rachel Karchin Advisor: Kevin Karplus

Local Statistical Dependencies in Protein Structure: Discovery, Evaluation, Prediction and Applications. Advancement to Candidacy Computer Science Department by Rachel Karchin Advisor: Kevin Karplus. Outline. Protein structure - primary, secondary, tertiary

melora
Download Presentation

Advancement to Candidacy Computer Science Department by Rachel Karchin Advisor: Kevin Karplus

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Local Statistical Dependencies in Protein Structure: Discovery, Evaluation, Prediction and Applications Advancement to Candidacy Computer Science Department by Rachel Karchin Advisor: Kevin Karplus

  2. Outline • Protein structure - primary, secondary, tertiary • Fold recognition, local and secondary structure • Alphabets of local structure • Designing and evaluating local structure alphabets • Improving fold recognition

  3. Molecular structure of proteins • Proteins are large, organic molecules composed of smaller molecules called amino acids. threoninecysteinearginine Ball-and-stick atomic model of Crambinplant seed protein with 44 amino acids

  4. The amino acids • There are 20 kinds ofamino acids found in natural proteins. • All share a common structure. R side chain amine group carboxyl group alpha carbon(with attached hydrogen) Biochemistry Mathews, 3ed. AddisonWesley

  5. Primary structure • Proteins consist of one or more polypeptide chains of amino acids connected by peptide bonds. • The sequence of linked amino acids along the chain is called the protein’s primary structure.Phe-Leu-Ser-Cys . . .FLSC . . . Access Excellence NHGRI Graphics Gallery

  6. Secondary structure • Symmetric patterns of hydrogen bonds between amino acids. • Helix. H-bonds between residues close in primary sequence. Anthony Day/Pace et. al. 1996

  7. Secondary structure • Strand. H-bonds between residues not close in primary sequence. Anthony Day/Pace et. al. 1996

  8. Protein Folding • In an aqueous environment (such as cell cytoplasm), polypeptide chains fold into 3D shapes (tertiary structure).

  9. From primary to tertiary structure • A protein’s 3D shape is determined by its primary amino acid sequence. Anfinsen et. al. 1963. • Predicting tertiary structure from amino acid sequence is an unsolved problem. • Difficult to model the energies that stabilize a protein molecule. • Conformational search space is enormous. Laboratory of MolecularBiophysics, University of Oxford

  10. Fold recognition • In nature, proteins are observed to assume on the order of a thousand shapes or “folds”. Biochemistry Mathews, 3ed. AddisonWesley

  11. Fold recognition • Given an amino acid sequence target: • search a set of known folds by aligning target and a template fold representative • predict the fold that gets the best scoring alignment Target amino acid sequence YLAADTYK Template Fold library FISSETCN MEPSSYV TGLIRKN Template amino acid sequence 7 21 2 Target/template Score:

  12. Twilight zone sequence relationships • This method is very effective when target and template have > 30% sequence identity. • Approximately 1/3 of protein sequences can be assigned folds and modeled this way. • We would like to extend the method to sequences in the twilight zone (< 30% identity to any sequence of known structure).

  13. Target amino acid sequence YLAADTYK Multiple alignment YLAADTYK FISTE-HR HVATD-H- -ITA--HR YLASDS-R SAM-T98 • Build a target HMM of amino acid frequencies from a multiple alignment of target plus homologs (SAM-T98). Protein Database Search for homologs Courtesy of K. Karplus Target amino acid HMM

  14. SAM-T98 • Amino acid HMM for target. Amino acid strings for templates • Three -fold increase in recognizing twilight zone similarities (Park et. al. 1998) Target amino acid HMM Template Fold library Courtesy of K. Karplus FISSETCN MEPSSYV TGLIRKN Template amino acid sequence 7 21 2 Target/template Score:

  15. SAM-T98 enhancements • Two-way scoring • Augment the method with secondary structure information.

  16. Two-way SAM-T98 • Also build amino acid HMMs for templates. Do 2-way scoring to strengthen recognition of twilight zone relationships. Template Fold library Target amino acid sequence YLAADTYK Template amino acid HMMs 19 82 31 Target/template Score:

  17. Secondary structure • DSSP alphabet (Kabsch and Sander 1983). Classifies the secondary structure of a residue using known tertiary structure. Repeating turns: Repeating bridges: Basic patterns: turnT bendS bridgeB Other: random coilC 3-10 helixG alpha helixH pi helixI beta strandE Biochemistry Mathews, 3ed. AddisonWesley

  18. Secondary structure • Alternatives to DSSP definitions. • Collapse 8 classes to 3: H,E,C • Other programs to automate assignment: • Richards and Kundrot (1988) Define • Sklenar (1989) P-Curve • Adzhubei and Sternberg (1993) • Frishman and Argos (1995) STRIDE • King and Johnson (1999) xlsstr

  19. Predicting secondary structure • Extensive research on predicting secondary structure from primary sequence. • Neural nets are most successful approach. • PHD (Rost and Sander 1996) • Predict_2nd (Karplus and Barrett 1998) • Best methods around 75-80% accurate

  20. Secondary structure and fold recognition • Predicted secondary structure shown useful for fold recognition (Russell et. al. 1998). • Fold recognition accuracy correlated with secondary structure prediction accuracy(Di Francesco 1995, 1997, 1999). • Why? • Structure more conserved than sequence. • Proteins in the same fold family have similar topologies (secondary structure elements have similar lengths, spatial organization and connectivities).

  21. Target amino acid sequence YLAADTYK Two-track SAM-T2K • Predicted probability vectors of secondary structure added to target HMM Target two-track HMM P(H) P(E) P(C) H E CY 0.65 0.2 0.15L 0.15 0.7 0.25A 0.01 0.04 0.9 A 0.47 0.45 0.08 D 0.85 0.1 0.05T 0.32 0.18 0.5Y 0.81 0.09 0.1K 0.5 0.25 0.15 Courtesy of C. Barrett Multiple alignment Courtesy of K. Karplus YLAADTYK FISTE-HR HVATD-H- -ITA--HR

  22. Two-track SAM-T2K • Search template library of sequence pairs with two-track target HMM Target two-track HMM Template Fold library Courtesy of K. Karplus TGLIRKN EEECEEE MEPSSYV HHHHCCE FISSETCN CCEECHHH Template with 2 sequence pairs 22 68 15 Target/template Score:

  23. Motivation for alternatives to secondary structure classes • What’s wrong with secondary structure classes? • The most widely used secondary structure alphabet (3-state DSSP) is crude (Helix, Strand, Coil). • Secondary structure classes are ambiguous. • Automated assignment methods disagree. • 63% agreement between DSSP, Define and P-Curve (Collc’h et. al. 1993).

  24. Local structure and fold recognition • What is Local structure? • describes environment of a residue • a residue’s relationship to neighbors • Can use this information to predict fold from primary structure. • Requires comparing local structure of target and template. Known Must predict (easier than 3d)

  25. Low level descriptions of local structure • Lowest level representation of protein structure - atomic position vectors. Position vectorX Y Z AtomNo. Type ResidueType No. ATOM 1 CA THR 1 7.047 14.099 3.625 ATOM 2 C THR 1 16.967 12.784 4.338 ATOM 3 O THR 1 15.685 12.755 5.133 ATOM 4 N SER 2 15.115 11.555 5.265 ATOM 5 CA SER 2 13.856 11.469 6.066 ATOM 6 C SER 2 14.164 10.785 7.379 ATOM 7 O SER 2 14.993 9.862 7.443 ATOM 8 CB SER 2 12.732 10.711 5.261 ATOM 9 N CYS 3 13.488 11.241 8.417 ATOM 10 CA CYS 3 13.660 10.707 9.787 Conformations of BiopolymersIUPAC-IUB

  26. Low level descriptions of local structure • “One level up”. From atomic position vectors can derive a list of properties that describe a residue’s local environment. Conformations of BiopolymersIUPAC-IUB

  27. Dihedral and bond angles • Dihedral angles are defined by 4 atoms. • Bond angles are defined by 3 atoms. Conformations of BiopolymersIUPAC-IUB Conformations of BiopolymersIUPAC-IUB

  28. ω ω Dihedral angles: Phi, Psi, Omega • The 6 atoms in each peptide unit lie in the same plane. •  = 180 (trans)or 0 (cis) •  and  free to rotate Biochemistry Mathews, 3ed. AddisonWesley

  29. Dihedral angles: Phi, Psi, Omega • Result: good approximation of polypeptide backbone is list of (,) pairs ( cis is rare). • (,) pairs often represented on a plane called the Ramachandran plot. http://www.biochem.artizona.eduBiochemistry 462A Lecture Notes

  30. A small gallery of properties: the geometry of local structure Kappa. Virtual bond angle between C of residues i-2, i, i+2 Zeta. Dihedral angle between carbonyl bonds of residues i and i-1 Alpha. Virtual dihedral angle between C of residues i-1, i, i+1, i+2 Tau. Virtual bond angle between C of residues i-1, i, i+1

  31. Relationship of a residue to its neighbors • Density measures. How many residues are within a given distance? 12 neighboring residueswithin 6 A radius • Count of H-bond partners. 2 H-bond partners

  32. Existing local structure alphabets • Approximately 30 alphabets of local structure in the literature. • Can they be used to improve fold recognition?

  33. Phi/psi alphabets • Classes based on partition of phi/psi space • Bystroff et. al. 2000. 10 classes: B E b d e G H L I x • Sun et. al. 1996DSSP H,E plus 5 phi/psi classes: a b e l t Bystroff et. al. 2000 • Kang et. al. 1993. 1296 classes: uniform partitioning by 10

  34. Backbone fragment alphabets • Classes based on clustering low-level properties of contiguous series of residues. • Unger et. al. 1987~100 6-residue fragments • k-nearest neighbor clustering by RMSD of C atoms • Centroid of each cluster selected as building block Unger et. al. 1987

  35. Backbone fragment alphabets • De Brevern et. al. 2000Protein Building Blocks (PBBs). • 16 classes of 5-residue fragments. • SOM clustering of vectors of 8 dihedral angles ( and  ). De Brevern et. al. 2000

  36. Desired properties of local structural alphabets • For purposes of improving fold recognition: • Predictable from primary sequence • Conserved within a fold family

  37. Comparison of existing local structure alphabets • Only a few of the alphabets have been tested for predictability. • None of the alphabets have been tested for conservation within fold families.

  38. Designing a Local Structure Alphabet • Extract properties with respect to each residue in the dataset. Selected PDB structures PDBNo AA TCO1 M -0.32 L -0.343 S 0.914 P 0.9355 E -0.16 V 0.2.. Property extraction Selected property:TCO i-1 i

  39. Designing a Local Structure Alphabet • Partition the data into k populations. PDBNo AA TCO1 M -0.32 L -0.345 E -0.1 Class A PDBNo AA TCO1 M -0.32 L -0.343 S 0.914 P 0.9355 E -0.16 V 0.2.. UnsupervisedLearningAlgorithm PDBNo AA TCO3 S 0.914 P 0.9356 V 0.2 Class B O O O X X X -1 -0.5 0 0.5 1 Class A Class B

  40. Designing a Local Structure Alphabet D1 dison3:H-bond lengthfrom Oi to Ni+3 i Selected property:KJ descriptor vector*:[,, d1, d2, d3] i+3 D2 dison4:H-bond lengthfrom Oi to Ni+4 i i+4 D3 discn3:length from Ci to Ni+3 i-1 i i i+3 i+1 i-1 i  ZETA  TAU * Descriptor vector of key geometric properties identified by King and Johnson 1999

  41. Designing a Local Structure Alphabet • Extract properties with respect to each residue in the dataset. Selected PDB structures PDBNo AA KJDV1 M [13.6, 9 2.9, 3.7, 3.1, 4.1]2 L [14.4, 9, 5.7,4 .9, 7.1, 4.9]3 S [19.8, 100.3, 7.2, 10.1, 6.9]4 P [18.1, 116.2, 6.7, 9.2,6 .9]... Property extraction Selected property:KJ descriptor vector:[, , d1, d2, d3]

  42. Designing a Local Structure Alphabet • Clustering multi-dimensional data points. PDBNo AA KJDV1 M [13.6, 9 2.9, 3.7, 3.1, 4.1]2 L [14.4, 9, 5.7,4 .9, 7.1, 4.9]3 S [19.8, 100.3, 7.2, 10.1, 6.9]4 P [18.1, 116.2, 6.7, 9.2,6 .9]... • Components in different units. Scale to same range? • For very high dimensional vectors require feature reduction.

  43. Evaluation protocol • Protocol is based on: • testing candidate alphabets for their conservation within fold families. • testing predictability of candidate alphabets • testing improvements in fold recognition when candidate alphabets are used.

  44. Stringbuilder Evaluation Protocol: string translation >2abd MDAAVKTG >4eca MELVIRSG . . . Selected PDB structures >2abd CAAABCAB >4eca ACBBABCA . . . Position-equivalent strings in new alphabet Selected alphabet

  45. Alignmentbuilder Evaluation Protocol: alignment translation Fold family alignments MD-AAVKTG ME-LVIRSGM-SAGCRDKMEA-SC-E- CA-AABCAB AC-BBABCAC-AACCBBCCCA-BB-A- Position-equivalent strings in new alphabet Position-equivalent alignments in new alphabet

  46. Evaluation Protocol: alphabet conservation Conserved? • Average entropy in columns of alignments. • Relative entropy of substitution matrix constructed from alignments (Altschul 91). Position-equivalent alignments in new alphabet CA-AABCAB AC-BBABCAC-AACCBBCCCA-BB-A-

  47. Evaluation Protocol: alphabet predictability • Test predictability with Predict_2nd neural net. • Improve on neural net performance with alternate methods. P(A) P(B) P(C) Courtesy of C. Barrett Predictable? Position-equivalent strings in new alphabet

  48. Evaluation Protocol: fold recognition • Build a fold library that incorporates the local structure alphabet and do fold recognition testing using this library.

  49. PROBLEM!Wrong letter predicted. Incorporating local structure alphabets into a fold library • Simplest. Use predicted local structure string for target and known local structure string for templates. Template Fold library Target local structure string ABBCACAB CCABBBAC AACBCAA CAACBBB Template local structure string 7 21 2 Target/template Score:

  50. PROBLEM!Wrong letters predicted. Incorporating local structure information into a fold library • Use several strings (amino acid and local structure) for target and templates. Target with string tuple Template Fold library YLAADTYK ABBCACAB WYTZTTVU TGLIRKN CAACBBB YUUUVZW MEPSSYV AACBCAA TTYUVWZ FISSETCN CCABBBAC YVUUTZVV Template with string tuples Target/template Score: 6 23 5

More Related