1 / 50

Characterization of Secondary Structure of Proteins using Different Vocabularies

Characterization of Secondary Structure of Proteins using Different Vocabularies. Madhavi K. Ganapathiraju Language Technologies Institute Advisors Raj Reddy, Judith Klein-Seetharaman, Roni Rosenfeld. 2 nd Biological Language Modeling Workshop Carnegie Mellon University May 13-14 2003.

Download Presentation

Characterization of Secondary Structure of Proteins using Different Vocabularies

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Characterization of Secondary Structure of Proteins using Different Vocabularies Madhavi K. Ganapathiraju Language Technologies Institute Advisors Raj Reddy, Judith Klein-Seetharaman, Roni Rosenfeld 2nd Biological Language Modeling Workshop Carnegie Mellon University May 13-14 2003

  2. Presentation overview • Classification of Protein Segments by their Secondary Structure types • Document Processing Techniques • Choice of Vocabulary in Protein Sequences • Application of Latent Semantic Analysis • Results • Discussion

  3. Secondary Structure of Protein Sample Protein: MEPAPSAGAELQPPLFANASDAYPSACPSAGANASGPPGARSASSLALAIAITAL YSAVCAVGLLGNVLVMFGIVRYTKMKTATNIYIFNLALADALATSTLPFQSA… Sample Protein: MEPAPSAGAELQPPLFANASDAYPSACPSAGANASGPPGARSASSLALAIAITAL YSAVCAVGLLGNVLVMFGIVRYTKMKTATNIYIFNLALADALATSTLPFQSA…

  4. Application of Text Processing Letters  Words  Sentences Letter counts in languages Word counts in Documents Residues  Secondary Structure ProteinsGenomes Can unigrams distinguish Secondary Structure Elements from one another

  5. Unigrams for Document Classification • Word-Document matrix • represents documents in terms of their word unigrams “Bag-of-words” model since the position of words in the document is not taken into account

  6. Word Document Matrix

  7. Document Vectors

  8. Document Vectors Doc-1

  9. Document Vectors Doc-2

  10. Document Vectors Doc-3

  11. Document Vectors Doc-N

  12. Document Comparison • Documents can be compared to one another in terms of dot-product of document vectors = .*

  13. Document Comparison • Documents can be compared to one another in terms of dot-product of document vectors = .*

  14. Document Comparison • Documents can be compared to one another in terms of dot-product of document vectors = .* • Formal Modeling of documents is • presented in next few slides…

  15. Vector Space Model construction • Document vectors in word-document matrix are normalized • By word counts in entire document collection • By document lengths • This gives a Vector Space Model (VSM) of the set of documents • Equations for Normalization…

  16. Word count normalization (Word count in document) (document length) (depends on word count in corpus) t_i is the total number of times word i occurs in the corpus

  17. Word-Document Matrix Normalized Word-Document Matrix

  18. Document vectors after normalisation ...

  19. Use of Vector Space Model • A query document is also represented as a vector • It is normalized by corpus word counts • Documents related to the query-doc are identified • by measuring similarity of document vectors to the query document vector

  20. Application to Protein Secondary Structure Prediction

  21. Protein Secondary Structure • Dictionary of Secondary Structure Prediction: annotation of each residue with its structure • based on hydrogen bonding patterns and geometrical constraints • 7 DSSP labels for PSS: • H • G • B • E • S • I • T Helix types Strand types Coil types

  22. Example PKPPVKFNRRIFLLNTQNVINGYVKWAINDVSLALPPTPYLGAMKYNLLH Residues PKPPVKFNRRIFLLNTQNVINGYVKWAINDVSLALPPTPYLGAMKYNLLH ____SS_SEEEEEEEEEEEETTEEEEEETTEEE___SS_HHHHHHTT_TT ____SS_SEEEEEEEEEEEETTEEEEEETTEEE___SS_HHHHHHTT_TT DSSP Key to DSSP labels T, S, I,_: Coil E, B: Strand H, G: Helix

  23. Reference Model • Proteins are segmented into structural Segments • Normalized word-document matrix • constructed from structural segments

  24. Example Residues PKPPVKFNRRIFLLNTQNVINGYVKWAINDVSLALPPTPYLGAMKYNLLH PKPPVKFNRRIFLLNTQNVINGYVKWAINDVSLALPPTPYLGAMKYNLLH ____SS_SEEEEEEEEEEEETTEEEEEETTEEE___SS_HHHHHHTT_TT ____SS_SEEEEEEEEEEEETTEEEEEETTEEE___SS_HHHHHHTT_TT DSSP Structural Segments obtained from the given sequence: PKPPVKFN RRIFLLNTQNVI NG YVKWAI ND VSL ALPPTP YLGAMKY NLLH

  25. Example Residues PKPPVKFNRRIFLLNTQNVINGYVKWAINDVSLALPPTPYLGAMKYNLLH PKPPVKFNRRIFLLNTQNVINGYVKWAINDVSLALPPTPYLGAMKYNLLH ____SS_SEEEEEEEEEEEETTEEEEEETTEEE___SS_HHHHHHTT_TT ____SS_SEEEEEEEEEEEETTEEEEEETTEEE___SS_HHHHHHTT_TT DSSP Structural Segments obtained from the given sequence: PKPPVKFN RRIFLLNTQNVI NG YVKWAI ND VSL ALPPTP YLGAMKY NLLH Unigrams in the structural segments

  26. Structural Segments Amino-acid Structural-Segment Matrix Amino Acids

  27. Structural Segments Amino-acid Structural-Segment Matrix Amino Acids Similar to Word-Document Matrix

  28. Document Vectors WordVectors

  29. Document Vectors Query Vector WordVectors …

  30. Data Set used for PSSP • JPred data • 513 protein sequences in all • <25% homology between sequences • Residues & corresponding DSSP annotations are given • We used • 50 sequences for model construction (training) • 30 sequences for testing

  31. Classification • Proteins from test set • segmented into structural elements • Called “query segments” • Segment vectors are constructed • For each query segment • ‘n’ most similar reference segment vectors are retrieved • Query segment is assigned same structure as that of the majority of the retrieved segments* *k-nearest neighbour classification

  32. Compare Similarities 3 most similar reference vectors Coil Majority voting out of 3-most similar reference vectors = = Structure type assignment to QVector Reference Model Query Vector Key HelixStrandCoil Hence Structure-type assigned to Query Vector isCoil

  33. Choice of Vocabulary in Protein Sequences • Amino Acids • But Amino acids are • Not all distinct.. • Similarity is primarily due to chemical composition So, • Represent protein segments in terms of “types” of amino acids • Represent in terms of “chemical composition”

  34. Representation in terms of “types” of AA • Classify based on Electronic Properties • e- donors: D,E,A,P • weak e-donors: I,L,V • Ambivalent: G,H,S,W • weak e- acceptor: T,M,F,Q,Y • e- acceptor: K,R,N • C(by itself, another group) • Use Chemical Groups

  35. Representation using Chemical Groups

  36. Results of Classification with “AA” as words Leave 1-out testing of reference vectors Unseen query segments

  37. Results with “chemical groups” as words • Build VSM using both reference segments and test segments • Structure labels of reference segments are known • Structure labels of query segments are unknown

  38. Modification to Word-Document matrix • Latent Semantic Analysis • Word document matrix is transformed • by “Singular Value Decomposition”

  39. Results with “AA” as words, using LSA

  40. Results with “types of AA” as wordsusing LSA

  41. Results with “chemical groups” as wordsusing LSA

  42. LSA results for Different Vocabularies Amino acids LSA Types of Amino acid LSA Chemical Groups LSA

  43. Model construction using all data Matrix models constructed using both reference and query documents together. This gives better models both for normalization and in construction Of latent semantic model Amino Acid Chemical Groups Amino acid types

  44. Applications • Complement other methods for protein structure prediction • Segmentation approaches • Protein classifications as all-alpha, all-beta, alpha+beta or alpha/beta types • Automatically assigning new proteins into SCOP families

  45. References • Kabsch, Sander “Dictionary of Secondary Structure Prediction”, Biopolymers. • Dwyer, D.S., Electronic properties of the amino acid side chains contribute to the structural preferences in protein folding. J Biomol Struct Dyn, 2001. 18(6): p. 881-92. • Bellegarda, J., “Exploiting Latent Semantic Information in Statistical Language Modeling”, Proceedings of the IEEE, Vol 88:8, 2000.

  46. Thank you!

  47. Use of SVD • Representation of Training and test segments very similar to that in VSM • Structure type assignment goes through same process, except that it is done with the LSA matrices

  48. Classification of Query Document • Query Document is assigned the same Structure as of those retrieved by similarity measure • Majority voting* • A query document is also represented as a vector • It is normalized by corpus word counts • Documents related to the query are identified • by measuring similarity of document vectors to the query document vector *k-nearest neighbour classification

  49. Notes… • Results described are per-segment • Normalized Word document matrix does not preserve document lengths • Hence “per residue” accuracies of structure assignments cannot be computed

More Related