1 / 29

Protein Secondary Structure Prediction: A New Improved Knowledge-Based Method

Protein Secondary Structure Prediction: A New Improved Knowledge-Based Method. Wen-Lian Hsu Institute of Information Science Academia Sinica, Taiwan. Outline. Introduction PSSP Motivation Knowledge-Based Method PROSP An Improved Hybrid Method PROSP II HYPROSP II+ Conclusion.

niveditha
Download Presentation

Protein Secondary Structure Prediction: A New Improved Knowledge-Based Method

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Protein Secondary Structure Prediction:A New Improved Knowledge-Based Method Wen-Lian Hsu Institute of Information Science Academia Sinica, Taiwan

  2. Outline • Introduction • PSSP • Motivation • Knowledge-Based Method • PROSP • An Improved Hybrid Method • PROSP II • HYPROSP II+ • Conclusion

  3. Protein Structures • Primary sequence • Secondary structures • Tertiary structures MTYKLILNGKTKGETTTEAVDAATAEKVFQYANDNGVDGEWTYTE loops helices strands Three dimensional packing of secondary structures

  4. Introduction to PSSP • Protein Secondary Structure Prediction (PSSP) is to predict protein secondary structure based only on its sequence. • Each amino acid is assigned a structure element (SSE): • Helix (H), Strand (E) or Coil (C or L).

  5. Motivation • PSSP plays an important role in tertiary structure predictions • Fischer (1996) improved the tertiary structure prediction accuracy from 59.0 to 71.0 by using PHD to predict SSE. • In Yang’s 2003, the tertiary structure prediction accuracy was improved from 71.9 to 79.0 by using PSIPRED to predict SSE. • Predicted SSE can also be employed in other prediction algorithms as features to improve performance

  6. Outline • Introduction • PSSP • Motivation • Knowledge-Based Method • PROSP • An Improved Hybrid Method • PROSP II • HYPROSP II+ • Conclusion

  7. Treat PSSP as a Translation Problem • Secondary structure prediction • A language of 20 alphabets • A language of 3 alphabets

  8. Treating Genomic/Proteomic sequencesas a Language • For proteomic data: Amino acid motif protein Alphabetwordsentence paragraph Protein structure or function Sentence meaning • Finding the interrelationships of data • Data Mining, Knowledge Discovery

  9. Speech Recognition─ ExampleSense Disambiguation in English • Selection of homonyms (or senses) in speech recognition 台 北 市 一 位 小 孩 走 失 了 台 北 市 小 孩 台 北 適 宜 走 失 事 宜 一 位 一 味 移 位

  10. How do we represent the context in a protein sequence (or sentence)? • Using motifs as Words? • Motifs could be too specific, do not provide enough coverage • What about using k-mers? • Can build (k-mer, structure) pairs • How many k-mers can we get? • How do we define similar k-mers? (under the context) • How do we combine the structural information from the k-mers?

  11. PROSP Our knowledge-based method for PSSP • Constructing a peptide Sequence-Structure Knowledge Base (SSKB) • Use PSI-BLAST to find all peptides similar to those of the target protein • Use similar peptides found in the SSKB to vote for the dominant structure of each amino acid in the target protein.

  12. Using PSI-BLAST to Amplify the Effect of DSSP Database (create more synonyms) • The number of peptide words is still small (~ 5 million) • Identify similar peptides • For each protein p in the NR database, apply PSI-BLAST to find its HSPs (high score segment pairs). • HSP: an alignment of subsequence of protein p and another protein q with unknown structure • Assign the structure of “selected” peptides of p to those of q • These peptides comprise our dictionary (~ 100 million)

  13. known unknown SSKB construction (synonyms) An example of High-scoring Segment Pair (HSP) from PSI-Blast Search result

  14. SSKB PSI-Blast H H H C E C Prediction at a position x x … H(x) E(x) C(x) x is assigned as helix Voting score

  15. Outline • Introduction • PSSP • Motivation • Knowledge-Based Method • PROSP • An Improved Hybrid Method • PROSP II • HYPROSP II+ • Conclusion

  16. Two problems of searching for homologous peptides in protein sequences databases • Redundant information generated by duplicate peptides • The voting bias problem in PROSP • Poor prediction accuracy due to insufficient knowledgebase matching • boost coverage

  17. KTYQCQY… KTYQCQY… KPYQCQY HHHHHH KPYQCQY HHHHHH KPYQCQY HHHHHH KPYQCQY HHHHHH KPYQCQY HHHHHH KVYQCQY CCHHHC QPYRCKY CCHHHC The voting bias problem The PSIBLAST results Query Sbject Dominate result SSKB

  18. MYSKILL MYSKILL MYKKIYL MYKKIYL MYKKIYL Clustering HSPs …MYKKILYPTDFSETAEIALK… MYSKILL MYKKIYL Similar HSPs MYSSILY MYSSILY

  19. Measuring the amount of structural information • Low Local match rate HSPs There is no information from SSKB7 for this region Found Unfound

  20. Training Protein PSI-BLAST search HSPs SSKB construction window length = 5 SSKBwindow length = 5 Construct SSKB with different lengths (to boost coverage) Training Protein PSI-BLAST search HSPs SSKB construction window length = 7 SSKBwindow length = 7

  21. H 1 2 1 3 6 7 8… H 1 3 2 5 5 5 2… E 1 2 2 0 0 0 1… E 1 3 2 0 0 0 1… C 2 3 8 8 5 4 2… C 2 4 7 7 6 6 7… Boost match rate using different length peptide record HSPs from SSKB7 Protein : MYKKILYPTDFSETAEIALK… HSPs from SSKB5 SSKB Window length = 7 SSKB Window length = 5

  22. H 1 2 1 3 6 7 8… H H 1 3 2 5 5 5 2… 1 3 2 5 7 6 7… E 1 2 2 0 0 0 1… E E 1 3 2 0 0 0 1… 1 3 2 0 0 0 1… C 2 3 8 8 5 4 2… C C 2 4 8 8 4 5 6… 2 4 7 7 6 6 7… NEW PROSP system Protein : MYKKILYPTDFSETAEIALK… SSKB Window length = 7 SSKB Window length = 5 HPROSPII(x) ← LMR7mer(x)×H7(x)+(1- LMR7mer(x))×H5(x) EPROSPII(x) ← LMR7mer(x)×E7(x)+(1- LMR7mer(x))×E5(x) CPROSPII(x) ← LMR7mer(x)×C7(x)+(1- LMR7mer(x))×C5(x)

  23. 3 features H score H score E score PSIPRED C score 3 features E score PROSP 20 features C score PSSM PSIPBLAST Hybrid by Neural Network Neural Network Final Result Query Protein

  24. Data Sets • Two broadly used test sets • CB513 • EVAc4 • Derivation of the training sets • Get 4,572 unique protein chains (with less than 25% mutual sequence identity) from DSSP database • Further remove protein chains of sequence identity over 25% with the respective test datasets to obtain their respective training datasets. • The final training datasets consist of 4395 and 4055 protein chains for EVAc4 and CB513, respectively.

  25. The respective performance improvement using SSKB5 and SSKB7 Q3(%) LMR7mer(%) Performance of prediction on CB513 by SSKB5, SSKB7 and PROSP II with respect to LMR7mer lower than 50%.

  26. Performance of HYPROSP II+

  27. Conclusion HYPROSP II+ • Using a more robust knowledge-based algorithm PROSP II • More structural information, better prediction. • Incremental Learning • The general strategy developed in this paper could be used to enhance the performance of similar approaches in other prediction problems.

  28. Ting-Yi Sung Wen-Lian Hsu Jia-Ming Chang Ei-Wen Yang Hsin-Nan Lin People

More Related