1 / 24

Conditional Random Fields for the Prediction of Signal Peptide Cleavage Sites

Conditional Random Fields for the Prediction of Signal Peptide Cleavage Sites. M.W. Mak The Hong Kong Polytechnic University. S.Y. Kung Princeton University. Contents. Introduction Proteins and Their Subcellular Locations Importance of Protein Cleavage-Site Prediction

agnes
Download Presentation

Conditional Random Fields for the Prediction of Signal Peptide Cleavage Sites

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Conditional Random Fields for the Prediction of Signal Peptide Cleavage Sites M.W. Mak The Hong Kong Polytechnic University S.Y. Kung Princeton University

  2. Contents • Introduction Proteins and Their Subcellular Locations Importance of Protein Cleavage-Site Prediction Information in Amino Acid Sequences Existing Approaches to Cleavage Site Prediction • Conditional Random Field (CRF) CRF for Cleavage Site Prediction • Experiments and Results Effectiveness of Different Feature Functions Effect of Varying Window Size Fusion with SignalP

  3. Proteins and Their Destination • A protein consists of a sequence of amino acids. • Newly synthesized proteins need to pass across intra-cellular membrane to their destination. http://redpoll.pharmacy.ualberta.ca

  4. Signal Peptide • A short segment of 20 to 100 amino acids (known as signal peptides) contains information about the destination (address) of the protein. • The signal peptide is cleaved off from the resulting mature protein when it passes across the membrane. http://nobelprize.org Mature protein Source: S. R. Goodman, Medical Cell Biology, Elsevier, 2008. Signal Peptide Cleavage Site

  5. Importance of Cleavage Site Prediction • Defects in the protein sorting process can cause serious diseases, e.g., kidney stone Source: http://nobelprize.org/nobel_prizes/medicine/laureates/1999/illpres/diseases.html

  6. Importance of Cleavage Site Prediction • Many proteins (e.g. insulin) are produced in living cells. To cause the proteins to be secreted out of the cell, they are provided with a signal peptide. Bioreactor Source: http://nobelprize.org/nobel_prizes/medicine/laureates/1999/illpres/diseases.html

  7. Information in Sequences • Signal peptides contain some regular patterns. • Although the patterns exhibit substantial variation, they can be detected by machine learning tools. Rich in hydrophobic AA Cleavage Site

  8. Existing Methods • Weight matrices (PrediSi) • Neural Networks (SignalP 1.1) • HMMs (SignalP 3.0)

  9. Weight Matrices 15 Positions 20 AA t -1 t t+1 M A R S S L F T F L C L A V F I N G C L S Q I E Q Q Score at position t = 16+0+8+6+78+7+7+13+10+6+8+6+0+6+7=178

  10. SignalP-HMM Source: Nielsen and Krogh Mature protein Signal Peptide

  11. Contents • Introduction Proteins and Their Subcellular Locations Importance of Protein Cleavage-Site Prediction Information in Amino Acid Sequences Existing Approaches to Cleavage Site Prediction • Conditional Random Field (CRF) CRF for Cleavage Site Prediction • Experiments and Results Effectiveness of Amino Acid Properties Effectiveness of Different Feature Functions Fusion with SignalP

  12. Conditional Random Fields • Conditional Random Fields (CRFs) were originally designed for sequence labeling tasks such as Part-of-Speech (POS) tagging • Given a sequence of observations (e.g., words), a CRF attempts to find the most likely label sequence, i.e., it gives a label for each of the observations.

  13. Advantages of CRF • Avoid computing likelihood p(observation|label). Instead, the posterior p(label|observation) is computed directly. • Able to model long-range dependency without making the inference problem intractable. • Guarantee global optimal. Depends on M A R S S L F T F L C L A V F I N G C L S Q I E Q Q

  14. CRF for Cleavage Cite Prediction Cleavage site Weights Length of Sequence Transition features n-grams of amino acids State features

  15. CRF for Cleavage Cite Prediction e.g. bi-gram and query sequence = T Q T W A G S H S . . .

  16. Contents • Introduction Proteins and Their Subcellular Locations Importance of Protein Cleavage-Site Prediction Information in Amino Acid Sequences Existing Approaches to Cleavage Site Prediction • Conditional Random Field (CRF) CRF for Cleavage Site Prediction • Experiments and Results Effectiveness of Different Feature Functions Effect of Varying Window Size Fusion with SignalP

  17. Experiments • Data: 1937 protein sequences extracted from Swissprot 56.5. The cleavage sites locations of these sequences were biologically determined • Ten-fold cross validation • For 1st-order state features, up to 5-grams of amino acids • For 2nd-order state features, up to bi-grams of amino acids. • Use CRF++ software

  18. Results Effectiveness of Different Feature Functions: • Observations: • Transition feature by itself is no good. • But, once combined with state-features, performance improves (Transition only) (Transition + State)

  19. Results Effect of Varying the Window Size: e.g. query sequence = T Q T W A G S H S . . .

  20. Results Compared with Other Predictors Observations: (1) CRF is slightly better than SignalP (2) CRF is complementary to SignalP

  21. Web Server http://158.132.148.85:8080/CSitePred/faces/Page1.jsp

  22. Web Server http://158.132.148.85:8080/CSitePred/faces/Page1.jsp Available in May 2009

  23. Conditional Random Fields • Conditional Random Fields (CRFs) were originally designed for sequence labeling tasks such as Part-of-Speech (POS) tagging • Given a sequence of observations, A CRF attempts to find the most likely label sequence, i.e., it gives a label for each of the observations. Observations x x y Labels

More Related