Download
secondary structure prediction using decision lists n.
Skip this Video
Loading SlideShow in 5 Seconds..
Secondary Structure Prediction Using Decision Lists PowerPoint Presentation
Download Presentation
Secondary Structure Prediction Using Decision Lists

Secondary Structure Prediction Using Decision Lists

128 Views Download Presentation
Download Presentation

Secondary Structure Prediction Using Decision Lists

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. Secondary Structure Prediction Using Decision Lists Deniz YURET Volkan KURT

  2. Outline • What is the problem? • What are the different approaches? • How do we use decision lists and why? • Why does evolution help?

  3. What is the problem? • The generic prediction algorithm • Some important pitfalls: definition, data set • Upper and lower bounds on performance • Evolution and homology enters the picture

  4. Tertiary / Quaternary Structure

  5. Tertiary / Quaternary Structure

  6. Secondary Structure MRRWFHPNITGVEAENLLLTRGVDGSFLARPSKSNPGD ----------HHHHHHHHHH------EEEEE-------

  7. A Generic Prediction Algorithm • Sequence to Structure • Structure to Structure

  8. A Generic Prediction Algorithm: Sequence to Structure MRRWFHPNITGVEAENLLLTRGVDGSFLARPSKSNPGD ??????????????????????????????????????

  9. A Generic Prediction Algorithm: Sequence to Structure MRRWFHPNITGVEAENLLLTRGVDGSFLARPSKSNPGD -?????????????????????????????????????

  10. A Generic Prediction Algorithm: Sequence to Structure MRRWFHPNITGVEAENLLLTRGVDGSFLARPSKSNPGD -?????????????????????????????????????

  11. A Generic Prediction Algorithm: Sequence to Structure MRRWFHPNITGVEAENLLLTRGVDGSFLARPSKSNPGD --????????????????????????????????????

  12. A Generic Prediction Algorithm: Sequence to Structure MRRWFHPNITGVEAENLLLTRGVDGSFLARPSKSNPGD --????????????????????????????????????

  13. A Generic Prediction Algorithm: Sequence to Structure MRRWFHPNITGVEAENLLLTRGVDGSFLARPSKSNPGD ---???????????????????????????????????

  14. A Generic Prediction Algorithm: Sequence to Structure MRRWFHPNITGVEAENLLLTRGVDGSFLARPSKSNPGD ----H-----????????????????????????????

  15. A Generic Prediction Algorithm: Sequence to Structure MRRWFHPNITGVEAENLLLTRGVDGSFLARPSKSNPGD ----H-----H???????????????????????????

  16. A Generic Prediction Algorithm: Sequence to Structure MRRWFHPNITGVEAENLLLTRGVDGSFLARPSKSNPGD ----H-----H???????????????????????????

  17. A Generic Prediction Algorithm: Sequence to Structure MRRWFHPNITGVEAENLLLTRGVDGSFLARPSKSNPGD ----H-----HH??????????????????????????

  18. A Generic Prediction Algorithm: Sequence to Structure MRRWFHPNITGVEAENLLLTRGVDGSFLARPSKSNPGD ----H-----HHHHHHHHHH------EEEEE------?

  19. A Generic Prediction Algorithm: Sequence to Structure MRRWFHPNITGVEAENLLLTRGVDGSFLARPSKSNPGD ----H-----HHHHHHHHHH------EEEEE-------

  20. A Generic Prediction Algorithm: Structure to Structure MRRWFHPNITGVEAENLLLTRGVDGSFLARPSKSNPGD ?---H-----HHHHHHHHHH------EEEEE-------

  21. A Generic Prediction Algorithm: Structure to Structure MRRWFHPNITGVEAENLLLTRGVDGSFLARPSKSNPGD ----H-----HHHHHHHHHH------EEEEE-------

  22. A Generic Prediction Algorithm: Structure to Structure MRRWFHPNITGVEAENLLLTRGVDGSFLARPSKSNPGD -?--H-----HHHHHHHHHH------EEEEE-------

  23. A Generic Prediction Algorithm: Structure to Structure MRRWFHPNITGVEAENLLLTRGVDGSFLARPSKSNPGD ----H-----HHHHHHHHHH------EEEEE-------

  24. A Generic Prediction Algorithm: Structure to Structure MRRWFHPNITGVEAENLLLTRGVDGSFLARPSKSNPGD --?-H-----HHHHHHHHHH------EEEEE-------

  25. A Generic Prediction Algorithm: Structure to Structure MRRWFHPNITGVEAENLLLTRGVDGSFLARPSKSNPGD ----H-----HHHHHHHHHH------EEEEE-------

  26. A Generic Prediction Algorithm: Structure to Structure MRRWFHPNITGVEAENLLLTRGVDGSFLARPSKSNPGD ----?-----HHHHHHHHHH------EEEEE-------

  27. A Generic Prediction Algorithm: Structure to Structure MRRWFHPNITGVEAENLLLTRGVDGSFLARPSKSNPGD ----------HHHHHHHHHH------EEEEE-------

  28. A Generic Prediction Algorithm: Structure to Structure MRRWFHPNITGVEAENLLLTRGVDGSFLARPSKSNPGD ----------HHHHHHHHHH------EEEEE------?

  29. A Generic Prediction Algorithm: Structure to Structure MRRWFHPNITGVEAENLLLTRGVDGSFLARPSKSNPGD ----------HHHHHHHHHH------EEEEE-------

  30. Pitfalls for newcomers • Definition of secondary structure • Choice of data set

  31. Pitfall 1: Definition of Secondary Structure • DSSP: H, P, E, G, I, T, S • STRIDE: H, G, I, E, B, b, T, C • DEFINE: ??? • Convert all to H, --, and E • They only agree 71% of the time!!! (95% for DSSP and STRIDE) • Solution: Use DSSP

  32. Pitfall 2: Dataset • Trivial to get 80%+ when homologies are present between the training and the test set • Homology identification keeps evolving • RS126, CB513, etc. • Comparison of programs on different data sets meaningless…

  33. Performance Bounds • Simple baselines for lower bound • A method for estimating an upper bound

  34. Baseline 1: 43% of all residues are tagged “loop” Performance Bounds 43%: assign loop

  35. Baseline 2: 49% of all residues are tagged with the most frequent structure for the given amino-acid. Performance Bounds 49%: assign most frequent 43%: assign loop

  36. Upper bound: Only consider exact matches for a given frame size. As the frame size increases accuracy should increase but coverage should fall. Performance Bounds 100% ??? 49%: assign most frequent 43%: assign loop

  37. Upper Bound with Homologs

  38. Upper Bound without Homologs

  39. Upper bound: Only consider exact matches for a given frame size. As the frame size increases accuracy should increase but coverage should fall. Performance Bounds 100% ??? 75%: estimated upper bound 49%: assign most frequent 43%: assign loop

  40. The Miracle of Homology • People used to be stuck at around 60%. • Rost and Sander crossed the 70% barrier in 1993 using homology information. • All algorithms benefit 5-10% from homology. • The homologues are of unknown structure, training and test sets still unrelated! • Why?

  41. The Miracle of Homology 60%

  42. The Miracle of Homology 70%

  43. Outline • What is the problem? • What are the different approaches? • How do we use decision lists and why? • Why does evolution help?

  44. GORV Sequence Secondary Structure PSI-BLAST +6.5% 66.9% Majority Vote Information Function / Bayesian Statistics Filter Secondary Structure Secondary Structure +73.4% * Garnier et al, 2002

  45. Frequency Profile HSSP Neural Network Secondary Structure PHD Secondary Structure +4.3% Neural Network 62.6% / 67.4% Jury + Filter +3.4% Secondary Structure 70.8% 61.7% / 65.9% * Rost & Sander, 1993

  46. JNet Profile Secondary Structure PSIBLAST HMMER2 CLUSTALW Neural Network Neural Network Jury + Jury Network Secondary Structure Secondary Structure 76.9% * Cuff & Barton, 2000

  47. PSIPRED Secondary Structure Profiles PSI-BLAST Neural Network Neural Network Secondary Structure Secondary Structure 76.3% * Jones, 1999

  48. Outline • What is the problem? • What are the different approaches? • How do we use decision lists and why? • Why does evolution help?

  49. Introduction to Decision Lists • Prototypical machine learning problem: • Decide democrat or republican for 435 representatives based on 16 votes. Class Name: 2 (democrat, republican) 1. handicapped-infants: 2 (y,n) 2. water-project-cost-sharing: 2 (y,n) 3. adoption-of-the-budget-resolution: 2 (y,n) 4. physician-fee-freeze: 2 (y,n) 5. el-salvador-aid: 2 (y,n) 6. religious-groups-in-schools: 2 (y,n) … 16. export-administration-act-south-africa: 2 (y,n)

  50. Introduction to Decision Lists • Prototypical machine learning problem: • Decide democrat or republican for 435 representatives based on 16 votes. 1. If adoption-of-the-budget-resolution = y and anti-satellite-test-ban = n and water-project-cost-sharing = y then democrat 2. If physician-fee-freeze = y then republican 3. If TRUE then democrat