1 / 44

Tutorial 9

Tutorial 9. Secondary Structures. Agenda. Evaluating secondary structure prediction tools (and bioinformatics tools in general) Jpred - SS prediction server Pfam - protein families DB Uni-Prot - a central repository of proteins Cool story of the day: How can you help cure diseases

raiden
Download Presentation

Tutorial 9

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Tutorial 9 Secondary Structures

  2. Agenda • Evaluating secondary structure prediction tools (and bioinformatics tools in general) • Jpred- SS prediction server • Pfam- protein families DB • Uni-Prot - a central repository of proteins • Cool story of the day: How can you help cure diseases by playing a game? (Fold-It)

  3. Protein Secondary Structure Prediction ? ? TDVEAAVNSLVNLYLQASYLS ?

  4. Protein secondary structure prediction • Input: protein sequence • Output: for each residue its associated Secondary structure (SS): alpha-helix, beta-strand, or loop.

  5. Evaluating secondary structure prediction methods • Assume you have a new method for SS prediction. • Given the following sequence you get the result: GLGGYMLGSAMSRPMIHFGNDWEDRYYRENMYRYPNQVYYRPVDQYSNQNNFVHDCVNIT ---EEEEEEE---EEEE-------HHHHHHHH-----EEEE---------EEEEEEEEEE Coil: - , Beta strand: E , Alpha helix: H How can you assess how good your result is? Compare it to the TRUTH, assuming this structure exists. (what if it doesn’t?) Calculate the percentage of amino acids whose secondary structure class (helix, coil, or sheet) is correctly predicted.

  6. Evaluating secondary structure prediction methods Original sequence: GLGGYMLGSAMSRPMIHFGNDWEDRYYRENMYRYPNQVYYRPVDQYSNQNNFVHDCVNIT Prediction: ---EEEEEEE---EEEE-------HHHHHHHH-----EEEE---------EEEEEEEEEE Truth (from a PDB file): -----EE-------------HHHHHHHHHH--------EE--------HHHHHHH-----

  7. Evaluating secondary structure prediction methods GLGGYMLGSAMSRPMIHFGNDWEDRYYRENMYRYPNQVYYRPVDQYSNQNNFVHDCVNIT ---EEEEEEE---EEEE-------HHHHHHHH-----EEEE---------EEEEEEEEEE -----EE-------------HHHHHHHHHH--------EE--------HHHHHHH----- YYYNNYYNNNYYYNNNNYYYNNNNYYYYYYNNYYYYYNYYNYYYYYYYNNNNNNNNNNNN • Overall, there are 61 AA. • Number of correctly predicted (Y) is 31. • So the score of this method would be: 50.81% What can be the problem with such calculation?

  8. Evaluating secondary structure prediction methods • What can be the problem with such calculation? • Assume that alpha helix is the SS of 60% of the residues. • Then a constant prediction of alpha helices would yield a score of 60%. • This method rewards over prediction of more common secondary structure classes in the database.

  9. Evaluating secondary structure prediction methods There are other ways to measure correlation between the result and the ‘truth’. Most of them rely on the ratio between: True positive (TP) = correctly identified True negative (TN) = correctly rejected False positive (FP) = incorrectly identified False negative (FN) = incorrectly rejected

  10. Evaluating secondary structure prediction methods • For instance, for the α-helix: • TP: number of α-helix residues that are correctly predicted. • TN: number of residues observed in β-strands and loops that are not predicted as α-helix. • FP: number of residues incorrectly predicted in α-helix conformation. • FN: number of residues observed in α-helices but predicted to be either in β-strands or loops.

  11. Sensitivity and specificity • Sensitivity and specificity are statistical measures of the performance of a classification test. • Sensitivity measures the proportion of actual positives which are correctly identified as such (e.g. the percentage of sick people who are correctly identified as having the condition). • Specificity measures the proportion of negatives which are correctly identified (e.g. the percentage of healthy people who are correctly identified as not having the condition).

  12. Sensitivity and specificity • Question: • If the predictor perfectly predicts the truth, what would be the sensitivity rate? The specificity rate? • Answer: • A perfect predictor would be described as ______% sensitivity (i.e. predict all people from the sick group as sick) and ______% specificity (i.e. not predict anyone from the healthy group as sick).

  13. Sensitivity and specificity • For any classification test, there is usually a trade-off between the measures. • Example: in an airport security setting in which one is testing for potential threats to safety, scanners may be set to trigger on low-risk items like belt buckles and keys (low specificity), in order to reduce the risk of missing objects that do pose a threat to the aircraft and those aboard (high sensitivity).

  14. Bioinformatics examples of sensitivity vs. specificity • Mapping sequenced reads to a reference genome: when increasing allowed number of mismatches per read Sensitivity Specificity • BLAST: when lowering a P-value threshold Sensitivity Specificity • K-means clustering: when increasing the number of neighbors K Sensitivity Specificity

  15. Sensitivity and specificity

  16. Exercise Calculate the specificity and sensitivity of the alpha helix prediction in the following SS prediction: Original sequence: GLGGYMLGSAMSRPMIHFGNDWEDRYYRENMYRYPNQVYYRPVDQYSNQNNFVHDCVNIT Prediction: ---EEEEEEE---EEEE-------HHHHHHHH-----EEEE---------EEEEEEEEEE Truth (from a PDB file): -----EE-------------HHHHHHHHHH--------EE--------HHHHHHH-----

  17. Answer Pred. ---EEEEEEE---EEEE-------HHHHHHHH-----EEEE---------EEEEEEEEEE -----EE-------------HHHHHHHHHH--------EE--------HHHHHHH----- Alpha helix: • TP = 6 • FP=2 • FN=4+7=11 • TN=61-(6+2+11)=42 Truth TP - Alpha helices Correctly identified FP - Alpha helices Incorrectly identified FN - Alpha helices incorrectly rejected

  18. Jpred 3 – SS prediction server

  19. MSA Final SS prediction Buried/exposed prediction Reliability score

  20. Jpred 3 – SS prediction server Original sequence: GLGGYMLGSAMSRPMIHFGNDWEDRYYRENMYRYPNQVYYRPVDQYSNQNNFVHDCVNIT Jpred Prediction + reliability: -----HHHH------------HHHHHHHHHHH-------------------EEE------ 997500000026777567776017899988721577400467777777773000000699 Truth (from a PDB file): -----EE-------------HHHHHHHHHH--------EE--------HHHHHHH-----

  21. Pfam http://pfam.sanger.ac.uk/ Proteins are generally composed of one or more functional regions, commonly termed domains. Different combinations of domains give rise to the diverse range of proteins found in nature. The Pfam database is a large collection of protein families, each represented by multiple sequence alignments and hidden Markov models (HMMs).

  22. Glossary Domain A structural unit which can be found in multiple protein contexts. Domains are long motifs (30-100 aa). Family A collection of related proteins

  23. What kind of domains can we find in Pfam? Trusted Domains Repeats Fragment Domains Nested Domains Disulfide bonds Important residues (e.g active sites) Trans membrane domains

  24. Pfam input

  25. Domains Domain range and score

  26. Description Structure info Gene Ontology Links

  27. Domain organization

  28. HMM logo

  29. Known structures for the domain

  30. UniProt http://www.uniprot.org/ • The Universal Protein Resource (UniProt) is a central repository of protein sequence, function, classification and cross reference. • It was created by joining the information contained in swiss-Prot and TrEMBL.

  31. Protein search Uniprot input Reviewed protein

  32. Sequence download Uniprot output Accession number Protein status organism length

  33. Information for one protein General information annotations

  34. General keywords GO annotation (MF, BP, CC)

  35. Alternative splicing isoforms Features in the sequence

  36. Sequences References

  37. Blast

  38. Cool Story of the day How can you help cure diseases by playing a game?

  39. Foldit is an online game in which humans try to solve one of the hardest computational problems in biology: protein folding. You don't need to know anything about biology to play the game, although a little background will help. http://fold.it/portal/

  40. Even small proteins have on the order of 1000 degrees of freedom. Human reasoning may optimize the complex algorithms.

  41. Not just a game…

More Related