1 / 51

Lesson 5

Lesson 5. Protein Prediction and Classification. Learning about a protein. What does a protein do?? Post-translational modifications – phosphorylation, glycosylation, etc. Identifying patterns, motifs Secondary structure Tertiary/quaternary structure Protein-protein interactions.

skyla
Download Presentation

Lesson 5

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Lesson 5 Protein Prediction and Classification

  2. Learning about a protein What does a protein do?? • Post-translational modifications – phosphorylation, glycosylation, etc. • Identifying patterns, motifs • Secondary structure • Tertiary/quaternary structure • Protein-protein interactions

  3. Domains & Motifs

  4. Domains • An analysis of known 3-D protein structures reveals that, rather than being monolithic, many of them contain multiple folding units. • Each such folding unit is a domain (>50 aa, < 500 aa)

  5. calcium/calmodulin-dependent protein kinase SH2 domain: interact with phosphorylated tyrosines, and are thus part of intracellular signal-transuding proteins. Characterized by specific sequences and tertiary structure

  6. What is a motif?? • A sequence motif = a certain sequence that is widespread and conjectured to have biological significance • Examples:KDEL– ER-lumen retention signalPKKKRKV– an NLS (nuclear localization signal)

  7. More loosely defined motifs • KDEL (usually)+ • HDEL (rarely) = • [HK]-D-E-L:H or K at the first position • This is called a pattern (in Biology), or a regular expression (in computer science)

  8. Syntax of a pattern • Example:W-x(9,11)-[FYV]-[FYW]-x(6,7)-[GSTNE].

  9. Patterns • W-x(9,11)-[FYV]-[FYW]-x(6,7)-[GSTNE]. Any amino, between 9-11 times F or Y or V WOPLASDFGYVWPPPLAWSROPLASDFGYVWPPPLAWSWOPLASDFGYVWPPPLSQQQ   

  10. Patterns - syntax The standard IUPAC one-letter codes. • ‘x’ : any amino acid. • ‘[]’ : residues allowed at the position. • ‘{}’ : residues forbidden at the position. • ‘()’ : repetition of a pattern element are indicated in parenthesis. X(n) or X(n,m) to indicate the number or range of repetition. • ‘-’ : separates each pattern element. • ‘‹’ : indicated a N-terminal restriction of the pattern. • ‘›’ : indicated a C-terminal restriction of the pattern. • ‘.’ : the period ends the pattern.

  11. Pattern ~ motif ~ signature • A pattern (similarly to consensus and profile) is a way to represent a conserved sequence • Whereas a profile and consensus usually relate to the entire sequence, a pattern usually relates to a a few tens of amino-acids

  12. Profile-pattern-consensus consensus multiple alignment pattern [AC]-A-[GC]-T-[TC]-[GC] profile • Information: • consensus<pattern<profile

  13. Interpro • Interpro: a collection of many protein signature databases (Prosite, Pfam, Prints…) integrated into a hierarchical classifying system

  14. Interpro example

  15. PTM – Post-Translational Modification

  16. PTM – Post-Translational Modification • PhosphorylationTyr, Ser, Thr • Glycosylation(addition of sugars)Asn, Ser, Thr • Addition of fatty acids (e.g. N-myristoylation, S-Palmitoylation)

  17. So how to predict Take into account: • Context (motif):PKC (a kinase) recognizes X S/T X R/KN-Myristoylation at M G X X X S/TSeveral times – we don’t know the exact motif! • ConservationIs the motif found (for instance, in human) also conserved in related organisms (for instance, in chimp)?

  18. Prediction problems • Signal for detection is very short • Not enough biological knowledge for characterizing the signal • Tertiary structure

  19. Prediction will be more efficient if more information is available

  20. Secondary Structure

  21. Secondary Structure • Reminder- secondary structure is usually divided into three categories: Anything else – turn/loop Alpha helix Beta strand (sheet)

  22. Secondary Structure • An easier question – what is the secondary structure when the 3D structure is known?

  23. DSSP • DSSP (Dictionary of Secondary Structure of a Protein) – assigns secondary structure to proteins which have a crystal structure H = alpha helix B = beta bridge (isolated residue) E = extended beta strand G = 3-turn helix I = 5-turn helix T = hydrogen bonded turn S = bend

  24. Predicting secondary structure from primary sequence

  25. Chou and Fasman (1974) Name P(a) P(b) P(turn) Alanine 142 83 66 Arginine 98 93 95 Aspartic Acid 101 54 146 Asparagine 67 89 156 Cysteine 70 119 119 Glutamic Acid 151 037 74 Glutamine 111 110 98 Glycine 57 75 156 Histidine 100 87 95 Isoleucine 108 160 47 Leucine 121 130 59 Lysine 114 74 101 Methionine 145 105 60 Phenylalanine 113 138 60 Proline 57 55 152 Serine 77 75 143 Threonine 83 119 96 Tryptophan 108 137 96 Tyrosine 69 147 114 Valine 106 170 50 The propensity of an amino acid to be part of a certain secondary structure (e.g. – Proline has a low propensity of being in an alpha helix or beta sheet  breaker)

  26. Ala Pro Tyr Phe Phe Lys Lys His Val Ala Thr α 142 57 69 113 113 114 114 100 106 142 83 β 83 55 147 138 138 74 74 87 170 83 119 Chou-Fasman prediction • Look for a series of >4 amino acids which all have (for instance) alpha helix values >100 • Extend (…) • Accept as alpha helix if average alpha score > average beta score

  27. Chou and Fasman (1974) • Success rate of 50%

  28. Improvements in the 1980’s • Conservation in MSA • Smarter algorithms (e.g. HMM, neural networks).

  29. Accuracy • Accuracy of prediction seems to hit a ceiling of 70-80% accuracy

  30. Gene Ontology

  31. GO • Gene Ontology – a project for consistent description of gene products in different databases. • Consistent description - Common key definitions. Example:‘protein synthesis’ or ‘translation’

  32. GO • GO - GO describes proteins in terms of :biological processcellular componentmolecular function • GO is not: • A sequence database. • A portal for sequence information

  33. GO – structure cell cellular component nucleus Nuclear chromosome

  34. GO example Links from the swissprot entry of human protein kinase C alpha

  35. Examples for use of GO • Enrichment for a GO category: • Do all up regulated genes in a microarray you built belong to the same GO “molecular function” category? • You have predicted a new transcription factor binding site. Do all genes with this site belong to the same GO biological process?

  36. Evaluation of prediction methods

  37. Evaluation of prediction methods • Comparing our results to experimentally verified sites Our prediction gives: Is the prediction correct?

  38. Method evaluation • A good method will be one with a high level of true-positives and true-negatives, and a low level of false-positives and false-negatives Our prediction gives: Is the prediction correct?

  39. Calibrating the method • All methods have a parameter (or a score) that can be calibrated to improve the accuracy of the method. • For example: the E-value cutoff in BLAST

  40. Calibrating E-value cutoff • Reminder: the lower the E-value, the more ‘significant’ the alignment between the query and the hit.

  41. Calibrating the E-value • What will happen if we raise the E-value cutoff (for instance – work with all hits with an E-value which is < 10) ? Our prediction gives: Is the prediction correct?

  42. Calibrating the E-value • On the other hand – if we lower the E-value (look only at hits with E-value < 10-8) Our prediction gives: Is the prediction correct?

  43. Improving prediction • Trade-off between specificity and sensitivity

  44. True positive True positive + False negative Sensitivity vs. specificity • Sensitivity = • Specificity = How good we hit real phosphorylations Represent all the proteins which are really phosphorylated True negative True negative + False positive How good we avoid real non-phosphorylations Represent all the proteins which are really NOT phosphorylated

  45. Raising the E-value to 10:sensitivityspecificity • Lowering the E-value to 10-8sensitivity specificity

  46. Over-predictions: example • Many PTM-predictors tend to over-predict high level of false positives  low specificity WHY? • Tertiary structure! (buried/exposed, tertiary motifs) • The phosphorylation recognition mechanism is not completely clear!

  47. Next time on: Biological Sequences Analysis

  48. The Human Genome

  49. Horizontal (Lateral) Gene Transfer

  50. Alternative splicing

More Related