1 / 75

Master Course Sequence Alignment Lecture 9 Database searching (3)

C. E. N. T. E. R. F. O. R. I. N. T. E. G. R. A. T. I. V. E. B. I. O. I. N. F. O. R. M. A. T. I. C. S. V. U. Master Course Sequence Alignment Lecture 9 Database searching (3). Dot-plots a simple way to visualise sequence similarity. Filter:

Download Presentation

Master Course Sequence Alignment Lecture 9 Database searching (3)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. C E N T E R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U Master Course Sequence Alignment Lecture 9Database searching (3)

  2. Dot-plotsa simple way to visualise sequence similarity Filter: 6/10 residues have to match... Can be a bit messy, though...

  3. Dot-plots, what about... • Insertions/deletions -- DNA and proteins • Duplications (e.g. tandem repeats) – DNA and proteins • Inversions -- DNA Dot plots are calculated using a diagonal window of preset length that is slid through the search matrix --typically the central cell holds the window score (e.g. sum, average)

  4. Direct repeat Tandem repeat Inverted repeat Dot-plots, self-comparison

  5. charge

  6. (cysteine bridge)

  7. Protein structure hierarchical levels PRIMARY STRUCTURE (amino acid sequence) VHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH QUATERNARY STRUCTURE (oligomers) SECONDARY STRUCTURE (helices, strands) TERTIARY STRUCTURE (fold)

  8. Globin fold  protein myoglobin PDB: 1MBN Helices are labelled ‘A’ (blue) to ‘H’ (red). D helix can be missing in some globins: what happens with the alignment?

  9.  sandwich  protein immunoglobulin PDB: 7FAB

  10. TIM barrel  /  protein Triose phosphate IsoMerase PDB: 1TIM

  11. Pyruvate kinase Phosphotransferase b barrel regulatory domain a/b barrel catalytic substrate binding domain a/b nucleotide binding domain

  12. What does this mean for alignment? • Alignments need to be able to skip secondary structural elements to complete domains (i.e. putting gaps opposite these motifs in the shorter sequence). • Depending on gap penalties chosen, the algorithm might have difficulty with making such long gaps (for example when using high affine gap penalties), resulting in incorrect alignment.

  13. What does this mean for homology searching? • Database searching algorithms just need to decide if the alignment score is good enough for inferring homology • Sometimes, alignments can be incorrect but the score can be close enough for the database searching method to correctly identify the DB sequence as a homolog (or not) • However, for distant hits alignments become crucial

  14. Sequence Analysis/Database Searching Finding relationships between genes and gene products of different species, including those at large evolutionary distances

  15. Compared to the preceding plot, RMSD is better able to pin-point relationships between more divergent sequences (RMSD stays relatively small for a longer time as compared to PAM distance) – Structure more conserved than sequence. Note that the spread around RMSD is larger

  16. Structural superpositioning RMSD: how far are equivalenced Cα atoms separated on average?

  17. Two superposed protein structures with two well-superposed helices Red: well superposed Blue: low match quality C5 anaphylatoxin -- human (PDB code 1kjs) and pig (1c5a)) proteins are superposed

  18. How to assess homology search methods • We need an annotated database, so we know which sequences belong to what homologous (super)families • Examples of databases of homologous families are PFAM, Homstrad or Astral • The idea is to take a protein sequence from a given homologous family, then run the search method, and then assess how well the method has carried out the search • This should be repeated for many query sequences and then the overall performance can be measured

  19. C; family: zinc finger -- CCHH-type C; class: small C; reordered by kitschorder 1.0a C; reordered by kitschorder 1.0a C; last update 7/9/98 >P1;1zaa1 structureX:1zaa: 3 :C: 33 :C:zinc-finger (ZIF268, domain 1):Mus musculus:2.10:18.20 ------RPYACPVESCDRRFSRSDELTRHI-RI-HTGQK* >P1;1zaa2 structureX:1zaa: 34 :C: 61 :C:zinc-finger (ZIF268, domain 2):Mus musculus:2.10:18.20 -------PFQCRI--CMRNFSRSDHLTTHI-RT-HTGEK* >P1;1zaa3 structureX:1zaa: 62 :C: 87 :C:zinc-finger (ZIF268, domain 3):Mus musculus:2.10:18.20 -------PFACDI--CGRKFARSDERKRHT-KI-HLR--* >P1;1ard structureN:1ard: 102 : : 130 : :zinc-finger (transcription factor ADR1):Saccharomyces cerevisiae:-1.00:-1.00 ------RSFVCEV--CTRAFARQEHLKRHY-RS-HTNEK* >P1;1znf structureN:1znf: 1 : : 25 : :zinc-finger (XFIN, 31st domain):Xenopus laevis:-1.00:-1.00 --------YKCGL--CERSFVEKSALSRHQ-RV-HKN--* >P1;2drp2 structureX:2drp: 137 :A: 165:A:zinc-finger (tramtrack, domain 2):Drosophila melanogaster:2.80:19.30 ----NVKVYPCPF--CFKEFTRKDNMTAHV-KIIHK---* >P1;3znf structureN:3znf: 1 : : 30 : :zinc-finger (enhancer binding protein):Homo sapiens:-1.00:-1.00 ------RPYHCSY--CNFSFKTKGNLTKHMKSKAHSKK-* >P1;5znf structureN:5znf: 1 : : 30 : :zinc-finger (ZFY-6T):Homo sapiens:-1.00:-1.00 ------KTYQCQY--CEYRSADSSNLKTHIKTK-HSKEK* Example You can also look at superposed structures..

  20. Sequence searching QUERY DATABASE True Positive True Positive True Negative POSITIVES False Positive T NEGATIVES False Negative True Negative

  21. So what have we got Observed P N TP P FP Predicted N TN FN

  22. Sensitivity and Specificity – medical world

  23. Receiver Operator Curve (ROC) • Plot Sensitivity (TP/(TP+FN)) against 1-Specificity (1 - TN/(FP+TN)), where the latter is called error Sensitivity is also called Coverage Sensitivity Error = 1 - specificity

  24. Database Search Algorithms:Sensitivity, Selectivity • Sensitivity – the ability to detect weak similarities between sequences (often due to long evolutionary separation). Increasing sensitivity reduces false negatives, i.e. those database sequences similar to thequery, but rejected. Sensitivity (or Coverage) = TP / (TP+FN) • Selectivity – the ability to screen out similarities due to chance. Increasing selectivity reduces false positives, those sequences recognized as similar when they are not. Selectivity (or Positive Prediction Value) = TP / (TP + FP) • Specificity also describes the ability of the method to select proper hits Specificity = TN / (TN + FP) Sensitivity Selectivity, Specificity Courtesy of Gary Benson (ISSCB 2003)

  25. COG – Cluster of Orthologous Groups • Orthologues found using bi-directional best hit searching with PSI-BLAST • All COG family members are supposed to have the same function • Searching with an unknown sequence only needs to hit a single member of a COG family, annotation can then be transferred COG2813 http://www.ncbi.nlm.nih.gov/COG/

  26. Structure-based function prediction • SCOP (http://scop.berkeley.edu/) is a protein structure classification database where proteins are grouped into a hierarchy of families, superfamilies, folds and classes, based on their structural and functional similarities

  27. Structure-based function prediction • SCOP hierarchy – the top level: 11 classes

  28. Structure-based function prediction All-alpha protein membrane protein Alpha-beta protein Coiled-coil protein All-beta protein

  29. Structure-based function prediction • SCOP hierarchy – the second level: 800 folds

  30. Structure-based function prediction • SCOP hierarchy - third level: 1294 superfamilies

  31. Structure-based function prediction • SCOP hierarchy - third level: 2327 families

  32. Structure-based function prediction • Using sequence-structure alignment method, one can predict a protein belongs to a • SCOP family, superfamily or fold • Proteins predicted to be in the same SCOP family are orthologous • Proteins predicted to be in the same SCOP superfamily are homologous • Proteins predicted to be in the same SCOP fold are structurally analogous folds superfamilies families

  33. Profile wander

More Related