1 / 55

Lesson 3

Lesson 3. Aligning sequences and searching databases . Some Terminology. Matrix = Table. Probability = סיכוי Likelihood = סבירות. Global and Local pairwise alignments. Global vs. Local . Global alignment – finds the best alignment across the entire two sequences.

mattox
Download Presentation

Lesson 3

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Lesson 3 Aligning sequences and searching databases

  2. Some Terminology

  3. Matrix = Table

  4. Probability = סיכויLikelihood = סבירות

  5. Global and Local pairwise alignments

  6. Global vs. Local Global alignment – finds the best alignment across the entire two sequences. Local alignment – finds regions of similarity in parts of the sequences. ADLGAVFALCDRYFQ |||| |||| | ADLGRTQN-CDRYYQ ADLG CDRYFQ |||| |||| | ADLG CDRYYQ

  7. The sequence similarity is restricted to a single domain PTK2 Domain A Protein tyrosine kinase domain Domain B Domain X Protein tyrosine kinase domain Leukocyte TK

  8. Which alignment is the correct one? AAGTGAATTCGAA AGGCTCATTTCTGA A-AG-TGAATTC--GAA AG-GCTCA-TTTCTGA- AAG-TGAATT-C-GAA AGGCT-CATTTCTGA-

  9. Scoring system (naïve) Perfect match: +1 Mismatch: -2 Indel (gap): -1 A-AG-TGAATTC--GAA AG-GCTCA-TTTCTGA- AAG-TGAATT-C-GAA AGGCT-CATTTCTGA- Score: =(+1)x9 + (-2)x2 + (-1)x5= 0 Score: =(+1)x8 + (-2)x2 + (-1)x6 = -1 Higher score  Better alignment

  10. DNA scoring matrices Uniform substitutions between all nucleotides: Match Mismatch

  11. Scoring gaps (I) Gap extension penalty < Gap opening penalty

  12. Protein matrices – actual substitutions The idea: Given an alignment of a large number of closely related sequences we can score the relation between amino acids based on how frequently they substitute each other M G Y D E M G Y D E M G Y E E M G Y D E M G Y Q E M G Y D E M G Y E E M G Y E E In the fourth column E and D are found in 7 / 8

  13. PAM Matrices Family of matrices PAM 80, PAM 120, PAM 250 The number on the PAM matrix represents evolutionary distance Larger numbers are for larger distances

  14. Example: PAM 250 Similar amino acids have greater score

  15. PAM - limitations Based only on a single, and limited dataset Examines proteins with few differences (85% identity) Based mainly on small globular proteins so the matrix is biased

  16. BLOSUM Henikoff and Henikoff (1992) derived a set of matrices based on a much larger dataset BLOSUM observes significantly more replacements than PAM, even for infrequent pairs

  17. BLOSUM:BlocksSubstitutionMatrix Based on BLOCKS database ~2000 blocks from 500 families of related proteins Families of proteins with identical function Blocks are short conserved patterns of 3-60 amino acids without gaps AABCDA----BBCDA DABCDA----BBCBB BBBCDA-AA-BCCAA AAACDA-A--CBCDB CCBADA---DBBDCC AAACAA----BBCCC

  18. Example : Blosum62 Derived from blocks where the sequences share at least 62% identity

  19. PAM vs. BLOSUM PAM100 = BLOSUM90 PAM120 = BLOSUM80 PAM160 = BLOSUM60 PAM200 = BLOSUM52 PAM250 = BLOSUM45 More distant sequences

  20. Intermediate summary • Scoring system = substitution matrix + gap penalty. • Used for both global and local alignment • For amino acids, there are two types of substitution matrices: PAM and Blosum

  21. Computational Aspects

  22. Many possible alignments AAGCTGAATTCGAA AGGCTCATTTCTGA • Which alignment has the best score? • Two sequences of length 10 have >> 1,000,000 possible alignments • Two sequences of length 20 have >> 1,000,000,000,000 possible alignments • Two sequences of length 30 have >> 1,000,000,000,000,000,000 possible alignments AAGCTGAATT-C-GAA AGGCT-CATTTCTGA- AAG-CTGAATT-C-GAA AGGCT-CATTT-CTGA- AAGCT-GAATT-C-GAA A-GGCT-CATTTCTGA-

  23. Optimal alignment algorithms • Needleman-Wunsch (global) [1970] • Smith-Waterman (local) [1981] • Two sequences of length 10: 100 computer operations (instead of 1,000,000). • Two sequences of length 20: 400 computer operations (instead of 1,000,000,000,000). • Two sequences of length 30: 900 computer operations (instead of 1,000,000,000,000,000,000).

  24. T S score(AAAC,AGC) = -1 Matrix Representation Match = 1 Mismatch = -1 Indel = -2 AAAC A-GC

  25. T S Matrix Representation Match = 1 Mismatch = -1 Indel = -2 AAA A-G score(AAA,AG) = -2

  26. T S Matrix Representation Match = 1 Mismatch = -1 Indel = -2 -- AG score(,AG) = -2

  27. T S Matrix Representation Match = 1 Mismatch = -1 Indel = -2 How do we fill in the alignment scores in the matrix? That’s where the algorithm comes into play

  28. A Useful Link • http://alggen.lsi.upc.es/docencia/ember/frame-ember.html • Gives a step by step illustration of the algorithm for any given pair of sequences.

  29. Homology versus chance similarity

  30. A suggestion A. Take the two sequences  Compute score. B. Take one sequence randomly shuffle it -> find score with the second sequence. Repeat 100,000 times. If the score in A is at the top 5% of the scores in B  the similarity is significant.

  31. Searching databases

  32. Craig Venter’s Cruise

  33. Craig Venter’s cruise A sequence found in Craig Venter’s cruise: …AGGTAGACTAGAGCAGTTAGAACGTTAGTTTA… Which organism is it coming from??

  34. Database GTGAGCAGAGAATAGTTTAAC… GAGCTATGTGAGCAGAGAATA… CTACGTGAGCAGAGAATAGTT… CATAGCTACTATGTGAGCAGA… GAGACCAGAGACTACGATAGC… CTAAACTGTGAGCAGACTCGT… GGGGACAGAGAATAGTTTAAC… TAGCTGAGCTATGTGAGCAGA… … … Query AGGTAGACTAGAGCAGTTAGAACGTTAGTTTA

  35. Searching a sequence database The idea: Use your sequence as a query to find homologous sequences in a sequence database Database A sequence takenfrom Venter’s trip

  36. Searching a sequence database Database query

  37. Searching a sequence database Database hit query

  38. Terminology Query sequence - the sequence with which we are searching Hit – a sequence found in the database, suspected as homologous

  39. Protein or DNA search

  40. Query sequence: DNA or protein? For coding sequences, we can use the DNA sequence or the protein sequence to search for similar sequences. Which is preferable if we want to learn about homology?

  41. Amino acids are better! Selection (and hence conservation) works (mostly) at the protein level:CTTTCA = Leu-SerTTGAGT=Leu-Ser

  42. Query type • Nucleotides: a four letter alphabet • Amino acids: a twenty letter alphabet • Two random DNA sequences will, on average, have 25% identity • Two random protein sequences will, on average, have 5% identity

  43. Computation time

  44. Searching a sequence database Assuming 10 comparisons in every second, a full comparison of the query to the database requires 11.5 days. Database query 107 sequences

  45. How do we search a database? 11.5 days is ok if we are doing it once. 150,000 searches (at least!!) are performed per day. >82,000,000 sequence records in GenBank.

  46. Heuristic Definition: a heuristic is a design to solve a problem that does not provide an exact solution (but is not too bad) but reduces the time complexity of the exact solution

  47. BLAST BLAST - Basic Local Alignment and Search Tool A heuristic for searching a database for similar sequences

  48. BLAST - underlying hypothesis The underlying hypothesis: when two sequences are similar there are short ungapped regions of high similarity between them The heuristic: Discard irrelevant sequences Perform exact local alignment only with the remaining sequences

More Related