1 / 34

Applied Bioinformatics

Applied Bioinformatics. Week 4. Similarity Searching. Heuristic Algorithms FASTA BLAST. FASTA Algorithm. Algorithm? Pearson and Lipman 1988. FASTA - Algorithm -. High level algorithm Let q be a query max  0 For each sequence, s in DB compare q with s and compute a score, y

Download Presentation

Applied Bioinformatics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Applied Bioinformatics Week 4

  2. Similarity Searching • Heuristic Algorithms • FASTA • BLAST

  3. FASTA Algorithm • Algorithm? • Pearson and Lipman 1988

  4. FASTA - Algorithm - High level algorithm Let q be a query max  0 For each sequence, s in DB compare q with s and compute a score, y if max < y max  y; bestSequence  s ; Return bestSequence

  5. FASTA Hashing • Hashing based on k in k-tuple (sequence of size k) • K e.g. 1 character .. n character 0123456789012345678 Q: ACCGCGACCCTGACGAATA D: ACCGCGATGACGAATA

  6. Second Step (diff)

  7. Third Step (Freq Dist) • Calculate frequency distribution • Histogram • Find most frequent offset • Shift query against sequence by that offset • Outcome: exact matches

  8. FASTA - Heuristic - • Heuristic Good local alignment should have some exact match subsequence. FASTA focus on this area

  9. FASTA - Algorithm - • Step 1 Find all hot-spots (remember hashing) // Hot spots is pairs of words of length k that exactly match Sequence 1 Hot Spots Sequence 2

  10. FASTA - Algorithm - • Step 2 Score the Hot-spot and locate the ten best diagonal run.

  11. FASTA - Algorithm - • Step 3 Combine sub-alignments into one alignment with GAP GAP One of local alignment

  12. FASTA - Algorithm - • Step 4 Consider weighted direct graph. Let node be a sub-alignment found in step 1 Let u and v be nodes Edge (u,v) exists if alignment u is before in the sequence. Each edge has gap penalty (negative) Find the maximum weight path Sub-sequence Edge One Sequence

  13. One of Sequence FASTA - Algorithm - • Step 4 in detail GAP Sub-alignment Gap -5 -3 -3 Max Weight Path

  14. FASTA - Algorithm - • Step 5 Use the dynamic programming in restricted area around the best-score alignment to find out the the best-score alignment Width of this band is a parameter

  15. FASTA - Algorithm - • Summary of Algorithm 1: Find all hot-spots // Hot spots are pairs of words of length k that exactly match 2: Score the Hot-spot and locate the ten best diagonal run. 3: Combine sub-alignments into one alignment 4: Score Each alignment with gap penalty and pick up the best-score alignment 5: Use dynamic programming in restricted area around the best-score alignment to find out the best-score alignment.

  16. FASTA Rumors • Is said to be more sensitive for nucleotide sequences than BLAST • Is supposed to be slower than BLAST • Is mostly found in European institutes • Who’s job is it to confirm or reject these assumptions?

  17. End Theory I • Mind mapping • 10 min break

  18. Practice I

  19. FASTA Hashing • Apply the FASTA hashing algorithm to the following two sequences • AGTATGTGATGTAGAT • TGATG • Show the histogram • Interpret the histogram in context of the two sequences

  20. FASTA Query Select a nucleotide sequence of your interest Copy the first 100 nucleotides into a text file and add a definition line to turn it into FASTA format Copy and paste the sequence 10 times and remove 10 nt each time Change the definition line accordingly Outcome: 10 sequences (10 .. 100 nt)

  21. FASTA Query Copy the 10 sequences again and add them to the end of the file Change the definition lines Add mutations (substitutions) to the sequences

  22. End Practice I • 15 min break

  23. Similarity Searching • Heuristic Algorithms • FASTA • BLAST

  24. BLAST - Heuristic - • Another Heuristic algorithm • Heuristic but evaluating the result statistically. Homologous sequence are likely to contain a short high scoring word pair, a hit. BLAST tries to extend it on both sides to get larger matches A T T A G ……………. Sequence Short high score Word

  25. BLAST - Algorithm - Neighbourhood Word • Step 1: pre-processing Query Compile the short-hit scoring word list from query. The length of query word w, is 3 for scoring Threshold T is 13

  26. BLAST - Algorithm - • Step 1 – 2 Create neighbourhood words for each query word Query Word Neighbourhood words

  27. BLAST - Algorithm - • Step 2: Scanning DB For each words list, identify all exact matches with DB sequences Neighbourhood Word list Query Word Sequences in DB Sequence 1 Sequence 2 Step 2 Step 1 The purpose of Step 1 and 2 is as same as FASTA

  28. Statistical Assessment • Combine matches • Calculate statistics for each alignment • Bit Score • E-value • Report results

  29. FASTA vs. BLAST BLAST Compare the query and sequences in DB with the same threshold. FASTA compare the query and a sequence one by one And compare each result. DB DB Query What does this mean?

  30. End of Theoretical Part 2 • 5 mind mapping • 10 min break

  31. Practical Part 2

  32. EBI FASTA Use the FASTA file you created before Run your query on EBI using the fasta algorithm with the default settings Change the settings and keep track of which settings you use and the number of queries that have the correct result as the top hit Use Excel (settings, %correct)

  33. NCBI BLAST Use the FASTA file you produced before and do the same research using NCBI BLAST that you did for EBI fasta Use blastn Select the proper database Finish EBI FASTA if you couldn't before

  34. Homework Find a query that you can find with FASTA but not with BLAST and vice versa Submit the queries to bioinformatics@allmer.de

More Related