BLAST CD Session 2 | Wednesday 4 May 2005 Bram Raats Lee Provoost
Introduction in BLAST BLAAT Explain also the HSP of Bogdan’s first question. Base on “Sequence similarity search PDF document that Bram found.
Lee | Peter #1: Question At Page 79 it is mentioned that NCBI BLAST, with a minimum word length of 7, does not use the two-hit algorithm. Since it would be useful to use the two-hit algorithm for smaller word sizes, is it possible to use the algorithm anyway, but then with smaller word sizes, and translate this information afterwards to longer words?
Lee | Peter #1: Answer • Two-hit algorithm uses Word size (W) and Treshold (T), but BLASTN seeds are always identical words, so only based on Word size (W). • Two-hit algorithms aren’t used in BLASTN searches because word hits are usually rare with large identical words (NCBI-BLASTN requires minimum W = 7) • BUT! Why BLASTN W >=7, while BLASTP W = 3? Why not BLASTN W = 3? -> # sizes in available data -> hash table puts memory constraints
Lee | Peter #2: Question At page 83 a floating/fixed bandwidth is mentioned i.c.w. gapped extension. What does this bandwidth mean and how is it used for the extension procedure?
Lee | Peter #2: Answer • Bandwidth: Window width (or band width) within which gapped alignments are generated. • Default for protein comparisons: 32 • Default for nucleotide comparisons: 16
Lee | Ingmar #1: Question I have the following question for you. On page 79 they are talking about a two-hit algorithm that looks for places where words are found within a given distance. In this way they want to reduce the search space. My question is why this is so useful, since when you are aligning a big string, it seems to me that even the two-hit strategy would be able to find a lot of useless seeds. Would it be useful to try a thee-hit or even four-hit strategy in this case? It seems to me that you could cut off the search space even further then.
Lee | Ingmar #1: Answer • BLAST 1 used a one-hit algorithm, but the nowadays used BLAST 2 uses the two-hit algorithm. • Extension step typically accounts for 90 % of BLAST’s execution time • Word hits tend to cluster along diagonals in the search space --> the two-hit algorithm takes advantage of that by requiring two word hits on the same diagonal within a given distance • But! To obtain the same sensitivity we have to lower the T, which causes more additional neighborhood words --> more computation • So: 3 and 4 hit algorithms wouldn’t be beneficial due to an even lower T + we ignore a lot of possible candidates!
Lee | Bogdan #1: Question The notion of HSP (high scoring pair) is used widely throughout the last part of the chapter about statistical models used in BLAST. However, it is never quite clearly explained. It’s supposed to be a pair with a high score. But what is this “high score”? Is it some sort of threshold? This is the only interpretation I can think of. However, if it is so, then what is the “expected HSP length” defined and used in the section called “Length Correction”? Is it supposed to be the length of a pair of matched sequences, in which all the corresponding pairs have the score above the possible threshold? Could you maybe explain this by an example of some sort, since I’ve read the section over and over and it still doesn’t fall into place. I seem to miss the intuition behind the significance of the HSP length.
Lee | Bogdan #1: Answer • Once seeds are extended in both directions to create alignments, the alignments are evaluated to determine if they are statistically significant. Those that are significant are called HSP’s. • Evaluation: just use some threshold, to sort alignments into low and high scoring
Lee | Bogdan #2: Question Have a look at the section called “Sum statistics and Sum scores”. We have a series of more and more accurate defined sums defined there. The part I don’t understand is why these sums are needed. I mean, we have the Needleman-Wunsch and the Smith-Waterman algorithms to get the best alignments for two algorithms in order to get the pair with the best score? Why would we want to keep computing sum scores for pairs of alignment, when we could simply detect the alignment with the best score (assuming we have the score matrix) directly?
Lee | Bogdan #2: Answer Needleman-Wunsch & Smith-Waterman • dynamic programming algorithm • Explores entire search space between two sequences BLAST • heuristic algorithm • Minimizing search space is key to its speed but at the cost of a loss in sensitivity --> SW = 10 min | BLAST = 20 sec Remark: BLAST based on SW! (http://www.sbc.su.se/~arne/kurser/swell/blast.html)
Lee | Bogdan #3: Question At some point in the evaluation part of the BLAST algorithms it is mentioned that HSP’s have to be organized into consistent groups. I understand the upper left - bottom right argument about consistent groups, but it doesn’t seem to be enough, in my opinion, for a complete definition of such groups. What more is there which the algorithm has to look at in order to decide how exactly a consistent group is supposed to be formed. For example, I don’t think that a group should consist of HSP’s which are very far apart, even if they conform to the upper left - bottom right criteria. Or am I not understanding things in the right way?
Lee | Bogdan #3: Answer • The relationship between HSP’s should resemble the relationship between ungapped alignments. • The lines in the graph should start from the upper left and continue to the lower right, the lines shouldn't overlap, and there should be a penalty for unaligned sequence. • Groups of HSP’s that behave this way are considered consistent. • “Some” algorithm for defining groups of consistent HSP’s • Coordinates -> overlapping? • Band width?
Jacob #1: Question I sort of understand why it is nice to have a pair-wise ordered sum score like Equation 4- 17, but why is the ln(r!) added to the sum? I don’t see how this adds the ordering to the sum.
Jacob #1: Answer • Error in book? • Ln(r!) is additional bonus points for collinear HSP’s • Exact meaning?
Bram | Jacob #2: Question On page 79 it is said that it looks like the word hits tend to cluster along diagonals in the search space. Why, in what way , is this and advantage for the two hit algorithm?
Bram | Jacob #2: Answer • 2 word hits on same diagonal within distance • Search space reduced -> increase speed
Bram | Marjolijn #1: Question PAM matrices are based on data available in the 1970’s, and BLOSSUM matrices are based on data available in the 1990’s. Can PAM and BLOSSUM produce the same score of alignment for two sequences when you choose the correct number i for PAMi and BLOSSUMi?
Bram | Marjolijn #1: Answer • PAM • Theoretical • attempt to model the course of sequence evolution. • BLOSSUM • empirical • Using the relative entropy of a scoring matrix (H) -> compare the two different matrices. • If the general behaviour of a scoring matrix is the same, the probabilitiy of an equal score of alignment is high. • So when you choose the correct number i, for PAM and BLOSSUM, they should produce the same score of alignment.
Bram | Marjolijn #2: Question BLAST searches for statistically significant similarities. How did BLAST learn which similarities between two sequences are statistically significant and which are not? Is the learn-set of sequences the whole DNA database or not?
Bram | Marjolijn #2: Answer BLAST uses the scores of the different alignments found to determine if an alignment is statistically significant or not. Significant alignments are called HSP’s. At the simplest level, BLAST uses a score threshold S, to sort alignments in high and low scoring. Because S and E are related through the Karlin-Altschul equation, a score threshold is synonymous with a statistical threshold. So BLAST uses scoring matrices to determine the score of a specific alignment and uses a threshold to determine if the score of the alignment is high enough to be statistically significant.
Laurence #1: Question On page 65 it says that the Karlin-Altschul theory makes five central assumptions, one of these is that the sequences are infinitely long, I don’t really understand why it is necessary for the sequence to be infinitely long. Can you please explain it?