1 / 20

Pairwise Local Alignment and Database Search

Pairwise Local Alignment and Database Search. Csc 487/687 Computing for Bioinformatics. Which Program should one use?. Most researchers use methods for determining local similarities: Smith-Waterman (gold standard) FASTA BLAST. }. Do not find every possible alignment

bian
Download Presentation

Pairwise Local Alignment and Database Search

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Pairwise Local Alignment and Database Search Csc 487/687 Computing for Bioinformatics

  2. Which Program should one use? • Most researchers use methods for determining local similarities: • Smith-Waterman (gold standard) • FASTA • BLAST } Do not find every possible alignment of query with database sequence. These are used because they run faster than S-W

  3. Heuristic Database Search Methods • Smith-Waterman dynamic programming too computer and time intensive for searching big databases • e.g., UniProt July 2004 – 1.5M sequences • Most popular: BLASTx (Altschul et al 1990, 1997) and FASTx (Lipman and Pearson 1985)

  4. BLAST – Basic Local Alignment Search Tool • Basic idea: • Identify short very similar segment pairs – extend local alignment • Critical issues: • For every database sequence d significantly similar to q, one should find at least one segment pair • Fewer segment pairs means faster computation

  5. Definitions • Maximal Segment Pair (MSPqd) – pair of identical length segments having the highest score of all ungapped local alignments between q and d. • High-Scoring Segment Pair (HSP) – segment pair for which the score cannot be increased by shortening or extension • Word – segment of fixed length w • Word pair – pair of segments of length w

  6. Reformulating the Problem • Identify those database sequences d such that MSPqd is over a threshold V. • A segment pair scoring at least V has with a high probability a word pair scoring at least T. • Identify word pairs with score at least T, extend to high-scoring segment pairs – check if score over V

  7. Finding Hits and HSPs • Hit – word pair scoring at least T • Preprocess q • Find all words oT(length w) that can score at least T against a word in q • Save in easy-to-use data structure • Find the hits • Search in d for all occurrences (od) of the words oT • Extend (heuristically) to high-scoring segment pairs • Perform dynamic programming around HSPs scoring over a certain threshold – allows introduction of gaps

  8. Pre-processing q • Aim: • Allow rapid identification of all words oT in d – and the location of corresponding words in q to allow extension into HSPs • Possibility: table of 20w entries

  9. Pre-processing q

  10. Finding HSPs • For each word in d (starting in position j) hitting a word in q (starting in position i), record the hit indexed by its diagonal (j-i ). • Hits close together on the same diagonal are joined before extension to HSPs • Extending to HSP: • Ideally – move to the end of the sequences in both directions • Heuristic – if score falls “far below” best seen so far, stop extension

  11. Dynamic Programming Around HSPs • DP is time consuming and need to be constrained • Starting from identified HSP, find ”seed pair” • Perform ”forward” and ”backward” DP from seed pair (independently) • Stop DP if score falls T below best score S’ seen so far

  12. Significance of alignments • Suppose alignment reveals an intriguing similarity between two sequences. • Is the similarity significant ? • Or could it have arisen by chance?

  13. Significance of alignment • If the score of the alignment observed is no better than might be expected from a random permutation of the sequence, then it is likely to have arisen by chance.

  14. How to Generate the Random Sequences? • Global alignment • Randomize one of the sequences, many times, realign each result to the second sequence (fixed), and collect the distribution of resulting scores. • Local alignment • Uses the population of results returned from the entire database as the population with which to measure the statistics.

  15. Statistical parameters • Z-score • A measure of how unusual our original match is A z-score of 0 means the observed similarity is no better than the average of the control population. The higher the Z-score, the greater the probability. Z-score  5

  16. Statistical parameters • P = the probability that the alignment is better than random • P ≤ 10-100 exact match • P in range 10-100 - 10-50 sequences very nearly identical • P in range 10-50 - 10-10 closely-related sequences, homology certain • P in range 10-5 - 10-1 distant relatives, usually • P > 10-1 match probably insignificant

  17. Statistical parameters • E-value • The expected number of sequences that give the same Z-score or better if the database is probed with a random sequence. • found by multiplying the value of P by the size of the database probed. • Note that E but not P depends on the size of the database.

  18. Statistical parameters • Interpreting E values • E ≤ 0.02 sequences probably homologous • E between 0.02 and 1 homologous cannot be ruled out • E > 1 you’d have to expect this good a match just by chance

  19. Rules and thinking.. • Percent of identical residues in the optimal alignment • over 45%, very similar structures, common or at least a related function. • Over 25%, a similar general folding pattern. • A lower degree of sequence similarity cannot rule out homology

  20. Rules and thinking.. • 18%-25% twilight zone, the suggestion of homology is tantalizing but dangerous • Absence of significant similarity does not imply that the sequences are not homologous – could be distantly related (twilight zone or beyond)

More Related