1 / 26

BLAST and FASTA

MSCS 230 Bioinformatics - BLAST and FASTA. 2. Overview. BLASTMathematical basis for central ideaModeling sequence alignmentsSteps of BLASTFASTAStepsFinding k-tupsBioPerl. MSCS 230 Bioinformatics - BLAST and FASTA. 3. BLAST (Mount, Chapter 7). Basic Local Alignment Search ToolAltschul et al.

valentine
Download Presentation

BLAST and FASTA

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


    1. BLAST and FASTA Craig A. Struble, Ph.D. Department of Mathematics, Statistics, and Computer Science Marquette University

    2. MSCS 230 Bioinformatics - BLAST and FASTA 2 Overview BLAST Mathematical basis for central idea Modeling sequence alignments Steps of BLAST FASTA Steps Finding k-tups BioPerl

    3. MSCS 230 Bioinformatics - BLAST and FASTA 3 BLAST (Mount, Chapter 7) Basic Local Alignment Search Tool Altschul et al. 1990,1994,1997 Heuristic method for local alignment Designed specifically for database searches Idea: Good alignments contain short lengths of exact matches

    4. MSCS 230 Bioinformatics - BLAST and FASTA 4 Mathematical Basis of “The Idea” (Mount, Chapter 3) Model matches as a sequence of coin tosses Let p be the probability of a “head” For a “fair” coin, p = 0.5 (Erdös-Rényi) If there are n throws, then the expected length R of the longest run of heads is R = log1/p (n).

    5. MSCS 230 Bioinformatics - BLAST and FASTA 5 Mathematical Basis of “The Idea” (Mount, Chapter 3) Example: Suppose n = 20 for a “fair” coin R=log2(20)=4.32 Trick is how to model DNA (or amino acid) sequence alignments as coin tosses.

    6. MSCS 230 Bioinformatics - BLAST and FASTA 6 Modeling Sequence Alignments To model random sequence alignments, replace a match with a “head” and mismatch with a “tail”. For DNA, the probability of a “head” is 1/4 What is it for amino acid sequences?

    7. MSCS 230 Bioinformatics - BLAST and FASTA 7 Modeling Sequence Alignments So, for one particular alignment, the Erdös-Rényi property can be applied What about for all possible alignments? Consider that sequences are being shifted back and forth, dot matrix plot The expected length of the longest match is R=log1/p(mn) where m and n are the lengths of the two sequences.

    8. MSCS 230 Bioinformatics - BLAST and FASTA 8 Modeling Sequence Alignments So suppose m = n = 10 and we’re looking at DNA sequences R=log4(100)=3.32 This analysis makes assumptions about the base composition (uniform) and no gaps, but it’s a good estimate.

    9. MSCS 230 Bioinformatics - BLAST and FASTA 9 Back to BLAST Steps of BLAST Filter out low-complexity regions where L is length, N is alphabet size, ni is the number of letter i appearing in sequence. Example: AAAT K=1/4 log4(24/(3!*1!*0!*0!))=0.25

    10. MSCS 230 Bioinformatics - BLAST and FASTA 10 Back to BLAST Steps of BLAST Query words of length 3 (for proteins) or 11 (for DNA) are created from query sequence using a sliding window Expected run length in sequences of ~90 for proteins and 64 for DNA. The values 90 and 64 can easily be obtained from the expected run length formula shown earlier.The values 90 and 64 can easily be obtained from the expected run length formula shown earlier.

    11. MSCS 230 Bioinformatics - BLAST and FASTA 11 BLAST Steps of BLAST Using BLOSUM62 (for proteins) or scores of +5/-4 (DNA, PAM40), score all possible words of length 3 or 11 respectively against a query word. Select a neighborhood word score threshold (T) so that only most significant sequences are kept. Approximately 50 hits per query word. Repeat 3 and 4 for each query word in step 2. Total number of high scoring words is approximately 50 * sequence length.

    12. MSCS 230 Bioinformatics - BLAST and FASTA 12 BLAST Steps of BLAST Organize the high-scoring words into a search tree

    13. MSCS 230 Bioinformatics - BLAST and FASTA 13 BLAST Steps of BLAST Scan each database sequence for match to high-scoring words. Each match is a seed for an ungapped alignment.

    14. MSCS 230 Bioinformatics - BLAST and FASTA 14 BLAST Steps of BLAST (Original BLAST) extend matching words to the left and right using ungapped alignments. Extension continues as long as score increases or stays same. This is a HSP (high scoring pair). (BLAST2) Matches along the same diagonal (think dot plot) within a distance A of each other are joined and then the longer sequence extended as before. (Requires lower T)

    15. MSCS 230 Bioinformatics - BLAST and FASTA 15 BLAST Steps of BLAST Using a cutoff score S, keep only the extended matches that have a score at least S. Determine statistical significance of each remaining match (from last time). Try to extend the HSPs if possible. Show Smith-Waterman local alignments.

    16. MSCS 230 Bioinformatics - BLAST and FASTA 16 FASTA Find k-tups in the two sequences (k=1 for proteins, 4-6 for DNA sequences) Score and select top 10 scoring “local diagonals” For proteins, each k-tup found is scored using the PAM250 matrix For DNA, the number of k-tups found Penalize intervening gaps

    17. MSCS 230 Bioinformatics - BLAST and FASTA 17 Finding k-tups

    18. MSCS 230 Bioinformatics - BLAST and FASTA 18 FASTA

    19. MSCS 230 Bioinformatics - BLAST and FASTA 19 FASTA Rescan top 10 regions, score with PAM250 (proteins) or DNA scoring matrix. Trim off the ends of the regions to achieve highest scores. Try to join regions with gapped alignments. Join if similarity score is one standard deviation above average expected score Score of 28 for a query sequence of length 200 and k=2.

    20. MSCS 230 Bioinformatics - BLAST and FASTA 20 FASTA After finding the best initial region, FASTA performs a global alignment of a 32 residue wide region centered on the best initial region, and uses the score as the optimized score.

    21. MSCS 230 Bioinformatics - BLAST and FASTA 21 BioPerl Using the Bio::Tools::Run::RemoteBLAST module Execute a remote BLAST search Iterate over each HSP match Using the Bio::Tools::pSW module Perform a local sequence alignment of two sequences

    22. MSCS 230 Bioinformatics - BLAST and FASTA 22 Bio::Tools::Run::RemoteBLAST Results accessed by either Bio::Tools::BPlite or Bio::SearchIO, depends on readmethod setting In this example, Bio::SearchIO is used (which is the default)

    23. MSCS 230 Bioinformatics - BLAST and FASTA 23 Example Script

    24. MSCS 230 Bioinformatics - BLAST and FASTA 24 Example Script

    25. MSCS 230 Bioinformatics - BLAST and FASTA 25 Bio::Tools::pSW

    26. MSCS 230 Bioinformatics - BLAST and FASTA 26 Bio::Tools::pSW

More Related