1 / 40

BLAT – The B LAST- L ike A lignment T ool

BLAT – The B LAST- L ike A lignment T ool. Kent, W.J. Genome Res. 2002 12: 656-664 Presenter: 巨彥霖 田知本. BLAT overview. Use an index to find regions in genome homologous to query. Do a detailed alignment between query and homologous regions.

barton
Download Presentation

BLAT – The B LAST- L ike A lignment T ool

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. BLAT – The BLAST-Like Alignment Tool Kent, W.J. Genome Res. 2002 12: 656-664 Presenter: 巨彥霖 田知本

  2. BLAT overview • Use an index to find regions in genome homologous to query. • Do a detailed alignment between query and homologous regions. • Use dynamic programming to stitch together detailed alignments regions into detailed alignment of whole.

  3. Index • Database : non-overlapping • Query : overlapping … K-mer K-mer K-mer … K-mer K-mer

  4. Example • Database: cacaattatcacgaccgc 3-mers: cac aat tat cac gac cgc Index: aat 3 gac 12 cac 0,9 tat 6 cgc 15 • Query: aattctcac 3-mers: aat att ttc tct ctc tca cac 0 1 2 3 4 5 6

  5. Search Criteria • Single Perfect Matches • Single Near Perfect Matches • Multiple Perfect Matches

  6. Notation • K : K-mer size • M : The match ratio between homologous area • H : Homologous region size • G : Query sequence size • A : The alphabet size

  7. Single Perfect Matches (1) K-mer Homologous region Perfect Match

  8. Single Perfect Matches (2) H K K K K K K K Homologous region The prob of at least one k-mer perfect match : (Sensitivity)

  9. Single Perfect Matches (3) • The number of k-mer in the database = G / K • The number of k-mer in the query = Q – K + 1  The number of k-mer that are expected to matched by chance : (Specificity)

  10. Single Perfect Nucleotide K-mer Matches as Search Criterion

  11.  Case (perfect match) • Comparing mouse and human coding sequences at the nucleotide level : H = 100 M = 86% Sensitivity = 0.99  max K = 7 chance matches = 13078962 (query = 500 , database = 3 billion)

  12. Single Near Perfect Matches (1) Almost Perfect : One letter may mismatch K-mer Homologous region Near Perfect Match

  13. Single Near Perfect Matches (2) • Sensitivity • Specificity

  14.  Case (near perfect match) • Comparing mouse and human coding sequences at the nucleotide level : H = 100 M = 86% Sensitivity = 0.99  max K = 12 chance matches = 275671 (query = 500 , database = 3 billion)

  15. Single Near Perfect Nucleotide K-mer Matches as Search Criterion

  16. Multiple Perfect Matches • Hit is triggered : • there must be N perfect matches • each no further than W letters from each other in the database coordinate • have the same diagonal coordinate

  17. Example a Query Coordinate W b c d Target Coordinate The hits a, b, c, and d are all k letters long. Hits b and d have the same diagonal coordinate within W letters of each other. Therefore, they would match the 2 perfect K-mer search criteria.

  18. Multiple Perfect Nucleotide K-mer Matches as Search Criterion

  19. Default • Nucleotide • two perfect 11-mer • Protein • single perfect 5-mer for standalone version • three perfect 4-mer for client/server version

  20. BLAST • Build the hash table for Sequence A. • Scan Sequence B for hits. • Extend hits.

  21. BLAST Step 1: Build the hash table for Sequence A. (3-tuple example) For protein sequences: Seq. A = ELVISAdd xyz to the hash table if Score(xyz, ELV) ≧ T;Add xyz to the hash table if Score(xyz, LVI) ≧ T;Add xyz to the hash table if Score(xyz, VIS) ≧ T; For DNA sequences: Seq. A = AGATCGAT 12345678 AAAAAC..AGA 1..ATC 3..CGA 5..GAT 2 6..TCG 4..TTT

  22. BLAST Step2: Scan sequence B for hits.

  23. BLAST Step2: Scan sequence B for hits. Step 3: Extend hits. BLAST 2.0 saves the time spent in extension, and considers gapped alignments. hit Terminate if the score of the sxtension fades away. (That is, when we reach a segment pair whose score falls a certain distance below the best score found for shorter extensions.)

  24. Algorithm • Search Stage • Use an index to find regions in genome homologous to query • Alignment Stage • Do a detailed alignment between query and homologous regions • Stitching and Filling In • Use dynamic programming to stitch together detailed alignments regions into detailed alignment of whole

  25. Search Stage • Build an index which contains positions of each K-mer in database. • Step through each overlapping K-mer in query and look it up in index • Get list of ‘hits’ - positions in query and in database that match for K bases • Cluster hits to find homologous regions

  26. Search Stage • Clump hits

  27. Search Stage • Eliminate small clumps • Clump ‘clumps’ homologous region

  28. Alignment Stage (nucleotide) • Start from scratch with regions defined with K-mers • Index on smaller K-mers, but extend each K-mer until it becomes specific • Extend in both direction without mismatches or gaps and merge overlapping or continues alignments • Recurse on gaps with smaller K until gap or hits are eliminated

  29. Alignment Stage (nucleotide) recursive

  30. Alignment Stage (protein) • Extend hits into maximal scoring ungapped alignment (HSPs) with +2/-1 scoring scheme • Create a graph of all possible HSP merges • Use dynamic programming to traverse the graph

  31. Alignment Stage (protein)

  32. Alignment Stage (protein) query HSP homologous region

  33. Stitching and Filling In • The alignment of gene is often scattered across multiple homologous regions found in the search stage query database

  34. Stitching and Filling In query homologous region database

  35. Evaluation • Comparison with Other Tools: • mRNA/Genome Alignments • Remapped 713 mRNAs corresponding to annotated chromosome 22 • BLAT took 26 sec while Sim4 took 17,468 sec (almost 5h)

  36. Evaluation • Comparison with Other Tools: • Translated Mouse/Human Alignments • 13 million mouse genomic reads vs. human chromosome 22

  37. BLATvs.BLAST • Index • Query vs. Database • Hits • Perfect vs. Near Perfect • Alignment • Separate vs. Together

  38. Magic Time !

  39. Magic 4 Prediction ! No mind ! Great ! 3 3 .5 4

  40. Reference • http://amber.cs.umd.edu/class/838-s04/nada.ppt • http://bioportal.weizmann.ac.il/course/ATIB/ATIB03_lecture3.print.pdf

More Related