1 / 41

ALAE: Accelerating Local Alignment with Affine Gap Exactly in Biosequence Databases

Northeastern University, China. ALAE: Accelerating Local Alignment with Affine Gap Exactly in Biosequence Databases. Xiaochun Yang , Honglei Liu, Bin Wang. Local Alignment. Similar over short conserved regions Dissimilar over remaining regions Applications

henrik
Download Presentation

ALAE: Accelerating Local Alignment with Affine Gap Exactly in Biosequence Databases

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Northeastern University, China ALAE: Accelerating Local Alignment with Affine Gap Exactly in Biosequence Databases Xiaochun Yang, Honglei Liu, Bin Wang

  2. Local Alignment • Similar over short conserved regions • Dissimilar over remaining regions • Applications • Comparing long stretches of anonymous DNA • Searching for unknown domains or motifs within proteins from different families • …

  3. Related Work • Smith-Waterman algorithm (1981) • An exact approach but very slow • Not used for search • BLAST: an efficient but approximate approach • OASIS:an exact approach and efficient only for short query sequences (less than 60 characters) • BWT-SW: an exact approach but inefficient • Our target • An efficient and exact approach: ALAE (Accelerating Local Alignment with affine gap Exactly)

  4. P T Local Alignment Score >= H P T • Input: 2 sequences, a similarity function, a threshold • Output: Alignments.

  5. Measure Similarity • Scoring scheme <sa, sb, sg, ss> • An identical mapping: positive score sa • A mismatch: negative score sb • Gap: negative score sg + r×ss • Gap opening penalty Gap extension penalty S1: S2: TGCGC-ATGGATTGACCGA TGCGCCATTGAT--ACCGA Scoring scheme: <1, -3, -2, -1> sim(S1,S2) = 15×1 + (-3) + (-2-1) + (-2 + 2 ×(-1)) = 5

  6. j … X i The best alignment score of X[1,i] and any substring of P ending at position j. A Basic Approach P T

  7. A DP Algorithm

  8. An Example of a DP Matrix P = GCTAG, T = AAAGCTA. Scoring scheme = <1,-3,-5,-2> Ga -2 -5-2 -2 -5-2 Gb

  9. A Basic Approach j P i T 4 i = i1+t1 = i2+t2 6 6

  10. Challenges • Speed • Each matrix contains m ~ m×n entries • n matrixes • How to avoid calculating most of entries without impairing the accuracy of the alignment results? • In-memory algorithm • Long sequences: both T and P are long

  11. Contributions • Speed • Prune unnecessary calculations • Avoid duplicate calculations • In-memory algorithm • Use compressed suffix array • Mathematical analysis

  12. Outline • Local filterings • Global filtering • Reusing calculations • A hybrid algorithm

  13. Local filterings • Length Filtering Pruned

  14. Local filterings • Score Filtering Pruned

  15. Pruned Pruned Local filterings • q-Prefix Filtering Simpler function

  16. Comparison of Calculating One Matrix P=G1C2T3A4A5G6C7T8A9A10G11C12T13G14C15 X=G1C2T3A4A5G6C7T8A9G10T11 Scoring scheme <1, -3, -5, -2> H=3

  17. Comparison of Calculating One Matrix P=G1C2T3A4A5G6C7T8A9A10G11C12T13G14C15 X=G1C2T3A4A5G6C7T8A9G10T11 Scoring scheme <1, -3, -5, -2> H=3

  18. Outline • Local filterings • Global filtering • Reusing calculations • A hybrid algorithm

  19. Pruned 4 6 i = i1+t1 = i2+t2 6 Global Filtering

  20. Pruned fork areas Global Filtering It is unnecessary to calculate the fork area in the matrix of X and P Using X’: Alignment score >= Sa Question: Safely avoid calculating based on calculated matrixes?

  21. X Global Filtering • Update and check unnecessary calculations on-the-fly X’ Scoring scheme <1, -3, -5, -2> Boolean matrix • Space consuming: m×n space • (2) Calculation order

  22. X’ X Global Filtering • q-prefix domination X’ dominates X

  23. t Global Filtering • q-prefix domination X’ dominates X X’ X • Text T • Constructing dominations offline in O(n) time • Query P • Check useless calculations on-the-fly Calculation order is unnecessary.

  24. Outline • Local filterings • Global filtering • Reusing calculations • A hybrid algorithm

  25. Reusing score calculations for P reusable alignment entries Entries with a common prefix Ps can share alignment scores.

  26. Reusing score calculations for P reusable alignment entries If two forks have equivalent scores for their FGOEs, their entries with common substring Ps can share alignment scores.

  27. Outline • Local filterings • Global filtering • Reusing calculations • A hybrid algorithm

  28. Row by row Column by column A Hybrid Algorithm

  29. Mathematical Analysis • Upper bound on the number of calculated entries for representative scoring schemes specified by BLAST ( http://blast.ncbi.nlm.nih.gov/Blast.cgi) • DNA: 4.50mn0.520 ~ 9.05mn0.896 • Proteins: 8.28mn0.364 ~ 7.49mn0.723

  30. Experiments • Data sets • Human genome data set • Length of a text: 50 million ~ 1 billion. • Mouse genome data set • Length of each query: 1 thousand ~ 1 million. • Protein data set • Length of a text: 10 million ~ 50 million. • Length of each query: 200 ~ 100,000. • E-value: threshold • Scoring scheme: the same parameters as BLAST Environment: GNU C++, Intel 2.93GHz Quad Core CPUi7 and 8GB memory with a 500GB disk, running a Ubuntu (Linux) operating system.

  31. Alignment Time and Number of Results 76 times faster than BWT-SW 16 times faster than BWT-SW

  32. Filtering Ratio

  33. Reusing Ratio

  34. Index Size

  35. Conclusions • High efficiency of ALAE • Improves BWT-SW significantly • Accelerates BLAST for most of the scoring schemes • In-memory approach using compressed suffix array • Mathematical analysis • Upper bound on calculated entries

  36. Thank you! Source code to be available at http://faculty.neu.edu.cn/yangxc/project

  37. Simulating Searches Using Compressed Suffix Array • Match a q-length substring in text • Identify forks • Find occurrences of a substring in text • Calculate end positions of alignments • Get all suffixes with the same prefix as Xq

  38. X = GC G C T A G C $ $ G C T A G C C T A G C $ G A G C $ G C T • Positions of GC in T • SA[4] = 5 • SA[5] = 1 C $ G C T A G T A G C $ G C C T A G C $ G A G C $ G C T G C $ G C T A G C $ G C T A G C T A G C $ C $ G C T A G $ G C T A G C T A G C $ G C Review of Compressed Suffix Array T = G1C2T3A4G5C6 T’ = G1C2T3A4G5C6$7 Conceptual matrix 7 4 6 2 5 1 3 BTW = CTGGA$C SA[0,6]

  39. X = GC  P-1 = CG C G A T C G $ $ C G A T C G G A T C G $ C A T C G $ C G • Positions of CG in T-1 • SA[2] = 2 • SA[3] = 6 Therefore, • Positions of GC in T • SA[2]-|X|+1 = 1 • SA[3]-|X|+1= 5 C G $ C G A T A T C G $ C G C G A T C G $ T C G $ C G A C G $ C G A T G $ C G A T C G A T C G $ C G $ C G A T C $ C G A T C G T C G $ C G A Compressed Suffix Array – reverse T to T-1 T = G1C2T3A4G5C6 T-1 = C6G5A4T3C2G1$0 T’ = $0G1C2T3A4G5C6 Conceptual matrix 0 4 2 6 1 5 3 BTW = GGT$CCA SA[0,6]

  40. v v … X Align Distinct Substring in T with P P T j i v

  41. Alignment Time • T = 50 million characters • P = 10 thousand characters

More Related