1 / 25

Memory-aware BWT by Segmenting Sequences

Memory-aware BWT by Segmenting Sequences. presented by Jiaying Wang April 12 , 2012. The 14th Asia-Pacific Web Conference (APWeb). Northeastern University, China. Motivation. Most interesting massive data sets contain string data (web data, record data, genome data, etc.)

sirius
Download Presentation

Memory-aware BWT by Segmenting Sequences

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Memory-aware BWT by Segmenting Sequences presented byJiaying Wang April 12, 2012 The 14th Asia-Pacific Web Conference (APWeb) Northeastern University, China

  2. Motivation Most interesting massive data sets contain string data (web data, record data, genome data, etc.) BWT as a full text index provides fast substring search over large text collections Enormous memory cost while building BWT(n log n + n logσ)

  3. Preliminaries text: T[0..n − 1], T[i]∈Σ, |Σ| = σ We add a $ to the end of the text. $ do not belong to Σ T[i...j] is a sequence starting at i position and ending at j position empty string iff i>j prefix iff i = 0 suffix iff j = 0

  4. Problem definition Let T[0..n−1] be a text, and P[0..m-1] be a query. Subsequence matching problem is to find all the start positions of occurrences of P in T, i.e. {i | 0 ≤ i ≤ n; T[i..i+m-1] = P[0..m-1]}. We take the memory cost into account. The process should guarantee the efficiency of query and memory cost at the same time.

  5. Bwt transformation SA i ssippi$miss L $ mississipp i i ssissippi$ m 11 10 7 4 1 0 9 8 6 3 5 2 m ississippi$ p i$mississi p p pi$mississ i s ippi$missi s s issippi$mi s s sippi$miss i s sissippi$m i F i $mississipp i ppi$mississ text: mississippi$ bwt: ipssm$pissii mississippi$ ississippi$m ssissippi$mi sissippi$mis issippi$miss ssippi$missi sippi$missis ippi$mississ ppi$mississi pi$mississip i$mississipp $mississippi

  6. Backward search on BWT L  0, hbwt.length For i from pat.length-1 to 0 k = pat[i] l = C[k] + occ(k,l) h = C[k] + occ(k,h) Return h - l searching "ssi" i ssippi$miss L i ssissippi$ m m ississippi$ p i$mississi p p pi$mississ i s ippi$missi s s issippi$mi s s sippi$miss i s sissippi$m i F $ mississipp i i $mississipp i ppi$mississ

  7. Memory cost analysis Enormous memory cost for building BWT. n log n + n logσ. About 5*n Bytes. (1G 5G) For example: mississippi mississippi mississippi$ ipssm$pissii SA:11 10 7 4 1 0 9 8 6 3 5 2 12 + 12×4 = 12×5

  8. Our idea(1/2) mississippi missis sippi Load one segment each time will help us save the memory search ssi How to find the segmented sequence?

  9. Our idea(2/2) mississippi issippi mississi search ssi Oops, we find another one

  10. BWT on Overlapped Segments T T1 L T2 l … bwt Tk bwt BWT1 bwt BWT2 … BWTk

  11. Searching cases prerequisite : query length ≤ l • For the second case, we have to remove duplicates of the results

  12. Filtering method f Filter interval f = l - m All the occurrences starting at positions in a filter interval should be filtered.

  13. Searching algorithm

  14. BWT on Disjoint Segments T T1 T2 … bwt Tk bwt BWT1 bwt BWT2 … BWTk

  15. Searching cases • For the second case, we need to • 1 Find the suffix of the query as the prefix of a segment. • 2 Verify rest prefix of the query needs on the left segment.

  16. Suffix checking Time complexity: Θ(m)

  17. Prefix verification • To verify the prefix, we can • 1 keep text. (waste space) • 2 revert text on the fly.(waste a little time)

  18. Searching algorithm

  19. Analysis Overlap method Memory cost (n + l + k) × (log σ+ log(n + l + k) − log(k))/k Time complexity Θ(occ+δ+mk) Backwalk method Memory cost n(log σ+log n−log k)/k bits. Time complexity Θ(occ + (η + k)m)

  20. Experiment Environment C++ language PC with 2.93 GHz Intel Core CPU 4 GB main memory Ubuntu operating system (Linux distribution). data sets English text at Pizza&Chili Corpus Genome sequence at UCSC goldenPath

  21. Performance on English Memory cost Build time Query time Query time

  22. Performance on genome Memory cost Build time Query time Query time

  23. More performance

  24. Conclusion We propose a novel variation of BWT called S-BWT Our index save more memory than BWT Two query method based on S-BWT Our method is faster than BWT method on large text.

  25. Thank you!Q&A

More Related