1 / 24

Counting Suffix Arrays and Strings

Counting Suffix Arrays and Strings. Text to be indexed:. T. C. T. T. C. T. C. T. T. C. T. C. $. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. Suffix Array Data Structure. Suffix Array – lexicographically sorted list of all suffixes:. 13 - $ 12 - C$ 10 - CTC$

jeneil
Download Presentation

Counting Suffix Arrays and Strings

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Counting Suffix Arrays and Strings

  2. Text to be indexed: T C T T C T C T T C T C $ 1 2 3 4 5 6 7 8 9 10 11 12 13 Suffix Array Data Structure Suffix Array – lexicographically sorted list of all suffixes: 13 - $ 12 - C$ 10 - CTC$ 5 - CTCTTCTC$ 7 - CTTCTC$ 2 - CTTCTCTTCTC$ 11 - TC$ 9 - TCTC$ 4 - TCTCTTCTC$ 6 - TCTTCTC$ 1 - TCTTCTCTTCTC$ 8 - TTCTC$ 3 - TTCTCTTCTC$ Dagstuhl, May 2006 - Jens Stoye

  3. Overview • Classify strings sharing same suffix array • Counting strings sharing same suffix array • Counting suffix arrays Lower bound suffix array compression • Summation identities Dagstuhl, May 2006 - Jens Stoye

  4. 1. Classify Strings for Suffix Array t - string of length n, P - permutation of {1,..., n}, R - inverse of P. Theorem: P is the suffix array of tif and only if for all i{1,...,n} • t[P[i]] t[P[i+1]] and • t[P[i]] = t[P[i+1]] R[P[i]+1]  R[P[i+1]+1] same as • R[P[i]+1] > R[P[i+1]+1]  t[P[i]] < t[P[i+1]] Dagstuhl, May 2006 - Jens Stoye

  5. t = A A B C B 1 2 3 4 5 i t[P[i]] P[i] 1 1 A ABCB 2 A BCB 2 B 5 3 4 B CB 3 5 C B 4 1. Classify Strings for Suffix Array a) t[P[i]] t[P[i+1]] and b) R[P[i]+1] > R[P[i+1]+1]  t[P[i]] < t[P[i+1]] Text to be indexed: R+-descent Dagstuhl, May 2006 - Jens Stoye

  6. t2 t t3 = = = A A A B A A C B D E C D D C B 1 1 1 2 2 2 3 3 3 4 4 4 5 5 5 i t2[P[i]] t3[P[i]] P[i] t[P[i]] A ABCB 1 A ACDC 1 A BDED A BCB A CDC 2 2 B DED 5 C D B 3 C DC D ED B CB 3 4 5 D C 4 E D C B 1. Classify Strings for Suffix Array Equivalences between strings Text to be indexed: (order-equivalent) (order-distinct) Dagstuhl, May 2006 - Jens Stoye

  7. t2 t3 t = = = A A A A A B B C D E D C B C D 1 1 1 2 2 2 3 3 3 4 4 4 5 5 5 t[P[i]] t[P[i]] t2[P[i]] t3[P[i]] i P[i] A A 1 1 A + 0 = A + 0 = + 0 = A + 1 = 2 A A B 2 + 2 = 3 C B + 1 = D B 5 + 2 = + 1 = B B C D 3 4 C D + 1 = E 5 4 C + 2 = 2. Counting Strings for Suffix Array Text to be indexed: Base string Non-decreasing sequences Dagstuhl, May 2006 - Jens Stoye

  8. 2. Counting Strings for Suffix Array Suffix array P of length n with dR+-descents. Number of strings over alphabet of size afor P = Number of non-decreasing sequences overa-d elements Dagstuhl, May 2006 - Jens Stoye

  9. 2. Counting Strings for Suffix Array Suffix array P of length n with dR+-descents. Number of strings composed of exactly k distinct characters for P is Dagstuhl, May 2006 - Jens Stoye

  10. Number of strings over alphabet size 20 for suffix arrays of length n with 10R+-descents: 2. Counting Strings for Suffix Array Dagstuhl, May 2006 - Jens Stoye

  11. 2. Counting Strings for Suffix Array Suffix array P of length n with dR+-descents • Number of order-distinct strings over alphabet of size a is • Number of order-distinct strings where all k distinct characters must appear is Dagstuhl, May 2006 - Jens Stoye

  12. 3. Counting Suffix Arrays Definition: Let P permutation of {1,..., n}. Position i{1,...,n-1} is a permutation descent if P[i] > P[i+1]. Definition: The Eulerian number gives the number of permutations of {1,...,n} with exactly d permutation descents. Dagstuhl, May 2006 - Jens Stoye

  13. 3. Counting Suffix Arrays Well-known fact: Recursive enumeration of Eulerian numbers • , • for n d, and Dagstuhl, May 2006 - Jens Stoye

  14. 3. Counting Suffix Arrays Definition: Let A(n,d) be the number of permutations of length n with dR+-descents. Observation: • A(n,0) = 1 • A(n,d) = 0 for n  d • see next Dagstuhl, May 2006 - Jens Stoye

  15. t = A A B C B 1 2 3 4 5 At = A A A B C B 1 2 3 4 5 6 i Pt[i] t[P[i]] A ABCB 1 1 i PAt[i] At[P[i]] A BCB 2 2 1 1 A AABCB 3 B 5 2 2 A ABCB 4 B CB 3 3 3 A BCB 4 5 C B 4 6 B 5 4 B CB 6 5 C B 3. Counting Suffix Arrays Text to be indexed: (d+1) possible positions without additional R+-descent Dagstuhl, May 2006 - Jens Stoye

  16. t = A A B C B 1 2 3 4 5 Bt = B A A B C B 1 2 3 4 5 6 i Pt[i] t[P[i]] A ABCB 1 1 i PBt[i] Bt[P[i]] A BCB 2 2 1 2 A ABCB 3 B 5 2 3 A BCB 4 B CB 3 3 6 B 4 5 C B 4 1 B AABCB 5 4 B CB 6 5 C B 3. Counting Suffix Arrays Text to be indexed: (d+1) possible positions without additional R+-descent Dagstuhl, May 2006 - Jens Stoye

  17. 3. Counting Suffix Arrays Together: • A(n,0) = 1, • A(n,d) = 0 for n  d, and • A(n,d) = (d+1) A(n-1,d) + (n-d) A(n-1,d-1) Theorem: The number A(n,d) of permutations of length n with d R+-descents is the Eulerian number . Dagstuhl, May 2006 - Jens Stoye

  18. 3. Counting Suffix Arrays • The number of distinct suffix arrays of length n for strings over alphabet of size a: • Lower bound for compressibility of suffix arrays in the Kolmogorov sense: Dagstuhl, May 2006 - Jens Stoye

  19. 3. Counting Suffix Arrays • Number of distinct suffix arrays of length n for strings over alphabet of size 20: Dagstuhl, May 2006 - Jens Stoye

  20. 3. Counting Suffix Arrays • Number of distinct suffix arrays of length n for strings over alphabet of size 4: Dagstuhl, May 2006 - Jens Stoye

  21. 4. Summation Identities • Worpitzki‘s identityby summing up the number of strings of length n for each suffix array: • Summation rule for Eulerian numbers to generate the Stirling numbers of second kind: Dagstuhl, May 2006 - Jens Stoye

  22. Summary • Constructive proofs to count strings sharing the same suffix array • Constructive proof to count distinct suffix arrays yielding lower bound for suffix array compression • Constructive proofs for Worpitzki‘s identity and the summation rule of Eulerian numbers to count Stirling numbers of second kind Dagstuhl, May 2006 - Jens Stoye

  23. Outlook • Efficient enumeration algorithm for suffix arrays • Compressed suffix arrays for fast querying in bioinformatics applications • Average case analysis under non-uniform model Dagstuhl, May 2006 - Jens Stoye

  24. Thank you for your attention!

More Related