1 / 25

Suffix Trees

Suffix Trees. Suffix trees Linearized suffix trees Virtual suffix trees Suffix arrays Enhanced suffix arrays Suffix cactus, suffix vectors, …. Suffix Trees. String … any sequence of characters. Substring of string S … string composed of characters i through j , i <= j of S .

lovie
Download Presentation

Suffix Trees

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Suffix Trees • Suffix trees • Linearized suffix trees • Virtual suffix trees • Suffix arrays • Enhanced suffix arrays • Suffix cactus, suffix vectors, …

  2. Suffix Trees • String … any sequence of characters. • Substring of string S … string composed of characters i through j, i <= j of S. • S = cater=>ate is a substring. • car is not a substring. • Empty string is a substring of S.

  3. Subsequence • Subsequence of string S … string composed of characters i1 < i2 < … < ik of S. • S = cater=>ate is a subsequence. • car is a subsequence. • The empty string is a subsequence.

  4. String/Pattern Matching • You are given a source string S. • Answer queries of the form: is the string pia substring of S? • Knuth-Morris-Pratt (KMP) string matching. • O(|S| + | pi |) time per query. • O(n|S| + Si | pi |) time for n queries. • Suffix tree solution. • O(|S| + Si | pi |) time for n queries.

  5. String/Pattern Matching • KMP preprocesses the query string pi, whereas the suffix tree method preprocesses the source string S. • An application of string matching. • Genome project. • Databank of strings (gene sequences). • Character set is ATGC. • Determine if a “new” sequence is a substring of a databank sequence.

  6. Definition Of Suffix Tree • Compressed trie with edge information. • Keys are the nonempty suffixes of a given string S. • Nonempty suffixes of S = sleeper are: • sleeper • leeper • eeper • eper • per, er, and r.

  7. String Matching & Suffixes • pi isa substring of S iff pi isa prefix of some suffix of S. • Nonempty suffixes of S = sleeper are: • sleeper • leeper • eeper • eper • per, er, and r. • Which of these are substrings of S? • leep, eepe, pe, leap, peel

  8. Last Character Of S Repeats • When the last character of S appears more than once in S, S has at least one suffix that is a proper prefix of another suffix. • S = creeper • creeper, reeper, eeper, eper, per, er, r • When the last character of S appears more than once in S, use an end of string character # to overcome this problem. • S = creeper# • creeper#, reeper#, eeper#, eper#, per#, er#, r#, #

  9. 1 abbb # b 5 2 abbbb# # b abbbb# b# 3 # abbbb# b 4 # abbbb# b# Suffix Tree For S = abbbabbbb#

  10. abbb # b abbbb# # b abbbb# b# # abbbb# b # abbbb# b# Suffix Tree For S = abbbabbbb# 1 5 2 10 3 1 5 9 4 4 8 3 abbbabbbb# 7 2 6 12345678910

  11. abbb # b abbbb# # b abbbb# b# # abbbb# b # abbbb# b# Suffix Tree For S = abbbabbbb# 1 1 5 4 2 10 1 3 8 1 5 9 4 4 2 8 3 abbbabbbb# 7 2 6 12345678910

  12. Suffix Tree Construction • See Web write up for algorithm. • Time complexity • |S| = n, alphabet size = r. • O(nr) using array nodes. • This is O(n) for r a constant (or r <= c). • O(n) expected time using a hash table. • O(n) time algorithm for large r in reference cited in Web write up.

  13. Suffix Array • Array that contains the start position of suffixes in lexicographic order. • abbbabbbb# • Assume # < a < b • # < abbbabbbb# < abbbb# < b# < babbbb# < bb# < bbabbbb# < bbb# < bbbabbbb# < bbbb# • SA = [10, 1, 5, 9, 4, 8, 3, 7, 2, 6] • LCP = length of longest common prefix between adjacent entries of SA. • LCP = [0, 4, 0, 1, 1, 2, 2, 3, 3, -]

  14. Suffix Array • Less space than suffix tree • Linear time construction • Can be used to solve several of the problems solved by a suffix tree with same asymptotic complexity. • Substring matching  binary search for p using SA. • O(|p| log |S|).

  15. abbb # b 10 abbbb# # b abbbb# b# 1 5 9 4 # abbbb# b 8 3 # abbbb# b# abbbabbbb# 7 2 6 12345678910 O(|pi|) Time Substring Matching babb abbba baba

  16. Find All Occurrences Of pi • Search suffix tree for pi. • Suppose the search for pi is successful. • When search terminates at an element node, pi appears exactly once in the source string S.

  17. abbb # b 10 abbbb# # b abbbb# b# 1 5 9 4 # abbbb# b 8 3 # abbbb# b# abbbabbbb# 7 2 6 12345678910 Search Terminates At Element Node abbbb#

  18. Search Terminates At Branch Node • When the search for pi terminates at a branch node, each element node in the subtree rooted at this branch node gives a different occurrence of pi.

  19. abbb # b 10 abbbb# # b abbbb# b# 1 5 9 4 # abbbb# b 8 3 # abbbb# b# abbbabbbb# 7 2 6 12345678910 Search Terminates At Branch Node ab

  20. Find All Occurrences Of pi • To find all occurrences of pi in time linear in the length of pi and linear in the number of occurrences of pi, augment suffix tree: • Link all element nodes into a chain in inorder. • Each branch node keeps a pointer to the left most and right most element node in its subtree.

  21. abbb # b 10 abbbb# # b abbbb# b# 1 5 9 4 # abbbb# b 8 3 # abbbb# b# abbbabbbb# 7 2 6 12345678910 Augmented Suffix Tree b

  22. Longest Repeating Substring • Find longest substring of S that occurs more than m > 1 times in S. • Label branch nodes with number of element nodes in subtree. • Find branch node with label >=m and max char# field.

  23. 10 5 7 2 3 abbb # b 10 abbbb# # b abbbb# b# 1 5 9 4 # abbbb# b 8 3 # abbbb# b# abbbabbbb# 7 2 6 12345678910 Longest Repeating Substring m = 2 m = 5

  24. Longest Common Substring • Given two strings S and T. • Find the longest common substring. • S = carport, T = airports • Longest common substring = rport • Longest common subsequence = arport • Longest common subsequence may be found in O(|S|*|T|) time using dynamic programming. • Longest common substring may be found in O(|S|+|T|) time using a suffix tree.

  25. Longest Common Substring • Let $ be a new symbol. • Construct the suffix tree for the string U = S$T#. • U = carport$airports# • No repeating substring includes $. • Find longest repeating substring that is both to left and right of $. • Find branch node that has max char# and has at least one element node in its subtree that represents a suffix that begins in S as well as at least one that begins in T.

More Related