1 / 15

Suffix Trees

Suffix Trees. Purpose. Given a (very long) text R , preprocess it, so that once a query text P is given, we can efficiently find if P appears in R . (Later – also where P appears in R ). Example R = “ HelloWorldWhatANiceDay ” , IsIn( “ World ” ) = YES, IsIn( “ Word ” ) = No

tibor
Download Presentation

Suffix Trees

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Suffix Trees

  2. Purpose • Given a (very long) text R, preprocess it, so that once a query text P is given, we can efficiently find if P appears in R. • (Later – also wherePappears in R). • Example R=“HelloWorldWhatANiceDay”, • IsIn(“World”) = YES, • IsIn(“Word”) = No • IsIn(“l”)=8 YES (note – appears more than once)

  3. Definition: A suffix • For a word R, a suffix is what is left of R after deleting the first few characters. • All the suffixes of R=“Hello” • Hello • ello • llo • lo • o

  4. Alg for answering IsIn • Preprocessing: • Create an empty trie T. • Given R=“HelloWorldWhatANiceDay”, insert into T all suffixes of R. • Answering IsIn(P): • Just check if P is in T • That is, return find(P). • (Here, find is as studied in the lecture on tries)

  5. Example • T=“hello”. Suffixes: “hello”, “ello”, “llo”, “lo”,”o”. o h e l l o e l o l l o l Examples: P=“ll” o

  6. Lets get greedy • Given a (very long) text R, preprocess it, so that once a query text P is given, we can find the location of P in R (if at all) efficiently. • More specifically, report the index of where P starts to appear in R. • (If more then one answer, report the last one). • Example R=“HelloWorldWhatANiceDay”, • Where(“World”) = 5, that is, the answer is 5, since “World” appears starting at index 5 in R. • Where(“Word”) = NoWhere • Where(“l”)=8 (also in other places)

  7. Alg for answering Where • Modify the trie, so that each node also contains a field b_inx. • When inserting a word s to the trie, whose first character is in index k of R, modify to nodes along the insertion path to contain the value k. • Preprocessing: • Create an empty trie T. • Given R=“HelloWorldWhatANiceDay”, insert into T all suffixes of R. • Answering IsIn(P ): • Just check if P is in T • That is, return find(P), and the value of b_inx where the search terminates. • (Here, find is as studied in the lecture on tries) • Resulting DataStructure is called: • Uncompressed Suffix Tree

  8. Example Examples: P=“ll” • T=“hello”. Suffixes: “hello”, “ello”, “llo”, “lo”,”o”. o h e l 4 \ 2 3 b_inx=0 1 l o e l 3 b_inx=2 b_inx=0 1 o l l b_inx=2 1 b_inx=0 o l 1 b_inx=0 o b_inx=0

  9. So much memory ????? • The problem with this data structure results from long paths: A sequence of nodes, each but the last one has a single child, and all has the same value of b_inx. o o h h e e l l 4 \ 2 3 b_inx=0 b_inx=0 1 l o e e l 3 b_inx=2 b_inx=0 b_inx=0 1 o l l l b_inx=2 1 b_inx=0 b_inx=0 o l l 1 b_inx=0 b_inx=0 o o b_inx=0 b_inx=0

  10. 0 0 0 0 1 1 More examples of paths

  11. Solution • Recall that all strings in the tree are suffixes of the same text R. • Add a new field to each node, called c_inx and lng such that if lng>0 then when computing a string, we need to concatenate lng chars from P starting at position c_idx o h e h e l o c_idx=1, lng=4 l b_inx=0 b_inx=0 b_inx=0 e e b_inx=0 b_inx=0 R=“h e l l o” 0 1 2 3 4 --------- l l b_inx=0 b_inx=0 l l b_inx=0 b_inx=0 o o b_inx=0 b_inx=0

  12. Compressing the tree • Assuming we are visiting nodes v of the tree, whose distance (num of edges) from the root in the uncompress trie is k. • Also assume that v is the first node on a path. • Then c_idx = b_idx + k. • So the function compress_tree should `know’ the distance from the root (in the uncompress tree) of the visited node.

  13. Need a function compress_tree that accepts a node v of the tree, and the depth of v in the uncompressed tree. • Also need the function check_path( NODE *p) returning the length (in # edges) of the path starting at *p. So for example if *p has two children, it returns 0;

  14. Compressing the tree – cont’ • compress_tree( NODE * p, int depth){ • for each cell ar[i] of *p • if ( (d = check_path (p->ar[i] ) ) > 0 ){ • Let q be a pointer to the node at the end of the path. Let h be the length of the path and let d be the depth of q (in the uncompressed tree). Both q, d and h should be obtained from check_path (think how) • Set p->ar[i]=q • Free unused nodes • q -> c_idx = q -> b_idx+depth+1 • q -> lng = h • compress_tree( q, d ) • }

  15. How large is the tree now • Lemma: If T is a tree with no node of degree 1, then the number of nodes is O(number-of-leaves) • In our scenario, number-of-leaves<|R| • So the size of the trie is O(|R|).

More Related