全文検索のためのデータ構造と構成の効率について

全文検索のためのデータ構造と構成の効率について全文検索のためのデータ構造と構成の効率について定兼邦彦東京大学理学系研究科情報科学専攻 http://naomi.is.s.u-tokyo.ac.jp/~sada/papers/fulltext.ppt

内容 • 全文検索のためのデータ構造の比較 • 検索時間 • ディスク容量 • 更新時間 • 検索精度

背景 • 電子化された文書の普及 • WWW, メール • 新聞, 辞書, 書籍 • ゲノムデータベース • 大量のテキストから高速に検索したい • もれがないようにしたい • 必要なもののみ欲しい

全文検索のアルゴリズム • sequential search • signature file [Moders 49] • 各文書がどのキーワードを含むか • inverted file [Bleir 67] • 各キーワードがどこにあるか • digital tree (trie) • 任意のキーワード

Inverted fileのデータ構造 キーワードごとに出現文書，位置を記憶 • sorted array • キーワードの出現位置のリスト • prefix B-tree • 更新が簡単 • trie • prefixをコンパクトに表現

決まったキーワードのみ サイズが小さい構成が早いデータ構造 sorted array prefix B-tree trie 任意のキーワードサイズが大きい構成が遅いデータ構造 suffix array String B-tree suffix tree Word indexes vs. Full-text indexes full-text indexes word indexes

Full-text indexのデータ構造 • suffix tree [Weiner 73] • suffix array [Manber, Myers 93] • String B-tree[Ferragina, Grossi 95]

b a b a b $ $ a $ b $ abab$ Suffix tree • 文字列の全てのsuffix(接尾辞)を表すcompacted trie • メモリ上では線形時間で構成可能 • サイズが大きい • unbalanced

ab$ 3 1 abab$ 4 b$ bab$ 2 Suffix array • 文字列の全てのsuffixのポインタを辞書順にソートした配列 • 省スペース(5N) • 更新が遅い

String B-tree • suffixのポインタをB-treeで表したもの • 検索時のdiskアクセスが少ない(blind tree) • 最悪時の性能が良い • 挿入、削除が容易 • サイズ: 13N • １から作るのは遅い abab$

I/O complexity • 検索のI/O complexity • 更新のI/O complexity • 構成のI/O complexity

検索のI/O complexity • Suffix tree • Nに依存しない • String B-tree • Suffix array p : キーワード長 occ : 答えの数 N : 文字列長

更新のI/O complexity • Suffix tree • Nに依存しない • String B-tree • Suffix array • 追加する量が多ければString B-treeと差はない p : キーワード長 N : 文字列長 B : ディスクページサイズ

構成のI/O complexity • suffix tree (optimal) • suffix array • String B-tree N : 文字列長 M : メモリサイズ B : ディスクページサイズ

構成アルゴリズム • Suffix treeの構成 • メモリ上 • ディスク上 • Suffix arrayの構成 • メモリ上 • ディスク上

Suffix treeの構成 • メモリ上 (線形時間) • Weiner 73 • McCreight 76 • Ukkonen 95 • Farach 97 • divide and conquer, batch処理 • ディスク上 • Farach, Ferragina, Muthukrishnan 98

I/O Disk上でのsuffix tree構成 • アルゴリズムをsortingとscanで表現 • 数のsortingと同じI/O complexity (optimal)

Sorting I/O complexity • 次の問題はsortingと同じI/O complexityを持つ • treeのノードのlcaをK個 (K個のrange minima) • tree T のEuler Tour ET(T)とノードの深さ • 文字列中の任意の位置のK文字 • treeの各ノードの子孫でmarkされているもの • uncompacted trieのmerge • suffix treeの全てのsuffix link • suffix treeの構成

block I/O random I/O 補題: random I/Oが回のsorting 回のI/Oを必要とする。アルゴリズムは Block vs. Random I/O • 2-waymerge • M/B-way merge

treeの枝が文字列へのポインタで 表されている treeをたどる際にrandom accessが生じる古典的なアルゴリズムは適さない　(divide and conquerを用いる) Disk上のsuffix treeの問題点

b a $ b a $ a b $ b $ $ Even tree Odd tree b $ a b a b $ $ a $ b Algorithm outline • Odd treeを作る • Even treeを作る • mergeする $

$ $ a A b a $ A $ b $ Building the odd tree • 連続する2文字を1つの文字とみなし長さN/2の文字列を作る • 新しい文字列のsuffix treeを再帰的に作る • 文字を元に戻す abab$ AA$ $

b a $ b $ Even tree $ a b 2: (b,ab$) = (b,2) 1 a $ 4: (b,$) = (b,1) b $ 3 2 Building the even tree • 偶数番目のsuffixを辞書順にradix sortする • (先頭の文字, 奇数番目のsuffixの辞書順) • 隣り合うsuffix間のlcpを求める • compacted trieを作る abab$ 2 4

Merging the odd and even trees • anchor pairを見つける • side tree pairに分割する • pull nodeを見つける • merge nodeを見つける • TeとToをmergeする

Suffix arrayのメモリ上での構成 • quick sort • 文字列の比較なので非常に遅い • ternary partitioning[Bentley, Sedgewick 97] • 無駄な文字列比較が少ない • 極端に遅くなることがある • doubling algorithm • Manber, Myers 93 • Sadakane, Imai 98 • 多くの場合最速

Doubling algorithm • Karp, Miller, Rosenberg 72 • ディスク上の文字列ソート[Arge et al. 97] • 長さ 1, 2, 4, … の部分文字列を数値に変換 • log n 回の比較で文字列を区別できる

Suffix sorting by doubling (1/5) • 各suffixを先頭の1文字でグループに分ける • グループに番号をつける • 各グループの中をsuffixの2文字目で分ける • 番号を更新 (番号が異なる　先頭の2文字が異なる) • 各グループの中をsuffixの3,4文字目で分ける • グループの番号でソート • 全てのsuffixの順序が決まるまで繰り返す • 順序の決まっているグループはskipする

6 0 1 1 3 3 6 6 6 10 11 11 11 5 先頭の２文字でソート Suffix sorting by doubling (2/5) 1 3 3 6 0 10 11 1 6 11 6 V[I[i]+1] V[I[i]] 13 2 11 3 12 6 1 4 7 10 5 0 8 9 I[i] ＄ｂｅｏｒｂｅ＄ｅｏｒｎｅ＄ｎｏｔｔｏｂｅｏｏｒｎｏｏｔｔｏｏｂｅ＄ｒｎｏｔｔｏｂｅｔｔｏｂｔｏｂｅｔｏｂｅｏｒｎｏｔｔｏｂｅ$

1 1 3 4 6 6 8 9 11 11 13 先頭の４文字でソート Suffix sorting by doubling (3/5) V[I[i]+2] 8 0 4 3 1 1 V[I[i]] 0 5 10 13 2 11 12 3 6 1 10 4 7 5 0 9 8 I[i] ＄ｂｅｏｒｂｅ＄ｅ＄ｎｏｔｔｏｂｅｏｏｂｅ＄ｒｎｏｔｅｏｒｎｏｒｎｏｏｔｔｏｔｏｂｅｔｏｂｅｔｔｏｂｔｏｂｅｏｒｎｏｔｔｏｂｅ$

1 2 3 4 6 7 8 9 11 11 13 先頭の８文字でソート Suffix sorting by doubling (4/5) V[I[i]+4] 8 0 V[I[i]] 0 5 10 13 11 2 12 3 6 10 1 4 7 5 0 9 8 I[i] ＄ｅ＄ｎｏｔｔｒｎｏｔｂｅ＄ｂｅｏｒｅｏｒｎｏｂｅ＄ｏｂｅｏｏｒｎｏｏｔｔｏｔｏｂｅｏｒｔｏｂｅ＄ｔｔｏｂｔｏｂｅｏｒｎｏｔｔｏｂｅ$

1 2 3 4 6 7 8 9 11 12 13 Suffix sorting by doubling (5/5) V[I[i]] 0 5 10 13 11 2 12 3 6 10 1 4 7 5 9 0 8 I[i] ＄ｅ＄ｎｏｔｔｒｎｏｔｂｅ＄ｂｅｏｒｅｏｒｎｏｂｅ＄ｏｂｅｏｏｒｎｏｏｔｔｏｔｏｂｅｏｒｔｏｂｅ＄ｔｔｏｂｔｏｂｅｏｒｎｏｔｔｏｂｅ$ ソート終了

I/O I/O Suffix arrayのディスク上での構成 • Gonnet, Baeza-Yates, Snider 92 • diskはsequential accessのみ • Crauser, Ferragina 98 • doubling algorithm + discarding

Doubling algorithm + discarding • doubling algorithmをディスク上で行う • 回の反復 • M/B-way マージソートを用いるメモリ内と異なる点 • すでにソートされている部分はスキップ

Word indexes vs. Full-text indexes

単語の先頭のみ データ量は約1/7 (日本語, 英語とも) 検索もれの可能性形態素解析が必要 DNA配列には使えない全ての部分文字列長いものを見つけるのが得意検索結果にごみが入る京都のつもりが東京都ルパンのつもりがダブルパンチはらだのつもりがはらだたしい AND検索で回避? 網羅性 word indexes full-text indexes

Full-text indexの利点・欠点 • 検索結果は文字列へのポインタ • ポインタから文書番号への変換が必要 • 超高速grepとして利用できる • サイズが大きい • Full-text indexからword indexは構成可能 • テキストを走査する • 必要の無いindexに印をつける • indexを走査し、印のついているものを削除

課題 • 検索結果のごみをなくす • AND検索? • シソーラスの利用 • OR検索 • 構造化された文書からの検索 • 見出しのみから検索など • データの収集速度 • 元の文書を圧縮して送る • word indexだけ送る

伸張時に文字列とsuffix arrayが復元される テキストを転送する際はBlock sortingで圧縮しておけば良い [Sadakane, Imai 98a] 圧縮と検索の統合 • Block sorting圧縮法[Burrows, Wheeler 94] • suffix arrayに従い文字列を並べ替えてから圧縮

謝辞貴重なコメントをくださったNTTの原田昌紀氏、中村隆幸氏に感謝いたします。

参考文献(1/3) [1] L.Arge, P.Ferragina, R.Grossi, and J.S. Vitter. On sorting strings in external memory. In ACM Symposium on Theory of Computing, pp. 540--548, 1997. [2] J.L. Bentley and R.Sedgewick. Fast algorithms for sorting and searching strings. In Proceedings of the 8th Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 360--369, 1997. [3] M. Burrows and D. J. Wheeler. A Block-sorting Lossless Data Compression Algorithms. Technical Report 124, Digital SRC Research Report, 1994. [4] A.Crauser and P.Ferragina. External memory construction of full-text indexes. In DIMACS Workshop on External Memory Algorithms and/or Visualization, 1998. [5] M.Farach. Optimal Suffix Tree Construction with Large Alphabets. In 38th Symp. on Foundations of Computer Science, pp. 137--143, 1997. URL URL URL URL URL

参考文献(2/3) [6] P.Ferragina and R.Grossi. The String B-Tree: a new data structure for string search in external memory and its applications. Journal of the ACM, 1998. [7] G.H. Gonnet, R.Baeza-Yates, and T.Snider. New Indices for Text: PAT trees and PAT arrays. In W.Frakes and R.Baeza-Yates, editors, Information Retrieval: Algorithms and Data Structures, chapter5, pp. 66--82. Prentice-Hall, 1992. [8] R.M. Karp, R.E. Miller, and A.L. Rosenberg. Rapid identification of repeated patterns in strings, arrays and trees. In 4th ACM Symposium on Theory of Computing, pp. 125--136, 1972. [9] U.Manber and G.Myers. Suffix arrays: A New Method for On-Line String Searches. SIAM Journal on Computing, Vol.22, No.5, pp. 935--948, October 1993. URL URL

参考文献(3/3) [10] E.M. McCreight. A space-economical suffix tree construction algorithm. Journal of the ACM, Vol.23, No.12, pp. 262--272, 1976. [11] K.Sadakane and H.Imai. A Cooperative Distributed Text Database Management Method Unifying Search and Compression Based on the Burrows-Wheeler Transformation. In Proceedings of NewDB’98, 1998. [12] K.Sadakane and H.Imai. Constructing Suffix Arrays of Large Texts. In Proceedings of DEWS'98, 1998. [13] E.Ukkonen. On-line construction of suffix trees. Algorithmica, Vol.14, No.3, pp. 249--260, September 1995. [14] P.Weiner. Linear Pattern Matching Algorihms. In Proceedings of the 14th IEEE Symposium on Switching and Automata Theory, pp. 1--11, 1973. URL URL

全文検索のためのデータ構造と 構成の効率について