Exploring Chinese Document Indexing Methods: A Comprehensive Overview

Surveys of Some Critical Issues in Chinese Indexing Chinese Document Indexing and Word Segmentation • Speaker : Reuy-Lung, Hsiao • Date : Wed, Dec, 22

Roadmap • An overview of Web Information Retrieval systems architecture • Automatic indexing overview • Questions of Chinese document indexing • Typical approaches to index Chinese document sets • Chinese words segmentation mechanism • Segmentation algorithms • Discussion and Conclusion • Reference

Request Response Index Database Result Document Set Indexing Query Formulation Similarity Measurement (Ranking) Chinese Document Indexing System Overview Information Discovery

Automatic Indexing Overview 1. Automatic indexing mechanism extracts the features (terms or keywords) of a given document. 2. Indexing processes may contains the follow steps: (1)Morphological & Lexical Analysis stemming -> stop list -> weighting -> thesaurus construction (2)Syntactic & Semantic Analysis part-of-speech tagging -> information extraction -> concept extraction 3. Weighting plays an important role in retrieval effectiveness. (1)Typical term weighting mechanism : TFxIDF. (2)Typical effectiveness measurement : recall,precision.

Relevance line Retrieval line # retrieved relevant document # relevant document A B B = # relevant relevant document A+B B+C # retrieved document D B = C Automatic Indexing Overview 4. TFxIDF 5. Recall/Precision Recall = Precision =

Questions of Chinese Document Indexing • 1.Words, rather than characters, should be • the smallest indexing unit. • More specific to the concepts • Less index space required 2.A comprehensive lexicon is needed. 3.Chinese text has no delimiters to mark word boundary. for example: English words have spaces and punctuations as separators 中文句子沒有明顯的分隔符號

Approaches to indexing Chinese Text • 1.N-gram Indexing • Typically use N = 1,2,3 • Produce large index file • 2.Statistical Indexing • Typically use mutual information for word corelation • 3.Word-based Indexing • Rule-based approach • Statistical approach • Hybrid approach

Approaches to indexing Chinese Text (N-gram Indexing) • N-gram indexing terms produced from the same text string C1C2C3C4C5C6 sentence unigram C1 , C2 , C3 , C4 , C5 , C6 bigram C1 C2 , C2 C3 , C3 C4 , C4 C5 , C5 C6 trigram C1 C2 C3 , C2 C3 C4 , C3 C4 C5 , C4 C5 C6 • N-gram index size for TREC-5 Chinese collection n-gram # distinct n-grams # of n-grams unigram 6,236 64,611,662 bigram 1,393,488 54,362,319 trigram 8,119,574 49,886,331

P(x,y) P(x)P(y) f(C1) f(C1C2) f(C1C2) = P(C1,C2) = P(C1)P(C2|C1) = N f(c1) N f(C1C2) I(C1,C2) = log2 N + log2 f(C1)f(C2) Approaches to indexing Chinese Text (Statistical Indexing) • Mutual Information I(x,y) between two events • x and y is defined as I(x,y) = log2 • If two events occur independently, p(x,y) would be • close to p(x)p(y), I(x,y) would be closed to zero. • If two events are strongly related, p(x,y) would be • much larger than p(x)p(y), I(x,y) would be large • Using statistical counting to derive probability

Approaches to indexing Chinese Text (Statistical Indexing) • Statistical Indexing Algorithm • 1. Compute the mutual information values for all adjacent • bigrams. • 2. Treat the bigram of the largest mutual information value • as a word and remove it from the text. • 3. Perform step 2 on each short phrases until all phrases • consist of one or tow characters. • The following statistics are based on text collections • from China Times, from 12/19/99, 12/20/99, 12/21/99. • Totally 621079 characters, 3827 distinct characters • per day on average. • Comparsion among above indexing methods. (result)

Approaches to indexing Chinese Text (Statistical Indexing) bigram 連戰戰新新的的競競選選宣宣言 f(C1) 543 517 1498 16187 223 1028 259 f(C2) 517 1498 16187 223 1028 259 305 f(C1C2) 76 0 80 34 61 2 8 I(C1,C2) 5.12 -7.13 0.72 1.77 5.11 1.54 4.14 Step phrases action remove連戰連戰新的競選宣言 1 remove競選 □□新的競選宣言 2 remove宣言 □□新的□□宣言 3 remove新的 □□新的□□□□ 4 other example

Approaches to indexing Chinese Text (Word-based Indexing) • 1.Rule-based approach • Use a dictionary(lexicon) to match words. • Concept: • a correct segmentation result should consist of legitimate • words. • For example: 中國文學 • 1.中國文學 • 2.中國文學 • 3.中國文學 • 4.中國文學 • 5.中國文學 • We will choose (1) as the result. • Out-of-Vocabulary problem.

Approaches to indexing Chinese Text (Word-based Indexing) • 2.Statistical approach • Rely on statistical information such as • word and character (co-)occurrence • frequencies in the training data. • Concepts • Given an sentence,the best solution is composed • of a sequence of potential words Si, such that • is the highest. • Supervised/Unsupervised learning • Require large data to acquire accuracy. • Sparse data problem

Approaches to indexing Chinese Text (Segmentation Algo.) • Hybrid Segmentation Algorithm by Jian-Yun Nie, • Martin Brisebois, SIGIR ‘96 • Use lexicon and statistical information to segment • words, with morphological heuristic rule to augment • lexicon coverage. (note:supervised learning) • Terminology: • background knowledge: words contained in the • dictionary with default probability (p) • foreground knowledge: statistical information • heuristic rule: two kind of rules are included • Nominal pre-determiner structure • such as 這一年、一百本、每一天 • Affix structure • such as 小朋友、大眾化

大會決議和和議程項目 (0.016) (0.029) (0.00108) (0.0005) (0.945) (0.0005) (0.0005) (0.0005) (0.0024) 大會大會決議決議議和和議議程議程項目項目 (1.0) (0.956) (0.001) (0.001) (1.0) (0.936) Approaches to indexing Chinese Text (Segmentation Algo.) • Algorithm: • Combination of both knowledge • if statistical information is available, use it! else • background knowledge is taken into account. • Each character in the input string is associated with all • the candidate words starting from that character, • together with their probability • The candidate words are combined to cover the input • string. The word sequence having the highest probability • is chosen as the result. • Example:大會決議和議程項目 (Result)

Approaches to indexing Chinese Text (Segmentation Algo.) • Unsupervised Segmentation Algorithm by Xiaoqiang Luo, • Salim Roukos, ACL ‘96 • Pure statistical learning model without using • dictionary. It divides training set into two parts, • randomly segments part-one, and segment part two by • part one. • Use the previously-constructed language Model for • iteration. • Use Viterbi-like algorithm to build LM. • Concept: Let a sentence S = C1C2C3..Cn-1Cn, where Ci(1≦i≦n) is a Chinese character. To segment a sentence into words is to group these characters into words, i.e.

Approaches to indexing Chinese Text (Segmentation Algo.) • S = C1C2...Cn-1Cn = (C1...Cx1)(Cx1+1...Cx2)...(Cxm-1+1...Cxm) • = W1W2...Wm • where xk is the index of the last character in kth word Wk, • i.e. Wk=Cxk-1+1...Cxk(k=1..m), and x0=0,xm=n • A segmentation of the sentence S can be uniquely represented • by an integer sequence X1,...,Xm, so we denote all possible • segmentation by • G(S)={ (x1...xm)|1≦x1≦...≦xm,m≦n } • and assign a score for a segmentation g(S)=(x1...xm)G(S) by

Approaches to indexing Chinese Text (Segmentation Algo.) • L(g(S)) = log Pg(W1...Wm) • = log Pg(Wi|hi) • where Wj =Cxj-1+1...Cxj(j=1..m) and hi is the history words • W1...Wi-1, Here we adopt trigram model with hi=Wi-2Wi-1 • Among all possible segmentations, we pick the one g* with the • highest score as our result. That is • g* = arg L(g(S)) • = arg log Pg(W1..Wm) • Let L(k) be the max accumulated score for the first k charac- • ters. L(k) is defined for k=1..n with L(1)=0, L(g*) = L(n).

chars C1 C2 C3 C4 C5 C6 k 1 2 3 4 5 6 p(k) 0 1 1 3 3 4 Approaches to indexing Chinese Text (Segmentation Algo.) • Given { L(i) | 1≦i≦k-1 }, L(k) can be computed recursively • as follows: • L(k) = [L(i)+log P(Ci+1...Ck|hi)] • p(k) = arg [L(i)+log P(Ci+1...Ck|hi)] • that Cp(k)+1...Ck comprises the last word of optimal • segmentation up to the kth character. • For example: a six-character sentence So the optimal segmentation for the sentence is (C1)(C2C3)(C4)(C5C6)

Discussion and Conclusion • 1. Since most Chinese words consist of two characters, • the bi-gram/statistical indexing outperform other • methods.(even dictionary-based method) • (According to New Advances in Computers and Natural • Language Processing in China, Liu, Information • Science (for China),’87 • 5% are unigrams, 75% are bigrams, 14% are trigrams, • 6% are words of four or more characters) • 2. Character-based indexing is not suited for Chinese • text retrieval, due to the reasons below: • Character-based approaches would lead to a great • deal of incorrect matching between queries and • documents due to quite free combination of characters

Discussion and Conclusion • Complex concept should always be expressed by a • fixed character string in both the doucments and the • query. • In character-based approaches, every character is • dealt with in the same way. • Character-based approaches do not allow us to easily • incorporate linguistic knowledge into the searching • process. • 3. Word-based indexing is the first step toward • concept-based indexing/retrieval, to avoid another • information explosion.

Reference 1. A Statistical Method for Finding Word Boundaries in Chinese Text - Richard Sproat and Chilin Shih, CPOCOL ’90 2. On Chinese Text Retrieval - Jian-Yun Nie, Martin Brisebois, SIGIR ’96. 3. An Iterative Algorithm to Build Chinese Language Models – Xiaoqiang Luo, Salim Roukos, ACL ’96. 4. Chinese Text Retrieval Without Using a Dictionary – Aitao Chen, Jianzhang He , SIGIR ’97 5. A Tagging-Based First-Order Markov Model Approach to Automatic Word Identification for Chinese Sentences – T.B.Y Lai, M.S. Sun, COLING ’98 6. Chinese Indexing using Mutual Information - Christopher C., Asian Digital Library Workshop ’98 7. A New Statistical Formula for Chinese Text Segmentation Incorporating Contextual Information – Yubin Dai, Teck Ee Loh, SIGIR ’99 8. Discovering Chinese Words from Unsegmented Text – Xianping Ge, Wanda Pratt, SIGIR ’99

Corpus • 1270 Kbytes • training set : 1247 • Test set : 272 • 90 words on average, 160 characters per document • segmentation accuracy is around 91% • Use stop-list such as 的,並,除非,此外… <Back>

Example:宋楚瑜興票案愈演愈烈 (849967/3998) bigram 宋楚楚瑜瑜興興票票案案愈愈演演愈愈烈 f(C1) 1103 800 673 498 687 1061 107 355 107 f(C2) 800 673 498 687 1061 107 355 107 118 f(C1C2) 665 665 1 191 66 1 2 2 2 I(C1,C2) 6.15 6.64 0.62 5.85 4.03 1.70 3.49 3.49 4.58 Result:宋楚瑜興票案愈演愈烈 (1865718/4513) bigram 宋楚楚瑜瑜興興票票案案愈愈演演愈愈烈 f(C1) 2820 2065 1703 1310 1945 2891 345 1085 345 f(C2) 2065 1703 1310 1945 2891 345 1085 345 360 f(C1C2) 1649 1678 4 383 90 2 4 4 4 I(C1,C2) 6.27 6.79 1.21 5.64 3.40 1.32 2.99 2.99 4.10 Result:宋楚瑜興票案愈演愈烈 <back>

Exploring Chinese Document Indexing Methods: A Comprehensive Overview

Exploring Chinese Document Indexing Methods: A Comprehensive Overview

Presentation Transcript

Critical Issues in Education

Issues in Indexing in the Philippines

Budget 2011-12 Some Critical Issues

Sufficiency Economy Some Critical Issues

Critical Issues in Administration

Critical issues

A Critical Look at some SNS Issues

Critical Issues

Quality in Official Statistics: Some Critical Issues

Some Highlights of Surveys of Homelessness in Calgary

Critical Issues

Critical issues in Stress

CRITICAL ISSUES IN HEALTH

Surveys of Some Critical Issues in Chinese Indexing

Some Chinese Domain Name Issues

Some Critical Issues…

Critical Issues in Transportation

Critical Issues in Policing

CRITICAL ISSUES

Some Highlights of Surveys of Homelessness in Calgary

Some CLIC Critical Issues

Critical Issues