1 / 25

Word Segmentation Models: Overview

Word Segmentation Models: Overview. Chinese Words, Morphemes and Compounds Word Segmentation Problems Heuristics Approaches Unknown Word Problems Probabilistic Model supervised mode unsupervised mode Dynamic Programming Unsupervised Model for Identifying New Words.

lee-brennan
Download Presentation

Word Segmentation Models: Overview

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Word Segmentation Models: Overview • Chinese Words, Morphemes and Compounds • Word Segmentation Problems • Heuristics Approaches • Unknown Word Problems • Probabilistic Model • supervised mode • unsupervised mode • Dynamic Programming • Unsupervised Model for Identifying New Words Jing-Shin Chang

  2. English Text With Well Delimited Word Boundary • (Computer Manual) • For information about installation, see Microsoft Word Getting Started. To choose a command from a menu, point to a menu name and click the left mouse button (滑鼠左鍵). For example, point to the File menu and click to display the File commands. If a command name is followed by an ellipsis, a dialog box (對話框) appears so you can set the options you want. You can also change the shortcut keys (快捷鍵) assigned to commands. (Microsoft Word User Guide) • (1996/10/29 CNN) • Microsoft Corp. announced a major restructuring Tuesday that creates two worldwide productgroups and shuffles the topranks of seniormanagement. Under the fourth realignment ..., the company will separate its consumer products from its business applications, creating a Platforms and Applications group and an Interactive Media group. ... NathanMyhrvold, who also co-managed the Applications and Content group, was named to the newly created position of chief technology officer. Jing-Shin Chang

  3. Chinese Text Without Well Delimited Word Boundaries • China Times 1997/7/26: • 台經院指出,隨著股市活絡與景氣回溫,第一季車輛及零件營業額成長十六.八一%,顯示民間需求回升。再加上為加入WTO,開放進口已是時勢所趨,也將帶動消費成長。台經院預測今年民間消費全年成長率可提昇至六.七四%。 • 在投資方面,第一季國內投資出現回升走勢,固定資本形成實質增加六.五六%,其中民間投資實質增加八.九五%。在持續有民間大型投資計畫進行、國內房市回溫、與政府開放投資、加速執行公共工程等多項因素下,預測今年全年民間投資將成長十一.八%。 • 台經院表示,口蹄疫連鎖效應在第二季顯現,使第二季出口貿易成長率比預期低,出口年增率二.一%,比去年低。而進口年增率為七.三八%,因此第二季貿易出超僅十七.一四億美元,比去年第二季減少四十三.六五%。不過,由於第三、四季為出口旺季,加上國際組織均預測今年世界貿易量擴大,台經院認為我國商品出口應可轉趨順暢。 Jing-Shin Chang

  4. Example: Word Segmentation [Chang 97] • Input: 移送台中少年法庭審理 • Seg1*: 移送 / 台中少年 / 法庭審理 • Seg2*: 移送 / 台中 / 少年 / 法庭審理 • Seg3 : 移送 / 台中 / 少年法庭 / 審理 • Successively better segmentation with an unsupervised approach ([Chang 97]) • Input: 土地公有政策 • Seg1 : 土地 / 公有 / 政策 • Seg2*: 土地公 / 有 / 政策 • Longest match problem + Unknown word problem Jing-Shin Chang

  5. Example: Word Segmentation • Input: 修憲亂成一團結果什麼也沒得到 • Output: 修 憲 亂 成 一 團結 果 什麼 也 沒 得 到 • mis-merge problem Jing-Shin Chang

  6. Why Word Segmentation • Word is the natural unit for natural language analyses & NLP applications • Tricky output may results if tokenization is not carefully conducted. • tokenization is the first step in most NLP applications • e.g., using character bi-grams as the indexing keys (I.e., representatives of documents) in search engine design and other similarity-based information retrieval tasks Jing-Shin Chang

  7. Word Segmentation Problems in Basic IR System • Information Sources & Acquisition • Web Pages • Web Robots: access all web pages of interested or registered sites to local storage • News Groups • News server: accept postings to the news groups • BBS Articles • BBS server: admin posting of BBS articles • IntraNet documents • shared through local lans • Document Conversion & Normalization • html to txt, etc. • Indexing System • identify features of documents & keep a representative signature for each document • Searching System • convert query into representative signature • compare the signature of input query to the signatures of archived documents • rank the relevant documents by similarity Jing-Shin Chang

  8. Basic Indexing Techniques & WS Problems • Vector Space Approach • document (or query) as a vector of frequency of terms (or variants of frequencies) • compare query vector against document vectors for similarity & relevance • Problems (quick but dirty) • depends on word frequencies only (not even compound words) • independent of word orders (no structural or syntactic information) • simple minded query functions: (user requirements not satisfied) • keyword matching (exact or fuzzy) • logical operators (AND, OR, NOT) • near/quasi natural language query • Chinese specific problems: weird output due to un-segmented input • Indexed with character 2-grams (not by words) • 資訊月=>資訊月刊 • 島內頻寬升級為1GB • 黨內頻喊換閣揆 • 錄音帶內容…尹清峰頻頻說: “…” Jing-Shin Chang

  9. Heuristics Approaches • Matching Against Lexicon • scan left-to-right or right-to-left • Heuristic Matching Criteria • (1) Longest (Maximal) Match • select the longest sub-string on multiple matches • (2) Minimum number of matches • select the segmentation patterns with smallest number of words • Greedy Method, Hard-Rejection • skip over matched lexicon entry, and repeat matching, regardless of whether there are embedded or overlapped word candidate in the current matched word Jing-Shin Chang

  10. Heuristics Approaches • Problems • hard-decision: skip over possible matching if it was covered by a previous match (impossible to recover based on more evidences) • i.e., p(w) = 1 or 0, for any word ‘w’, depending on whether it was covered by a previously matched word, unconditionally • less contextual constraints: depends on local match • not depending on all context • not jointly optimized • cannot handle unknown word problem: • words not registered in dictionary will not be handled gracefully • e.g., new compound words, proper names, numbers • Advantages • simple and easy to implement • only need a large dictionary • need no training corpora for estimating probabilities Jing-Shin Chang

  11. Problems with Segmentation Using Known Words • Incomplete Error Recovery Capability • Two types of segmentation errors due to unknown word problems: • Over-segmentation: Split unknown words into short segments • (e.g., single character regions `修憲'=> `修 憲') • 分析家 對 馬來西亞 的 預測 • <=> 分析 家 對 馬 來 西亞 的 預測 • Under-Segmentation: Prefer long segment when combining segments • (搶詞問題) • e.g., `土地 公有 政策‘ =WS Error (`公有’ unknown)=> `土地公 有 政策' • =Merge=> `土地公有', `有政策' (NOT: `土地', `公有', `政策') • 團結: mis-merge=> 修 憲 亂 成 一 團結 果 什麼 也 沒 得 到 • MERGE operation ONLY recover over-split candidates but NOT over-merged • (under-segmented) candidates Jing-Shin Chang

  12. Problems with Segmentation Using Known Words • Use known words for segmentation without considering potential unknown words (zero word probabilities to unknown words) • cannot take advantages of contextual constraints over unknown words to get the desired segmentation • millions of randomly merged unknown word candidates for filter • (-省都委會:) 獲省都委會同意 => 獲 省 都 委 會同 意 • =>省都|省都委|省都委會|都委|都委會同|委會同|委會同意 • (+省都委會:) 獲省都委會同意 => 獲 省都委會 同意 • an extra disambiguation step for resolving overlapping candidates • e.g., 省都 vs 省都委會 (etc.) • e.g., 彰化 縣 警 刑警隊 少年組 Jing-Shin Chang

  13. Probabilistic Models • Find all possible segmentation patterns, and select the best one according to a scoring function • Advantages of Probabilistic Models: • Soft-decision: retain all possibilities of segmentation, without pre-exclude any possibilities, and select the best by a scoring function which might maximizes the joint likelihood of segmentation • Take care of contextual constraints to maximize the likelihood of the whole segmentation pattern • all words in a segmentation pattern will impose constraints on neighboring words • the segmentation pattern which best fit such constraints (or criteria) are selected • Unsupervised training is possible even though there is no dictionary or there is only a small seed dictionary or seed segmentation corpus • because many probabilistic optimization criteria can be maximized by iteratively trying possible segmentations and re-estimation in some known ways • e.g., EM & Viterbi training Jing-Shin Chang

  14. Probabilistic Model • Basic Model: • Word Uni-gram Model [Chang 1991]: to jointly optimize the likelihood of segmentation by the product of probabilities of constituent words in the segmentation pattern • Dynamic Programming: fast searching for the best possible segmentation even though there is a vast number of possible segmentation patterns • Other Models: [Chiang et. al, 1992] • Including parts-of-speech (詞類), morphological (詞素) features into accounts • Including simple, yet probably useful features like length distribution into accounts • Taking care of unknown words into consideration Jing-Shin Chang

  15. Word Uni-gram Model for Identifying Words • Segmentation Stage: Find the best segmentation pattern S* • which maximizes the following likelihood function of the input corpus • c1n : input characters c1, c1, ..., cn • Sj : j-th segmentation pattern, consisting of { wj,1, wj,2, ..., wj,mj } • V(t): vocabulary (n-grams in the augmented dictionary used for segmentation) • S*(V): the best segmentation (is a function of V) Jing-Shin Chang

  16. Dynamic Programming (DP) • Dynamic Programming • A methodology for searching for the best solution without explicitly enumerating all candidates in the solution space • Resolve the optimization problem for the whole problem by resolving the optimization problem of a much simpler sub-problem, whose solution does not really depend on the large number of combination of the remaining parts of the whole problem • therefore, virtually reduce the large solution space into a very small one • Resolving successively larger sub-problems after the simpler ones are resolved, and finally resolve the optimization problem of the whole task • Requirement • The optimum solution of the sub-problem should not depends on the remaining parts of the whole problem Jing-Shin Chang

  17. Dynamic Programming (DP) Steps • Initialization • initialize known path scores • Recursion • find best previous local path, assuming current node is one of the node in the best path, by comparing sum of local and accumulative scores • keep the trace of the best previous path, and • accumulative score for this best path • Termination • Path Backtracking • trace back the best path Jing-Shin Chang

  18. Dynamic Programming (DP) • Examples: • shortest path problem • speech recognition: DTW (Dynamic Time Wrapping) • minimum alignment cost between an input speech feature vector and a speech feature vector for the typical utterance of a word • speech-to-speech distance measure • speech-text alignment • align words in speech waveform with written transcription • an extension of isolated word recognition using DTW • speech-to-phonetic transcription • spelling correction: • minimum editing cost between an input string and a reference pattern (e.g., dictionary word) • editing operations: insertion, deletion, substitution (including matching) • advanced operations: swapping • post-editing cost: • cost required to modify machine translated text into fluent translation Jing-Shin Chang

  19. Dynamic Programming (DP) • Examples: • Bilingual Text Alignment • find corresponding sentences in two parallel bi-lingual corpus • sentence length to sentence length distribution • in words • in characters • Word Correspondence, Translation Equivalent, Bilingual Collocation (連語) • find corresponding words in aligned sentences of bilingual corpora • word association metrics as distance between matching • word association metrics: anything that indicate degree of (in-)dependency between word pairs can be used for this purpose • to be addressed in later chapters … • Machine Translation Jing-Shin Chang

  20. Application of DP: Feedback Control via. A Parameterized System Jing-Shin Chang

  21. Application of DP: Feedback Controlled Parameterized MT Architecture • Metrics for error distance • (i) Levison Distance • (ii) e(t-1,I,j) = log P( T(I)* | S(I) ) - log P( T(I,j) | S(I) ) Jing-Shin Chang

  22. Dynamic Programming (DP) for Finding the Best Word Segmentation • Ex. 國民大會代表人民行使職權 (c1, c2, …, cN) • Scan all character boundaries left-to-right • For all word boundary with index ‘idx’, assuming that idx is one of the best segmentation boundary, then the best previous segmentation boundary idx_best can be found by • idx_best = argmax {accumulative_score( 0, j ) x d( j, idx) } • over all j = idx-1 to max(idx – k , 0) (k: maximum word length (in characters)) • d(j, idx) = Prob(c[j+1…idx]) (the probability that c[j+1…idx] form a word) • initialization: accumulative_score( 0,0) = 1.0 • update: accumulative_score(0, idx) = accumulative_score(0, idx_best) x d(idx_best, idx) • After scanning all word boundaries, and finding all (assumed) best previous word boundaries, trace back from the end (which is surely one of the best word boundary), and get the real best segments • right-to-left scanning is virtually identical with the above steps Jing-Shin Chang

  23. Unsupervised Word Segmentation: Viterbi Training for Identifying New Words • Criteria: • 1. produce words that maximizes the likelihood of the input corpus • 2. avoid producing over-segmented entries due to unknown words • Viterbi Training Approach: • Re-estimate the parameters of the segmentation model iteratively to improve the system performance, where the word candidates in the augmented dictionary contain known words and potential words in the input corpus. • Potential unknown words will be assigned non-zero probabilities automatically in the above process. Jing-Shin Chang

  24. Viterbi Training for Identifying Words (cont.) • Segmentation Stage: Find the best segmentation pattern S* • which maximizes the following likelihood function of the input corpus • c1n : input characters c1, c1, ..., cn • Sj : j-th segmentation pattern, consisting of { wj,1, wj,2, ..., wj,mj } • V(t): vocabulary (n-grams in the augmented dictionary used for segmentation) • S*(V): the best segmentation (is a function of V) Jing-Shin Chang

  25. Viterbi Training for Identifying Words (cont.) • Reestimation Stage: Estimate the word probability which maximizes the likelihood of the input text: • Initial Estimation: • Reestimation: Jing-Shin Chang

More Related