1 / 21

Pat-Tree-Based Adaptive keyphrase Extraction for Intelligent Chinese Information Retrieval

Pat-Tree-Based Adaptive keyphrase Extraction for Intelligent Chinese Information Retrieval. 出處: institute of information science , academia sinica , taipei, taiwan,R.O.C. 學生:陳道輝、周鉦琪、葉飛 指導老師:黃三益 教授. Abstract. PAT-tree-based adaptive approach

beck
Download Presentation

Pat-Tree-Based Adaptive keyphrase Extraction for Intelligent Chinese Information Retrieval

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Pat-Tree-Based Adaptive keyphrase Extraction for Intelligent Chinese Information Retrieval 出處:institute of information science , academia sinica , taipei, taiwan,R.O.C. 學生:陳道輝、周鉦琪、葉飛 指導老師:黃三益 教授

  2. Abstract • PAT-tree-based adaptive approach • IR application: automatic term suggestion, domain-specific lexicon construction, book indexing and document classification

  3. Introduction • Keyphrase (keywords) extraction in Chinese language is a critical problem because of difficulties in word segmentation and unknown word identification.ex(哈電族)

  4. Definition of the Problems • Lexical pattern: a string that consists of more than one successive character and has certain occurrences in a text collection with a specific domain. • For example:關鍵詞抽取 • LPs:關鍵、建詞、 詞抽、抽取、關鍵詞、鍵詞抽、詞抽取、關鍵詞抽、鍵詞抽取、關鍵詞抽取

  5. Definition of the Problems (cont) • Complete lexical pattern: a LP with a complete meaning and lexical boundaries in semantics. • For example: 關鍵詞抽取 • CLP:關鍵、抽取、關鍵詞、關鍵詞抽取

  6. Definition of the Problems (cont) • Significant lexical pattern: A CLP which is either “specific” or “significant” in the database • For example: 關鍵詞抽取 • SLP:關鍵詞、關鍵詞抽取

  7. Definition of the Problems (cont) • Definition 1:SLP Extraction Problem • Definition 2:CLP Estimation Problem • To solve problem 1, first we should solve problem 2

  8. Definition of the Problems (cont) • Proposed Approach: 3 modules • Text analysis and PAT-tree indexing module • CLP extraction module • SLP extraction module

  9. Definition of the Problems (cont)

  10. Estimation of CLP • Most CLP have strong associations between their composed and overlapped substrings • Association Norm Estimation function • If AE is large, it can be found that in many cases, patterns y and z will occur together is the text collection (關鍵詞抽取、鍵詞抽取、關鍵詞抽)

  11. Estimation of CLP (cont) • It’s not enough to check if x has complete lexical boundaries using AE (關鍵詞) • To overcome this, we use two additional metrics, LCD (left context dependency) and RCD(right context dependency) ex.李登輝 • By these metrics we can say: • X is a CLP iff it has no LCD and RCD, and AE > (t3) threshold

  12. Estimation of CLP (cont) • X has LCD if |L|<t1, or MAX z (f(zx)/f(x))>t2, where t1, t2 are threshold values , z E L and |L| means the number of unique right adjacent characters of x • X has RCD if |L|<t1, or MAX z f(xy)/f(x)>t2, where t1, t2 are threshold values , y E L and |L|means the number of unique right adjacent characters of x

  13. Text Analysis and PAT-Tree Indexing • PAT tree uses as primarily implementation structure, and used for text retrieval and keyphrase extraction • Use delimiter(, “” .) to determine a segment boundary, then build semi-infinite string • For example:個人電腦,人腦 • 個人電腦,人電腦,電腦,腦,人腦,腦 • Node information (comparison bit, external nodes,frequency) • PAT Is easy for prefix search. • IPAT is easy for postfix search.

  14. Text Analysis and PAT-Tree Indexing (cont) • Convert semi-infinite strings to bits • According semi-infinite strings’ bit sequences and differences to build PAT Tree • We also create inverse PAT tree for inverse data streams of the database to check the occurrences of LSs and RSs • (詞鍵關、詞鍵、詞鍵關展發、詞鍵關行進)

  15. Text Analysis and PAT-Tree Indexing (cont) • Why use Pat tree (patricia)? • Log key value comparison times is low. • Computing time and space is down. • Efficient search. • We can use Pat tree to check RCD. • We can use Inverse Pat tree to check LCD.

  16. Extraction of SLP • A CLP is not always a SLP • It cannot prove its significance in the text collection • Many CLP are commonly found in daily use • All CLP is checked against a set of lexical rules and a general-domain corpus • Rules: • Numbers, Adverbs, Timing-related Terms • General Domain Pat Tree vs Specific Domain Pat Tree.

  17. Evaluation • Extraction of SLP • Ask 3 people to select CLPs and keyphrases from 50 “seed sentence” • Use these test data to test accuracy of SLP extraction

  18. Evaluation (cont) • Speed and Space Requirements

  19. Conclusion • This method reduced the difficulty of keyphrase extraction in Chinese, with better performance

  20. String Bit 0 2 4 6 1 9 8 9 11 17 25 個人電腦/節點0 10101101 11010011 10100100 … 個 人 電 腦 , 人 腦 人電腦/節點2 10100100 01001000 10111001 … 電腦/節點4 10111001 01110001 00000000 … 腦/節點6 10111000 0000000 00000000 … 人腦/節點9 10100100 01001000 00000000 … 腦/節點6 10111000 00000000 00000000 … 節點號碼 Semi-infinite strings

  21. (比較位元,外部節點數,字串次數) (0,6,1) 0 (4,6,1) 4 (5,3,1) (8,3,2) 2 6 (24,2,1) 9 0 4 2 6 9

More Related