1 / 36

Term inology E xtraction System based on Vocabulary Space

German-Japan NL WS in Sapporo2003/7/4. Term inology E xtraction System based on Vocabulary Space. Hiroshi Nakagawa Information Technology Center, The University of Tokyo. 歩留まり : Bu-Domari: Success rate ?? 横持ち : Side take:

raja
Download Presentation

Term inology E xtraction System based on Vocabulary Space

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. German-Japan NL WS in Sapporo2003/7/4 TerminologyExtractionSystem based on Vocabulary Space Hiroshi Nakagawa Information Technology Center, The University of Tokyo

  2. 歩留まり: Bu-Domari: • Success rate ?? • 横持ち: Side take: • Transportation between main transportation method station (like airport, train station )and destination or starting point. • 玉掛け: ball hinge • To operate a power shovel • Really useful and interesting terminologies

  3. Long Compound Nouns • German • German-Japan • German-Japan natural • German-Japan natural language • German-Japan natural language processing • German-Japan natural language processing workshop • German-Japan natural language processing workshop program • German-Japan natural language processing workshop program chair

  4. German-Japan natural language processing workshop program chair and • German-Japan natural language processing workshop program chair and ACL • German-Japan natural language processing workshop program chair and ACL2003 • German-Japan natural language processing workshop program chair and ACL2003 general

  5. German-Japan natural language processing workshop program chair and ACL2003 general chair Professor • German-Japan natural language processing workshop program chair and ACL2003 general chair Professor Tsujii • German-Japan natural language processing workshop program chair and ACL2003 general chair Professor Tsujii’s laboratory

  6. German-Japan natural language processing workshop program chair and ACL2003 general chair Professor Tsujii’s laboratory • German-Japan natural language processing workshop program chair and ACL2003 general chair Professor Tsujii’s laboratory chief scientist Dr. Tsuruoka • Long compound noun (NP) is the source of information about terminology

  7. Objective • Up-to-date domain terminology dictionary is the gateway to various technology and academic fields. • For this, first of all we need high quality terminologies of the target domain. • What corpus? Ordinary corpus or Web pages?

  8. Concepts • Methodological classification: • Supervised Learning based extraction • finding heavily influenced features • surrounding patterns of target expression • technology developed by NE task • Statistics based extraction  our target • document space based statistics • linguistic structure, such as syntactic, semantic structure based formalism • vocabulary space based statisticsour target

  9. Document space versus Vocabulary space Web abc,abc,ab lmn xy,xy abc ab, xy abc, lmn xy

  10. document space based statistics • Old fashioned • Weight term candidates based on their occurrence on document space: corpus or Web, and rank them descending order. • term frequency or tf*idf for basic nouns • To extract compound nouns,contingency matrix and co-occurrence based decision with MI, χ2 ,Dice etc.

  11. Linguistic Structure based method • Syntactic structure • POS pattern like {adj (noun)+} • phrasal verbs, etc. • Semantic structure of compound nouns • Predicate argument structure (i.e.Pustejovski) • Case frame of predicate • Single and compound nouns are not treated equally.

  12. Vocabulary space based method • Statistics of vocabulary space such as • Statistics of embedded relation (C-value) • How many compound nouns the target noun makes (LR = our proposal) • Application of link structure analysis of Web pages: (PageRank, HITS) • Single and compound nouns are treated equally

  13. Our objective • Experimental analysis and evaluation of various term extraction methods with • Test collection (TMREC) corpus • Web page corpus • Domain dictionaries on Web or in CR-ROM as gold-standard • Term extraction system repository • Gensen Web (言選Web) • http://gensen.dl.itc.u-tokyo.ac.jp/gensenweb_eng.html • Finally Automatic builder for up-to-date domain terms dictionary

  14. ATR byCompound noun statistics

  15. 言選 Gensen Web • Automatic term extraction from WEB pages • Step1. Term candidate extraction • separating text by stop-words (or using morphological analyzer ) to generate candidates • Step 2. Scoring candidates to rank them • our scoring mechanism is innovative and unique

  16. Domain Specific Terms expressing domain concepts About 85%       about15% compoundsimple nouns nouns • Simple noun: no more divided into shorter nouns • Compound noun: uninterrupted sequence of simple nouns Our Purpose is Extracting domain specific terms including compound and simple nouns from domain corpus automatically.

  17. Scoring of Simple Nouns • Li =freq. n Nm Rj= freq. • 3noun statistics 2 • 1character trigram • 1class acquisition1 LN(trigram)=5 n=3 m=2 RN(trigram)=3 Principle:A simple noun which contributes to make a big number of compound nouns has a high score.

  18. Scoring of compound nounsGM(Compound Noun) GM(CN) is a geometric mean which does not depend on the length of CN.

  19. New scoring function:FGM(CN) if CN occurs independently then where f(CN) means the number of independent occurrences of noun CN (= CN does not appear as a part of longer CN ) Ex. GM(trigram)=((5+1)x(3+1))1/2=4.9 if f(trigram)= 5 FGM(trigram)=24.5

  20. Modified C-value Modify C-value(Frantzi&Ananiadou,1996) to be able to score a simple noun length(a) :# of simple nouns consisting afreq(a):frequency of a t(a):frequency of candidate compound nouns including a  c(a):frequency of distinct candidate compound nouns including a

  21. Experimental Evaluations Data used in our experiment is developed by NII. • Manually POS tagged Japanese corpus and the gold-standard is a set of manually extracted terms developed by NTCIR1 TMREC task  (Artificial Intelligence field:1,870 paper abstracts) • Gold-standard consists of manually extracted 8,843 domain specific terms

  22. Complete and Partial match by GM: (base line) Partial match (contained) Complete match

  23. Number of complete matched terms by FGM,MC-value MCval - GM FGM-GM

  24. Number of partially matched terms by FGM,MC-value FGM-GM MCval-GM

  25. Average length (every 100 terms) of extracted terms MC-value GM FGM

  26. Top scored 20 terms by GM • candidate terms frequency • 知識(knowledge) 787  ○ • 学習知識(learning knowledge) 1 ○ • 学習(learning) 255 ○ • 言語的知識(linguistic knowledge) 2 ○ • 知識システム(knowledge system) 14 ○ • 学習システム(learning system) 16 ○ • 問題知識(problem knowledge) 3 × • 学習問題(learning problem) 5 ○ • 言語的(linguistic) 1 ○ • システム(system) 861 ○

  27. Top scored 20 terms by GM(con’t) • 11. 問題(problem) 561 ○ • 12. 論理的知識(logical knowledge) 1 ○ • 13. 学習支援システム(learning assistance system) 3 ○ • 14. 設計知識 (design knowledge) 29 ○ • 学習問題解決システム(learning problem solver) 1 ○ • 16. 学習支援 (learning assistance) 9 ○ • 17. 言語的情報(linguistic knowledge) 3 ○ • 18. 知識モデル(knowledge model) 3 ○ • 19. 設計システム(design system) 6 ○ • 20. システム設計(system design) 1 ○

  28. Top scored 20 terms by FGM • candidate terms frequency • 知識(knowledge) 787  ○ • システム(system) 861 ○ • 問題(problem) 561 ○ • 学習(learning) 255 ○ • 学習者(learner) 383 ○ • モデル(model) 356 ○ • 情報(information) 382 ○ • 問題解決(problem solving) 186 ○ • 設計(design) 183 ○ • 知識ベース(knowledge base) 149 ○

  29. Top scored 20 terms by FGM(con’t) 11. 推論(inference) 162 ○ 12. 支援(assistance) 87 × 13. 知識表現(knowledge representation) 74 ○ 14. エージェント(agent) 256 ○ 15. 学習者モデル(learner’s model) 57 ○ 16. 機能(function) 294 × 17. 設計者(designer) 69 ○ 18. 対話(dialogue) 205 ○ 19. 言語(language) 75 ○ 20. 対象(object) 293 ○

  30. Top scored 20 terms by MC-value • candidate terms frequency • 学習者(learner) 383  ○ • 問題解決(problem solving) 186 ○ • システム(system) 861 ○ • 知識(knowledge) 787 ○ • 研究(research) 651 × • 本稿(this paper) 594 × • 手法(method) 562 × • 問題(problem) 561 ○ • 知識ベース(knowledge base) 149 ○ • 論文(paper) 453 ×

  31. Top scored 20 terms by MC-value (con’t) 11. 方法(method, way to do) 426 × 12. 支援システム(assistance system) 18 × 13. 計算機(computer) 128 ○ 14. 情報(information) 382 ○ 15. モデル(model) 356 ○ 16. 自然言語(natural language) 63 ○ 17. 我々(we) 332 × 18. 有効性(effectiveness) 160 × 19. エキスパートシステム(expert system) 78  ○ 20. ユーザ(user) 297 ○

  32. Precision(complete matched) of each method N1,N2: top two systems of NTCIR1

  33. Precision(partially matched) of each method

  34. Precision of each method when large number of terms extracted N1, N2: top two systems of NTCIR1 

  35. Conclusions-1 New statistical methods for ATR, which are basically how many nouns adjoin the single-noun in question to form compound nouns. FGM ・best in extracting small number( up to 1400) of high quality domain specific terms ・longer terms including correct terms are better extracted by FGM or GM MC-value Strong in extracting large number (up to 6000) of domain specific terms

  36. Conclusions-2 • Web is perceived as a gigantic knowledge resource, but yet to be fully utilized. • Terminology in various domain is sure to be the gateway to the domain for novices even for experts. • More readily useful ATR is needed.

More Related