Terminology Extraction System Based on Vocabulary Space

German-Japan NL WS in Sapporo2003/7/4 ＴｅｒｍinologyExtractionSystem based on Vocabulary Space Hiroshi Nakagawa Information Technology Center, The University of Tokyo

歩留まり: Bu-Domari: • Success rate ?? • 横持ち: Side take: • Transportation between main transportation method station (like airport, train station )and destination or starting point. • 玉掛け: ball hinge • To operate a power shovel • Really useful and interesting terminologies

Long Compound Nouns • German • German-Japan • German-Japan natural • German-Japan natural language • German-Japan natural language processing • German-Japan natural language processing workshop • German-Japan natural language processing workshop program • German-Japan natural language processing workshop program chair

German-Japan natural language processing workshop program chair and • German-Japan natural language processing workshop program chair and ACL • German-Japan natural language processing workshop program chair and ACL2003 • German-Japan natural language processing workshop program chair and ACL2003 general

German-Japan natural language processing workshop program chair and ACL2003 general chair Professor • German-Japan natural language processing workshop program chair and ACL2003 general chair Professor Tsujii • German-Japan natural language processing workshop program chair and ACL2003 general chair Professor Tsujii’s laboratory

German-Japan natural language processing workshop program chair and ACL2003 general chair Professor Tsujii’s laboratory • German-Japan natural language processing workshop program chair and ACL2003 general chair Professor Tsujii’s laboratory chief scientist Dr. Tsuruoka • Long compound noun (NP) is the source of information about terminology

Objective • Up-to-date domain terminology dictionary is the gateway to various technology and academic fields. • For this, first of all we need high quality terminologies of the target domain. • What corpus? Ordinary corpus or Web pages?

Concepts • Methodological classification: • Supervised Learning based extraction • finding heavily influenced features • surrounding patterns of target expression • technology developed by NE task • Statistics based extraction  our target • document space based statistics • linguistic structure, such as syntactic, semantic structure based formalism • vocabulary space based statisticsour target

Document space versus Vocabulary space Web abc,abc,ab lmn xy,xy abc ab, xy abc, lmn xy

document space based statistics • Old fashioned • Weight term candidates based on their occurrence on document space: corpus or Web, and rank them descending order. • term frequency or tf*idf for basic nouns • To extract compound nouns,contingency matrix and co-occurrence based decision with MI, χ2 ,Dice etc.

Linguistic Structure based method • Syntactic structure • POS pattern like {adj (noun)+} • phrasal verbs, etc. • Semantic structure of compound nouns • Predicate argument structure (i.e.Pustejovski) • Case frame of predicate • Single and compound nouns are not treated equally.

Vocabulary space based method • Statistics of vocabulary space such as • Statistics of embedded relation (C-value) • How many compound nouns the target noun makes (LR = our proposal) • Application of link structure analysis of Web pages: (PageRank, HITS) • Single and compound nouns are treated equally

Our objective • Experimental analysis and evaluation of various term extraction methods with • Test collection (TMREC) corpus • Web page corpus • Domain dictionaries on Web or in CR-ROM as gold-standard • Term extraction system repository • Gensen Web (言選Web) • http://gensen.dl.itc.u-tokyo.ac.jp/gensenweb_eng.html • Finally Automatic builder for up-to-date domain terms dictionary

ATR byCompound noun statistics

言選　Gensen Web • Automatic term extraction from WEB pages • Step1. Term candidate extraction • separating text by stop-words (or using morphological analyzer ) to generate candidates • Step 2. Scoring candidates to rank them • our scoring mechanism is innovative and unique

Domain Specific Terms expressing domain concepts About ８５％　　　　　　about１５％ compoundsimple nouns nouns • Simple noun: no more divided into shorter nouns • Compound noun: uninterrupted sequence of simple nouns Our Purpose is Extracting domain specific terms including compound and simple nouns from domain corpus automatically.

Scoring of Simple Nouns • Li =freq. n Nm Rj= freq. • 3noun statistics 2 • 1character trigram • 1class acquisition1 LN(trigram)=５ n=3 m=2 RN(trigram)=３ Principle:A simple noun which contributes to make a big number of compound nouns has a high score.

Scoring of compound nounsGM（Ｃompound Noun) GM(CN) is a geometric mean which does not depend on the length of CN.

New scoring function:FGM(CN) if CN occurs independently then where f(CN) means the number of independent occurrences of noun CN (= CN does not appear as a part of longer CN ) Ex. GM（trigram）=((5+1)x(3+1))1/2=4.9 if f(trigram)= 5 FGM(trigram)=24.5

Modified Ｃ-ｖａｌｕｅ Modify Ｃ-ｖａｌｕｅ（Ｆｒａｎｔｚｉ＆Ananiadou,1996) to be able to score a simple noun ｌｅｎｇｔｈ（ａ）　：# of simple nouns consisting aｆｒｅｑ（ａ）：frequency of ａｔ（ａ）：frequency of candidate compound nouns including ａ　ｃ（ａ）：frequency of distinct candidate compound nouns including ａ

Experimental Evaluations Data used in our experiment is developed by NII. • Manually POS tagged Japanese corpus and the gold-standard is a set of manually extracted terms developed by ＮＴＣＩＲ1 TMREC task 　（Artificial Intelligence field：1,870 paper abstracts） • Gold-standard consists of manually extracted 8,843 domain specific terms

Complete and Partial match by GM: (base line) Partial match (contained) Complete match

Number of complete matched terms by FGM,MC-value MCval - GM FGM-GM

Number of partially matched terms by FGM,MC-value FGM-GM MCval-GM

Average length (every 100 terms) of extracted terms MC-value GM FGM

Top scored 20 terms by GM • candidate terms frequency • 知識(knowledge) 787　 ○ • 学習知識(learning knowledge) 1 ○ • 学習(learning) 255 ○ • 言語的知識(linguistic knowledge) 2 ○ • 知識システム(knowledge system) 14 ○ • 学習システム(learning system) 16 ○ • 問題知識(problem knowledge) 3 × • 学習問題(learning problem) 5 ○ • 言語的(linguistic) 1 ○ • システム(system) 861 ○

Top scored 20 terms by GM(con’t) • 11. 問題(problem) 561 ○ • 12. 論理的知識(logical knowledge) 1 ○ • 13. 学習支援システム(learning assistance system) 3 ○ • 14. 設計知識 (design knowledge) 29 ○ • 学習問題解決システム(learning problem solver) 1　○ • 16. 学習支援 (learning assistance) 9 ○ • 17. 言語的情報(linguistic knowledge) 3 ○ • 18. 知識モデル(knowledge model) 3 ○ • 19. 設計システム(design system) 6 ○ • 20. システム設計(system design) 1 ○

Top scored 20 terms by FGM • candidate terms frequency • 知識(knowledge) 787　 ○ • システム(system) 861 ○ • 問題(problem) 561 ○ • 学習(learning) 255 ○ • 学習者(learner) 383 ○ • モデル(model) 356 ○ • 情報(information) 382 ○ • 問題解決(problem solving) 186 ○ • 設計(design) 183 ○ • 知識ベース(knowledge base) 149 ○

Top scored 20 terms by FGM(con’t) 11. 推論(inference) 162 ○ 12. 支援(assistance) 87 × 13. 知識表現(knowledge representation) 74 ○ 14. エージェント(agent) 256 ○ 15. 学習者モデル(learner’s model) 57 ○ 16. 機能(function) 294 × 17. 設計者(designer) 69 ○ 18. 対話(dialogue) 205 ○ 19. 言語(language) 75 ○ 20. 対象(object) 293 ○

Top scored 20 terms by MC-value • candidate terms frequency • 学習者(learner) 383　 ○ • 問題解決(problem solving) 186 ○ • システム(system) 861 ○ • 知識(knowledge) 787 ○ • 研究(research) 651 × • 本稿(this paper) 594 × • 手法(method) 562 × • 問題(problem) 561 ○ • 知識ベース(knowledge base) 149 ○ • 論文(paper) 453 ×

Top scored 20 terms by MC-value (con’t) 11. 方法(method, way to do) 426 × 12. 支援システム(assistance system) 18 × 13. 計算機(computer) 128 ○ 14. 情報(information) 382 ○ 15. モデル(model) 356 ○ 16. 自然言語(natural language) 63 ○ 17. 我々(we) 332 × 18. 有効性(effectiveness) 160 × 19. エキスパートシステム(expert system) 78　 ○ 20. ユーザ(user) 297 ○

Precision(complete matched) of each method N1,N2：　top two systems of ＮＴＣＩＲ１

Precision(partially matched) of each method

Precision of each method when large number of terms extracted N1, N2：　top two systems of ＮＴＣＩＲ１　

Conclusions－１ New statistical methods for ATR, which are basically how many nouns adjoin the single-noun in question to form compound nouns. FGM ・best in extracting small number( up to 1400) of high quality domain specific terms ・longer terms including correct terms are better extracted by FGM or GM MC-value Strong in extracting large number (up to 6000) of domain specific terms

Conclusionｓ－２ • Web is perceived as a gigantic knowledge resource, but yet to be fully utilized. • Terminology in various domain is sure to be the gateway to the domain for novices even for experts. • More readily useful ATR is needed.

Terminology Extraction System Based on Vocabulary Space