Inology e xtraction system based on vocabulary space
This presentation is the property of its rightful owner.
Sponsored Links
1 / 36

Term inology E xtraction System based on Vocabulary Space PowerPoint PPT Presentation


  • 38 Views
  • Uploaded on
  • Presentation posted in: General

German-Japan NL WS in Sapporo2003/7/4. Term inology E xtraction System based on Vocabulary Space. Hiroshi Nakagawa Information Technology Center, The University of Tokyo. 歩留まり : Bu-Domari: Success rate ?? 横持ち : Side take:

Download Presentation

Term inology E xtraction System based on Vocabulary Space

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Inology e xtraction system based on vocabulary space

German-Japan NL WS in Sapporo2003/7/4

TerminologyExtractionSystem based on Vocabulary Space

Hiroshi Nakagawa

Information Technology Center,

The University of Tokyo


Inology e xtraction system based on vocabulary space

  • 歩留まり: Bu-Domari:

  • Success rate ??

  • 横持ち: Side take:

  • Transportation between main transportation method station (like airport, train station )and destination or starting point.

  • 玉掛け: ball hinge

  • To operate a power shovel

  • Really useful and interesting terminologies


Inology e xtraction system based on vocabulary space

Long Compound Nouns

  • German

  • German-Japan

  • German-Japan natural

  • German-Japan natural language

  • German-Japan natural language processing

  • German-Japan natural language processing workshop

  • German-Japan natural language processing workshop program

  • German-Japan natural language processing workshop program chair


Inology e xtraction system based on vocabulary space

  • German-Japan natural language processing workshop program chair and

  • German-Japan natural language processing workshop program chair and ACL

  • German-Japan natural language processing workshop program chair and ACL2003

  • German-Japan natural language processing workshop program chair and ACL2003 general


Inology e xtraction system based on vocabulary space

  • German-Japan natural language processing workshop program chair and ACL2003 general chair Professor

  • German-Japan natural language processing workshop program chair and ACL2003 general chair Professor Tsujii

  • German-Japan natural language processing workshop program chair and ACL2003 general chair Professor Tsujii’s laboratory


Inology e xtraction system based on vocabulary space

  • German-Japan natural language processing workshop program chair and ACL2003 general chair Professor Tsujii’s laboratory

  • German-Japan natural language processing workshop program chair and ACL2003 general chair Professor Tsujii’s laboratory chief scientist Dr. Tsuruoka

  • Long compound noun (NP) is the source of information about terminology


Objective

Objective

  • Up-to-date domain terminology dictionary is the gateway to various technology and academic fields.

  • For this, first of all we need high quality terminologies of the target domain.

  • What corpus? Ordinary corpus or Web pages?


Concepts

Concepts

  • Methodological classification:

  • Supervised Learning based extraction

    • finding heavily influenced features

    • surrounding patterns of target expression

    • technology developed by NE task

  • Statistics based extraction  our target

    • document space based statistics

    • linguistic structure, such as syntactic, semantic structure based formalism

    • vocabulary space based statisticsour target


Document space versus vocabulary space

Document space versus Vocabulary space

Web

abc,abc,ab

lmn

xy,xy

abc

ab, xy

abc, lmn

xy


Document space based statistics

document space based statistics

  • Old fashioned

  • Weight term candidates based on their occurrence on document space: corpus or Web, and rank them descending order.

  • term frequency or tf*idf for basic nouns

  • To extract compound nouns,contingency matrix and co-occurrence based decision with MI, χ2 ,Dice etc.


Linguistic structure based method

Linguistic Structure based method

  • Syntactic structure

    • POS pattern like {adj (noun)+}

    • phrasal verbs, etc.

  • Semantic structure of compound nouns

    • Predicate argument structure (i.e.Pustejovski)

    • Case frame of predicate

  • Single and compound nouns are not treated equally.


Vocabulary space based method

Vocabulary space based method

  • Statistics of vocabulary space such as

    • Statistics of embedded relation (C-value)

    • How many compound nouns the target noun makes (LR = our proposal)

    • Application of link structure analysis of Web pages: (PageRank, HITS)

    • Single and compound nouns are treated equally


Our objective

Our objective

  • Experimental analysis and evaluation of various term extraction methods with

    • Test collection (TMREC) corpus

    • Web page corpus

    • Domain dictionaries on Web or in CR-ROM as gold-standard

  • Term extraction system repository

    • Gensen Web (言選Web)

    • http://gensen.dl.itc.u-tokyo.ac.jp/gensenweb_eng.html

  • Finally Automatic builder for up-to-date domain terms dictionary


Atr by compound noun statistics

ATR byCompound noun statistics


Inology e xtraction system based on vocabulary space

言選 Gensen Web

  • Automatic term extraction from WEB pages

  • Step1. Term candidate extraction

    • separating text by stop-words (or using morphological analyzer ) to generate candidates

  • Step 2. Scoring candidates to rank them

  • our scoring mechanism is innovative and unique


Inology e xtraction system based on vocabulary space

Domain Specific Terms

expressing domain concepts

About 85%       about15%

compoundsimple nouns nouns

  • Simple noun: no more divided into shorter nouns

  • Compound noun: uninterrupted sequence of simple nouns

Our Purpose is

Extracting domain specific terms including compound and

simple nouns from domain corpus automatically.


Inology e xtraction system based on vocabulary space

Scoring of Simple Nouns

  • Li =freq. n Nm Rj= freq.

  • 3noun statistics 2

  • 1character trigram

  • 1class acquisition1

LN(trigram)=5 n=3 m=2 RN(trigram)=3

Principle:A simple noun which contributes to make a big number of compound nouns has a high score.


Scoring of compound nouns gm ompound noun

Scoring of compound nounsGM(Compound Noun)

GM(CN) is a geometric mean which does not

depend on the length of CN.


Inology e xtraction system based on vocabulary space

New scoring function:FGM(CN)

if CN occurs independently

then

where f(CN) means the number of independent occurrences of noun CN

(= CN does not appear as a part of longer CN )

Ex. GM(trigram)=((5+1)x(3+1))1/2=4.9

if f(trigram)= 5

FGM(trigram)=24.5


Modified

Modified C-value

Modify C-value(Frantzi&Ananiadou,1996) to be able to

score a simple noun

length(a) :# of simple nouns consisting afreq(a):frequency of a

t(a):frequency of candidate compound nouns including a

 c(a):frequency of distinct candidate compound nouns including a


Experimental evaluations

Experimental Evaluations

Data used in our experiment is developed by NII.

  • Manually POS tagged Japanese corpus and the gold-standard is a set of manually extracted terms developed by NTCIR1 TMREC task

     (Artificial Intelligence field:1,870 paper abstracts)

  • Gold-standard consists of manually extracted 8,843 domain specific terms


Inology e xtraction system based on vocabulary space

Complete and Partial match by GM: (base line)

Partial match

(contained)

Complete match


Inology e xtraction system based on vocabulary space

Number of complete matched terms by

FGM,MC-value

MCval - GM

FGM-GM


Inology e xtraction system based on vocabulary space

Number of partially matched terms by

FGM,MC-value

FGM-GM

MCval-GM


Inology e xtraction system based on vocabulary space

Average length (every 100 terms)

of extracted terms

MC-value

GM

FGM


Inology e xtraction system based on vocabulary space

Top scored 20 terms by GM

  • candidate terms frequency

  • 知識(knowledge) 787 ○

  • 学習知識(learning knowledge)1○

  • 学習(learning)255○

  • 言語的知識(linguistic knowledge)2○

  • 知識システム(knowledge system)14○

  • 学習システム(learning system)16○

  • 問題知識(problem knowledge)3×

  • 学習問題(learning problem)5○

  • 言語的(linguistic)1○

  • システム(system)861○


Inology e xtraction system based on vocabulary space

Top scored 20 terms by GM(con’t)

  • 11.問題(problem)561○

  • 12.論理的知識(logical knowledge)1○

  • 13.学習支援システム(learning assistance system)3○

  • 14.設計知識(design knowledge)29○

  • 学習問題解決システム(learning problem solver)1 ○

  • 16.学習支援(learning assistance)9○

  • 17.言語的情報(linguistic knowledge)3○

  • 18.知識モデル(knowledge model)3○

  • 19.設計システム(design system)6○

  • 20.システム設計(system design)1○


Inology e xtraction system based on vocabulary space

Top scored 20 terms by FGM

  • candidate termsfrequency

  • 知識(knowledge)787 ○

  • システム(system)861○

  • 問題(problem)561○

  • 学習(learning)255○

  • 学習者(learner)383○

  • モデル(model)356○

  • 情報(information)382○

  • 問題解決(problem solving)186○

  • 設計(design)183○

  • 知識ベース(knowledge base)149○


Inology e xtraction system based on vocabulary space

Top scored 20 terms by FGM(con’t)

11.推論(inference)162○

12.支援(assistance)87×

13.知識表現(knowledge representation)74○

14.エージェント(agent)256○

15.学習者モデル(learner’s model)57○

16.機能(function)294×

17.設計者(designer)69○

18.対話(dialogue)205○

19.言語(language)75○

20.対象(object)293○


Inology e xtraction system based on vocabulary space

Top scored 20 terms by MC-value

  • candidate terms frequency

  • 学習者(learner)383 ○

  • 問題解決(problem solving)186○

  • システム(system)861○

  • 知識(knowledge)787○

  • 研究(research)651×

  • 本稿(this paper)594×

  • 手法(method)562×

  • 問題(problem)561○

  • 知識ベース(knowledge base)149○

  • 論文(paper)453×


Inology e xtraction system based on vocabulary space

Top scored 20 terms by MC-value (con’t)

11.方法(method, way to do)426×

12.支援システム(assistance system)18×

13.計算機(computer)128○

14.情報(information)382○

15.モデル(model)356○

16.自然言語(natural language)63○

17.我々(we)332×

18.有効性(effectiveness)160×

19.エキスパートシステム(expert system)78  ○

20.ユーザ(user)297○


Precision complete matched of each method

Precision(complete matched) of each method

N1,N2: top two systems of NTCIR1


Precision partially matched of each method

Precision(partially matched) of each method


Precision of each method when large number of terms extracted

Precision of each method when large number of terms extracted

N1, N2: top two systems of NTCIR1 


Conclusions

Conclusions-1

New statistical methods for ATR, which are basically how many nouns adjoin the single-noun in question to form compound nouns.

FGM

・best in extracting small number( up to 1400) of high quality domain specific terms

・longer terms including correct terms are better extracted by FGM or GM

MC-value

Strong in extracting large number (up to 6000) of domain specific terms


Inology e xtraction system based on vocabulary space

Conclusions-2

  • Web is perceived as a gigantic knowledge resource, but yet to be fully utilized.

  • Terminology in various domain is sure to be the gateway to the domain for novices even for experts.

  • More readily useful ATR is needed.


  • Login