inology e xtraction system based on vocabulary space
Download
Skip this Video
Download Presentation
Term inology E xtraction System based on Vocabulary Space

Loading in 2 Seconds...

play fullscreen
1 / 36

Term inology E xtraction System based on Vocabulary Space - PowerPoint PPT Presentation


  • 66 Views
  • Uploaded on

German-Japan NL WS in Sapporo2003/7/4. Term inology E xtraction System based on Vocabulary Space. Hiroshi Nakagawa Information Technology Center, The University of Tokyo. 歩留まり : Bu-Domari: Success rate ?? 横持ち : Side take:

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Term inology E xtraction System based on Vocabulary Space' - raja


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
inology e xtraction system based on vocabulary space
German-Japan NL WS in Sapporo2003/7/4

TerminologyExtractionSystem based on Vocabulary Space

Hiroshi Nakagawa

Information Technology Center,

The University of Tokyo

slide2
歩留まり: Bu-Domari:
  • Success rate ??
  • 横持ち: Side take:
  • Transportation between main transportation method station (like airport, train station )and destination or starting point.
  • 玉掛け: ball hinge
  • To operate a power shovel
  • Really useful and interesting terminologies
slide3
Long Compound Nouns
  • German
  • German-Japan
  • German-Japan natural
  • German-Japan natural language
  • German-Japan natural language processing
  • German-Japan natural language processing workshop
  • German-Japan natural language processing workshop program
  • German-Japan natural language processing workshop program chair
slide4
German-Japan natural language processing workshop program chair and
  • German-Japan natural language processing workshop program chair and ACL
  • German-Japan natural language processing workshop program chair and ACL2003
  • German-Japan natural language processing workshop program chair and ACL2003 general
slide5
German-Japan natural language processing workshop program chair and ACL2003 general chair Professor
  • German-Japan natural language processing workshop program chair and ACL2003 general chair Professor Tsujii
  • German-Japan natural language processing workshop program chair and ACL2003 general chair Professor Tsujii’s laboratory
slide6
German-Japan natural language processing workshop program chair and ACL2003 general chair Professor Tsujii’s laboratory
  • German-Japan natural language processing workshop program chair and ACL2003 general chair Professor Tsujii’s laboratory chief scientist Dr. Tsuruoka
  • Long compound noun (NP) is the source of information about terminology
objective
Objective
  • Up-to-date domain terminology dictionary is the gateway to various technology and academic fields.
  • For this, first of all we need high quality terminologies of the target domain.
  • What corpus? Ordinary corpus or Web pages?
concepts
Concepts
  • Methodological classification:
  • Supervised Learning based extraction
    • finding heavily influenced features
    • surrounding patterns of target expression
    • technology developed by NE task
  • Statistics based extraction  our target
    • document space based statistics
    • linguistic structure, such as syntactic, semantic structure based formalism
    • vocabulary space based statisticsour target
document space versus vocabulary space
Document space versus Vocabulary space

Web

abc,abc,ab

lmn

xy,xy

abc

ab, xy

abc, lmn

xy

document space based statistics
document space based statistics
  • Old fashioned
  • Weight term candidates based on their occurrence on document space: corpus or Web, and rank them descending order.
  • term frequency or tf*idf for basic nouns
  • To extract compound nouns,contingency matrix and co-occurrence based decision with MI, χ2 ,Dice etc.
linguistic structure based method
Linguistic Structure based method
  • Syntactic structure
    • POS pattern like {adj (noun)+}
    • phrasal verbs, etc.
  • Semantic structure of compound nouns
    • Predicate argument structure (i.e.Pustejovski)
    • Case frame of predicate
  • Single and compound nouns are not treated equally.
vocabulary space based method
Vocabulary space based method
  • Statistics of vocabulary space such as
    • Statistics of embedded relation (C-value)
    • How many compound nouns the target noun makes (LR = our proposal)
    • Application of link structure analysis of Web pages: (PageRank, HITS)
    • Single and compound nouns are treated equally
our objective
Our objective
  • Experimental analysis and evaluation of various term extraction methods with
    • Test collection (TMREC) corpus
    • Web page corpus
    • Domain dictionaries on Web or in CR-ROM as gold-standard
  • Term extraction system repository
    • Gensen Web (言選Web)
    • http://gensen.dl.itc.u-tokyo.ac.jp/gensenweb_eng.html
  • Finally Automatic builder for up-to-date domain terms dictionary
slide15
言選 Gensen Web
  • Automatic term extraction from WEB pages
  • Step1. Term candidate extraction
    • separating text by stop-words (or using morphological analyzer ) to generate candidates
  • Step 2. Scoring candidates to rank them
  • our scoring mechanism is innovative and unique
slide16
Domain Specific Terms

expressing domain concepts

About 85%       about15%

compoundsimple nouns nouns

  • Simple noun: no more divided into shorter nouns
  • Compound noun: uninterrupted sequence of simple nouns

Our Purpose is

Extracting domain specific terms including compound and

simple nouns from domain corpus automatically.

slide17
Scoring of Simple Nouns
  • Li =freq. n Nm Rj= freq.
  • 3noun statistics 2
  • 1character trigram
  • 1class acquisition1

LN(trigram)=5 n=3 m=2 RN(trigram)=3

Principle:A simple noun which contributes to make a big number of compound nouns has a high score.

scoring of compound nouns gm ompound noun
Scoring of compound nounsGM(Compound Noun)

GM(CN) is a geometric mean which does not

depend on the length of CN.

slide19
New scoring function:FGM(CN)

if CN occurs independently

then

where f(CN) means the number of independent occurrences of noun CN

(= CN does not appear as a part of longer CN )

Ex. GM(trigram)=((5+1)x(3+1))1/2=4.9

if f(trigram)= 5

FGM(trigram)=24.5

modified
Modified C-value

Modify C-value(Frantzi&Ananiadou,1996) to be able to

score a simple noun

length(a) :# of simple nouns consisting afreq(a):frequency of a

t(a):frequency of candidate compound nouns including a

 c(a):frequency of distinct candidate compound nouns including a

experimental evaluations
Experimental Evaluations

Data used in our experiment is developed by NII.

  • Manually POS tagged Japanese corpus and the gold-standard is a set of manually extracted terms developed by NTCIR1 TMREC task

 (Artificial Intelligence field:1,870 paper abstracts)

  • Gold-standard consists of manually extracted 8,843 domain specific terms
slide22
Complete and Partial match by GM: (base line)

Partial match

(contained)

Complete match

slide23
Number of complete matched terms by

FGM,MC-value

MCval - GM

FGM-GM

slide24
Number of partially matched terms by

FGM,MC-value

FGM-GM

MCval-GM

slide25
Average length (every 100 terms)

of extracted terms

MC-value

GM

FGM

slide26
Top scored 20 terms by GM
  • candidate terms frequency
  • 知識(knowledge) 787  ○
  • 学習知識(learning knowledge) 1 ○
  • 学習(learning) 255 ○
  • 言語的知識(linguistic knowledge) 2 ○
  • 知識システム(knowledge system) 14 ○
  • 学習システム(learning system) 16 ○
  • 問題知識(problem knowledge) 3 ×
  • 学習問題(learning problem) 5 ○
  • 言語的(linguistic) 1 ○
  • システム(system) 861 ○
slide27
Top scored 20 terms by GM(con’t)
  • 11. 問題(problem) 561 ○
  • 12. 論理的知識(logical knowledge) 1 ○
  • 13. 学習支援システム(learning assistance system) 3 ○
  • 14. 設計知識 (design knowledge) 29 ○
  • 学習問題解決システム(learning problem solver) 1 ○
  • 16. 学習支援 (learning assistance) 9 ○
  • 17. 言語的情報(linguistic knowledge) 3 ○
  • 18. 知識モデル(knowledge model) 3 ○
  • 19. 設計システム(design system) 6 ○
  • 20. システム設計(system design) 1 ○
slide28
Top scored 20 terms by FGM
  • candidate terms frequency
  • 知識(knowledge) 787  ○
  • システム(system) 861 ○
  • 問題(problem) 561 ○
  • 学習(learning) 255 ○
  • 学習者(learner) 383 ○
  • モデル(model) 356 ○
  • 情報(information) 382 ○
  • 問題解決(problem solving) 186 ○
  • 設計(design) 183 ○
  • 知識ベース(knowledge base) 149 ○
slide29
Top scored 20 terms by FGM(con’t)

11. 推論(inference) 162 ○

12. 支援(assistance) 87 ×

13. 知識表現(knowledge representation) 74 ○

14. エージェント(agent) 256 ○

15. 学習者モデル(learner’s model) 57 ○

16. 機能(function) 294 ×

17. 設計者(designer) 69 ○

18. 対話(dialogue) 205 ○

19. 言語(language) 75 ○

20. 対象(object) 293 ○

slide30
Top scored 20 terms by MC-value
  • candidate terms frequency
  • 学習者(learner) 383  ○
  • 問題解決(problem solving) 186 ○
  • システム(system) 861 ○
  • 知識(knowledge) 787 ○
  • 研究(research) 651 ×
  • 本稿(this paper) 594 ×
  • 手法(method) 562 ×
  • 問題(problem) 561 ○
  • 知識ベース(knowledge base) 149 ○
  • 論文(paper) 453 ×
slide31
Top scored 20 terms by MC-value (con’t)

11. 方法(method, way to do) 426 ×

12. 支援システム(assistance system) 18 ×

13. 計算機(computer) 128 ○

14. 情報(information) 382 ○

15. モデル(model) 356 ○

16. 自然言語(natural language) 63 ○

17. 我々(we) 332 ×

18. 有効性(effectiveness) 160 ×

19. エキスパートシステム(expert system) 78  ○

20. ユーザ(user) 297 ○

precision complete matched of each method
Precision(complete matched) of each method

N1,N2: top two systems of NTCIR1

precision of each method when large number of terms extracted
Precision of each method when large number of terms extracted

N1, N2: top two systems of NTCIR1 

conclusions
Conclusions-1

New statistical methods for ATR, which are basically how many nouns adjoin the single-noun in question to form compound nouns.

FGM

・best in extracting small number( up to 1400) of high quality domain specific terms

・longer terms including correct terms are better extracted by FGM or GM

MC-value

Strong in extracting large number (up to 6000) of domain specific terms

slide36
Conclusions-2
  • Web is perceived as a gigantic knowledge resource, but yet to be fully utilized.
  • Terminology in various domain is sure to be the gateway to the domain for novices even for experts.
  • More readily useful ATR is needed.
ad