Chinese term extraction based on delimiters
Download
1 / 24

Chinese Term Extraction Based on Delimiters - PowerPoint PPT Presentation


  • 139 Views
  • Uploaded on

Chinese Term Extraction Based on Delimiters. Yuhang Yang, Qin Lu , Tiejun Zhao School of Computer Science and Technology, Harbin Institute of Technology Department of Computing, The Hong Kong Polytechnic University May, 2008. Outline. Introduction Related Work s Methodology

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Chinese Term Extraction Based on Delimiters' - hayes


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Chinese term extraction based on delimiters l.jpg

Chinese Term Extraction Based on Delimiters

Yuhang Yang, Qin Lu, Tiejun Zhao

School of Computer Science and Technology, Harbin Institute of Technology

Department of Computing,

The Hong Kong Polytechnic University

May,2008


Outline l.jpg
Outline

  • Introduction

  • Related Works

  • Methodology

  • Experiment and Discussion

  • Conclusion


Basic concepts l.jpg
Basic Concepts

  • Terms(terminology): lexical units of the most fundamental knowledge of a domain

  • Term extraction

    • Term candidate extraction

      • Unithood

    • Terminology verification

      • Termhood


Major problems l.jpg
Major Problems

Term boundary identification based on term features

  • Fewer features are not enough

  • More features lead to more conflicts

    Limitation in scope

  • low frequency terms

  • long compound terms

  • dependency on Chinese segmentation


Main idea l.jpg
Main Idea

  • Delimiter based Term candidates extraction: identifying the relative stable and domain independent words immediate before and after these terms

    • 扫描隧道显微镜是一种基于量子隧道效应的高分辨率显微镜Scan tunneling microscopeisa kind ofquantum tunnelling effect-basedhigh angular resolution microscope

    • 社会主义制度是中华人民共和国的根本制度

      Socialist systemisthe basic systemofthe People's Republic of China

  • Potential Advantages of the proposed approach

    • No strict limits on frequency or word length

    • No need for full segmentation

    • Relatively domain independent


Related works statistic based measures l.jpg
Related works:Statistic-based Measures

  • Internal measure (Schone and Jurafsky, 2001)

    Internal associative measures between constituents of the candidate characters, such as:

    • Frequency

    • Mutual information

  • Contextual measure

    Dependency of candidates on its context:

    • The left/right entropy (Sornlertlamvanich et al., 2000)

    • The left/right context dependency (Chien, 1999)

    • Accessor variety criteria (Feng et al., 2004).


Hybrid approaches l.jpg
Hybrid Approaches

  • The UnitRate algorithm (Chen et al., 2006)

    occurrence probability + marginal variety probability

  • The TCE_SEF&CV algorithm (Ji et al, 2007)

    significance estimation function + C-value measure

    Limitations

  • Data sparseness for low frequency terms and long terms

  • Cascading errors by full segmentation


Observations l.jpg
Observations

  • Sentences are constituted by substantives and functional words

  • Domain specific terms (terms for short) are more likely to be domain substantives

  • Predecessors and successorsof terms are more likely to be functional words or general substantives connecting terms

    • Predecessors and successors are markers of terms, referred to as term delimiters(or simply delimiters)


Delimiter based term extraction l.jpg
Delimiter Based Term Extraction

  • Characteristics of delimiters

    • Mainly functional words and general substantives

    • Relatively stable

    • Domain independent

    • Can be extracted more easily

  • Proposed model

    • Identifying features of delimiters

    • Identify terms by finding their predecessors and successors as their boundary words


Algorithm design l.jpg
Algorithm design

TCE_DI (Term Candidate Extraction – Delimiter Identification)

  • Input: Corpusextract (domain corpus ), DListlist )

  • (1). Partition Corpusextract to char strings by punctuations.

  • (2). Partition char strings by delimiters to obtain term candidates.

    • If there is no delimiter contained in a string, the whole string is regarded as a term candidate.


Acquisition of dlist l.jpg
Acquisition of DList

  • From a given stop word list

    • Produced by experts or from a general corpus

    • No training is needed

  • DList_Extalgorithm

    • Given a training corpus CorpusD_training, and

    • A domain lexicon LexiconDomain


The dlist ext algorithm l.jpg
The DList_Extalgorithm

  • S1: For each term in LexiconDomain

    mark Ti in CorpusD_training as a lexical unit

  • S2: Segment the remaining text

  • S3: Extracts predecessors and successors of all

    Ti as delimiter candidates

  • S4: Remove all Ti from delimiter candidates

  • S5: Rank delimiter candidates by frequency

    Use of a simple threshold NDI


Experiments data preparation l.jpg
Experiments:Data Preparation

Delimiter List

  • DListIT Extracted by using CorpusIT_Small and LexiconIT

  • DListLegal Extracted by using CorpusLegal_Small and LexiconLegal

  • DListSW 494 general stop words


Performance measurements l.jpg
Performance Measurements

  • Evaluation: Precision(sampling) & Rate of NTE

  • Reference algorithms

    • SEF&C-value (Ji et al, 2007) for term candidate extraction

    • TFIDF (Frank et al., 1999) for both term candidate extraction and terminology verification

  • LA_TV (Link Analysis based – Terminology Verification) for fair comparison


Evaluation dlist ext algorithm n di l.jpg
Evaluation:DList_Extalgorithm: NDI

Coverage of Delimiters on Different Corpora


Evaluation dlist ext algorithm n di16 l.jpg
Evaluation:DList_Extalgorithm: NDI

Frequency of Delimiters on Domain Corpora


Evaluation dlist ext algorithm n di17 l.jpg
Evaluation:DList_Extalgorithm: NDI

Performance of DListIT on CorpusIT_Large

Performance of DListLegal on CorpusIT_Large


N di 500 l.jpg
NDI = 500

Performance of DListIT on CorpusLegal_Large

Performance of DListLegal on CorpusLegal_Large


Evaluation on term extraction l.jpg
Evaluation on Term Extraction

Performance of Different Algorithms on IT Domain and Legal Domain


Performance analysis l.jpg
Performance Analysis

  • Domain independent and stable delimiters

    • Being extracted easily and useful

  • Larger granularity of domain specific terms

    • Keeping many noisy strings out

  • Less frequency sensitivity

    • Concentrating on delimiters without regards to the frequencies of the candidates


Evaluation on new term extraction r nte l.jpg
Evaluation on New Term Extraction: RNTE

Performance of Different Algorithms for New Term Extraction


Error analysis l.jpg
Error Analysis

  • Figure of Speech phrases

    • “不难看出”(it is not difficult to see that….)

    • “新方法中”(in the new methods)

  • General words

    • “思维状态”(mental state)

    • “建筑”(architecture)

  • Long strings which contain short terms

    • “访问共享资源”(access shared resources),

    • “再次遍历”(traverse again)


Conclusion l.jpg
Conclusion

  • A delimiter basedapproach for term candidate extraction

  • Advantages

    • Less sensitivity to term frequency

    • Requiring little prior domain knowledge, relatively less adaptation for new domains

    • Quite significant improvements for term extraction

    • Much better performance for new term extraction

  • Future works

    • Improving overall term extraction algorithms

    • Applying to related NLP taskssuch as NER

    • Applying to other languages


Slide24 l.jpg
Q & A

Thank You !


ad