How large a corpus do we need statistical method vs rule based method l.jpg
This presentation is the property of its rightful owner.
Sponsored Links
1 / 18

How Large a Corpus do We Need: Statistical Method vs. Rule-based Method PowerPoint PPT Presentation


  • 98 Views
  • Uploaded on
  • Presentation posted in: General

How Large a Corpus do We Need: Statistical Method vs. Rule-based Method. Hai Zhao , Yan Song and Chunyu Kit Department of Computer Science and Engineering Shanghai Jiao Tong University, China [email protected] 2010.05.20. Motivation. If

Download Presentation

How Large a Corpus do We Need: Statistical Method vs. Rule-based Method

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


How large a corpus do we need statistical method vs rule based method l.jpg

How Large a Corpus do We Need:Statistical Method vs. Rule-based Method

Hai Zhao, Yan Song and Chunyu Kit

Department of Computer Science and Engineering

Shanghai Jiao Tong University, China

[email protected]

2010.05.20


Motivation l.jpg

Motivation

  • If

    • corpus scale is the only factor that affects the learning performance,

  • then

    • how large an annotated corpus do we need for a specific performance metric?


Zipf s law l.jpg

Zipf’s Law

  • Data sparseness becomes serious


Choosing the task chinese word segmentation l.jpg

Choosing the task Chinese word segmentation

  • A special case of tokenization in natural language processing (NLP) for many languages that have no explicit word delimiters such as spaces.

  • Original:

    • 她来自苏格兰

    • She comes from SU GE LAN

      Meaningless!

  • Segmented:

    • 她/来/自/苏格兰

    • She comes from Scotland.

      Meaningful!


Why the task cws l.jpg

Why the Task (CWS)

  • A simple task

  • Both statistical and rule-based methods are available for this task

  • Multiple standard large scale annotated corpora are available, too.

  • A word-oriented task just like what Zipf’s law will be interested in


Performance metric l.jpg

Performance Metric

  • Evaluation Metric, F-score:

    F=2RP/(R+P)

  • R: recall, the proportions of the correctly segmented words to all words in the gold-standard segmentation

  • P: precision, the proportions of the correctly segmented words to all words in a segmenter’s output


Data sets and approaches c haracters in number of characters l.jpg

Data sets and ApproachesCharacters in number of characters

  • Approaches

    • CRFs as the statistical method: learning from an annotated corpus

    • Forward maximal matching algorithm (FMM) as the rule-based method: perform segmentation based on a predefined lexicon

  • Comparable:

    • FMM lexicon is extracted from the same annotated corpus that CRFs adopts


Data splitting l.jpg

Data Splitting

  • Overcome data sparseness by training corpus splitting


Learning curves crfs vs fmm l.jpg

Learning Curves:CRFs vs. FMM


Slide10 l.jpg

CRFs Performance vs. Corpus ScaleExponential enlargement of corpus gives linear performance improvement


Fmm about the lexicon l.jpg

FMM: about the Lexicon

  • let L denote the size of the lexicon, and s for that of the corpus from which the lexicon is extracted, we will have

  • And, F-score given by FMM


Fmm performance vs corpus scale l.jpg

FMM Performance vs. Corpus Scale


Fmm lexicon size vs performance l.jpg

FMM Lexicon Size vs. Performance


Oov issue l.jpg

OOV issue

  • Special interest in CWS: Out-of-vocabulary words (OOV) mean those that appear in test corpus but absent in training corpus.

  • the rate of OOV, the proportion of OOV to all words from test corpus, will heavily affect the segmentation performance.


Oov rate vs corpus scale l.jpg

OOV rate vs. Corpus Scale


Fitting oov rate l.jpg

Fitting OOV Rate


Conclusions l.jpg

Conclusions

  • A bad news: Statistical method asks for an exponential increase of annotated corpus scale to overcome the sparseness caused by Zipf’s law.

    • To enlarge annotated corpus is not a good way for statistical method’s performance improvement.

  • A little surprise: Rule-based method only asks for a negative inverse increase of corpus (lexicon) scale.

    • Is rule-based method more effective than statistical one?

    • Lexicon is much cheaper than annotated corpus(text).


Thanks l.jpg

Thanks!


  • Login