How large a corpus do we need statistical method vs rule based method l.jpg
This presentation is the property of its rightful owner.
Sponsored Links
1 / 18

How Large a Corpus do We Need: Statistical Method vs. Rule-based Method PowerPoint PPT Presentation


  • 102 Views
  • Uploaded on
  • Presentation posted in: General

How Large a Corpus do We Need: Statistical Method vs. Rule-based Method. Hai Zhao , Yan Song and Chunyu Kit Department of Computer Science and Engineering Shanghai Jiao Tong University, China [email protected] 2010.05.20. Motivation. If

Download Presentation

How Large a Corpus do We Need: Statistical Method vs. Rule-based Method

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


How Large a Corpus do We Need:Statistical Method vs. Rule-based Method

Hai Zhao, Yan Song and Chunyu Kit

Department of Computer Science and Engineering

Shanghai Jiao Tong University, China

[email protected]

2010.05.20


Motivation

  • If

    • corpus scale is the only factor that affects the learning performance,

  • then

    • how large an annotated corpus do we need for a specific performance metric?


Zipf’s Law

  • Data sparseness becomes serious


Choosing the task Chinese word segmentation

  • A special case of tokenization in natural language processing (NLP) for many languages that have no explicit word delimiters such as spaces.

  • Original:

    • 她来自苏格兰

    • She comes from SU GE LAN

      Meaningless!

  • Segmented:

    • 她/来/自/苏格兰

    • She comes from Scotland.

      Meaningful!


Why the Task (CWS)

  • A simple task

  • Both statistical and rule-based methods are available for this task

  • Multiple standard large scale annotated corpora are available, too.

  • A word-oriented task just like what Zipf’s law will be interested in


Performance Metric

  • Evaluation Metric, F-score:

    F=2RP/(R+P)

  • R: recall, the proportions of the correctly segmented words to all words in the gold-standard segmentation

  • P: precision, the proportions of the correctly segmented words to all words in a segmenter’s output


Data sets and ApproachesCharacters in number of characters

  • Approaches

    • CRFs as the statistical method: learning from an annotated corpus

    • Forward maximal matching algorithm (FMM) as the rule-based method: perform segmentation based on a predefined lexicon

  • Comparable:

    • FMM lexicon is extracted from the same annotated corpus that CRFs adopts


Data Splitting

  • Overcome data sparseness by training corpus splitting


Learning Curves:CRFs vs. FMM


CRFs Performance vs. Corpus ScaleExponential enlargement of corpus gives linear performance improvement


FMM: about the Lexicon

  • let L denote the size of the lexicon, and s for that of the corpus from which the lexicon is extracted, we will have

  • And, F-score given by FMM


FMM Performance vs. Corpus Scale


FMM Lexicon Size vs. Performance


OOV issue

  • Special interest in CWS: Out-of-vocabulary words (OOV) mean those that appear in test corpus but absent in training corpus.

  • the rate of OOV, the proportion of OOV to all words from test corpus, will heavily affect the segmentation performance.


OOV rate vs. Corpus Scale


Fitting OOV Rate


Conclusions

  • A bad news: Statistical method asks for an exponential increase of annotated corpus scale to overcome the sparseness caused by Zipf’s law.

    • To enlarge annotated corpus is not a good way for statistical method’s performance improvement.

  • A little surprise: Rule-based method only asks for a negative inverse increase of corpus (lexicon) scale.

    • Is rule-based method more effective than statistical one?

    • Lexicon is much cheaper than annotated corpus(text).


Thanks!


  • Login