chinese word segmentation and statistical machine translation n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Chinese Word Segmentation and Statistical Machine Translation PowerPoint Presentation
Download Presentation
Chinese Word Segmentation and Statistical Machine Translation

Loading in 2 Seconds...

play fullscreen
1 / 16

Chinese Word Segmentation and Statistical Machine Translation - PowerPoint PPT Presentation


  • 105 Views
  • Uploaded on

Chinese Word Segmentation and Statistical Machine Translation. Presenter : Wu, Jia-Hao Authors : RUIQIANG ZHANG , KEIJI YASUDA , EIICHIRO SUMITA. 國立雲林科技大學 National Yunlin University of Science and Technology. TOSLP (2008). Outline. Motivation Objective Methodology

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Chinese Word Segmentation and Statistical Machine Translation' - bardia


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
chinese word segmentation and statistical machine translation

Chinese Word Segmentation and Statistical Machine Translation

Presenter : Wu, Jia-Hao

Authors : RUIQIANG ZHANG , KEIJI YASUDA , EIICHIRO SUMITA

國立雲林科技大學National Yunlin University of Science and Technology

TOSLP (2008)

outline
Outline
  • Motivation
  • Objective
  • Methodology
    • Dictionary-based
    • CRF-based
  • Experiments
  • Conclusion
  • Personal Comments
motivation
Motivation
  • Chinese word segmentation is a necessary step in Chinese-English statistical machine translation.
  • However, there are many choices involved in creating a CWS system such as various specifications and CWS methods.

Ex 我們要發展中國家用電器

我們 要 發展 中國 家用電器

WeWant to developChina’sHome electrical appliances.

我們 要 發展中國家 用 電器

WeWant Developing countryTo useElectrical appliances.

motivation1
Motivation
  • Chinese word segmentation is a necessary step in Chinese-English statistical machine translation.
  • However, there are many choices involved in creating a CWS system such as various specifications and CWS methods.

Chinese word segmentation

Statistical machine translation

The ChineseName is called byRome phonetic transcription

objective
Objective
  • They created 16 CWS schemes under different setting to examine the relationship between CWS and SMT.
  • The authors also tested two CWS methods that dictionary-based and CRF-based approaches.
  • The authors propose two approaches for combining advantages of different specifications .
    • A simple concatenation of training data.
    • Implementing linear interpolation of multiple translation models.
methodology dictionary based
Methodology-Dictionary-based
  • The pure dictionary-based CWS does not recognize OOV words.
  • The authors combined N-gram language model with Dictionary-based word segmentation.
    • For a give Chinese character sequence , C=c0c1c2…cN
    • The word sequence , W=wt0wt1wt2…wtM
      • Which satisfies

Out-of-vocabulary

δ(u,v) equal to 1 if both arguments are the same , and 0 otherwise.

methodology crf based iob tagging
Methodology-CRF-based IOB Tagging
  • Each character of a word is labeled.
    • B if it is the first character of a multiple-character word.
    • O if the character functions as an independent word
    • I for other.
  • Ex:全北京市 is labeled 全/O 北/B 京/I 市/I
  • The probability of an IOB tag sequence, T=t0t1…tM , given the word sequence W=w0w1…wM

bigram features : simply used absolute counts for each feature in the training data and define a cutoff value for each feature type.

Unigram features : w0,w-1,w1,w-2,w2,w0w-1,w0w1,w-1w1,w-2w-1,w2w0

methodology achilles
Methodology-Achilles
  • An In-House CWS including Both Dictionary-Based and CRF-Based Approaches.
    • Dictionary-based
      • Zero OOV recognition rate.
      • In-vocabulary rate is higher.
    • CRF-based
      • OOV recognition rate higher than Dictionary-based.
      • Best F-scores.
methodology phrase based smt
Methodology-Phrase-Based SMT
  • The method use a framework of log-linear models to integrate multiple features.
  • Where fi(F,E) is the logarithmic value of the i-th feature ,and λi is the weight of the i-th feature. The target sentence candidate that maximizes P(E|F) is the solution.
experiments
Experiments
  • The data used in the experiments were provided by LDC , and use the English sentences of the data plus Xinhua news of the LDC Gigaword English corpus.
  • Implementation of CWS Schemes
    • Tokens : the total number of words in the training data
    • Unique word : lexicon size of the segmented training data.
    • OOVs : the unknown words in the test data.
experiment
Experiment
  • The effect of CWS specifications on SMT.
experiment combining multiple cws schemes
Experiment - Combining multiple CWS schemes
  • Effect of Combining Training Data from Multiple CWS Specifications.
    • Create a new CWS scheme called dict-hybrid by combining AS, CITYU, MSR, PKU.
    • 49,546,231 tokens , 112,072 unique words for the training data. 693 OOVs for the test data.
experiment2
Experiment
  • Effect of Feature Interpolation of Translation Models.
    • The authors generated multiple translation models by using different word segmenters.
    • The phrase translation model p(e|f) can be linearly interpolated as
    • Where pi(e|f) is the phrase translation model corresponding to the i-th CWSs. αi is the weight and S is the total number of models.
conclusion
Conclusion
  • The authors analyzed multiple CWS specifications and built a CWS for each one to examine how they affected translations.
  • They proposed a new approach to linear interpolation of translation features , and improvement in translation and achieved the best BLEU score of all the CWS schemes.
comments
Comments
  • Advantage
    • There are many experiments to evaluate their performance.
  • Drawback
    • But some interpretation of experiments are complex.
  • Application
    • Chinese Word Segmentation.
    • Statistical Machine Translation.