chinese word segmentation and statistical machine translation n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Chinese Word Segmentation and Statistical Machine Translation PowerPoint Presentation
Download Presentation
Chinese Word Segmentation and Statistical Machine Translation

Loading in 2 Seconds...

play fullscreen
1 / 16

Chinese Word Segmentation and Statistical Machine Translation - PowerPoint PPT Presentation


  • 110 Views
  • Uploaded on

Chinese Word Segmentation and Statistical Machine Translation. Presenter : Wu, Jia-Hao Authors : RUIQIANG ZHANG , KEIJI YASUDA , EIICHIRO SUMITA. 國立雲林科技大學 National Yunlin University of Science and Technology. TOSLP (2008). Outline. Motivation Objective Methodology

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

Chinese Word Segmentation and Statistical Machine Translation


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
chinese word segmentation and statistical machine translation

Chinese Word Segmentation and Statistical Machine Translation

Presenter : Wu, Jia-Hao

Authors : RUIQIANG ZHANG , KEIJI YASUDA , EIICHIRO SUMITA

國立雲林科技大學National Yunlin University of Science and Technology

TOSLP (2008)

outline
Outline
  • Motivation
  • Objective
  • Methodology
    • Dictionary-based
    • CRF-based
  • Experiments
  • Conclusion
  • Personal Comments
motivation
Motivation
  • Chinese word segmentation is a necessary step in Chinese-English statistical machine translation.
  • However, there are many choices involved in creating a CWS system such as various specifications and CWS methods.

Ex 我們要發展中國家用電器

我們 要 發展 中國 家用電器

WeWant to developChina’sHome electrical appliances.

我們 要 發展中國家 用 電器

WeWant Developing countryTo useElectrical appliances.

motivation1
Motivation
  • Chinese word segmentation is a necessary step in Chinese-English statistical machine translation.
  • However, there are many choices involved in creating a CWS system such as various specifications and CWS methods.

Chinese word segmentation

Statistical machine translation

The ChineseName is called byRome phonetic transcription

objective
Objective
  • They created 16 CWS schemes under different setting to examine the relationship between CWS and SMT.
  • The authors also tested two CWS methods that dictionary-based and CRF-based approaches.
  • The authors propose two approaches for combining advantages of different specifications .
    • A simple concatenation of training data.
    • Implementing linear interpolation of multiple translation models.
methodology dictionary based
Methodology-Dictionary-based
  • The pure dictionary-based CWS does not recognize OOV words.
  • The authors combined N-gram language model with Dictionary-based word segmentation.
    • For a give Chinese character sequence , C=c0c1c2…cN
    • The word sequence , W=wt0wt1wt2…wtM
      • Which satisfies

Out-of-vocabulary

δ(u,v) equal to 1 if both arguments are the same , and 0 otherwise.

methodology crf based iob tagging
Methodology-CRF-based IOB Tagging
  • Each character of a word is labeled.
    • B if it is the first character of a multiple-character word.
    • O if the character functions as an independent word
    • I for other.
  • Ex:全北京市 is labeled 全/O 北/B 京/I 市/I
  • The probability of an IOB tag sequence, T=t0t1…tM , given the word sequence W=w0w1…wM

bigram features : simply used absolute counts for each feature in the training data and define a cutoff value for each feature type.

Unigram features : w0,w-1,w1,w-2,w2,w0w-1,w0w1,w-1w1,w-2w-1,w2w0

methodology achilles
Methodology-Achilles
  • An In-House CWS including Both Dictionary-Based and CRF-Based Approaches.
    • Dictionary-based
      • Zero OOV recognition rate.
      • In-vocabulary rate is higher.
    • CRF-based
      • OOV recognition rate higher than Dictionary-based.
      • Best F-scores.
methodology phrase based smt
Methodology-Phrase-Based SMT
  • The method use a framework of log-linear models to integrate multiple features.
  • Where fi(F,E) is the logarithmic value of the i-th feature ,and λi is the weight of the i-th feature. The target sentence candidate that maximizes P(E|F) is the solution.
experiments
Experiments
  • The data used in the experiments were provided by LDC , and use the English sentences of the data plus Xinhua news of the LDC Gigaword English corpus.
  • Implementation of CWS Schemes
    • Tokens : the total number of words in the training data
    • Unique word : lexicon size of the segmented training data.
    • OOVs : the unknown words in the test data.
experiment
Experiment
  • The effect of CWS specifications on SMT.
experiment combining multiple cws schemes
Experiment - Combining multiple CWS schemes
  • Effect of Combining Training Data from Multiple CWS Specifications.
    • Create a new CWS scheme called dict-hybrid by combining AS, CITYU, MSR, PKU.
    • 49,546,231 tokens , 112,072 unique words for the training data. 693 OOVs for the test data.
experiment2
Experiment
  • Effect of Feature Interpolation of Translation Models.
    • The authors generated multiple translation models by using different word segmenters.
    • The phrase translation model p(e|f) can be linearly interpolated as
    • Where pi(e|f) is the phrase translation model corresponding to the i-th CWSs. αi is the weight and S is the total number of models.
conclusion
Conclusion
  • The authors analyzed multiple CWS specifications and built a CWS for each one to examine how they affected translations.
  • They proposed a new approach to linear interpolation of translation features , and improvement in translation and achieved the best BLEU score of all the CWS schemes.
comments
Comments
  • Advantage
    • There are many experiments to evaluate their performance.
  • Drawback
    • But some interpretation of experiments are complex.
  • Application
    • Chinese Word Segmentation.
    • Statistical Machine Translation.