slide1 n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Background PowerPoint Presentation
Download Presentation
Background

Loading in 2 Seconds...

play fullscreen
1 / 1

Background - PowerPoint PPT Presentation


  • 126 Views
  • Uploaded on

. Constructing a Chinese-Japanese Parallel Corpus from Wikipedia Chenhui Chu, Toshiaki Nakazawa , Sadao Kurohashi (Graduate School of Informatics, Kyoto University). Common Chinese characters. Common Chinese characters. Background. Features. Baseline features

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Background' - joseph-watkins


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
slide1

.

Constructing a Chinese-Japanese Parallel Corpus from WikipediaChenhui Chu, Toshiaki Nakazawa, SadaoKurohashi

(Graduate School of Informatics, Kyoto University)

.

.

Common

Chinese

characters

Common

Chinese

characters

Background

Features

  • Baseline features
  • General features: sentence length, word overlap
  • Word alignment features

Seed parallel corpus

  • Lack of Chinese-Japanese parallel corpora for SMT

Common Chinese characters filtering

Bilingual dictionary

Inter-language link

Novel features

  • Chinese character features (+CC)

Filter

Classifier

Zh:

而被指定为政令指定都市、中核市、特例市。

Ja:

別途政令指定都市、中核市、特例市に定められている。

c

(1)

(2)

(3)

Parallel

sentences

  • Non-CC word features (+Non-CC)

Article pairs

# http://orchid.kuee.kyoto-u.ac.jp/ASPEC/

Parallel sentence candidates

Zh:

日本的一级行政区划单位为都道府县,全国划分为

1都、1道、2府、43县。

Zh-Ja Wikipedia

  • Chinese-Japanese Wikipedia

Seed parallel corpus

Zh:

...

日本的一级行政区划单位为都道府县,全国划分为1都、1道、2府、43县。部份市因人口较多,在当地影响较大,而被指定为政令指定都市、中核市、特例市。都道府县下的行政区划为市町村,此外还有郡、支厅、区、特别区等行政单位。

...

Ja:

...

都道府県(1都1道2府43県)という広域行政区画から構成される。但し、地域区分(地方区分)には、揺れが見られる。また、一部の市は、行政上、別途政令指定都市、中核市、特例市に定められている。他にも、市町村や、町村をまとめた郡がある(全国市町村一覧参照)。

...

Ja:

都道府県(1都1道2府43県)という広域行政区画から構成される。

Bilingual dictionary

Cartesian

product

Positive instances

  • Content word features (+Con)

Filter

Classifier

Zh:

YY/的/尸体/,/和/活着/的/黑/猩猩/相比/,/皮肤/的/颜色/看起来/稍微/明朗/一些/。

Ja:

つぎに/,/配線/に/使用/する/パターン/幅/や/クリアランス/の/設定/の/方法/を/説明/した/。

Negative

instances

Filtered non-parallel

sentence pairs

Non-parallel

sentence pairs

1

3

Parallel Sentence Extraction System

Experiments

p

  • Overview
  • Classification results with WF
  • Extraction results (#extracted sentences [unit: k])
  • Parallel sentence candidate filtering
  • WF: dictionary-based word overlap (Baseline)
  • CCF: common Chinese character (cognate) overlap
  • WF and CCF: logical conjunction of WF and CCF
  • WF or CCF: logical disjunction of WF and CCF
  • Parallel sentence classifier
  • MT results (BLEU-4)

# The resource is freely available at:

http://orchid.kuee.kyoto-u.ac.jp/~chu/resource/wiki_zh_ja.tgz

2

4