1 / 13

Alignment of Bilingual Named Entities in Parallel Corpora Using Statistical Model

Alignment of Bilingual Named Entities in Parallel Corpora Using Statistical Model. Chun-Jen Lee Jason S. Chang Thomas C. Chuang. AMTA 2004. Introduction. Focusing on extracting entity names (PER, LOC, ORG) in bilingual corpus.

bryce
Download Presentation

Alignment of Bilingual Named Entities in Parallel Corpora Using Statistical Model

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Alignment of Bilingual Named Entities in Parallel Corpora Using Statistical Model Chun-Jen Lee Jason S. Chang Thomas C. Chuang AMTA 2004

  2. Introduction • Focusing on extracting entity names (PER, LOC, ORG) in bilingual corpus. • The feasibility of extracting interlingual NEs has seldom been addressed. • Al Onaizan and Knight 2002 • Huang and Vogel 2002 • Chen et al. 2003 • Moore 2003 • Kumano et al. 2004 • Lee et al. (Baseline Model) 2003 • Integrating approximate matching and personal name recognition into the baseline model.

  3. Framework • Preprocess: • Perform sentence alignment. • Label English named entities. • Main process: • For each labeled NEE, apply Statistical Probability Translation Model and Approximate Matching to find Chinese named-entity candidates {NEA} in SC. • For any word WE, in NEE, that cannot find the corresponding Chinese translation in SC, apply the proposed Statistical Transliteration Model, enhanced with Chinese Personal Name Recognition to extracting the corresponding Chinese transliterations {NEB}, in SC, with scores above a predefined threshold. • Merge {NEA} with {NEB} into possible candidates {NEC}. • Rank {NEC} by the cost scores. The candidate with the maximum score is chosen as the answer.

  4. SPTM • A noisy channel approach • Translating an English phrase ewith lwords into a Mandarin Chinese phrase fwith mwords by decomposing the channel function into two independent probabilistic functions: • Lexical translation probability function P(fai | ei)where eiis the i-th word in eand eiis aligned with faiin funder the alignment a • Alignment probability function P(a | l, m) = P(a1, a2, …, al| l, m)

  5. SPTM • E = “Ichthyosis Concern Association” • F = “關懷 魚鱗癬 協會” • Correct alignment: (a1 = 2, a2 = 1, a3 = 3). • The phrase translation probability is • Defining the scoring function as a log probability function:

  6. Estimating Lexical Translation Probability Based on Parallel Corpus • Adopting a word alignment module to automatically extracting lexical translation probabilities. (Wu and Chang 2003) • Developing a list of preferred part-of-speech (POS) patterns of collocation in both languages • Conducting collocation candidates matching to the preferred POS patterns and apply N-gram statistics for both languages • The log likelihood ratio statistics is employed for two consecutive words in both languages • Finally, we deploy content word alignment based on the Competitive Linking Algorithm (Melamed 1997). • For the purpose of not introducing too much noise, only bilingual phrases with high probabilities are considered.

  7. Estimating Lexical Translation Probability Based on Transliteration Model • Adopting a Romanization system to represent a Chinese word • Eand Fare assumed to be an English word and a Romanized Chinese character sequence, respectively. • The transliteration probability P(F|E) can be approximated by decomposing E and F into transliteration units (TUs). • A word Ewith lcharacters and a Romanized word Fwith mcharacters are denoted by e1e2 …eland f1f2 …fmrespectively. • We can represent the mapping of (E, F) as a sequence of matched nTUs: {(u1, v1), (u2, v2), … (un, vn) }. • The alignment abetween E and Fcan be represented as a sequence of match type (m1m2 …mn) where midenotes as a pair of lengths of uiand vi.

  8. Estimating Lexical Translation Probability Based on Transliteration Model

  9. NE alignment • g(0,0) = 0 • Suppose that there is an entry (ei ,wf) in the bilingual dictionary. Scorelex(fai | ei)is formulated as:

  10. Approximate Matching

  11. CPNR • Chinese surnames are used as anchor points. • The Chinese personal name recognizer is applied only on the case that the given NE is a named person and Scoretm(R(fai) | ei)is less than Thr1.

  12. Training Data • Noun phrases of the BDC Electronic Chinese-English Dictionary were used to train PTM. • To train the transliteration model, 2,430 pairs of English names together with their Chinese transliterations and Chinese Romanization were used. • The LDC Central News Agency Corpus was used to extract keywords of entity names for identifying NE types. We collected 117 bilingual keyword pairs from the corpora. • A list of Chinese surnames was also gathered to help to identify and extract the PER-type of NEs. • The parallel corpus collected from the Sinorama Magazine was used to construct the corpus-based lexicon and estimation of LTP.

  13. Experiments • 275 aligned sentences from Sinorama are randomly selected. • Answer keys are manually prepared. • Each chosen aligned sentence contains at least one NE pair. • Currently, the lengths of English NEs are restricted to be less than 6. • In total, 830 pairs of NEs are labeled. The numbers of NE pairs for types PER, LOC, and ORG are 208, 362, and 260, respectively.

More Related