Alignment of Bilingual Named Entities in Parallel Corpora Using Statistical Model

Alignment of Bilingual Named Entities in Parallel Corpora Using Statistical Model Chun-Jen Lee Jason S. Chang Thomas C. Chuang AMTA 2004

Introduction • Focusing on extracting entity names (PER, LOC, ORG) in bilingual corpus. • The feasibility of extracting interlingual NEs has seldom been addressed. • Al Onaizan and Knight 2002 • Huang and Vogel 2002 • Chen et al. 2003 • Moore 2003 • Kumano et al. 2004 • Lee et al. (Baseline Model) 2003 • Integrating approximate matching and personal name recognition into the baseline model.

Framework • Preprocess: • Perform sentence alignment. • Label English named entities. • Main process: • For each labeled NEE, apply Statistical Probability Translation Model and Approximate Matching to find Chinese named-entity candidates {NEA} in SC. • For any word WE, in NEE, that cannot find the corresponding Chinese translation in SC, apply the proposed Statistical Transliteration Model, enhanced with Chinese Personal Name Recognition to extracting the corresponding Chinese transliterations {NEB}, in SC, with scores above a predefined threshold. • Merge {NEA} with {NEB} into possible candidates {NEC}. • Rank {NEC} by the cost scores. The candidate with the maximum score is chosen as the answer.

SPTM • A noisy channel approach • Translating an English phrase ewith lwords into a Mandarin Chinese phrase fwith mwords by decomposing the channel function into two independent probabilistic functions: • Lexical translation probability function P(fai | ei)where eiis the i-th word in eand eiis aligned with faiin funder the alignment a • Alignment probability function P(a | l, m) = P(a1, a2, …, al| l, m)

SPTM • E = “Ichthyosis Concern Association” • F = “關懷魚鱗癬協會” • Correct alignment: (a1 = 2, a2 = 1, a3 = 3). • The phrase translation probability is • Defining the scoring function as a log probability function:

Estimating Lexical Translation Probability Based on Parallel Corpus • Adopting a word alignment module to automatically extracting lexical translation probabilities. (Wu and Chang 2003) • Developing a list of preferred part-of-speech (POS) patterns of collocation in both languages • Conducting collocation candidates matching to the preferred POS patterns and apply N-gram statistics for both languages • The log likelihood ratio statistics is employed for two consecutive words in both languages • Finally, we deploy content word alignment based on the Competitive Linking Algorithm (Melamed 1997). • For the purpose of not introducing too much noise, only bilingual phrases with high probabilities are considered.

Estimating Lexical Translation Probability Based on Transliteration Model • Adopting a Romanization system to represent a Chinese word • Eand Fare assumed to be an English word and a Romanized Chinese character sequence, respectively. • The transliteration probability P(F|E) can be approximated by decomposing E and F into transliteration units (TUs). • A word Ewith lcharacters and a Romanized word Fwith mcharacters are denoted by e1e2 …eland f1f2 …fmrespectively. • We can represent the mapping of (E, F) as a sequence of matched nTUs: {(u1, v1), (u2, v2), … (un, vn) }. • The alignment abetween E and Fcan be represented as a sequence of match type (m1m2 …mn) where midenotes as a pair of lengths of uiand vi.

Estimating Lexical Translation Probability Based on Transliteration Model

NE alignment • g(0,0) = 0 • Suppose that there is an entry (ei ,wf) in the bilingual dictionary. Scorelex(fai | ei)is formulated as:

Approximate Matching

CPNR • Chinese surnames are used as anchor points. • The Chinese personal name recognizer is applied only on the case that the given NE is a named person and Scoretm(R(fai) | ei)is less than Thr1.

Training Data • Noun phrases of the BDC Electronic Chinese-English Dictionary were used to train PTM. • To train the transliteration model, 2,430 pairs of English names together with their Chinese transliterations and Chinese Romanization were used. • The LDC Central News Agency Corpus was used to extract keywords of entity names for identifying NE types. We collected 117 bilingual keyword pairs from the corpora. • A list of Chinese surnames was also gathered to help to identify and extract the PER-type of NEs. • The parallel corpus collected from the Sinorama Magazine was used to construct the corpus-based lexicon and estimation of LTP.

Experiments • 275 aligned sentences from Sinorama are randomly selected. • Answer keys are manually prepared. • Each chosen aligned sentence contains at least one NE pair. • Currently, the lengths of English NEs are restricted to be less than 6. • In total, 830 pairs of NEs are labeled. The numbers of NE pairs for types PER, LOC, and ORG are 208, 362, and 260, respectively.

Alignment of Bilingual Named Entities in Parallel Corpora Using Statistical Model

Alignment of Bilingual Named Entities in Parallel Corpora Using Statistical Model

Presentation Transcript

Towards a semantic extraction of named entities

Indexing concepts and/or named entities

Corpora and Statistical Methods

Statistical Phrase Alignment Model Using Dependency Relation Probability

Corpora and Statistical Methods

Corpora and Statistical Methods

Corpora and Statistical Methods

Corpora and Statistical Methods

Corpora and Statistical Methods

Extracting bilingual terminologies from comparable corpora

Learning Bilingual Lexicons from Monolingual Corpora

Corpora and Statistical Methods

Corpora and statistical methods

Using Wikipedia for Hierarchical Finer Categorization of Named Entities

Named Entities in Domain Unlimited Speech Translation

Named Entity Discovery from Multilingual Corpora

Identification of bilingual named entities from Wikipedia using a pair Hidden Markov Model

Corpora and Statistical Methods

Corpora and Statistical Methods

Corpora and statistical methods

Named Entities in Czech Texts and Their Processing

Iterative Set Expansion of Named Entities using the Web