named entity recognition and transliteration for 50 languages l.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Named Entity Recognition and Transliteration for 50 Languages PowerPoint Presentation
Download Presentation
Named Entity Recognition and Transliteration for 50 Languages

Loading in 2 Seconds...

play fullscreen
1 / 55

Named Entity Recognition and Transliteration for 50 Languages - PowerPoint PPT Presentation


  • 524 Views
  • Uploaded on

Named Entity Recognition and Transliteration for 50 Languages Richard Sproat, Dan Roth, ChengXiang Zhai, Elabbas Benmamoun, Andrew Fister, Nadia Karlinsky, Alex Klementiev, Chongwon Park, Vasin Punyakanok, Tao Tao, Su-youn Yoon University of Illinois at Urbana-Champaign

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Named Entity Recognition and Transliteration for 50 Languages' - jacob


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
named entity recognition and transliteration for 50 languages

Named Entity Recognition and Transliteration for 50 Languages

Richard Sproat, Dan Roth, ChengXiang Zhai, Elabbas Benmamoun,

Andrew Fister, Nadia Karlinsky, Alex Klementiev, Chongwon Park,

Vasin Punyakanok, Tao Tao, Su-youn Yoon

University of Illinois at Urbana-Champaign

http://compling.ai.uiuc.edu/reflex

The Second Midwest Computational Linguistics Colloquium

(MCLC-2005)

May 14-15

The Ohio State University

general goals
General Goals
  • Develop multilingual named entity recognition technology: focus on persons, places, organizations
  • Produce seed rules and (small) corpora for several LCTLs (Less Commonly Taught Languages)
  • Develop methods for automatic named entity transliteration
  • Develop methods for tracking names in comparable corpora

Sproat et al.: NER and Transliteration for 50 Languages

languages
Languages
  • Languages for seed rules: Chinese, English, Spanish, Arabic, Hindi, Portuguese, Russian, Japanese, German, Marathi, French, Korean, Urdu, Italian, Turkish, Thai, Polish, Farsi, Hausa, Burmese, Sindhi, Yoruba, Serbo-Croatian, Pashto, Amharic, Indonesian, Tagalog, Hungarian, Greek, Czech, Swahili, Somali, Zulu, Bulgarian, Quechua, Berber, Lingala, Catalan, Mongolian, Danish, Hebrew, Kashmiri, Norwegian, Wolof, Bamanankan, Twi, Basque.
  • Languages for (small) corpora: Chinese, Arabic, Hindi, Marathi, Thai, Farsi, Amharic, Indonesian, Swahili, Quechua.

Sproat et al.: NER and Transliteration for 50 Languages

milestones
Milestones
  • Resources for various languages:
    • NER seed rules for: Armenian, Persian, Swahili, Zulu, Hindi, Russian, Thai
    • Tagged corpora for: Chinese, Arabic, Korean
    • Small tagged corpora for: Armenian, Persian, Russian (10-20K words)
  • Named Entity recognition technology:
    • Ported NER technology from English to Chinese, Arabic, Russian and German
  • Name transliteration: Chinese-English, Arabic-English, Korean-English

Sproat et al.: NER and Transliteration for 50 Languages

linguistic orthographic issues
Linguistic/Orthographic Issues
  • Capitalization
  • Word boundaries
  • Phonetic vs.Orthographic issues in transliteration

Sproat et al.: NER and Transliteration for 50 Languages

named entity recognition
Named Entity Recognition

Sproat et al.: NER and Transliteration for 50 Languages

multi lingual text annotator
Multi-lingual Text Annotator

Annotate any word in a sentence by selecting the word and an

available category. It's also possible to create new categories.

http://l2r.cs.uiuc.edu/~cogcomp/ner_applet.php

Sproat et al.: NER and Transliteration for 50 Languages

multi lingual text annotator8
Multi-lingual Text Annotator

View text in other encodings. New language encodings are

easily added in a simple text file mapping.

http://l2r.cs.uiuc.edu/~cogcomp/ner_applet.php

Sproat et al.: NER and Transliteration for 50 Languages

motivation for seed rules
Motivation for Seed Rules

“The only supervision is in the form of 7 seed rules (namely, that New York, California and U.S. are locations; that any name containing Mr. is a person; that any name containing Incorporated is an organization; and that I.B.M. and Microsoft are organizations).”

[Collins and Singer, 1999]

Sproat et al.: NER and Transliteration for 50 Languages

seed rules thai
Seed Rules: Thai
  • Something including and to the right of นาย is likely to be a personSomething including and to the right of นาง is likely to be a personSomething including and to the right of นางสาว is likely to be a personSomething including and to the right of น.ส. is likely to be a personSomething including and to the right of คุณ is likely to be a personSomething including and to the right of เด็กหญิง is likely to be a personSomething including and to the right of ด.ญ. is likely to be a person
  • Something including and to the right of พ.ต.อ. is likely to be a personSomething including and to the right of พล.ต.ต. is likely to be a personSomething including and to the right of พล.ต.ท. is likely to be a personSomething including and to the right of พล.ต.อ. is likely to be a personSomething including and to the right of ส.ส. is likely to be a person
  • ทักษิณ ชินวัตร is a personทักษิณ is likely a personชวน หลีกภัย is a personบรรหาร ศิลปอาชา is a person

Sproat et al.: NER and Transliteration for 50 Languages

seed rules thai11
Seed Rules: Thai
  • Something including and in between บริษัท and จำกัด is likely to be an organizationSomething including and to the right of บจก. is likely to be an organizationSomething including and in between บริษัท and จำกัด (มหาชน) is likely to be an organizationSomething including and in between บจก. and (มหาชน) is likely to be an organizationSomething including and to the right of ห้างหุ้นส่วนจำกัด is likely to be an organizationSomething including and to the right of หจก. is likely to be an organization
  • สำนักนายกรัฐมนครี is an organizationวุฒิสภา is an organizationแพทยสภา is an organizationพรรคไทยรักไทย is an organizationพรรคประชาธิปัตย์ is an organizationพรรคชาติไทย is an organization
  • Something including and to the right of จังหวัด is likely to be a locationSomething including and to the right of จ. is likely to be a locationSomething including and to the right of อำเถอ is likely to be a locationSomething including and to the right of ตำบล is likely to be a location
  • กรุงเทพมหานคร is a locationเชียงใหม่ is a locationเชียงราย is a locationขอนแก่น is a location

Sproat et al.: NER and Transliteration for 50 Languages

seed rules armenian
Seed Rules: Armenian
  • CityName = CapWord  [ քաղաք | մայրաքաղաք ]

StateName = CapWord նահանգ

CountryName1 = CapWord երկիր

  • PersonName1 = TITLE? FirstName? LastName 

LastName = [Ա-Ֆ].*յան

FirstName = [FirstName1 | FirstName2]

FirstName1 = [Ա-Ֆ]\.

FirstName2 = [Ա-Ֆ].*

  PersonNameForeign = TITLE FirstName? CapWord? CapWord PersonAny = PersonName1 | PersonNameForeign

Sproat et al.: NER and Transliteration for 50 Languages

armenian lexicon
Armenian Lexicon

Lexicon GEODESC

արեւելյան

արեւմտյան …

Lexicon PLACEDESC

պանդոկ

պալատ …

Lexicon ORGDESC

միություն

ժողով …

Lexicon COMPDESC

գործակալություն

ընկերություն…

Lexicon TITLE

տիկին

Տկն…

Sproat et al.: NER and Transliteration for 50 Languages

seed rules persian
Lexicon TITLEآقايدکترخانمجناببانومهندس

Lexicon OrgDescاستانداريوزارتدولترژيمشهرداريانجمن

Lexicon POSITIONرئيس جمهوررييس جمهوريپرزيدنتديپلمات

Descriptors for named entitiesLexicon PerDescسابقآيندهLexicon CityDescشهرشهرکپايتختLexicon CountryDescکشور

Seed Rules: Persian

Sproat et al.: NER and Transliteration for 50 Languages

seed rules swahili
People Rules

Something including and to the right of Bw. is likely to be a person.

Something including and to the right of Bi. is likely to be a person.

A capitalized word to the right of bwana, together with the word bwana, is likely to be a person.

A capitalized word to the right of bibi, together with the word bibi, is likely to designate a person.

Place Rules

A capitalized word to the right of a word ending with -jini, is likely to be a place.

A capitalized word starting with the letter U is likely to be a place.

A word ending in ni is likely to be a place.

A sequence of words including and following the capitalized word Uwanja is likely a place.

Seed Rules: Swahili

Sproat et al.: NER and Transliteration for 50 Languages

named entity recognition16
Named Entity Recognition
  • Identify entities of specific types in text (e.g. people, locations, dates, organizations, etc.)

After receiving his M.B.A. from [ORG Harvard Business School], [PER Richard F. America] accepted a faculty position at the [ORG McDonough School of Business] in [LOC Washington].

Sproat et al.: NER and Transliteration for 50 Languages

named entity recognition17
Named Entity Recognition
  • Not an easy problem since entities:
    • Are inherently ambiguous (e.g. JFK can be both location and a person depending on the context)
    • Can appear in various forms (e.g. abbreviations)
  • Can be nested, etc.
  • Are too numerous and constantly evolving

(cf. Baayen, H. 2000. Word Frequency Distributions. Kluwer. Dordrecht.)

Sproat et al.: NER and Transliteration for 50 Languages

named entity recognition18
Named Entity Recognition

Two tasks (sometimes, done simultaneously):

  • Identify the named entity phrase boundaries (segmentation)
    • May need to respect constraints:
      • Phrases do not overlap
      • Phrase order
      • Phrase length
  • Classify the phrases (classification)

Sproat et al.: NER and Transliteration for 50 Languages

identifying phrase properties with sequential constraints

s1

s2

s3

s4

s5

s6

s1

s2

s3

s4

s5

s6

o2

o1

o3

o4

o5

o6

o1

o2

o3

o4

o5

o6

Identifying phrase properties with sequential constraints
  • View as inference with classifiers problem. Three models[Punyakanok & Roth NIPS’01] http://l2r.cs.uiuc.edu/~danr/Papers/iwclong.pdf
    • HMMs
      • HMM with classifiers
    • Conditional Models
      • Projection based Markov model
    • Constraint Satisfaction Models
      • Constraint satisfaction with classifiers
  • Other models proposed
    • CRF
    • StructurePerceptron
  • A model comparison in the context of the SRL problem [Punyakanok et al IJCAI’05]

Most common

Sproat et al.: NER and Transliteration for 50 Languages

adaptation
Adaptation
  • Most approaches in NER are targeted toward specific setting: language, subject, set of tags, etc.
    • Labeled data may be hard to acquire for each particular setting
    • Trained classifiers tend to be brittle when moved even just to a related subject
  • We consider the problem of exploiting the hypothesis we learned in one setting to improve learning in another.
  • Kinds of adaptation that can be considered:
    • Across corpora with a domain
    • Across domains
    • Across annotation methodologies
    • Across languages

Sproat et al.: NER and Transliteration for 50 Languages

adaptation example
Adaptation Example

Starting with Reuters classifier is better than starting from scratch

  • Train on:
    • Reuters + increasing amounts of NYT
    • No Reuters, just increasing amounts of NYT
  • Test on: NYT
  • Performance on NYT increases quickly as classifier is trained on examples from NYT
  • Starting with existing classifier trained on related corpus is better than starting from scratch

Trained on Reuters + 13% NYT; tested on NYT

Trained on Reuters; tested on NYT

Sproat et al.: NER and Transliteration for 50 Languages

current architecture training

Sentence

Splitter

Word Splitter

FEX

NER

SNoW-based

Network file

Current Architecture - Training

Annotated Corpus

  • Pre-process annotated corpus
  • Extract features
  • Train classifier

Honorifics

Features script

Gazetteers

Italics : setting specific

: optional

Sproat et al.: NER and Transliteration for 50 Languages

current architecture tagging

Sentence

Splitter

Word Splitter

FEX

NER

SNoW-based

Current Architecture - Tagging

Corpus

  • Pre-process corpus
  • Extract features
  • Run NER

Honorifics

Features script

Gazetteers

Network file

Annotated Corpus

Sproat et al.: NER and Transliteration for 50 Languages

extending current architecture to multiple settings

Document Classifier

Sentence

Splitter

Honorifics

Knowledge

Engineering

Components

Word Splitter

Features script

Gazetteers

Network file

Annotated Corpus

Extending Current Architecture to Multiple Settings

Chinese newswire

Corpus

German biological

English news

  • Choose setting
  • Pre-process, extract features and run NER

FEX

NER

SNoW-based

Sproat et al.: NER and Transliteration for 50 Languages

extending current architecture to multiple settings issues
Extending Current Architecture to Multiple Settings: Issues

For each setting, we need:

  • Honorifics and gazetteers
  • Tuned sentence and word splitters
  • Types of features
  • Tagged training corpus
    • Work is being done to move tags across parallel corpora (if available)

Sproat et al.: NER and Transliteration for 50 Languages

extending current architecture to multiple settings issues26
Extending Current Architecture to Multiple Settings: Issues

If parallel corpora are available and one is annotated, may be able to use Stochastic Inversion Transduction Grammars to move tags across corpora [Wu, Computational Linguistics ‘97]

  • Generate bilingual (annotated and unannotated parallel corpora) parses
  • Use ITGs as a filter to deem sentence/phrase pairs as parallel enough
  • For those that are, simply move the label from annotated to the unannotated phrase in same parse tree node.
  • Use the now tagged examples as training corpus

Sproat et al.: NER and Transliteration for 50 Languages

extending current architecture to multiple settings27
Extending Current Architecture to Multiple Settings
  • Baseline experiments with Arabic, German, and Russian:
    • E.g. For Russian with no honorifics, gazetteers, features tuned for English, and imperfect sentence splitter we still get about 77% precision and 36% recall.

NB: Used small hand-constructed corpus of approx. 15K wds, 1,300 NE (80/20 split)

Sproat et al.: NER and Transliteration for 50 Languages

summary
Summary
  • Seed rules and corpora for subset of 50 languages
  • Adapted NER system for English to other languages
  • Demonstrated adaptation of NER system to other settings
  • Experimenting with ITG as basis for annotation transplantation

Sproat et al.: NER and Transliteration for 50 Languages

methods of transliteration
Methods of Transliteration

Sproat et al.: NER and Transliteration for 50 Languages

comparable corpora
Comparable Corpora

三号种子龚睿那今晚以两个11:1轻取丹麦选手蒂・

拉斯姆森,张宁在上午以11:2和11:9淘汰了荷兰

的于・默伦迪克斯,周蜜在下午以11:4和11:1战

胜了中国香港选手凌婉婷。

In the day's other matches, second seed Zhou Mi overwhelmed

Ling Wan Ting of Hong Kong, China 11-4, 11-4, Zhang Ning defeat

Judith Meulendijks of Netherlands 11-2, 11-9 and third

seed Gong Ruina took 21 minutes to eliminate Tine Rasmussen

of Denmark 11-1, 11-1, enabling China to claim five quarterfinal

places in the women's singles.

三号种子龚睿那今晚以两个11:1轻取丹麦选手蒂・

拉斯姆森,张宁在上午以11:2和11:9淘汰了荷兰

的于・默伦迪克斯,周蜜在下午以11:4和11:1战

胜了中国香港选手凌婉婷。

In the day's other matches, second seed Zhou Mi overwhelmed

Ling Wan Ting of Hong Kong, China 11-4, 11-4, Zhang Ning defeat

Judith Meulendijks of Netherlands 11-2, 11-9 and third

seed Gong Ruina took 21 minutes to eliminate Tine Rasmussen

of Denmark 11-1, 11-1, enabling China to claim five quarterfinal

places in the women's singles.

Sproat et al.: NER and Transliteration for 50 Languages

transliteration in comparable corpora
Transliteration in Comparable Corpora
  • Take the newspapers for a day in any set of languages: a lot of them will have names in common.
  • Given a name in one language, find its transliteration in a similar text in another language.
  • How can we make use of:
    • Linguistic factors such as similar pronunciations
    • Distributional factors
  • Right now we used partly supervised methods (e.g. we assume small training dictionaries):
    • We are aiming for largely unsupervised methods (in particular, no training dictionary)

Sproat et al.: NER and Transliteration for 50 Languages

some comparable corpora
Some Comparable Corpora
  • We have (from the LDC) comparable text corpora for:
    • English (19M words)
    • Chinese (22M characters)
    • Arabic (8M words)
  • Many more such corpora can, in principle, be collected from the web

Sproat et al.: NER and Transliteration for 50 Languages

how chinese transliteration works
How Chinese Transliteration Works
  • About 500 characters tend to be used for foreign words
  • Attempt to mimic the pronunciation
  • But lots of alternative ways of doing it

Sproat et al.: NER and Transliteration for 50 Languages

transliteration problem
Transliteration Problem
  • Many applications of transliteration have been in machine translation [Knight&Graehl, 1998; Al-Onaizan&Knight, 2002; Gao, 2004]:
    • What’s the best translation of this Chinese name?
  • Our problem is slightly different:
    • Are these two names the same?
    • Want to be able to reject correspondences
    • Assign 0 probability to some unseen cases in training data

Sproat et al.: NER and Transliteration for 50 Languages

approaches to transliteration
Approaches to Transliteration
  • Much work using the source-channel approach:
    • Cast as a problem where you have a clean “source” – e.g. a Chinese name – and a “noisy channel” that “corrupts” the source into the observed form – e.g. an English name:
    • P(E|C)P(C)
    • E.g.: P(fi,E fi+1,E fi+2,E … fi+n,E |sC)

Chinese characters represent syllables (s); we match these to sequences of English phonemes (f)

Sproat et al.: NER and Transliteration for 50 Languages

resources
Resources
  • Small dictionary of 721 (mostly English) names and their Chinese transliterations
  • Large dictionary of about 1.6 million names from LDC

Sproat et al.: NER and Transliteration for 50 Languages

general approach
General Approach
  • Train a tight transliteration model from a dictionary of known transliterations
  • Identify names in English news text for a given day using an existing named entity recognizer
  • Process same day of Chinese text looking for sequences of characters used in foreign names
  • Do an all-pairs match using the transliteration model to find possible transliteration pairs

Sproat et al.: NER and Transliteration for 50 Languages

model estimation
Model Estimation
  • Seek to estimate P(e|c) where e is a sequence of words in Roman script and c is a sequence of Chinese characters
  • We actually estimate P(e’|c’), where e’ is the pronunciation of e and c’ is the pronunciation of c.
  • We decompose the estimate of P(e’|c’) as:
  • Chinese transliteration matches syllables to similar-sounding spans of foreign phones. So c’I are syllables, and e’I are subsequences of the English phone string

Sproat et al.: NER and Transliteration for 50 Languages

model estimation39
Model Estimation
  • Align phone strings using modified Sankoff/Kruskal algorithm
  • For each Chinese s, allow an English phone string f to correspond just in case the initial of s corresponds to the initial of f some minimum number of times in training
  • Smooth probabilities using Good-Turing
  • Distribute unseen probability mass over unseen cases non-uniformly according to a weighting scheme

Sproat et al.: NER and Transliteration for 50 Languages

model estimation40
Model Estimation
  • We estimate the probability for a given unseen case as follows:
  • Where:
    • P(n0) is the probability of unseen cases according to the Good-Turing smoothing
    • P(len(e)=m|len(c)=n) is the probability of a Chinese syllable of length n corresponding to an English phone sequence of length m
    • count(len(e)=m) is the type count of phone sequences of length m (estimated from 194,000 pronunciations produced by the Festival TTS system on the XTag dictionary)

Sproat et al.: NER and Transliteration for 50 Languages

some automatically found pairs
Some Automatically Found Pairs

Pairs found in same day of newswire text

Sproat et al.: NER and Transliteration for 50 Languages

further pairs
Further Pairs

Sproat et al.: NER and Transliteration for 50 Languages

time correlations
Time Correlations
  • When some major event happens (e.g., the tsunami disaster), it is very likely covered by news articles in multiple languages
  • Each event/topic tends to have its own “associated vocabulary” (e.g., names such as Sri Lanka, India may occur in recent news articles)
  • We thus will likely see that the frequency of a name such as Sri Lanka will peak as compared with other time periods and the pattern is likely the same across languages
  • cf. [Kay and Roscheisen, CL, 1993; Kupiec, ACL, 1993; Rapp, ACL, 1995; Fung, WVLC, 1995]

Sproat et al.: NER and Transliteration for 50 Languages

slide44

… …

Documents

Day 1

Day 2

Day 3

Day n

Time line

a term

Term Frequency

… …

Normalized to obtain a distribution

Construct Term Distributions over Time

Sproat et al.: NER and Transliteration for 50 Languages

slide45

Pearson Correlation scores [-1, 1]

Megawati-English

Arafat-Chinese

Megawati-English

Megawati-Chinese

Measure Correlations of English and Chinese Word Pairs

bad correlation

corr = 0.0324

good correlation

corr = 0.885

Sproat et al.: NER and Transliteration for 50 Languages

chinese transliteration
Chinese Transliteration

English termEdmonton

Chinese

documents

Candidate Chinese

names

埃德蒙顿

阿勒泰

埃丁顿

阿马纳

阿亚德

埃蒂纳罗

… …

埃德蒙顿 0.96

阿勒泰 0.91

埃丁顿 0.88

阿马纳 0.75

… …

Rank

Candidates

  • Methods:
  • Phonetic approach
  • Frequency correlation
  • Combination

Sproat et al.: NER and Transliteration for 50 Languages

evaluation

Method1 (Freq+PhoneticFilter)

  • Compute the correlation
  • ranking them by correlation scores

Phonetic method

  • Method2 (Freq+PhoneticScore)
  • Linearly combine the correlation scores with Phonetic scores (half/half)

Chinese candidate

埃德蒙顿

阿勒泰

埃丁顿

阿马纳

阿亚德

埃蒂纳罗

… …

Evaluation

English term Edmonton

MRR: Mean Reciprocal Rank

AllMRR: Evaluation over all English names

CoreMRR: Evaluation over just names w/

found Chinese correspondence

Sproat et al.: NER and Transliteration for 50 Languages

summary and future work
Summary and Future Work
  • So far:
    • Phonetic transliteration models
    • Time correlation between name distributions
  • Work in progress:
    • Linguistic models:
      • Develop graphical model approach to transliteration
      • Semantic aspects of transliteration in Chinese: female names ending in –ia transliterated with 娅 ya rather than 亚
      • Resource-poor transliteration for any pair of languages
    • Document alignment
    • Coordinated mixture models for document/word-level alignment

Sproat et al.: NER and Transliteration for 50 Languages

slide49

character counter

End

character

character

transition

chinese

phone

transition

chinese phone

english phone

Graphical Models

[Bilmes & Zweig 2002]

Sproat et al.: NER and Transliteration for 50 Languages

semantic aspects of transliteration
Semantic Aspects of Transliteration
  • Phonological model doesn’t capture semantic/orthographic features of transliteration:
    • Saint, San, Sao, … use 圣 sheng `holy’
    • Female names ending in –ia transliterated with 娅 ya rather than 亚 ya
    • Such information boosts evidence that two strings are transliterations of each other
  • Consider gender. For each character c:
    • compute log-likelihood ratio abs(log(P(f|c)/P(m|c)))
    • build a decision list ranked by decreasing LLR

Sproat et al.: NER and Transliteration for 50 Languages

decision list for gender classification
Decision List for Gender Classification

41.5833898566 顿 :male

40.8357601821 琳 :female

39.064753687 丝 :female

35.8207097407 修 :male

34.7980589928 名 :female

34.4008926875 彦 :female

33.9871287766 屋 :female

33.9225902555 儿 :male

26.945842105 慈 :male

Sproat et al.: NER and Transliteration for 50 Languages

document alignment
Document Alignment

Basic idea: Sum up correlations of all e-c pairs. Use these to find documents paired by relevance.

Method 1: Expected correlation (ExpCorr)

e1

e2

e3

e|E|

c1

c2

c3

c|C|

Matching two rare words is

more surprising, so should count more

English document E

Chinese document C

Method 2: IDF weighted correlation (IDFCorr)

Repeated occurrences of a word

Contributes less than the first few

Occurrences.

IDF (for Inverse Doc Freq) penalizes

common words

Method 3: BM25 weighting (TF-IDF)

BM25 is a typical retrieval weighting function

Sproat et al.: NER and Transliteration for 50 Languages

document alignment evaluation
Document Alignment Evaluation
  • Randomly pick 6 English documents
  • “Retrieve” 50 Chinese documents (out of approx. 900) for each English document
  • Rank the 300 E-C pairs by each of 3 methods
  • Evaluate the relevance by standard “precision” metric

Method 3: TF-IDF

Method 2: IDF

Method 1: ExpCorr

About 80% of the top 100 pairs of documents are correct.

Sproat et al.: NER and Transliteration for 50 Languages

summary and some ongoing work
Summary and Some Ongoing Work
  • Some seed rules and corpora; more in progress
  • NER techniques being adapted to other languages
    • Investigating ITG for annotation transplantation
    • What features to use for various languages
  • Combined phonetic and temporal information in transliteration
    • Semantic/orthographic aspects of transliteration
    • Resource poor transliteration
    • Document alignment
    • Coordinated mixture models

Sproat et al.: NER and Transliteration for 50 Languages

acknowledgments
Acknowledgments
  • National Security Agency Contract NBCHC040176, REFLEX (Research on English and Foreign Language Exploitation)
  • Language experts thus far:
    • Karine Megerdoomian (Persian, Armenian)
    • Alla Rozovskaya (Russian, Hebrew)
    • Archna Bhatia (Hindi)
    • Brent Henderson (Swahili)
    • Tholani Hlongwa (Zulu)
  • Karen Livescu for much help with GMTK

Sproat et al.: NER and Transliteration for 50 Languages