name ethnicity classification and ethnicity sensitive name matching n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Name Ethnicity Classification and Ethnicity Sensitive Name Matching PowerPoint Presentation
Download Presentation
Name Ethnicity Classification and Ethnicity Sensitive Name Matching

Loading in 2 Seconds...

play fullscreen
1 / 31

Name Ethnicity Classification and Ethnicity Sensitive Name Matching - PowerPoint PPT Presentation


  • 193 Views
  • Uploaded on

Name Ethnicity Classification and Ethnicity Sensitive Name Matching. Pucktada Treeratpituk and C. Lee Giles College of Information Sciences and Technology Penn State University. Outline. Name-Matching & Name-Ethnicity Problem Definition Motivation Previous Work

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Name Ethnicity Classification and Ethnicity Sensitive Name Matching' - temple


Download Now An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
name ethnicity classification and ethnicity sensitive name matching

Name Ethnicity Classification and Ethnicity Sensitive Name Matching

Pucktada Treeratpituk and C. Lee Giles

College of Information Sciences and Technology

Penn State University

outline
Outline
  • Name-Matching & Name-Ethnicity
    • Problem Definition
    • Motivation
  • Previous Work
  • Ethnicity-Sensitive Name-Matching Framework
    • Name-Ethnicity Classification
    • Ethnicity Sensitive Name-Matching
  • Evaluation
  • Conclusion
name matching
Name Matching
  • Name matching
    • Pairwise people disambiguation based only on personal names
    • Problem: Can name1 and name2 refer to the same person?
      • Bill Gates = William Henry Gates ?
      • Mao Zedong = Mao Tse-Tung ?
  • Lots of applications
    • NLP, Information Integration, Social Network Analysis, etc.
  • Name matching is a special case of string matching
    • In string matching, the objects to match can be
      • product names, institution names, street addresses
    • Name matching focuses on just personal names
      • Want to take advantage of what make personal names differ from other types of names to improve the disambiguation result
name and ethnicity
Name and Ethnicity
  • What makes personal names different from other types of names???
  • Personal names are very cultural (ethnicity-dependent)
    • Ethnicities are often identifiable from names
    • More importantly, for name matching, valid variations in names are dependent on ethnicities
  • English names
    • Use of nicknames and middle names
    • William Henry Gates
      • = Bill Gates, William H. Gates, William Gates
name and ethnicity cont
Name and Ethnicity (Cont)
  • Middle Eastern names
    • Extensive use of ancestral names
    • Khalid Bin HasanBin Ahmad al-Fulan
      • Khalid, Son of Hasan, Son of Ahmad, of Fulan family
    • Khalid Bin Hasan Bin Ahmadal-Fulan
      • = Khalid Bin Hasan al-Fulan drop grandfather names
      • = Khalid al-Fulan no both ancestral names
      • != Khalid Bin Ahmadal-Fulan cannot drop only father name
  • Spanish names
    • Use composite given names and two surnames (paternaland maternal)
    • Pedro Juan LópezRodríguez = Pedro López(can drop maternal surnames)
    • Juan Morales Garcia = JuanMorales
      • != Juan Garcia
    • William Henry Gates (Bill Gates)
      • != William Henry (17 century chemist – Henry’s Law)
      • For English names, cannot similarly drop the last surnames
name and ethnicity cont1
Name and Ethnicity (Cont)
  • Chinese names
    • Multiple transliteration standards
      • Mao Zedong = Mao Tse-tung
    • Reverse ordering
      • Li Ming ~ Ming Li (more likely to have this kind of error than for English names)
    • Western nicknames that are closed to the original Chinese names, are often used
      • Heung-Yeung Shum = Harry Shum
    • Segmentation
      • Heung-YeungShum = HeungyeungShum = HY Shum = H Shum
      • Li KaShing != Li ShingKa is not a middle name, thus cannot be dropped
  • SO a name matching algorithm should be ethnicity sensitive !!!
previous work
Previous Work
  • Name-Matching
    • Phonetic-based – e.g. Soundex, Metaphone
      • Convert name-strings to phonetic codes then compare
    • Edit-distance (like) similarity
      • Winkler, Jaro-Winkler, Levenstein, Smith-Waterman
  • Name-Ethnicity Classification
    • Frequency-based method (Dictionary-based)
      • Certain names are more common in some ethnic groups, e.g. Rodriguez is a common Spanish last name, etc.
    • LDA-based model using US Census [ICWSM10]
    • HMM + Decision Tree [KDD09]
outline1
Outline
  • Name-Matching & Name-Ethnicity
    • Problem Definition
    • Motivation
  • Previous Work
  • Ethnicity-Sensitive Name-Matching Framework
    • Name-Ethnicity Classification
    • Ethnicity Sensitive Name-Matching
  • Evaluation
  • Conclusion
ethnicity sensitive name matching framework
Ethnicity-Sensitive Name Matching: Framework

1. Identifying name-ethnicities

2. Computing the optimal alignment between names using ethnicity-dependent distance function

Name 1

Name 2

e1, e2

Juan Gines Sanchez Moreno

Name-Ethnicity Classifier

Optimal Alignment

G Lopez Moreno

3. Generating the feature vector of alignment profile

Alignment Profile

Me1,e2

f = <x1, x2, …, x7>

Name Matching

Model

4. Use an ethnicity-dependent model to compute the match probability based on the alignment profile

Match Probability

p = 0.78

name ethnicity classification
Name-Ethnicity Classification
  • Goal: To infer one’s ethnicity from one’s name

Personal Name

Juan Gines Sanchez Moreno, etc.

F = <f1, f2, f3, … >

Features vector with 4 types of features –

- sequence of characters

- sequences of phonetic sound, …

Multiclass Classifier

Multinomial Logistic Regression

Ethnicity

Chinese, British, German, etc.

name ethnicity classification 4 feature types
Name-Ethnicity Classification:4 Feature Types
  • nonASCII– diacritics characters
    • MineichirōAdachi => ō
    • Adriana Muñoz => ñ
  • charNgram– character ngrams
    • Pad token boundaries with ‘$’, and last name’s boundaries with ‘+’
    • 2-gram, 3-gram, and 4-gram
  • soundex– phonetic encoded
    • Steven, Stephen, Stevenson => S315
    • Steeve => S310
  • dmpNgram– double metaphone ngrams
    • Double metaphone is designed to better handle non-English words, to deal with phonetic ambiguity
    • Schmidt => XMT and SMT
    • Steven, Stephen => STF Stevenson => STFNSN
    • Use similar padding scheme as charNgram
multinomial logistic regression
Multinomial Logistic Regression
  • Logistic Regression generalized to multi-classes
  • The set of coefficients {βk,0,βk}k=1…K-1 is estimated through iterative process
  • {y}k=1…K is the set of ethnicities
ethnicity sensitive name matching
Ethnicity-Sensitive Name Matching

Name 1

Name 2

e1, e2

Juan Gines Sanchez Moreno

Name-Ethnicity Classifier

Optimal Alignment

G Lopez Moreno

Done

Alignment Profile

M

f = <x1, x2, …, x8>

Name Matching

Model

Match Probability

p = 0.78

compute optimal alignment
Compute Optimal Alignment
  • Modify the Smith-Waterman algorithm to find the optimal alignment between two names
  • Smith–Waterman Algorithm
    • DNA sequence matching, e.g. between ‘ACAT’ and ‘AGCA’
    • Use dynamic programming to calculate the scoring matrix H
      • Character alignment: A = a1a2…aMand B = b1b2…bN
      • H(i, j) = the maximum similarity score between a1…ai and b1…bj

Match/Mismatch score

W(ai, bj) = 1, if ai= bj

= 0, otherwise

Gap score

W(ai, -) = W(-, bj) = 0

smith waterman example
Smith–Waterman: example

Fill the scoring matrix Husing dynamic programming

2. Use the traceback procedure to find the optimal path

3. Extract the optimal alignment

traceback

alignment

extending smith waterman
Extending Smith–Waterman

1. Word Match

P = (p1,p2,…,pM) and

Q = (q1,q2,…,qN)

instead of character match

word similarity

2. Fuzzy Match

Edward = E.

Kathy = Katharine

Can use ethnicity-dependent nickname dict and transliteration rules

3. Span Match

Al Hashim = Alhashim

De Félice = DeFélice

Zhao Hui Wu= Zhaohui Wu

Address word-segmentation

problem

4. Shift (None, Left, Right)

Find the optimal alignment for all 3 permutations

Min Seo Kim = Kim Min Seo

example
Example

traceback

alignment

alignment profile
Alignment Profile

Define an alignment profile as a vector of 7 features

fa = (0, 0, 1, 0, 0, 0, 0.91)

0.96 x 0.95

<skip>

fb = (1, 0, 0, 0, 2, 0, 0.95)

<skip>

<con>

0.95

match probability
Match Probability
  • So far, we convert <name1, name2> pair to an alignment profile f=<x1,…,x8>
  • Now, need a function ΘE: f => [0,1], that convert an alignment profile to a probability
  • P = Probability that name1and name2 match = ΘE(f)
    • Let D1,…, D7 be the discounting factors for different types of misalignment
    • If we assume that the probability odd ratio (P/1-P) is proportional to

Logistic Regression

Then, the log odd ration can be rewritten in the form of a simple logistic regression

outline2
Outline
  • Name-Matching & Name-Ethnicity
    • Problem Definition
    • Motivation
  • Previous Work
  • Ethnicity-Sensitive Name-Matching Framework
    • Name-Ethnicity Classification
    • Ethnicity Sensitive Name-Matching
  • Evaluation
    • Name-Ethnicity Classification (via Wikipedia)
    • Ethnicity Sensitive Name-Matching (via DBLP data)
  • Conclusion
evaluation name ethnicity classification
Evaluation: Name-Ethnicity Classification
  • Use Wikipedia as the data source
    • More fine grain
    • US Census only has 6 types of ethnic groups
      • White, African American, Hispanic, Asian+Pacific Islander, Multi-nationality, Others
  • Automatically crawl for names of various nationalities from Wikipedia categories
    • Use Breadth-First-Search starting from “<nationality> people” pages, up to the depth of 4
    • Manually curated results with some heuristics
      • E.g. names of `British people of Indian descents’ are more likely to be names of Indian ethnicity than of British ethnicity
wikipedia data
Wikipedia Data
  • 19 Nationalities
  • 12 Ethnic groups
  • 70/30 split for training and testing
accuracy and confusion matrix
Accuracy and Confusion Matrix
  • 85% overall accuracy, slightly drop to 84% if ignore nonASCII features
  • High confusion between MEA and IND, and between ENG, FRN, and GER (observation: countries with high immigration rates)
  • Asian names are fairly easy to identified, especially JAP
top identifiable features
Top Identifiable Features
  • Top features (without diacritics) for each name-ethnicity classes according to the coefficients in the logistic regression models, e.g.
    • ‘bh’ sequence is mostly unique to Indian names, while names with ‘sch’ likely to be German names
    • Names ending with ‘ng’ are mostly Chinese names
top identifiable features full
Top Identifiable Features (Full)
  • Top features (including diacritics feature) for each name-ethnicity classes
  • While many diacritics features are highly ranked (especially for European names), removing them only hurt the accuracy slightly
evaluation ethnicity sensitive name matching
Evaluation: Ethnicity Sensitive Name Matching
  • Data: DBLP10K person data set (10,000 pairs)
    • Manually labeled data from DBLP’s correction requests and heuristically detected errors
    • Lange, D., and Naumann, F. Frequency-aware Similarity Measures: Why Arnold Schwarzenegger is Always a Duplicate. CIKM 2011
    • Select only the paper reference pairs from the same author with different name aliases (2,500 pairs)
  • Compare with 4 baselines (2 Basic and 2 Level2)
    • Basic
      • Levenstein, Jaro-Winkler
    • Level2 [Monge and Elkan, KDD96]
      • Recursive matching scheme for multi-fields strings (last names, forenames)
      • L2 Levenstein, L2 Jaro-Winkler
  • Ethnicity-Sensitive Name-Matching (4 Models)
    • Middle Eastern (MEA), Spanish (SPA), East Asian (CHI, JAP, KOR, VIE), and Default – (ALL)
experiment result
Experiment Result
  • N x N comparison (N ~ 2,500)

Levenstein

F1=0.70 (R=0.6, P=0.81)

Jaro-Winkler

F1=0.75 (R=0.7, P=0.81)

L2 Levenstein

F1=0.77 (R=0.8, P=0.74)

L2 Jaro-Winkler

F1=0.80 (R=0.7, P=0.93)

Our Algorithm

F1=0.94 (+0.14)

R=0.89 (+0.19)

P=0.99 (+0.06)

Error cases: Maria-FlorinaPopa Maria-FlorinaBalcan

HedvigSidenbladhHedvigKjellstrom

ethnicseer
EthnicSeer

http://singularity.ist.psu.edu/ethnicity

conclusion future work
Conclusion & Future Work
  • Name-ethnicity classification
    • 85% accuracy on 12 ethnicities on Wikipedia
    • Show that character/phonetic ngrams together with a logistic regression model can be used to effectively identify name-ethnicity
  • Ethnicity-sensitive name-matching
    • Improve performance, F1=0.94 (+14%), P=0.99 (+6%), on DBLP hard data set over the best baselines.
  • Future Work
    • Expand to more ethnicities, to finer grain classification (French in Quebec vs. in France).
    • Incorporate frequency knowledge + more syntactic knowledge
    • Ethnicity trends & prediction
    • Use finer grain name-ethnicity distance function
      • Naming convention between Spanish in Spain & Latin American differ somewhat