Name Ethnicity Classification and Ethnicity Sensitive Name Matching

Name Ethnicity Classification and Ethnicity Sensitive Name Matching Pucktada Treeratpituk and C. Lee Giles College of Information Sciences and Technology Penn State University

Outline • Name-Matching & Name-Ethnicity • Problem Definition • Motivation • Previous Work • Ethnicity-Sensitive Name-Matching Framework • Name-Ethnicity Classification • Ethnicity Sensitive Name-Matching • Evaluation • Conclusion

Name Matching • Name matching • Pairwise people disambiguation based only on personal names • Problem: Can name1 and name2 refer to the same person? • Bill Gates = William Henry Gates ? • Mao Zedong = Mao Tse-Tung ? • Lots of applications • NLP, Information Integration, Social Network Analysis, etc. • Name matching is a special case of string matching • In string matching, the objects to match can be • product names, institution names, street addresses • Name matching focuses on just personal names • Want to take advantage of what make personal names differ from other types of names to improve the disambiguation result

Name and Ethnicity • What makes personal names different from other types of names??? • Personal names are very cultural (ethnicity-dependent) • Ethnicities are often identifiable from names • More importantly, for name matching, valid variations in names are dependent on ethnicities • English names • Use of nicknames and middle names • William Henry Gates • = Bill Gates, William H. Gates, William Gates

Name and Ethnicity (Cont) • Middle Eastern names • Extensive use of ancestral names • Khalid Bin HasanBin Ahmad al-Fulan • Khalid, Son of Hasan, Son of Ahmad, of Fulan family • Khalid Bin Hasan Bin Ahmadal-Fulan • = Khalid Bin Hasan al-Fulan drop grandfather names • = Khalid al-Fulan no both ancestral names • != Khalid Bin Ahmadal-Fulan cannot drop only father name • Spanish names • Use composite given names and two surnames (paternaland maternal) • Pedro Juan LópezRodríguez = Pedro López(can drop maternal surnames) • Juan Morales Garcia = JuanMorales • != Juan Garcia • William Henry Gates (Bill Gates) • != William Henry (17 century chemist – Henry’s Law) • For English names, cannot similarly drop the last surnames

Name and Ethnicity (Cont) • Chinese names • Multiple transliteration standards • Mao Zedong = Mao Tse-tung • Reverse ordering • Li Ming ~ Ming Li (more likely to have this kind of error than for English names) • Western nicknames that are closed to the original Chinese names, are often used • Heung-Yeung Shum = Harry Shum • Segmentation • Heung-YeungShum = HeungyeungShum = HY Shum = H Shum • Li KaShing != Li ShingKa is not a middle name, thus cannot be dropped • SO a name matching algorithm should be ethnicity sensitive !!!

Previous Work • Name-Matching • Phonetic-based – e.g. Soundex, Metaphone • Convert name-strings to phonetic codes then compare • Edit-distance (like) similarity • Winkler, Jaro-Winkler, Levenstein, Smith-Waterman • Name-Ethnicity Classification • Frequency-based method (Dictionary-based) • Certain names are more common in some ethnic groups, e.g. Rodriguez is a common Spanish last name, etc. • LDA-based model using US Census [ICWSM10] • HMM + Decision Tree [KDD09]

Outline • Name-Matching & Name-Ethnicity • Problem Definition • Motivation • Previous Work • Ethnicity-Sensitive Name-Matching Framework • Name-Ethnicity Classification • Ethnicity Sensitive Name-Matching • Evaluation • Conclusion

Ethnicity-Sensitive Name Matching: Framework 1. Identifying name-ethnicities 2. Computing the optimal alignment between names using ethnicity-dependent distance function Name 1 Name 2 e1, e2 Juan Gines Sanchez Moreno Name-Ethnicity Classifier Optimal Alignment G Lopez Moreno 3. Generating the feature vector of alignment profile Alignment Profile Me1,e2 f = <x1, x2, …, x7> Name Matching Model 4. Use an ethnicity-dependent model to compute the match probability based on the alignment profile Match Probability p = 0.78

Name-Ethnicity Classification • Goal: To infer one’s ethnicity from one’s name Personal Name Juan Gines Sanchez Moreno, etc. F = <f1, f2, f3, … > Features vector with 4 types of features – - sequence of characters - sequences of phonetic sound, … Multiclass Classifier Multinomial Logistic Regression Ethnicity Chinese, British, German, etc.

Name-Ethnicity Classification:4 Feature Types • nonASCII– diacritics characters • MineichirōAdachi => ō • Adriana Muñoz => ñ • charNgram– character ngrams • Pad token boundaries with ‘$’, and last name’s boundaries with ‘+’ • 2-gram, 3-gram, and 4-gram • soundex– phonetic encoded • Steven, Stephen, Stevenson => S315 • Steeve => S310 • dmpNgram– double metaphone ngrams • Double metaphone is designed to better handle non-English words, to deal with phonetic ambiguity • Schmidt => XMT and SMT • Steven, Stephen => STF Stevenson => STFNSN • Use similar padding scheme as charNgram

Multinomial Logistic Regression • Logistic Regression generalized to multi-classes • The set of coefficients {βk,0,βk}k=1…K-1 is estimated through iterative process • {y}k=1…K is the set of ethnicities

Ethnicity-Sensitive Name Matching Name 1 Name 2 e1, e2 Juan Gines Sanchez Moreno Name-Ethnicity Classifier ✔ Optimal Alignment G Lopez Moreno Done Alignment Profile M f = <x1, x2, …, x8> Name Matching Model Match Probability p = 0.78

Compute Optimal Alignment • Modify the Smith-Waterman algorithm to find the optimal alignment between two names • Smith–Waterman Algorithm • DNA sequence matching, e.g. between ‘ACAT’ and ‘AGCA’ • Use dynamic programming to calculate the scoring matrix H • Character alignment: A = a1a2…aMand B = b1b2…bN • H(i, j) = the maximum similarity score between a1…ai and b1…bj Match/Mismatch score W(ai, bj) = 1, if ai= bj = 0, otherwise Gap score W(ai, -) = W(-, bj) = 0

Smith–Waterman: example Fill the scoring matrix Husing dynamic programming 2. Use the traceback procedure to find the optimal path 3. Extract the optimal alignment traceback alignment

Extending Smith–Waterman 1. Word Match P = (p1,p2,…,pM) and Q = (q1,q2,…,qN) instead of character match word similarity 2. Fuzzy Match Edward = E. Kathy = Katharine Can use ethnicity-dependent nickname dict and transliteration rules 3. Span Match Al Hashim = Alhashim De Félice = DeFélice Zhao Hui Wu= Zhaohui Wu Address word-segmentation problem 4. Shift (None, Left, Right) Find the optimal alignment for all 3 permutations Min Seo Kim = Kim Min Seo

Example traceback alignment

Alignment Profile Define an alignment profile as a vector of 7 features fa = (0, 0, 1, 0, 0, 0, 0.91) 0.96 x 0.95 <skip> fb = (1, 0, 0, 0, 2, 0, 0.95) <skip> <con> 0.95

Match Probability • So far, we convert <name1, name2> pair to an alignment profile f=<x1,…,x8> • Now, need a function ΘE: f => [0,1], that convert an alignment profile to a probability • P = Probability that name1and name2 match = ΘE(f) • Let D1,…, D7 be the discounting factors for different types of misalignment • If we assume that the probability odd ratio (P/1-P) is proportional to Logistic Regression Then, the log odd ration can be rewritten in the form of a simple logistic regression

Outline • Name-Matching & Name-Ethnicity • Problem Definition • Motivation • Previous Work • Ethnicity-Sensitive Name-Matching Framework • Name-Ethnicity Classification • Ethnicity Sensitive Name-Matching • Evaluation • Name-Ethnicity Classification (via Wikipedia) • Ethnicity Sensitive Name-Matching (via DBLP data) • Conclusion

Evaluation: Name-Ethnicity Classification • Use Wikipedia as the data source • More fine grain • US Census only has 6 types of ethnic groups • White, African American, Hispanic, Asian+Pacific Islander, Multi-nationality, Others • Automatically crawl for names of various nationalities from Wikipedia categories • Use Breadth-First-Search starting from “<nationality> people” pages, up to the depth of 4 • Manually curated results with some heuristics • E.g. names of `British people of Indian descents’ are more likely to be names of Indian ethnicity than of British ethnicity

Wikipedia Data • 19 Nationalities • 12 Ethnic groups • 70/30 split for training and testing

Accuracy and Confusion Matrix • 85% overall accuracy, slightly drop to 84% if ignore nonASCII features • High confusion between MEA and IND, and between ENG, FRN, and GER (observation: countries with high immigration rates) • Asian names are fairly easy to identified, especially JAP

Top Identifiable Features • Top features (without diacritics) for each name-ethnicity classes according to the coefficients in the logistic regression models, e.g. • ‘bh’ sequence is mostly unique to Indian names, while names with ‘sch’ likely to be German names • Names ending with ‘ng’ are mostly Chinese names

Top Identifiable Features (Full) • Top features (including diacritics feature) for each name-ethnicity classes • While many diacritics features are highly ranked (especially for European names), removing them only hurt the accuracy slightly

Evaluation: Ethnicity Sensitive Name Matching • Data: DBLP10K person data set (10,000 pairs) • Manually labeled data from DBLP’s correction requests and heuristically detected errors • Lange, D., and Naumann, F. Frequency-aware Similarity Measures: Why Arnold Schwarzenegger is Always a Duplicate. CIKM 2011 • Select only the paper reference pairs from the same author with different name aliases (2,500 pairs) • Compare with 4 baselines (2 Basic and 2 Level2) • Basic • Levenstein, Jaro-Winkler • Level2 [Monge and Elkan, KDD96] • Recursive matching scheme for multi-fields strings (last names, forenames) • L2 Levenstein, L2 Jaro-Winkler • Ethnicity-Sensitive Name-Matching (4 Models) • Middle Eastern (MEA), Spanish (SPA), East Asian (CHI, JAP, KOR, VIE), and Default – (ALL)

Experiment Result • N x N comparison (N ~ 2,500) Levenstein F1=0.70 (R=0.6, P=0.81) Jaro-Winkler F1=0.75 (R=0.7, P=0.81) L2 Levenstein F1=0.77 (R=0.8, P=0.74) L2 Jaro-Winkler F1=0.80 (R=0.7, P=0.93) Our Algorithm F1=0.94 (+0.14) R=0.89 (+0.19) P=0.99 (+0.06) Error cases: Maria-FlorinaPopa Maria-FlorinaBalcan HedvigSidenbladhHedvigKjellstrom

EthnicSeer http://singularity.ist.psu.edu/ethnicity

Conclusion & Future Work • Name-ethnicity classification • 85% accuracy on 12 ethnicities on Wikipedia • Show that character/phonetic ngrams together with a logistic regression model can be used to effectively identify name-ethnicity • Ethnicity-sensitive name-matching • Improve performance, F1=0.94 (+14%), P=0.99 (+6%), on DBLP hard data set over the best baselines. • Future Work • Expand to more ethnicities, to finer grain classification (French in Quebec vs. in France). • Incorporate frequency knowledge + more syntactic knowledge • Ethnicity trends & prediction • Use finer grain name-ethnicity distance function • Naming convention between Spanish in Spain & Latin American differ somewhat

Big Picture

Q&A

Name Ethnicity Classification and Ethnicity Sensitive Name Matching

Name Ethnicity Classification and Ethnicity Sensitive Name Matching

Presentation Transcript

Ethnicity

Ethnicity

Ethnicity

Ethnicity

American Ethnicity: Ethnicity and Ethnic Relations

Ethnicity

Ethnicity

Ethnicity

Ethnicity

Ethnicity

Ethnicity

ETHNICITY

Ethnicity

Ethnicity

Ethnicity

Ethnicity

Ethnicity

Ethnicity

Ethnicity

Ethnicity

Ethnicity

Ethnicity