60 likes | 119 Views
Enhancing transliteration accuracy for names via NLP and OCR integration. Analyzing various datasets, models, and correction methods to achieve high precision.
E N D
Automatic Name Transliteration via OCR and NLP Yu Cao Tao Wang
Optical Character Recognition (OCR) • ICDAR 2011 dataset • character embedded in natural scene • histogram of oriented gradients (HOG) • 8x8 window sliding across at step of 2 • linear kernel SVM • 52 classes, i.e. capital and small letters • overall character-level accuracy 74%
Bayesian Correction • Char-level bigram language model • Char-level accuracy improved to 75.3%
Named Entity Recognition (NER) • essentially two types of labels, “PERSON” and “NONPERSON” • MUC 7 corpora • maximum entropy Markov model • set of features: “CUR_WORD”, “PREV_ LABEL”, “MID_INITIAL”, “IN_DICT”, “IN_NAME DATABASE”, “NEXT_WORD” • F1 score of 77.5% (Precision 76.9% & Recall 78.1%)
F r a n c i s c o 弗 朗 西 斯 科 Transliteration • character-level translation model • training data: 4,256 English – Chinese name pairs obtained online • trigram Chinese language model • alignment model IBM model 1,3,4 • human evaluation • 120 English names obtained by NER for testing • acceptance score 100 ± 2 /120