1 / 16

FA XIAN Finding, Ascertaining, eXtracting, And Identifying Names Yuan Ding and John Blitzer

FA XIAN Finding, Ascertaining, eXtracting, And Identifying Names Yuan Ding and John Blitzer. Preliminaries. 250K of hand-annotated Chinese newswire 2586 Person Names 1422 Unique BBN Identifinder Trained on 650K of English Text NYU MENE Trained on 270K of English Text. Modeling Overview.

taipa
Download Presentation

FA XIAN Finding, Ascertaining, eXtracting, And Identifying Names Yuan Ding and John Blitzer

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. FA XIANFinding, Ascertaining, eXtracting, And Identifying Names Yuan Ding and John Blitzer

  2. Preliminaries • 250K of hand-annotated Chinese newswire • 2586 Person Names • 1422 Unique • BBN Identifinder • Trained on 650K of English Text • NYU MENE • Trained on 270K of English Text

  3. Modeling Overview • SVM (Support Vector Machine) • Common Feature Set • Common Training/Evaluation Data

  4. Features • Character based feature set • Chinese character (GB2312), English letter • Ci-2 Ci-1 Ci Ci+1 Ci+2 • Word boundary • Begin_of_Word (BOW), End_of_Word (EOW) • Previous tag (B,I,O) • [O]Outside PN, [I]Inside PN, [B]Beginning of PN

  5. Data • Characters: 392795(training) + 34740(testing) • Unique Characters: 3252 • 行销 全球 的 <ENAMEX TYPE="ORGANIZATION:corporation">宜进 实业</ENAMEX> <ENAMEX TYPE="PER_DESC">董事长</ENAMEX> <ENAMEX TYPE="PERSON">詹正田</ENAMEX>忿忿 指陈 : 原先 <NUMEX TYPE="QUANTITY:weight">一 公斤</NUMEX> <NUMEX TYPE="MONEY">一块八 美金</NUMEX> 的 加工 丝 , 目前 滑落 至 <NUMEX TYPE="MONEY">一块三</NUMEX>

  6. Example ju shuo yuehan zai bin da shangxue • 据说 约翰 在 宾大 上学。 C0 C1 C2 C3 C4 …… • O O B I O O O O O O (Tag) • B E B E BE B E B E BE (Word boundary) • O O O B I O O O O O (Previous tag) • 据 说约 翰 在宾 大上学 (Previous char)

  7. Model 1: SVM (Support Vector Machine) • To search the Optimal Separating Hyperplane to maximize the margin [V.Vapnik 1995]

  8. SVM - Properties • Two strong properties • – High generalization performance independent of feature dimension • – Training with combinations of multiple features by using a Kernel Function. • Maximal Margin Strategy • Separate positive and negative (binary) examples with a Linear Hyperplane: (w *x + b=0; w; x in Rn; b in R) • Find an optimal hyperplane (parameter w; b) with the maximal margin

  9. SVM – Kernel Function • Kernel Function • K(x,y) = (x) • (y) • x,y are vectors in input space • (x), (y) are vectors in feature space • d (feature space) >> d (input space) • No need to compute (x) explicitly • d-th polynomial kernel • K(xi, xj) = (xi * xj + 1)^d • considering combinations of up to d features

  10. SVM – Polynomial Kernel • So, the larger the d, the better? – Not necessarily! • Larger d • Virtually considering all d-grams • Higher precision, lower recall • Potentially equivalent to over fitting • Smaller d • Model trained is more general

  11. Evaluation Setup • Features • [ES1]: { Ci-1 Ci Ci+1 } + { BOW EOW } • [ES2]: [ES1] + { Prev_Tag } • [ES3]: Ci-2 Ci-1 Ci Ci+1 Ci+2 (Ci+2 only for SVM) • Estimation of input space: 5 * 3200 = 16000 binary features

  12. Target of Decision • SVM Model • Binary classifier • I and O only • Any word contains an “I” is viewed as a Person’s name.

  13. Experiment Setup ju-shuo |yue-han| zai| bing-da| shang-xue据 说 约 翰 在 宾 大 上 学 Gold O  O B  I O  O  O    O  O Model1  O  B   I   I  I    O  O    O  OModel2  O  O   O   O  O    O  O    O OUse perfect word boundary in evaluation Model1: positive<+3> gold<+1> true positive<+1>Model2: positive<+0> gold<+1> true positive<+0>Use automatic segmenter Model1: positive<+1> gold<+1> true positive<+0>Model2: positive<+0> gold<+1> true positive<+0>

  14. Results Current English Person Name Finder: F-score around 75%

  15. Future Work • Dynamic decoding • Integrate with word segmenter • Other named entities

  16. Thank you! Special Thanks To BBN M. Palmer E. Loper S. Kulick T. Morton T. Joachims (Author of SVMLight)

More Related