fa xian finding ascertaining extracting and identifying names yuan ding and john blitzer n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
FA XIAN Finding, Ascertaining, eXtracting, And Identifying Names Yuan Ding and John Blitzer PowerPoint Presentation
Download Presentation
FA XIAN Finding, Ascertaining, eXtracting, And Identifying Names Yuan Ding and John Blitzer

Loading in 2 Seconds...

play fullscreen
1 / 16

FA XIAN Finding, Ascertaining, eXtracting, And Identifying Names Yuan Ding and John Blitzer - PowerPoint PPT Presentation


  • 58 Views
  • Uploaded on

FA XIAN Finding, Ascertaining, eXtracting, And Identifying Names Yuan Ding and John Blitzer. Preliminaries. 250K of hand-annotated Chinese newswire 2586 Person Names 1422 Unique BBN Identifinder Trained on 650K of English Text NYU MENE Trained on 270K of English Text. Modeling Overview.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'FA XIAN Finding, Ascertaining, eXtracting, And Identifying Names Yuan Ding and John Blitzer' - taipa


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
preliminaries
Preliminaries
  • 250K of hand-annotated Chinese newswire
    • 2586 Person Names
    • 1422 Unique
  • BBN Identifinder
    • Trained on 650K of English Text
  • NYU MENE
    • Trained on 270K of English Text
modeling overview
Modeling Overview
  • SVM (Support Vector Machine)
  • Common Feature Set
  • Common Training/Evaluation Data
features
Features
  • Character based feature set
    • Chinese character (GB2312), English letter
    • Ci-2 Ci-1 Ci Ci+1 Ci+2
  • Word boundary
    • Begin_of_Word (BOW), End_of_Word (EOW)
  • Previous tag (B,I,O)
    • [O]Outside PN, [I]Inside PN, [B]Beginning of PN
slide5
Data
  • Characters: 392795(training) + 34740(testing)
  • Unique Characters: 3252
    • 行销 全球 的 <ENAMEX TYPE="ORGANIZATION:corporation">宜进 实业</ENAMEX> <ENAMEX TYPE="PER_DESC">董事长</ENAMEX> <ENAMEX TYPE="PERSON">詹正田</ENAMEX>忿忿 指陈 : 原先 <NUMEX TYPE="QUANTITY:weight">一 公斤</NUMEX> <NUMEX TYPE="MONEY">一块八 美金</NUMEX> 的 加工 丝 , 目前 滑落 至 <NUMEX TYPE="MONEY">一块三</NUMEX>
example
Example

ju shuo yuehan zai bin da shangxue

  • 据说 约翰 在 宾大 上学。

C0 C1 C2 C3 C4 ……

  • O O B I O O O O O O (Tag)
  • B E B E BE B E B E BE (Word boundary)
  • O O O B I O O O O O (Previous tag)
  • 据 说约 翰 在宾 大上学 (Previous char)
model 1 svm support vector machine
Model 1: SVM (Support Vector Machine)
  • To search the Optimal Separating Hyperplane to maximize the margin [V.Vapnik 1995]
svm properties
SVM - Properties
  • Two strong properties
    • – High generalization performance independent of feature dimension
    • – Training with combinations of multiple features by using a Kernel Function.
  • Maximal Margin Strategy
    • Separate positive and negative (binary) examples with a Linear Hyperplane: (w *x + b=0; w; x in Rn; b in R)
    • Find an optimal hyperplane (parameter w; b) with the maximal margin
svm kernel function
SVM – Kernel Function
  • Kernel Function
    • K(x,y) = (x) • (y)
      • x,y are vectors in input space
      • (x), (y) are vectors in feature space
      • d (feature space) >> d (input space)
    • No need to compute (x) explicitly
  • d-th polynomial kernel
    • K(xi, xj) = (xi * xj + 1)^d
    • considering combinations of up to d features
svm polynomial kernel
SVM – Polynomial Kernel
  • So, the larger the d, the better? – Not necessarily!
  • Larger d
    • Virtually considering all d-grams
    • Higher precision, lower recall
    • Potentially equivalent to over fitting
  • Smaller d
    • Model trained is more general
evaluation setup
Evaluation Setup
  • Features
    • [ES1]: { Ci-1 Ci Ci+1 } + { BOW EOW }
    • [ES2]: [ES1] + { Prev_Tag }
    • [ES3]: Ci-2 Ci-1 Ci Ci+1 Ci+2 (Ci+2 only for SVM)
    • Estimation of input space: 5 * 3200 = 16000 binary features
target of decision
Target of Decision
  • SVM Model
    • Binary classifier
    • I and O only
    • Any word contains an “I” is viewed as a Person’s name.
experiment setup
Experiment Setup

ju-shuo |yue-han| zai| bing-da| shang-xue据 说 约 翰 在 宾 大 上 学

Gold O  O B  I O  O  O    O  O Model1  O  B   I   I  I    O  O    O  OModel2  O  O   O   O  O    O  O    O OUse perfect word boundary in evaluation

Model1: positive<+3> gold<+1> true positive<+1>Model2: positive<+0> gold<+1> true positive<+0>Use automatic segmenter

Model1: positive<+1> gold<+1> true positive<+0>Model2: positive<+0> gold<+1> true positive<+0>

results
Results

Current English Person Name Finder: F-score around 75%

future work
Future Work
  • Dynamic decoding
  • Integrate with word segmenter
  • Other named entities
thank you
Thank you!

Special Thanks To

BBN

M. Palmer

E. Loper

S. Kulick

T. Morton

T. Joachims (Author of SVMLight)