Information Extraction: What has Worked, What hasn't, and What has Promise for the Future. Ralph Weischedel BBN Technologies 7 November 2000. Outline. Information extraction tasks & past performance Two approaches Learning to extract relations Our view of the future.
7 November 2000
Named Entity (NE)
Names only of persons, organizations, locations
Template Element (TE)
All names, a description (if any) and type
of organizations and persons;
name and type of a named location
Template Relations (TR)
Who works for what organization;
Where an organization is located;
What an organization produces
Scenario Template (ST)
Georgian leader Eduard Shevardnadze suffered nothing worse than cuts and bruises when a bomb exploded yesterday near the parliament building. Officials investigating the bombing said they are blaming a group of people with plans of the parliament building.
YearBest Performance in Scenario Template
The delegation, which included the commander of the U.N. troops in Bosnia, Lt. Gen. Sir Michael Rose, went to the Serb stronghold of Pale, near Sarajevo, for talks with Bosnian Serb leaderRadovan Karadzic.Named Entity
all names of an organization/person/location,
one description of the organization/person, and
a classification for the organization/person/location.
“...according to the report by Edwin Dorn, under secretary of defense for personnel and readiness. … Dorn's conclusion that Washington…”
ENT_NAME: "Edwin Dorn"
ENT_DESCRIPTOR: "under secretary of defense for personnel and readiness"
Determine who works for what organization,
where an organization is located,
what an organization produces.
“Donald M. Goldstein, a historian at the University of Pittsburgh who helped write…”
ENT_NAME: "University of Pittsburgh"
ENT_NAME: "Donald M. Goldstein"
ENT_DESCRIPTOR: "a historian at the University of Pittsburgh"
BNPerformance in MUC/Broadcast News
The delegation, which included the commander of the U.N. troops in Bosnia, Lt. Gen. Sir Michael Rose, went to the Serb stronghold of Pale, near Sarajevo, for talks with Bosnian Serb leaderRadovan Karadzic.
Identify every name mention of
locations, persons, and organizations.
The delegation, which included the commander of the U.N. troops in Bosnia, Lt. Gen. Sir Michael Rose, went to the Serb stronghold of Pale, near Sarajevo, for talks with Bosnian Serb leader Radovan Karadzic.
Lexical Pattern Matcher
OutputTraditional (Rule-Based) Architecture
Morphological analysis may determine part of speech
Lots of manually constructed patterns
Structure of IdentiFinder’s Model
HUB 4 98
SER(speech) @ SER(text) + WER
Part of Speech
Named Entity Extraction
Sentence-Level Pattern Matcher
Coref, Merging, & Inference
OutputTraditional (Rule-based) Architecture ( beyond names)
BBN Architecture in MUC-6 (1995)
Determining which person holds what office in what organization
Determining where an organization is located
PakistaniA TREEBANK Skeletal Parse
Nance , who is also a paid consultant to ABC News , said ...
Semantic training data consists ONLY of
...Augmented Semantic Tree
Head category: P(ch | cp), e.g.
P(vp | s)
Left modifier categories: PL(cm | cp,chp,cm-1,wp), e.g.
PL(per/np | s, vp, null, said)
Right modifier categories: PR(cm | cp,chp,cm-1,wp)
PR(emp-of/pp-lnk | per-desc-r/np, per-desc/np, null, consultant)
Head part-of-speech: P(tm | cm,th,wh),
P(per/nnp | per/np, vbd, said)
Head word: P(wm | cm, tm, th, wh), e.g.
P(nance | per/np, per/nnp, vbd, said)
Head word features: P(fm | cm, tm, th, wh, known(wm)), e.g.
P(cap | per/np, per/nnp, vbd, said, true)
(1) Max [ product[P(node | history)]]
tree nodesLexicalized Probabilistic CFG Model
(1) the head node is generated
(2) premodifier nodes, if any, are generated
(3) postmodifier nodes, if any, are generated
NewsTree Generation Example
How can the (limited) semantic interpretation required for MUC be integrated with parsing?
Integrate syntax and semantics by training on and then generating parse trees augmented with semantic labels
Since the source documents were NYT newswire, rather than WSJ, would we need to treebank NYT data?
First train the parser on WSJ.
Then constrain the parser on NYT to produce trees consistent with the semantic annotation.
Retrain the parser to produce augmented syntax/semantics trees on the NYT data.
Must computational linguists be the semantic annotators?
No, college students from various majors are sufficient.Issues and Answers
Training MUC be integrated with parsing?