300 likes | 413 Views
Information extraction transforms raw text into structured data by recognizing entities and relationships without comprehensively understanding all the content. By defining domain-specific templates, we can reliably extract meaningful insights through simple linguistic processing. This technique includes identifying known entities, filling templates with extracted information, and contextual analysis of event descriptions. With applications such as disaster reporting or monitoring corporate announcements, efficient pattern recognition becomes essential in data processing and knowledge extraction.
E N D
Information Extraction • Extract meaningful information from text • Without fully understanding everything! • Basic idea: • Define domain-specific templates • Simple and reliable linguistic processing • Recognize known types of entities and relations • Fill templates with recognized information
Example 4 Apr. Dallas - Early last evening, a tornado swept through northwest Dallas. The twister occurred without warning at about 7:15 pm and destroyed two mobile homes. The Texaco station at 102 Main St. was also severely damaged, but no injuries were reported. Event: tornado Date: 4/3/97 Time: 19:15 Location: “northwest Dallas” : Texas : USA Damage: “mobile homes” (2) “Texaco station” (1) Injuries: none
Tokenization & Tagging Sentence Analysis Early last evening, a tornado swept through northwest Dallas. The twister occurred without warning at about .... Merging Pattern Extraction Template Generation tornado swept: Event: tornado through northwest Dallas: Loc: “northwest Dallas” causing extensive damage: Damage Early last evening: adv-phrase:time a tornado: noun-group:subject swept: verb-group ... Early/ADV last/ADJ evening/NN:time ,/, a/DT tornado/NN:weather swept/VBD ... 4 Apr. Dallas – Early last evening, a tornado swept through northwest.... Event: tornado Date: 4/3/97 Time: 19:15 Location: “northwest Dallas” : Texas : USA ...
MUC: Message Understanding Conference • “Competitive” conference with predefined tasks for research groups to address • Tasks (MUC-7): • Named Entities: Extract typed entities from text • Equivalence Classes: Solving coreference • Attributes: Fill in attributes of entities • Facts: Extract logical relations between entities • Events: Extract descriptions of events from text
Tokenization & Tagging • Tokenization & POS tagging • Also lexical semantic information, such as “time”, “location”, “weather”, “person”, etc. Sentence Analysis • Shallow parsing for phrase types • Use tagging & semantics to tag phrases • Note phrase heads
Pattern Extraction • Find domain-specific relations between text units • Typically use lexical triggers and relation-specific patterns to recognize relations Concept: Damaged-Object Trigger: destroyed Position: direct-object Constraints: physical-thing ... and [ destroyed ] [ two mobile homes ] Damaged-Object = “two mobile homes”
Learning Extraction Patterns • Very difficult to predefine extraction patterns • Must be redone for each new domain • Hence, corpus-based approaches are indicated • Some methods: • AutoSlog (1992) – “syntactic” learning • PALKA (1995) – “conceptual” learning • CRYSTAL (1995) – covering algorithm
AutoSlog (Lehnert 1992) • Patterns based on recognizing “concepts” • Concept: what concept to recognize • Trigger: a word indicating an occurrence • Position: what syntactic role the concept will take in the sentence • Constraints: what type of entity to allow • Enabling conditions: constraints on the linguistic context
Concept: Event-Time • Trigger:“at” • Position: prep-phrase-object • Constraints: time • Enabling conditions: post-verb The twister occurred without warning at about 7:15 pm and destroyed two mobile homes. Event-Time = 19:15
Learning Patterns • Supervised: Training is text with patterns to be extracted from it • Knowledge: 13 general syntactic patterns • Algorithm: • Find sentence with target noun phrase “two mobile homes” • Partial parsing of sentence: find syntactic relations • Try all linguistic patterns to find match • Generate concept pattern from match
Linguistic Patterns • Identify domain-specific thematic roles based on syntactic structure active-voice-verb followed by target=direct object Concept = target concept Trigger = verb of active-voice-verb Position = direct-object Constraints = semantic-class of target Enabling conditions = active-voice
More Examples • victim was murdered • perpetratorbombed • perpetrator attempted to kill • was aimed at target • Some bad extraction patterns occur (e.g, “is” as a trigger) • Human review process
CRYSTAL • Complex syntactic patterns • Use “covering” algorithm: • Generate most specific possible patterns for all occurrences of targets in corpus • Loop: • Find most specific unifier of the most similar patterns C & C’, generating new pattern P • If P has less than ε error on corpus, replace C and C’ with P • Continue until no new patterns can be added
Merging Motor Vehicles International Corp. announced a major management shake-up ... MVI said the CEO has resigned ... The Big 10 auto maker is attempting to regain market share ... It will announce losses ... A company spokesman said they are moving their operations ... MVI, the first company to announce such a move since the passage of the new international trade agreement, is facing increasing demands from unionized workers...
Coreference Resolution • Many different kinds of linguistic phenomena: • Proper names, • Aliases (MVI), • Definite NPs (the Big 10 auto maker), • Pronouns (it, they), • Appositives (, the first company to ...) • Errors of previous phases may be amplified
Learning to Merge • Treat coreference as a classification task • Should this pair of entities be linked? • Methodology: • Training corpus: manually link all coreferential expressions • Each possible pair is a training example, if they are linked it is positive if not, it is negative • Create a feature vector for each example • Use your favorite learning algorithm
MLR (1995) • 66 features were used, in 4 categories: • Lexical features of each phrase e.g, do they overlap? • Grammatical role of each phrase e.g, subject, direct-object • Semantic classes of each phrase e.g, physical-thing, company • Relative positions of the phrases e.g, X one sentence after Y • Decision-tree learning (C4.5)
C4.5 • Incrementally build decision-tree from labeled training examples • At each stage choose “best” attribute to split dataset • E.g, use info-gain to compare features • After building complete tree, prune the leaves to prevent overfitting • Use statistical tests to determine if enough examples are in leaf bins, if not – prune!
f2 f3 C1 C2 C2 C1 C4.5 40 training f1 25 training 15 training 18 training 7 training 2 training 13 training
RESOLVE (1995) • C4.5 with 8 complex features: • NAME-{1,2}: does reference include a name? • JV-CHILD-{1,2}: does reference refer to part of a joint venture? • ALIAS: does one reference contain an alias for the other? • BOTH-JV-CHILD: do both refer to part of a joint venture? • COMMON-NP: do both contain a common NP? • SAME-SENTENCE: are both in the same sentence?
RESOLVE Results • 50 texts, leave-1-out cross-validation:
Pattern Recognition Coreference Resolution Output Template Partial Templates Template Merger Full System: FASTUS (1996) Input Text
Num Aux P Pers-Name Org-Name V N Poss-N-Group V-Group Domain-Event Pattern Recognition • Multiple passes of finite-state methods John Smith, 47, was named president of ABC Corp.
Person: _______ Pos: President Org: ABC Corp. Person: John Smith Pos: President Org: ABC Corp. Start: End: Partially-Instantiated Templates Domain-Dependent!!
Person: Mike Jones Pos: ________ Org: ________ Person: John Smith Pos: ________ Org: ________ Start: End: The Next Sentence... He replaces Mike Jones. Coreference analysis: He = John Smith
Person: Mike Jones Pos: President Org: ABC Corp. Person: John Smith Pos: President Org: ABC Corp. Start: End: Unification Unify new template with preceding template(s), if possible...
NN2 DT NN1 VBD CSub VBZ Event: Announce Actor: Committee heads Principle of Least Commitment • Idea: Maintain options as long as possible • E.g: parsing – maintain a lattice structure: The committee heads announced that... N-GRP Event
NN2 DT NN1 NNpos NN VBZ Head: Committee Effort: ABC’s recruitment Principle of Least Commitment • Idea: Maintain options as long as possible • E.g: parsing – maintain a lattice structure: The committee heads ABC’s recruitment effort. N-GRP N-GRP Event
More Least Commitment • Maintain multiple coreference hypotheses: • Disambiguate when creating domain-events • More information available • Too many possibilities? • Use beam search algorithm: maintain k ‘best’ hypotheses at every stage