NYU: Description of the Proteus/PET System as Used for MUC-7 ST

NYU:Description of the Proteus/PET System as Used for MUC-7 ST Roman Yangarber & Ralph Grishman Presented by Jinying Chen 10/04/2002

Outline • Introduction • Proteus IE System • PET User Interface • Performance on the Launch Scenario

Introduction • Problem : portability and customization of IE engines at the scenario level • To address this problem • NYU built a set of tools, which allow the user to adapt the system to new scenarios rapidly through example-based learning • The present system operates on two tiers: Proteus & PET

Introduction (Cont.) • Proteus • Core extraction engine, an enhanced version of the one employed at MUC-6 • PET • GUI front end, through which the user interacts with Proteus • The user provide the system examples of events in text, and examples of associated database entries to be created

Proteus IE System • Modular design • Control is encapsulated in immutable, domain-independent core components • Domain-specific information resides in the knowledge bases

Proteus IE System (Cont.) • Lexical analysis module • Assign each token a reading or a list of alternative readings by consulting a set of on-line dictionaries • Name Recognition • Identify proper names in the text by using local contextual cues

Proteus IE System (Cont.) • Partial Syntax • Find small syntactic units, such as basic NPs and VPs • Marks the phrase with semantic information, e.g. the semantic class of the head of the phrase • Scenario Patterns • Find higher–level syntactic constructions using local semantic information: apposition, prepositional phrase attachment, limited conjucntions, and clausal constructions.

Proteus IE System (Cont.) • Note: • The above three modules are Pattern matching phrases, they operate by deterministic, bottom-up, partial parsing or pattern matching. • The output is a sequence of LFs corresponding to the entities, relationships, and events encountered in the analysis.

Figure 2: LF for the NP: “a satellite built by Loral Corp. of New York for Intelsat”

Proteus IE System (Cont.) • Reference Resolution (RefRes) • Links anaphoric pronouns to their antecedents and merges other co-referring expressions • Discourse Analysis • Uses higher-level inference rules to build more complex event structures • E.g. a rule that merges a Mission entity with a corresponding Launch event. • Output Generation

PET User Interface • A disciplined method of customization of knowledge bases, and the pattern base in particular • Organization of Patterns • The pattern base is organized in layers • Proteus treats the patterns at the different levels differently • Acquires the most specific patterns directly from user, on a per-scenario basis

clausal patterns that capture events (scenario-specific) Domain-dependent find relationships among entities, such as between persons and organizations perform partial syntactic analysis Domain-independent most general patterns, capture the most basic constructs, such as proper names, temporal expressions etc. user Pattern Lib Core part of System

PET User Interface (Cont.) • Pattern Acquisition • Enter an example • Choose an event template • Apply existing patterns (step 3) • Tune pattern elements(step 4) • Fill event slots(step 5) • Build pattern • Syntactic generalization

Step 3 Step 4 Step 5

Performance on the Launch Scenario • Scenario Patterns • Basically two types: launch events and mission events • In cases there is no direct connection between these two events, the post-processing inference rules attempted to tie the mission to a launch event • Inference Rules • Involve many-to-many relations (e.g. multiple payloads correspond to a single event) • Extending inference rule set with heuristics, e.g. find date and site

Conclusion: • Example-based pattern acquisition is appropriate for ST-level task, especially when training data is quite limited • Pattern editing tools are useful and effective

NYU:Description of the MENE Named Entity System as Used in MUC-7 Andrew Borthwick, John Sterling etc. Presented by Jinying Chen 10/04/2002

Outline • Maximum Entropy • MENE’s Feature Classes • Feature Selection • Decoding • Results • Conclusion

Maximum Entropy • Problem Definition The problem of named entity recognition can be reduced to the problem of assigning 4*n+1 tags to each token • n: the number of name categories, such as company, product, etc. For MUC-7, n=7 • 4 states: x_start, x_continue, x_end, x_unique • other : not part of a named entity

Maximum Entropy (cont.) • Maximum Solution • compute p(f | h), where f is the prediction among the 4*n+1 tags and h is the history • the computation of p(f | h) depends on a set of binary-valued features, e.g.

Maximum Entropy (cont.) • Given a set of features and some training data, the maximum entropy estimation process produces a model:

MENE’s Feature Classes • Binary Features • Lexical Features • Section Features • Dictionary Features • External Systems Features

Binary Features • Features whose “history” can be considered to be either on or off for a given token. • Example: • The token begins with a capitalized letter • The token is a four-digit number

Lexical Features • Example:

Section Features • Features make predictions based on the current section of the article, like “Date”, “Preamble”, and “Text”. • Play a key role by establishing the background probability of the occurrence of the different futures (predictions).

Dictionary Features • Make use of a broad array of dictionaries of useful single or multi-word terms such as first names, company names, and corporate suffixes. • Require no manual editing

External Systems Features • MENE incorporates the outputs of three NE taggers • a significantly enhanced version of the traditional , hand-coded “Proteus” named-entity tagger • Manitoba • IsoQuest

Example:

Feature Selection • Simple • Select all features which fire at least 3 times in the training corpus

Decoding • Simple • For each token, check all the active features for this token and compute p(f | h) • Run a viterbi search to find the highest probability coherent path through the lattice of conditional probabilities

Results • Training set: 350 aviation disaster articles (consisted of about 270,000 words) • Test set: • Dry run : within-domain corpus • Formal run : out-of-domain corpus

Result (cont.)

Conclusion • A new, still immature system. Can improve the performance by: • Adding long-range reference-resolution features • Exploring compound features • Sophisticated methods of feature selection • Highly portable • An efficient method to combine NE systems

NYU: Description of the Proteus/PET System as Used for MUC-7 ST