1 / 26

Entity extraction: rule-based methods

“I’m booked on the train leaving from Paris at 6 hours 31”. Rule: Location (Token string = “from”) ({DictionaryLookup=location}):loc -> Location=:loc. Entity extraction: rule-based methods. Extraction through Rules. Rules are useful for:. “My email address is jane@byu.edu ”.

finola
Download Presentation

Entity extraction: rule-based methods

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. “I’m booked on the train leaving from Paris at 6 hours 31” Rule: Location (Token string = “from”) ({DictionaryLookup=location}):loc -> Location=:loc Entity extraction: rule-based methods

  2. Extraction through Rules • Rules are useful for: “My email address is jane@byu.edu” Email address “For information you should call 801-111-2222” Phone number Extracting information from controlled and well-behaved data Creating wrappers • Rule-Based System: • . Collection of Rules Policies to control the firing of multiple rules

  3. Rule Representation • Diverse rule-based systems employ distinct formats for rule representation • CSPL: Common Pattern Specification Language • Rapier: Robust Automated Production of I.E. Rules • WHISK: supervised algorithm to learn regular expressions • Avatar: SQL expressions

  4. Form of a Basic Rule • Contextual Pattern -> Action 1 or more labeled patterns capturing properties of 1 or more entities and the context in which they appear in the text Tagging actions • Examples: • Assigning entity label • Inserting a start/end of an entity tag at a position • Assigning multiple entity tags Regular expressions defined over features of tokens in the text and an optional label Any property of the token or the context or the documents in which the token appears

  5. Features of Tokens • A token is associated with a bag of features obtained through 1 or more criteria: Kitchen Jones&Jones String representing the token Orthography type of the token V N PDETADJ N <location> ….. </location> Heat water in alarge vessel Annotations attached by previous processing steps Part of Speech of the token Locations: Rome Paris Greece List of dictionaries in which token s appear

  6. Rules to Identify a Single Entity • Patterns followed by entity-recognizing rules • An optional pattern to capture the context before the start of an entity • A pattern matching the tokens in the entity • An optional pattern to capture the context after the end of the entity • Example: • Identify person names of the form “Dr. Yair Weiss” ({DictionaryLookup = Titles} {String = “.”} -> Person names {Orthography type = capitalized word} {2} )

  7. Rules to Identify a Single Entity • Examples: • Rules for identifying company names in GATE, a popular entity recognition system

  8. Rules to Mark Entity Boundaries • Entity boundaries are useful to mark long entities • Separate rules are defined to mark the start/end of entity boundary • Each rule leads to the insertion of a SGML tag in the text • Example: • Insert <journal> tag to mark the start of a journal name in a citation record ({String=“to”} {String=“appear”} {String=“in”}):jstart ({Orthography type = Capitalized word} {2-5}) -> insert <journal> after jstart

  9. Rules for Multiple Entities • Rules for multiple-entity recognition • Regular expression with multiple slots, each representing a different entity, to simultaneously identify more than one entity • Useful to extract information from structured records • Medical records, classified ads, etc. • Example: • Extract the number of bedrooms and rent from an apt. rental ad ({Orthography type = Digit}): Bedroom ({String=“BR”}) ({})* {String=“$”}) ({Orthography type = Number} ):Price -> Number of Bedrooms =: Bedroom, Rent =: Price

  10. Organizing Collections of Rules • Rule-based systems consist of very large collection of rules • Problem • Solution • How Spans demarcated by different rules overlap, leading to conflicting actions Component to organize rules and control de order in which they are applied to eliminate/resolve conflict Use of heuristics and special –case handling, since rule-managing is a nonstandardized and custom-tuned part of rule-based system

  11. Resolving Rule Conflicts • Use of special/custom policies • Sample policies • Prefer rules that mark larger spans of text as an entity type • Merge spans of text that overlap, only if the action portion of the two applied rules is the same • Popular strategy since it allows flexibility in defining rules

  12. Resolving Rule Conflicts • Arrange rules as ordered set • Prioritize the order on all the rules and favor the one with higher priority • Priority of a rule is fixed by some function of the precision and coverage of the rule of the training data “It is an open question whether a good rule-based theory should consist of rules that cover many examples at the expense of a certain number of misclassifications or whether one should prefer rules that cover only few examples, but appear to be more precise” (Fürnkranz, 2003)

  13. Resolving Rule Conflicts • Based on complete order • A later rule can be defined on actions of earlier rules • Example: • Insert an endtag on the results of an earlier rule used for inserting a start tag Since R1 has precedence, R2 can assume that <journal> can be used as part of the rule

  14. Resolving Rule Conflicts • Finite State Machines • A full automata is explicitly defined to control the exact sequence in which rules are applied • Nodes (entities) are connected via directed edges • Each edge is associated with a rule on the input tokens that must be satisfied for the edge to be taken • Each rule correspond to a path in the FST • There is no ambiguity about the order in which rules are applied, as long as there is a unique path from the start to sink state for each sequence of tokens

  15. How Rules are Created • A typical entity extraction system depends on a finely tuned set of rules Rules manually coded by domain experts Rules automatically learnt from labeled examples of entities in unstructured text

  16. Rule Learning Algorithm • Create a set of rules R1, R2, … Rk such that the action of each rule either • Identifies a single entity • Marks entity boundaries • Identifies multiple entities • From a training set consisting of • Unstructured set of documents where all the occurrences of entities are marked correctly

  17. Rule Learning Algorithm • Goal • Cover all the segments that contain an annotation by 1 or more rules • Ensure that the precision of each rule is high Coverage of a rule R, i.e. S(R), is the fraction of data segments matched by R in the training documents Precision of R is the ratio between S’(R), the subset of segments covered by R for which the action specified by R is correct, and S(R) • The overall set of rules must have good recall and precision on new documents

  18. Rule Learning Algorithm • Generalizability of learnt rules is required • Define the smallest set of rules that cover the maximum number of training cases with high precision • Finding the optimal size for a rule set is intractable • Rule-learning algorithms follow a greedy hill climbing strategy for learning one rule at the time

  19. Rule Learning Algorithm • The main challenge is to create a new rule that achieves high overall coverage and has high precision • These can be achieved using heuristics or existing strategies, classified as Specific Rule General Rule Specialized Rule Generalized Rule Bottom-up approach Top-down approach

  20. Bottom-Up • Bottom-up rule formation • Start with a rule with minimal coverage but 100% precision • Gradually make rule more general to increase coverage, even if some precision is lost • Example • Rule learning using (LP)2 for each tag type

  21. Bottom-Up • Creation of a seed rule • Example • Seed rule to insert tag T=<PER> before a position pstart ({String =“According”} {String =“to”}):pstart { String=“Robert”} {String =“Callahan”} -> insert <PER> at :pstart

  22. Bottom-Up • Generalizing seed rules • Example • Replace or drop a token by a more general feature token ({String =“According”} {String =“to”}):pstart { String=“Robert”} {String =“Callahan”} -> insert <PER> at :pstart ({String =“According”} {String =“to”}):pstart {Orthography type =“Capitalized word”} {Orthography type =“Capitalized word”} -> insert <PER> after :pstart ({DictioinaryLookup = Person}):pstart ({DictionaryLookup = Person}) -> insert <PER> before :pstart

  23. Bottom-Up • Generalizations retained starting from a single seed rule • Top-K rules are selected sequentially in decreasing order of precision over uncovered instances • (LP)2 also considers a number of measure of quality rules, such as • Precision • Overall coverage • Coverage of instances not covered by other rules

  24. Top-Down • Top-down rule formation • Start with a rule that covers all possible instances, i.e., 100% coverage and poor precision • Specialize rule to increase precision • Select the set k of the most precise rules • Example • Rule learning using (LP)2 for each tag type User-provided threshold for the coverage of each rule

  25. Rule Learning Algorithm • Problem • Due to the limited availability of labeled data, purely automated data-driven method for rule induction are not adequate • Solution • Hybrid of automated and manual methods to improve rule-based systems Rules Labeled Data

  26. Summary “I’m booked on the train leaving from Paris at 6 hours 31” Rule: Location (Token string = “from”) ({DictionaryLookup=location}):loc -> Location=:loc Rule-based methods for entity extraction Conflicts need to be resolved How sets of rules are created

More Related