1 / 20

Relational Learning of Pattern-Match Rules for Information Extraction

Relational Learning of Pattern-Match Rules for Information Extraction. Mary Elaine Califf Raymond J. Mooney. Motivation. Increasing electronic documents contain a large amount of information Time-consuming to build IE systems Highly domain-specific components. RAPIER.

axelle
Download Presentation

Relational Learning of Pattern-Match Rules for Information Extraction

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Relational Learning of Pattern-Match Rules for Information Extraction Mary Elaine Califf Raymond J. Mooney

  2. Motivation • Increasing electronic documents contain a large amount of information • Time-consuming to build IE systems • Highly domain-specific components

  3. RAPIER • Uses relational learning to construct unbounded pattern-match rules, given a database of texts and filled templates • Primarily consists of a bottom-up search • Employs limited syntactic and semantic information • Learn rules for the complete IE task

  4. Filled template of RAPIER

  5. Relational learning and Inductive Logic Programming (ILP) • Allow induction over structured examples that can include first-order logical representations and unbounded data structures • Work well in text categorization and generation of the past tense of English verbs

  6. Other ILP Systems • GOLEM • CHILLIN • PROGOL

  7. RAPIER’s rule representation • Indexed by template name and slot name • Consists of three parts: 1. A pre-filler pattern 2. Filler pattern (matches the actual slot) 3. Post-filler

  8. Pattern • Pattern item: matches exactly one word • Pattern list: has a maximum length N and matches 0..N words. • Must satisfy a set of constraints 1. Specific word, POS, Semantic class 2. Disjunctive lists

  9. An example of rule Sold to the bank for an undisclosed amount Paid Honeywell an undisclosed price

  10. RAPIER’S Learning Algorithm • Begins with a most specific definition and compresses it by replacing with more general ones • Attempts to compress the rules for each slot • Preferring more specific rules

  11. Implementation • Least general generalization (LGG) • Starts with rules containing only generalizations of the filler patterns • Employs top-down beam search for pre and post fillers • Rules are ordered using an information gain metric and weighted by the size of the rule (preferring smaller rules)

  12. Example Located in Atlanta, Georgia. Offices in Kansas City, Missouri

  13. Example (cont)

  14. Example (cont) Final best rule:

  15. Experimental Evaluation • A set of 300 computer-related job posting from austin.jobs • A set of 485 seminar announcements from CMU. • Three different versions of RAPIER were tested 1.words, POS tags, semantic classes 2. words, POS tags 3. words

  16. Other learning IE systems • Naïve Bayes system, uses words in a fixed-length window to locate slot • SRV, uses top-down, set-covering rule learner and four pre-determined predicates. • WHISK, uses pattern match and restricted form of regular expressions

  17. Performance on job postings

  18. Results for seminar announcement task

  19. Conclusion • Pros 1. Have the potential to help automate the development process of IE systems. 2. Work well in locating specific data in newsgroup messages 3. Identify potential slot fillers and their surrounding context with limited syntactic and semantic information 4. Learn rules from relatively small sets of examples in some specific domains • Cons 1.single slot 2.regular expression 3. Unknown performances for more complicated situations

  20. Question?

More Related