1 / 25

Information Extraction with Unlabeled Data

This paper explores the use of unlabeled data for information extraction (IE). It discusses practical applications of IE, different approaches for training IE models, and the challenges involved. The paper also presents bootstrapping approaches for IE and evaluates their performance using a dataset of corporate web pages. Additionally, it proposes active learning and collecting training data from the web as alternative approaches for training IE systems.

helencohen
Download Presentation

Information Extraction with Unlabeled Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Information Extraction with Unlabeled Data Rayid Ghani Joint work with: Rosie Jones (CMU) Tom Mitchell (CMU & WhizBang! Labs) Ellen Riloff (University of Utah)

  2. Training Program Models The Vision Tables training sentences answers Link Analysis Data Base Entities Extractor Events Relations Information Extraction Geo Display Time Line

  3. What is IE? • Analyze unrestricted text in order to extract information about pre-specified types of events, entities or relationships

  4. Practical / Commercial Applications • Database of Job Postings extracted from corporate web paes (flipdog.com) • Extracting specific fields from resumes to populate HR databases (mohomine.com) • Information Integration (fetch.com) • Shopping Portals

  5. Where the world is now? • MUC helped drive information extraction research but most systems were fine tuned for terrorist activities • Commercial systems can detect names of people, locations, companies (only for proper nouns) • Very costly to train and port to new domains • 3-6 months to port to new domain (Cardie 98) • 20,000 words to learn named entity extraction (Seymore et al 99) • 7000 labeled examples to learn MUC extraction rules (Soderland 99)

  6. IE Approaches • Hand-Constructed Rules • Supervised Learning • Semi-Supervised Learning

  7. Goal • Can you start with 5-10 seeds and learn to extract other instances? • Example tasks • Locations • Products • Organizations • People

  8. Aren’t you missing the obvious? • Not really! • Acquire lists of proper nouns • Locations : countries, states, cities • Organizations : online database • People: Names • But not all instances are proper nouns • *by the river*, *customer*,*client*

  9. Use context to disambiguate • A lot of NPs are unambiguous • “The corporation” • A lot of contexts are also unambiguous • Subsidiary of <NP> • But as always, there are exceptions….and a LOT of them in this case • “customer”, John Hancock, Washington

  10. Bootstrapping Approaches • Utilize Redundancy in Text • Noun-Phrases • New York, China, place we met last time • Contexts • Located in <X>, Traveled to <X> • Learn two models • Use NPs to label Contexts • Use Contexts to label NPs

  11. Algorithms for Bootstrapping • Meta-Bootstrapping (Riloff & Jones, 1999) • Co-Training (Blum & Mitchell, 1999) • Co-EM (Nigam & Ghani, 2000)

  12. Data Set • ~5000 corporate web pages (4000 for training) • Test data marked up manually by labeling every NP as one or more of the following semantic categories: • location, organization, person, product, none • Preprocessed (parsed) to generate extraction patterns using AutoSlog (Riloff, 1996)

  13. Evaluation Criteria • Every test NP is labeled with a confidence score by the learned model • Calculate Precision and Recall at different thresholds • Precision = Correct / Found • Recall = Found / Max that can be found

  14. Seeds

  15. Results

  16. Active Learning • Can we do better by keeping the user in the loop? • If we can ask the user to label any examples, which examples should they be? • Selected randomly • Selected according to their density/frequency • Selected according to disagreement between NP and context (KL divergence to the mean weighted by density)

  17. NP – Context Disagreement • KL Divergence

  18. Results

  19. Results

  20. What if you’re really lazy? • Previous experiments assumed a training set was available • What if you don’t have a set of documents that can be used to train? • Can we start from only the seeds?

  21. Collecting Training Data from the Web • Use the seed words to generate web queries • Simple Approaches • For each seed word, fetch all documents returned • Only fetch documents, where N or more seed words appear

  22. Collecting Training Data from the Web Query Generator WWW Seed Documents Text Filter

  23. Interleaved Data Collection • Select a seed word with uniform probability • Get documents containing that seed word • Run bootstrapping on the new documents • Select new seedwords that are learned with high confidence • Repeat

  24. Seed-Word Density

  25. Summary • Starting with 10 seed words, extract NPs matching specific semantic classes • Probabilistic Bootstrapping is an effective technique • Asking the user helps only if done intelligently • The Web is an excellent resource for training data that can be collected automatically => Personal Information Extraction Systems

More Related