Information Extraction with Unlabeled Data

Information Extraction with Unlabeled Data Rayid Ghani Joint work with: Rosie Jones (CMU) Tom Mitchell (CMU & WhizBang! Labs) Ellen Riloff (University of Utah)

Training Program Models The Vision Tables training sentences answers Link Analysis Data Base Entities Extractor Events Relations Information Extraction Geo Display Time Line

What is IE? • Analyze unrestricted text in order to extract information about pre-specified types of events, entities or relationships

Practical / Commercial Applications • Database of Job Postings extracted from corporate web paes (flipdog.com) • Extracting specific fields from resumes to populate HR databases (mohomine.com) • Information Integration (fetch.com) • Shopping Portals

Where the world is now? • MUC helped drive information extraction research but most systems were fine tuned for terrorist activities • Commercial systems can detect names of people, locations, companies (only for proper nouns) • Very costly to train and port to new domains • 3-6 months to port to new domain (Cardie 98) • 20,000 words to learn named entity extraction (Seymore et al 99) • 7000 labeled examples to learn MUC extraction rules (Soderland 99)

IE Approaches • Hand-Constructed Rules • Supervised Learning • Semi-Supervised Learning

Goal • Can you start with 5-10 seeds and learn to extract other instances? • Example tasks • Locations • Products • Organizations • People

Aren’t you missing the obvious? • Not really! • Acquire lists of proper nouns • Locations : countries, states, cities • Organizations : online database • People: Names • But not all instances are proper nouns • *by the river*, *customer*,*client*

Use context to disambiguate • A lot of NPs are unambiguous • “The corporation” • A lot of contexts are also unambiguous • Subsidiary of <NP> • But as always, there are exceptions….and a LOT of them in this case • “customer”, John Hancock, Washington

Bootstrapping Approaches • Utilize Redundancy in Text • Noun-Phrases • New York, China, place we met last time • Contexts • Located in <X>, Traveled to <X> • Learn two models • Use NPs to label Contexts • Use Contexts to label NPs

Algorithms for Bootstrapping • Meta-Bootstrapping (Riloff & Jones, 1999) • Co-Training (Blum & Mitchell, 1999) • Co-EM (Nigam & Ghani, 2000)

Data Set • ~5000 corporate web pages (4000 for training) • Test data marked up manually by labeling every NP as one or more of the following semantic categories: • location, organization, person, product, none • Preprocessed (parsed) to generate extraction patterns using AutoSlog (Riloff, 1996)

Evaluation Criteria • Every test NP is labeled with a confidence score by the learned model • Calculate Precision and Recall at different thresholds • Precision = Correct / Found • Recall = Found / Max that can be found

Seeds

Results

Active Learning • Can we do better by keeping the user in the loop? • If we can ask the user to label any examples, which examples should they be? • Selected randomly • Selected according to their density/frequency • Selected according to disagreement between NP and context (KL divergence to the mean weighted by density)

NP – Context Disagreement • KL Divergence

Results

What if you’re really lazy? • Previous experiments assumed a training set was available • What if you don’t have a set of documents that can be used to train? • Can we start from only the seeds?

Collecting Training Data from the Web • Use the seed words to generate web queries • Simple Approaches • For each seed word, fetch all documents returned • Only fetch documents, where N or more seed words appear

Collecting Training Data from the Web Query Generator WWW Seed Documents Text Filter

Interleaved Data Collection • Select a seed word with uniform probability • Get documents containing that seed word • Run bootstrapping on the new documents • Select new seedwords that are learned with high confidence • Repeat

Seed-Word Density

Summary • Starting with 10 seed words, extract NPs matching specific semantic classes • Probabilistic Bootstrapping is an effective technique • Asking the user helps only if done intelligently • The Web is an excellent resource for training data that can be collected automatically => Personal Information Extraction Systems

Information Extraction with Unlabeled Data

Information Extraction with Unlabeled Data

Presentation Transcript

Information Extraction

Information Extraction

Information Extraction

Information Extraction

Information Extraction

Information Extraction

information extraction

Information Extraction

Information Extraction with Linked Life Data

Information Extraction

Information Extraction

Information Extraction

Information Extraction

Information Extraction

Information Extraction

Information Extraction

Information Extraction

Information Extraction

Information Extraction

Classification of unlabeled data: