Sparse Information Extraction: Unsupervised Language Models to the Rescue

Sparse Information Extraction: Unsupervised Language Models to the Rescue Doug Downey, Stef Schoenmackers, Oren Etzioni Turing Center University of Washington

Answering Questions on the Web Q: Who has won a best actor Oscar for playing a villain? Q: Which nanotechnology companies are hiring? Q: What’s the general consensus on the IBM T40? Q: What kills bacteria? . . . No single Web page contains the answer.

Open Information Extraction • Compile time: • Parse every sentence on the Web • Extract key information • Query time: • Synthesize extractions in response to queries Challenges: Topics of interest not known in advance No hand-tagged examples

TextRunner [Banko et al 2007] At compile time… …and when Thomas Edisoninvented the light bulb around the early 1900s… …end of the 19th century when Thomas Edison and Joseph Swan invented a light bulb using carbon fiber… … => Invented(Thomas Edison, light bulb)

in real time TextRunner [Banko et al 2007] invented Live demo at: www.cs.washington.edu/research/textrunner

e.g., (Thomas Edison, light bulb) Tend to be correct e.g.,(A. Church, lambda calculus)(drug companies, diseases) A mixture of correct and incorrect Problem: Sparse Extractions context

Assessing Sparse Extractions Task: Identify which sparse extractions are correct. Challenge: No hand-tagged examples. Strategy: • Build a model of how common extractions occur in text • Rank sparse extractions by fit to model • The distributional hypothesis: elements of the same relation tend to appear in similar contexts.[Brin, 1998; Riloff & Jones 1999; Agichtein & Gravano, 2000; Etzioni et al. 2005; Pasca et al. 2006; Pantel et al. 2006] Our contribution: Unsupervised language models. • Methods for mitigating sparsity • Precomputed – scalable to Open IE

The REALM Architecture RElation Assessment using Language Models Input: Set of extractions for relation R ER = {(arg11, arg21), …, (arg1M, arg2M)} • Seeds SR = s most frequent pairs in ER(assume these are correct) • Output ranking of (arg1, arg2)  ER – SR by distributional similarity to each (seed1, seed2) in SR

Distributional Similarity Naïve Approach – find sentences containing seed1&seed2 or arg1&arg2: Compare context distributions: P(wb,…, we | seed1, seed2 ) P(wb,…, we | arg1, arg2) But e – b can be large Many parameters, sparse data => inaccuracy

N-gram Language Models Computes phrase probabilities of n words: P(wi,…, wi+n-1) E.g.: P( ) > P( ) Obtained by counting over a corpus.

Distributional Similarity in REALM Two steps for assessing R(arg1, arg2) • Typechecking • Ensure arg1 and arg2 are of proper type for R MayorOf(Intel,Santa Clara) Leverages all occurrences of each arg • Relation Assessment • Ensure Ractually holds between arg1and arg2 MayorOf(Giuliani, Seattle) Both steps use pre-computed language models => Scales to Open IE

The REALM Architecture Two steps for assessing R(arg1, arg2) • Typechecking • Ensure arg1 and arg2 are of proper type for R MayorOf(Intel,Santa Clara) Leverages all occurrences of each arg • Relation Assessment • Ensure R actually holds between arg1and arg2 MayorOf(Giuliani, Seattle) Both steps use pre-computed language models => Scales to Open IE

Typechecking and HMM-T Task: For each extraction (arg1, arg2) ER, determine if arg1 and arg2 are the proper type for R. Solution: Assume seedj SRare of the proper type, and rank argj by distributional similarity to each seedj Computing Distributional Similarity: • Offline, train Hidden Markov Model (HMM) of corpus • At query time, measure distance between argj , seedj in HMM’s N-dimensional latent state space.

HMM Language Model k = 1 case: ti ti+1 ti+2 ti+3 wi wi+1 wi+2 wi+3 cities such as Seattle Offline Training: Learn P(w | t), P(ti | ti-1, …, ti-k) to maximize probability of corpus (using EM).

HMM-T Trained HMM gives “distributional summary” of each w: N-dimensional state distribution P(t | w) Typecheck each arg by comparing state distributions: Rank extractions in ascending order of f(arg)summed over arguments.

Previous n-gram technique (1) 1) Form a context vector for each extracted argument: … cities such as Chicago , Boston , But Chicago isn’t the best cities such as Chicago , Boston , Los Angeles and Chicago . … 2) Compute dot products between extractions and seeds in this space [cf. Ravichandran et al. 2005]. such as <x> , Boston Angeles and <x> . But <x> isn’t the … …

Previous n-gram technique (2) Miami:<> Twisp:<> Problems: • Vectors are large • Intersections are sparse visited X and other X and other cities when he visited X he visited X and

Compressing Context Vectors Miami:<> P(t | Miami): Latent state distribution P(t | w) • Compact (efficient – 10-50x less data retrieved) • Dense (accurate – 23-46% error reduction) t=12N

Example: N-Grams on Sparse Data Is Pickeringtonof the same type as Chicago? Chicago , Illinois Pickerington , Ohio Chicago: Pickerington: => N-grams says no, dot product is 0! <x> , Illinois <x> , Ohio … …

Example: HMM-T on Sparse Data HMM Generalizes: Chicago , Illinois Pickerington , Ohio

HMM-T Limitations Learning iterations take time proportional to (corpus size *Tk+1) T = number of latent states k = HMM order We use limited values T=20, k=3 • Sufficient for typechecking (Santa Clara is a city) • Too coarse for relation assessment(Santa Clara is where Intel is headquartered)

The REALM Architecture Two steps for assessing R(arg1, arg2) • Typechecking • Ensure arg1 and arg2 are of proper type for R MayorOf(Intel, Santa Clara) Leverages all occurrences of each arg • Relation Assessment • Ensure Ractually holds between arg1and arg2 MayorOf(Giuliani, Seattle) Both steps use pre-computed language models => Scales to Open IE

Relation Assessment • Type checking isn’t enough NY Mayor Giuliani toured downtown Seattle. • Want: How do arguments behave in relation to each other?

REL-GRAMS (1) N-gram language model: P(wi, wi-1, … wi-k) arg1, arg2 often far apart => large k (inaccurate)

REL-GRAMS (2) Relational Language Model(REL-GRAMS): For any two arguments e1, e2: P(wi, wi-1, … wi-k | wi = e1, e1 near e2) k can be small – REL-GRAMS still captures entity relationships • Mitigate sparsity with BM25 metric (from IR) Combine with HMM-T by multiplying ranks.

Experiments Task: Re-rank sparse TextRunner extractions for Conquered, Founded, Headquartered, Merged REALM vs. • TextRunner (TR) – frequency ordering (equivalent to PMI [Etzioni et al, 2005] and Urns [Downey et al, 2005]) • Pattern Learning (PL) – based on Snowball [Agichtein 2000] • HMM-T and REL-GRAMS in isolation

Results Metric: Area under precision-recall curve. REALM reduces missing area by 39% over nearest competitor.

Conclusions Sparse extractions are common, even on the Web Language models can assess sparse extractions • Accurate • Scalable Future Work • Other language modeling techniques

Web Fact-Finding Who has won three or more Academy Awards?

Web Fact-Finding Problems: User has to pick the right words, often a tedious process: "world foosball champion in 1998“ – 0 hits “world foosball champion” 1998 – 2 hits, no answer What if I could just ask for P(x) in “x was world foosball champion in 1998?” How far can language modeling and the distributional hypothesis take us?

Thanks!

KnowItAll Hypothesis Distributional Hypothesis X and other cities he visited X and cities such as X X soundtrack Miami Twisp Star Wars X lodging

X and other cities he visited X and cities such as X X soundtrack Miami Twisp Star Wars X lodging KnowItAll Hypothesis Distributional Hypothesis

in real time TextRunner invent Ranked by frequency REALM improves precision of the top 20 extractions by an average of 90%.

Improving TextRunner: Example (1) “headquartered” Top 10: Tarantella, Santa Cruz International Business Machines Corporation, Armonk Mirapoint, Sunnyvale ALD, Sunnyvale PBS, Alexandria General Dynamics, Falls Church Jupitermedia Corporation, Darien Allegro, Worcester Trolltech, Oslo Corbis, Seattle TR Precision: 40% REALM Precision: 100% company, Palo Alto held company, Santa Cruz storage hardware and software, Hopkinton Northwestern Mutual, Tacoma 1997, New York City Google, Mountain View PBS, Alexandria Linux provider, Raleigh Red Hat, Raleigh TI, Dallas TR Precision: 40%

Improving TextRunner: Example (2) “conquered” Top 10: Arabs, Rhodes Arabs, Istanbul Assyrians, Mesopotamia Great, Egypt Assyrians, Kassites Arabs, Samarkand Manchus, Outer Mongolia Vandals, North Africa Arabs, Persia Moors, Lagos TR Precision: 60% REALM Precision: 90% Great, Egypt conquistador, Mexico Normans, England Arabs, North Africa Great, Persia Romans, part Romans, Greeks Rome, Greece Napoleon, Egypt Visigoths, Suevi Kingdom TR Precision: 60%

Sparse Information Extraction: Unsupervised Language Models to the Rescue

Sparse Information Extraction: Unsupervised Language Models to the Rescue

Presentation Transcript

Models of Teaching

Managing Information Extraction SIGMOD 2006 Tutorial

INTRODUCTRY TO CHEMICAL ENGINEERING

Lipid Rescue

Models of Learning

Trench Rescue I

Information Extraction

Discourse Segmentation

ACML 2010 Tutorial Web People Search: Person Name Disambiguation and Other Problems

LING / C SC 439/539 Statistical Natural Language Processing

Smoothing N-gram Language Models

Forecasting

Trench Awareness

Information Extraction from the World Wide Web

Automatic Performance Tuning and Sparse-Matrix-Vector-Multiplication (SpMV)

Text summarization

Temporal Information Extraction

Autotuning sparse matrix kernels

Language Modeling

Information Retrieval and Search Engines

Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning