Self-supervised Probabilistic Methods for Extracting Facts from Text

Self-supervised Probabilistic Methods for Extracting Facts from Text Doug Downey

Web Search: Answering Questions Q: Who did IBM acquire in 2002? A: “IBM acquired * in 2002” Q: Who has won a best actor Oscar for playing a villain? A: “won best actor for playing a villain”– 0 hits! The answer isn’t on just one Web page

Solution: Synthesizing Across Pages Q: Who has won a best actor Oscar for playing a villain? A: Find all $Xwhere the following appear: “$X won best actor for $Y” “$X, who played $Z in $Y” “the villain, $Z” “Forest Whitaker won best actor for The Last King of Scotland” – 210 hits “Forest Whitaker, who played Idi Amin in The Last King of Scotland” – 4 hits “the villian, Idi Amin” – 1 hit Answer: Forest Whitaker

Self-supervised Information Extraction Given: One or more contexts indicating a semantic class C,e.g.,“$X starred in $Y” => StarredIn($X, $Y) • User-specified (TextRunner [Banko et al., IJCAI 2007]) • Automatically generated (KnowItAll [Etzioni et al., AIJ 2005]) • Bootstrapped from resources [Snow et al., NIPS 2004]. Output: instances of C but, extraction from contexts is highly imperfect! => Output P(xC) for each termx Self-supervised – no hand-tagged examples

Self-supervised Information Extraction Given: One or more contexts suggestive of a semantic class C, and a corpus of text Output: P(xC) for each term x KnowItAll Hypothesis • Terms x which occur in the suggestive contexts more frequently are more likely to be instances of C. Distributional Hypothesis • Terms in the same class tend to appear in similar contexts. My task: formalizing these heuristics into statements about P(x  C) given a corpus.

Who cares about Probabilities? Why not use rankings (e.g., the precision/recall metric)? P( WonBestActorFor(Forest Whitaker, The Last King of Scotland) ) And P( PlayedVillainIn(Forest Whitaker, The Last King of Scotland) ) =>Our goal: an estimate of the probability that Forest Whitaker won best actor for playing a villain. Not possible with rankings! In fact, combining even perfect rankings can yield accuracy < .

Outline • Two Research Questions • URNS model • REALM • Proposal for DH • Chez KnowItAll

E.g., Miami (Robert De Niro, Raging Bull) …potential elements of C Term-Context Matrix Terms Contexts

Term-Context Matrix Terms Contexts E.g., cities such as $X, $Xsaid$Yoffered to, also: parse trees, bag of words, containing Web domain, etc.

KnowItAll Hypothesis Distributional Hypothesis X and other cities he visited X and cities such as X X soundtrack Miami Twisp Star Wars X lodging

Two Research Questions -- term-context matrix -- columns of M for contexts suggesting C. -- prior estimate that x  C Formalizing the KnowItAll hypothesis: What is an expression for ? Formalizing the distributional hypothesis: What is an expression for ?

Key Requirements for Models • Produce probabilities • Execute at “interactive” speed • No hand-tagged data

X and other cities he visited X and cities such as X X soundtrack Miami Twisp Star Wars X lodging KnowItAll Hypothesis Distributional Hypothesis

1. Modeling Redundancy – The Problem Consider a single context, e.g.: “citiessuch as x” If an extraction x appears k times in a set of n sentences containing this pattern, what is the probability that x  C?

Modeling with k Country(x) extractions, n = 10 “…countries such asSaudi Arabia…” “…countries such as theUnited States…” “…countries such asSaudi Arabia…” “…countries such asJapan…” “…countries such asAfrica…” “…countries such asJapan…” “…countries such as theUnited Kingdom…” “…countries such asIraq…” “…countries such asAfghanistan…” “…countries such asAustralia…”

Modeling with k Country(x) extractions, n = 10 k Noisy-Or Model: Saudi Arabia Japan United States Africa United Kingdom Iraq Afghanistan Australia 2 2 1 1 1 1 1 1 0.99 0.99 0.9 0.9 0.9 0.9 0.9 0.9 p is the probability that a single sentence is true, i.e. p = 0.9 • Important: • Sample size (n) • Distribution of C } Noisy-orignores these

Needed in Model: Sample Size Country(x) extractions, n ~50,000 Country(x) extractions, n = 10 k k Saudi Arabia Japan United States Africa United Kingdom Iraq Afghanistan Australia 2 2 1 1 1 1 1 1 0.99 0.99 0.9 0.9 0.9 0.9 0.9 0.9 Japan Norway Israil OilWatch Africa Religion Paraguay Chicken Mole Republics of Kenya Atlantic Ocean New Zeland 1723 295 1 1 1 1 1 1 1 0.9999… 0.9999… 0.9 0.9 0.9 0.9 0.9 0.9 0.9 As sample size increases, noisy-or becomes inaccurate.

Needed in Model: Distribution of C Country(x) extractions, n ~50,000 k Japan Norway Israil OilWatch Africa Religion Paraguay Chicken Mole Republics of Kenya Atlantic Ocean New Zeland 1723 295 1 1 1 1 1 1 1 0.9999… 0.9999… 0.9 0.9 0.9 0.9 0.9 0.9 0.9

Needed in Model: Distribution of C Country(x) extractions, n ~50,000 k Japan Norway Israil OilWatch Africa Religion Paraguay Chicken Mole Republics of Kenya Atlantic Ocean New Zeland 1723 295 1 1 1 1 1 1 1 0.9999… 0.9999… 0.05 0.05 0.05 0.05 0.05 0.05 0.05

Needed in Model: Distribution of C Country(x) extractions, n ~50,000 City(x) extractions, n ~50,000 k k Japan Norway Israil OilWatch Africa Religion Paraguay Chicken Mole Republics of Kenya Atlantic Ocean New Zeland 1723 295 1 1 1 1 1 1 1 0.9999… 0.9999… 0.05 0.05 0.05 0.05 0.05 0.05 0.05 Toronto Belgrade Lacombe Kent County Nikki Ragaz Villegas Cres Northeastwards 274 81 1 1 1 1 1 1 1 0.9999… 0.98 0.05 0.05 0.05 0.05 0.05 0.05 0.05 Probability that x  C depends on the distribution of C.

The URNS Model – Single Urn

The URNS Model – Single Urn Urn for City(x) Tokyo Sydney U.K. Cairo U.K. Tokyo Utah Atlanta Yakima Atlanta

The URNS Model – Single Urn Urn for City(x) Tokyo Tokyo Sydney U.K. Cairo U.K. Tokyo …cities such as Tokyo… Utah Atlanta Yakima Atlanta

Single Urn – Formal Definition C – set of unique target labels E – set of unique error labels num(b) – number of balls labeled by b  C  E num(B) –distribution giving the number of balls for each label b  B.

Single Urn Example Urn for City(x) Tokyo Sydney U.K. Cairo U.K. num(“Atlanta”) = 2 num(C) = {2, 2, 1, 1, 1} num(E) = {2, 1} Tokyo Utah Atlanta Yakima Atlanta Estimated from data

Single Urn: Computing Probabilities If an extraction x appears k times in a set of n sentences containing a pattern, what is the probability that x  C?

Single Urn: Computing Probabilities Given that an extraction x appears k times in n draws from the urn (with replacement), what is the probability that x C? where s is the total number of balls in the urn

Uniform Special Case Consider the case where num(ci) = RCand num(ej) = RE for all ci C, ej  E Then: Then using a Poisson Approximation: Odds increase exponentially with k, but decrease exponentially with n.

The URNS Model – Multiple Urns Correlation across contexts is higher for elements of C than for elements of E.

Unsupervised Performance

e.g., (MichaelBloomberg, New York City) Tend to be correct e.g.,(Dave Shaver, Pickerington)(Ronald McDonald, McDonaldland) A mixture of correct and incorrect Redundancy fails on “sparse” facts

KnowItAll Hypothesis Distributional Hypothesis X and other cities he visited X and cities such as X X soundtrack Miami Twisp Star Wars X lodging

X and other cities he visited X and cities such as X X soundtrack Miami Twisp Star Wars X lodging KnowItAll Hypothesis Distributional Hypothesis

Assessing Sparse Extractions Task: Identify which sparse extractions are correct. Strategy: • Build a model of how common extractions occur in text • Rank sparse extractions by fit to model • The distributional hypothesis! Our contribution: Unsupervised language models. • Methods for mitigating sparsity • Precomputed, so greatly improved scalability

The REALM Architecture RElation Assessment using Language Models Input: Set of extractions for relation R ER = {(arg11, arg21), …, (arg1M, arg2M)} • Seeds SR = s most frequent pairs in ER(assume these are correct) • Output ranking of (arg1, arg2)  ER by distributional similarity to each (seed1, seed2) in SR

Distributional Similarity (1) N-gram Language Model: Estimate P(wi | wi-1, … wi-k) #Parameters scales with (Vocab. Size)k+1

Distributional Similarity (2) Naïve Approach: Compare context distributions: P(wg,…, wj | seed1, seed2 ) P(wg,…, wj | arg1, arg2) But j-g can be large Many parameters, sparse data => inaccuracy

The REALM Architecture Two steps for assessing R(arg1, arg2) • Typechecking • e.g., AuthorOf( arg1, arg2 )arg1 must be an author, arg2 a written workValuable, but allows errors like:AuthorOf(Danielle Steele, Hamlet) • Relation Assessment • Ensure Ractually holds between arg1and arg2 Both steps use small, pre-computed language models => Scaleable

Typechecking and HMM-T Task: For each extraction (arg1, arg2) ER, determine if arg1 and arg2 are the proper type for R. Solution: Assume seedj SRare of the proper type, and rank argj by distributional similarity to each seedj Computing Distributional Similarity: • Offline, train Hidden Markov Model (HMM) of corpus • Measure distance between argj , seedj in HMM’sN-dimensional latent state space.

HMM Language Model k = 1 case: ti ti+1 ti+2 ti+3 wi wi+1 wi+2 wi+3 cities such as Seattle Offline Training: Learn P(w | t), P(ti | ti-1, …, ti-k) to maximize probability of corpus (using EM).

HMM-T Trained HMM gives “distributional summary” of each w: N-dimensional state distribution P(t | w) Typecheck each arg by comparing state distributions: Rank extractions in ascending order of f(arg)summed over arguments.

Why not use context vectors? Miami:<> Twisp:<> Problems: • Vectors are large • Intersections are sparse visited X and other X and other cities when he visited X he visited X and

HMM-T Advantages (1) Miami:<> P(t | Miami): Latent state distribution P(t | w) • Compact (efficient – 10-50x less data retrieved) • Dense (accurate) t=12N

HMM-T Advantages (2) Is Pickeringtonof the same type as Chicago? Chicago , Illinois Pickerington , Ohio Chicago: Pickerington: => N-grams says no, Dot product is 0! <x> , Illinois <x> , Ohio … …

HMM-T Advantages (3) HMM Generalizes: Chicago , Illinois Pickerington , Ohio

HMM-T Limitations • Learning time is proportional to (corpus size *Tk+1) • T = number of latent states • k = HMM order • We use limited values T=20, k=3 • Sufficient for typechecking (Santa Clara is a city) • Too coarse for relation assessment(Santa Clara is where Intel is headquartered)

Outline • Two Research Questions • URNS model • REALM • Proposal for Formalizing the DH • Chez KnowItAll

Self-supervised Probabilistic Methods for Extracting Facts from Text

Self-supervised Probabilistic Methods for Extracting Facts from Text

Presentation Transcript

Semi-Supervised Learning over Text

A Probabilistic Framework for Semi-Supervised Clustering

Supervised learning for text

Supervised and Traditional Term Weighting Methods for Automatic Text Categorization

Snowball : Extracting Relations from Large Plain-Text Collections

Extracting probabilistic severe weather guidance from convection-allowing model forecasts

Probabilistic Methods for Targeted Advertising

Unsupervised and Weakly-Supervised Probabilistic Modeling of Text

Probabilistic Text Generation

Extracting Semantic Networks From Text Via Relational Clustering

Linear Models for Classification : Probabilistic Methods

Soft-Supervised Learning for Text Classification

Self-Supervised Relation Learning from the Web

Probabilistic Graphical Models for Semi-Supervised Traffic Classification

Extracting Events from Probabilistic Streams

A semantic approach for extracting domain taxonomies from text

Extracting Semantic Constraint from Description Text for Semantic Web Service Discovery

Probabilistic Topic Models for Text Mining

Pseudo-supervised Clustering for Text Documents

LAHAR: Extracting Events from Probabilistic Streams

Image Classification: Supervised Methods