The Physics of Text: Ontological Realism in Information Extraction

The Physics of Text: Ontological Realism in Information Extraction Stuart Russell Joint work with Justin Uang, Ole Torp Lassen, Wei Wang

Models of text • Standard generative models of text: • N-grams, PCFGs, LDAs, relational models, etc. • They describe distributions over text • what sentences or word sequences look like • analogous to Ptolemaic epicycle model of solar system • They do not explain why the text is on the page • What is Newton’s theory of text?

A trivial causal theory • There is a world, composed of facts • Someone picks a fact to say • They choose a way to say it • They say it [cfMelcuk, Charniakand Goldman]

The theories are different • Standard generative models • Predict that a large enough corpus will contain, e.g., every possible sentence of the form “<person A> wrote <book B>” • A fact-based model • Predicts that corpus contains “<person A> wrote <book B>”only if A wrote B • Sentences are coupled by a latent world

Bootstrapping à la Brin (1999) • Bootstrapping is the core of CMU NELL, UW Machine Reading, etc. • Begin with facts in a specific relation • Author(CharlesDickens,GreatExpectations) etc. • Look for sentences containing the argument pairs: • “CharlesDickens wrote GreatExpectations” • and add pattern “x wrote y” to set of patterns for “Author” • Look for sentences containing the same pattern: • “JKRowling wrote HarryPotter” • and add fact Author(JKRowling,HarryPotter) to set of facts for “Author” • Repeat until all Author facts have been extracted

Bootstrapping contd. Errors made by the basic algorithm: • Type 1: Same arguments, different patterns, different relations: • “JKRowling has made a lot of money from HarryPotter” • Incorrect patterns are added for the Author relation • Type 2: Same pattern, different arguments, different relations • “JKRowling wrote NevilleLongbottom out of the movie” • Incorrect tuples are added to the Author relation Why does bootstrapping work? Does a well-motivated version fix these problems?

A trivial generative model • How the world of facts is generated • There are N objects, K relations • For each R, x, y, R(x,y) holds with probability σR • How facts are selected for “reporting” • Uniformly from the set of facts • How a fact R(x,y)is expressed as a sentence • Text “x w y” where • global “relation string” dictionary contains k words • word w ~ Categorical[p1,…,pk] specific to R • parameters p1,…,pk~ Dirichlet[α1,…, αk]

BLOG model #Object ~ OM(3,1); #Relation ~ OM(2,1) Dictionary(r) ~ Dirichlet(α,WordList); Sparsity(r) ~ Beta(10,1000); Holds(r,x,y) ~ Boolean(Sparsity(r)); ChosenFact(s) ~ Uniform({f : Holds(f)}) Subject(s) = Arg1(ChosenFact(s)) Object(s) = Arg2(ChosenFact(s)) Verb(s) ~ Categorical(Dictionary(Rel(ChosenFact(s)))) <sentence data> Query: posterior over worlds, or MAP world, or P(Rel(ChosenFact(s1)) = Rel(ChosenFact(s1)) )

When does bootstrapping work? Under what parameter scenarios does the posterior recover the true world accurately? • SparsityσR<< 1 (R(x,y) false for most x,y pairs) • Independence between relations • No polysemy (each w expresses one relation) • Overlap in sentence content (cf. redundancy) • Many sentences share their arguments • Many sentences share their relation words

Sparsity • Most relations are very sparse: • Married(x,y) holds for about 2B/(7B)2 or 1 in 25B (in real data, population size is fame-adjusted) • If relations are sparse and independent, worlds with two different relations for the same A,B argument pair are much less likely than worlds with one; i.e., “pure coincidence” is unlikely

Example • Given • “CharlesDickens wrote GreatExpectations” • “CharlesDickens писалGreatExpectations” • What is the probability that “писал”is expressing the same fact as “wrote”?

General formula • Given N objects, sparseness σ, independence, 2 relations • X = писалmeans wrote; Y = писалis distinct • Odds ratio P(X,e)/P(Y,e) = where

Bootstrap is based on world sparsity • For small σ,odds ratio is O(1/σ) • I.e., bootstrap inference is reliable

Caveat: non-independence • Type 1 errors: real relations are not independent • E.g., Divorced(x,y) => Married(x,y) • E.g., Married(HenryVIII,AnnBoleyn), Beheaded(HenryVIII,AnnBoleyn) • Fixes: • Allow relations to be subrelations or “de novo” relations • Subrelations also support fact–>fact inference • Some generic allowance for undirected correlation

Odds ratio for subrelation case

Comparison: Subrelations • Probability that “married” =“divorced”increases as we add pairs that are both married and divorced • Independent prior leads to overconfidence • Subrelation prior gives a more reasonable confidence level

Robustness to type 2 errors (polysemy)

Experiment • ~8500 sentences of pre-parsed NY Times text (Yao et al., EMNLP 2011); 4300 distinct dependency paths • <named entity> <dependency path> <named entity> • E.g., J. Edgar Hoover appos-> director -> prep -> of -> pobjF.B.I. • Unsupervised Bayesian inference on “text”, model • Automatic relation discovery plus text pattern learning • Simultaneous extraction of facts (Each dependency path or “trigger”treated as atomic; no features at all) (Inference: smart-dumb/dumb-smart MCMC (UAI 2015))

Relation [rel_46] : text patterns appos->unit->prep->of->pobj appos->part->prep->of->pobj nn<-unit->prep->of->pobj partmod->own->prep->by->pobj rcmod->own->prep->by->pobj appos->subsidiary->prep->of->pobj rcmod->part->prep->of->pobj rcmod->unit->prep->of->pobj poss<-parent->appos appos->division->prep->of->pobj pobj<-of<-prep<-office->appos->part->prep->of->pobj pobj<-of<-prep<-unit->appos->part->prep->of->pobj nn<-division->prep->of->pobj appos->unit->nn nsubjpass<-own->prep->by->pobj nn<-office->prep->of->pobj

Relation [rel_46] : extracted facts rel_46(ABC, Walt Disney Company) rel_46(American Airlines, AMR Corporation) rel_46(American, AMR Corporation) rel_46(Arnold Worldwide, Arnold Worldwide Partners division) rel_46(BBDO Worldwide, Omnicom Group) rel_46(Bozell Worldwide, Bozell) rel_46(Chicago, DDB Worldwide) rel_46(Conde Nast Publications, Advance Publications) rel_46(DDB Needham Worldwide, Omnicom Group) rel_46(DDB Worldwide, Omnicom Group) rel_46(Eastern, Texas Air Corporation) rel_46(Electronic Data Systems Corporation, General Motors Corporation) rel_46(Euro RSCG Worldwide, Havas Advertising) rel_46(Euro RSCG Worldwide, Havas) rel_46(Fallon Worldwide, PublicisGroupe) rel_46(Foote, True North Communications) rel_46(Fox, News Corporation) rel_46(Goodby, Omnicom Group) rel_46(Grey Worldwide, Grey Global Group) rel_46(Hughes, General Motors Corporation)

Relation [rel_46] : extracted facts • rel_46(J. Walter Thompson, WPP Group) rel_46(Kellogg Brown & Root, Halliburton) rel_46(Kellogg, Halliburton) rel_46(Kraft General Foods, Philip Morris Cos.) rel_46(Lorillard Tobacco, Loews Corporation) rel_46(Lowe Group, Interpublic Group of Companies) rel_46(McCann-Erickson World Group, Interpublic Group of Companies) rel_46(NBC, General Electric Company) rel_46(New York, BBDO Worldwide) rel_46(New York, Hill) rel_46(Ogilvy & Mather Worldwide, WPP Group) rel_46(Saatchi & Saatchi, PublicisGroupe) rel_46(Salomon Smith Barney, Citigroup) rel_46(San Francisco, Foote) rel_46(Sears Receivables Financing Group Inc., Sears) rel_46(TBWA Worldwide, Omnicom Group) rel_46(United, UAL Corporation) rel_46(United, UAL) rel_46(Young & Rubicam, WPP Group)

Evaluation • We cannot (in general) inspect and check the “KB” • cf inspecting each other’s brains • A relation symbol may “mean” something different across possible worlds sampled by MCMC • Only solution: ask questions in natural language

Next steps • Entity resolution: generative models of entity mentions • Ontology: types, time, events, vector-space meaning • Pragmatics: choice of facts, effect of context • Grammar: learning the missing link text text meaning syntax text text text text

The Physics of Text: Ontological Realism in Information Extraction