1 / 23

The Physics of Text: Ontological Realism in Information Extraction

The Physics of Text: Ontological Realism in Information Extraction. Stuart Russell Joint work with Justin Uang, Ole Torp Lassen, Wei Wang. Models of text. Standard generative models of text: N- grams, PCFGs, LDAs, relational models, etc . They describe distributions over text

emily
Download Presentation

The Physics of Text: Ontological Realism in Information Extraction

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The Physics of Text: Ontological Realism in Information Extraction Stuart Russell Joint work with Justin Uang, Ole Torp Lassen, Wei Wang

  2. Models of text • Standard generative models of text: • N-grams, PCFGs, LDAs, relational models, etc. • They describe distributions over text • what sentences or word sequences look like • analogous to Ptolemaic epicycle model of solar system • They do not explain why the text is on the page • What is Newton’s theory of text?

  3. A trivial causal theory • There is a world, composed of facts • Someone picks a fact to say • They choose a way to say it • They say it [cfMelcuk, Charniakand Goldman]

  4. The theories are different • Standard generative models • Predict that a large enough corpus will contain, e.g., every possible sentence of the form “<person A> wrote <book B>” • A fact-based model • Predicts that corpus contains “<person A> wrote <book B>”only if A wrote B • Sentences are coupled by a latent world

  5. Bootstrapping à la Brin (1999) • Bootstrapping is the core of CMU NELL, UW Machine Reading, etc. • Begin with facts in a specific relation • Author(CharlesDickens,GreatExpectations) etc. • Look for sentences containing the argument pairs: • “CharlesDickens wrote GreatExpectations” • and add pattern “x wrote y” to set of patterns for “Author” • Look for sentences containing the same pattern: • “JKRowling wrote HarryPotter” • and add fact Author(JKRowling,HarryPotter) to set of facts for “Author” • Repeat until all Author facts have been extracted

  6. Bootstrapping contd. Errors made by the basic algorithm: • Type 1: Same arguments, different patterns, different relations: • “JKRowling has made a lot of money from HarryPotter” • Incorrect patterns are added for the Author relation • Type 2: Same pattern, different arguments, different relations • “JKRowling wrote NevilleLongbottom out of the movie” • Incorrect tuples are added to the Author relation Why does bootstrapping work? Does a well-motivated version fix these problems?

  7. A trivial generative model • How the world of facts is generated • There are N objects, K relations • For each R, x, y, R(x,y) holds with probability σR • How facts are selected for “reporting” • Uniformly from the set of facts • How a fact R(x,y)is expressed as a sentence • Text “x w y” where • global “relation string” dictionary contains k words • word w ~ Categorical[p1,…,pk] specific to R • parameters p1,…,pk~ Dirichlet[α1,…, αk]

  8. BLOG model #Object ~ OM(3,1); #Relation ~ OM(2,1) Dictionary(r) ~ Dirichlet(α,WordList); Sparsity(r) ~ Beta(10,1000); Holds(r,x,y) ~ Boolean(Sparsity(r)); ChosenFact(s) ~ Uniform({f : Holds(f)}) Subject(s) = Arg1(ChosenFact(s)) Object(s) = Arg2(ChosenFact(s)) Verb(s) ~ Categorical(Dictionary(Rel(ChosenFact(s)))) <sentence data> Query: posterior over worlds, or MAP world, or P(Rel(ChosenFact(s1)) = Rel(ChosenFact(s1)) )

  9. When does bootstrapping work? Under what parameter scenarios does the posterior recover the true world accurately? • SparsityσR<< 1 (R(x,y) false for most x,y pairs) • Independence between relations • No polysemy (each w expresses one relation) • Overlap in sentence content (cf. redundancy) • Many sentences share their arguments • Many sentences share their relation words

  10. Sparsity • Most relations are very sparse: • Married(x,y) holds for about 2B/(7B)2 or 1 in 25B (in real data, population size is fame-adjusted) • If relations are sparse and independent, worlds with two different relations for the same A,B argument pair are much less likely than worlds with one; i.e., “pure coincidence” is unlikely

  11. Example • Given • “CharlesDickens wrote GreatExpectations” • “CharlesDickens писалGreatExpectations” • What is the probability that “писал”is expressing the same fact as “wrote”?

  12. General formula • Given N objects, sparseness σ, independence, 2 relations • X = писалmeans wrote; Y = писалis distinct • Odds ratio P(X,e)/P(Y,e) = where

  13. Bootstrap is based on world sparsity • For small σ,odds ratio is O(1/σ) • I.e., bootstrap inference is reliable

  14. Caveat: non-independence • Type 1 errors: real relations are not independent • E.g., Divorced(x,y) => Married(x,y) • E.g., Married(HenryVIII,AnnBoleyn), Beheaded(HenryVIII,AnnBoleyn) • Fixes: • Allow relations to be subrelations or “de novo” relations • Subrelations also support fact–>fact inference • Some generic allowance for undirected correlation

  15. Odds ratio for subrelation case

  16. Comparison: Subrelations • Probability that “married” =“divorced”increases as we add pairs that are both married and divorced • Independent prior leads to overconfidence • Subrelation prior gives a more reasonable confidence level

  17. Robustness to type 2 errors (polysemy)

  18. Experiment • ~8500 sentences of pre-parsed NY Times text (Yao et al., EMNLP 2011); 4300 distinct dependency paths • <named entity> <dependency path> <named entity> • E.g., J. Edgar Hoover appos-> director -> prep -> of -> pobjF.B.I. • Unsupervised Bayesian inference on “text”, model • Automatic relation discovery plus text pattern learning • Simultaneous extraction of facts (Each dependency path or “trigger”treated as atomic; no features at all) (Inference: smart-dumb/dumb-smart MCMC (UAI 2015))

  19. Relation [rel_46] : text patterns appos->unit->prep->of->pobj appos->part->prep->of->pobj nn<-unit->prep->of->pobj partmod->own->prep->by->pobj rcmod->own->prep->by->pobj appos->subsidiary->prep->of->pobj rcmod->part->prep->of->pobj rcmod->unit->prep->of->pobj poss<-parent->appos appos->division->prep->of->pobj pobj<-of<-prep<-office->appos->part->prep->of->pobj pobj<-of<-prep<-unit->appos->part->prep->of->pobj nn<-division->prep->of->pobj appos->unit->nn nsubjpass<-own->prep->by->pobj nn<-office->prep->of->pobj

  20. Relation [rel_46] : extracted facts rel_46(ABC, Walt Disney Company) rel_46(American Airlines, AMR Corporation) rel_46(American, AMR Corporation) rel_46(Arnold Worldwide, Arnold Worldwide Partners division) rel_46(BBDO Worldwide, Omnicom Group) rel_46(Bozell Worldwide, Bozell) rel_46(Chicago, DDB Worldwide) rel_46(Conde Nast Publications, Advance Publications) rel_46(DDB Needham Worldwide, Omnicom Group) rel_46(DDB Worldwide, Omnicom Group) rel_46(Eastern, Texas Air Corporation) rel_46(Electronic Data Systems Corporation, General Motors Corporation) rel_46(Euro RSCG Worldwide, Havas Advertising) rel_46(Euro RSCG Worldwide, Havas) rel_46(Fallon Worldwide, PublicisGroupe) rel_46(Foote, True North Communications) rel_46(Fox, News Corporation) rel_46(Goodby, Omnicom Group) rel_46(Grey Worldwide, Grey Global Group) rel_46(Hughes, General Motors Corporation)

  21. Relation [rel_46] : extracted facts • rel_46(J. Walter Thompson, WPP Group) rel_46(Kellogg Brown & Root, Halliburton) rel_46(Kellogg, Halliburton) rel_46(Kraft General Foods, Philip Morris Cos.) rel_46(Lorillard Tobacco, Loews Corporation) rel_46(Lowe Group, Interpublic Group of Companies) rel_46(McCann-Erickson World Group, Interpublic Group of Companies) rel_46(NBC, General Electric Company) rel_46(New York, BBDO Worldwide) rel_46(New York, Hill) rel_46(Ogilvy & Mather Worldwide, WPP Group) rel_46(Saatchi & Saatchi, PublicisGroupe) rel_46(Salomon Smith Barney, Citigroup) rel_46(San Francisco, Foote) rel_46(Sears Receivables Financing Group Inc., Sears) rel_46(TBWA Worldwide, Omnicom Group) rel_46(United, UAL Corporation) rel_46(United, UAL) rel_46(Young & Rubicam, WPP Group)

  22. Evaluation • We cannot (in general) inspect and check the “KB” • cf inspecting each other’s brains • A relation symbol may “mean” something different across possible worlds sampled by MCMC • Only solution: ask questions in natural language

  23. Next steps • Entity resolution: generative models of entity mentions • Ontology: types, time, events, vector-space meaning • Pragmatics: choice of facts, effect of context • Grammar: learning the missing link text text meaning syntax text text text text

More Related