1 / 67

Self-supervised Probabilistic Methods for Extracting Facts from Text

Self-supervised Probabilistic Methods for Extracting Facts from Text. Doug Downey. Web Search: Answering Questions. Q: Who did IBM acquire in 2002? A: “IBM acquired * in 2002” Q: Who has won a best actor Oscar for playing a villain? A: “won best actor for playing a villain” – 0 hits!

lavonn
Download Presentation

Self-supervised Probabilistic Methods for Extracting Facts from Text

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Self-supervised Probabilistic Methods for Extracting Facts from Text Doug Downey

  2. Web Search: Answering Questions Q: Who did IBM acquire in 2002? A: “IBM acquired * in 2002” Q: Who has won a best actor Oscar for playing a villain? A: “won best actor for playing a villain”– 0 hits! The answer isn’t on just one Web page

  3. Solution: Synthesizing Across Pages Q: Who has won a best actor Oscar for playing a villain? A: Find all $Xwhere the following appear: “$X won best actor for $Y” “$X, who played $Z in $Y” “the villain, $Z” “Forest Whitaker won best actor for The Last King of Scotland” – 210 hits “Forest Whitaker, who played Idi Amin in The Last King of Scotland” – 4 hits “the villian, Idi Amin” – 1 hit Answer: Forest Whitaker

  4. Self-supervised Information Extraction Given: One or more contexts indicating a semantic class C,e.g.,“$X starred in $Y” => StarredIn($X, $Y) • User-specified (TextRunner [Banko et al., IJCAI 2007]) • Automatically generated (KnowItAll [Etzioni et al., AIJ 2005]) • Bootstrapped from resources [Snow et al., NIPS 2004]. Output: instances of C but, extraction from contexts is highly imperfect! => Output P(xC) for each termx Self-supervised – no hand-tagged examples

  5. Self-supervised Information Extraction Given: One or more contexts suggestive of a semantic class C, and a corpus of text Output: P(xC) for each term x KnowItAll Hypothesis • Terms x which occur in the suggestive contexts more frequently are more likely to be instances of C. Distributional Hypothesis • Terms in the same class tend to appear in similar contexts. My task: formalizing these heuristics into statements about P(x  C) given a corpus.

  6. Who cares about Probabilities? Why not use rankings (e.g., the precision/recall metric)? P( WonBestActorFor(Forest Whitaker, The Last King of Scotland) ) And P( PlayedVillainIn(Forest Whitaker, The Last King of Scotland) ) =>Our goal: an estimate of the probability that Forest Whitaker won best actor for playing a villain. Not possible with rankings! In fact, combining even perfect rankings can yield accuracy < .

  7. Outline • Two Research Questions • URNS model • REALM • Proposal for DH • Chez KnowItAll

  8. E.g., Miami (Robert De Niro, Raging Bull) …potential elements of C Term-Context Matrix Terms Contexts

  9. Term-Context Matrix Terms Contexts E.g., cities such as $X, $Xsaid$Yoffered to, also: parse trees, bag of words, containing Web domain, etc.

  10. KnowItAll Hypothesis Distributional Hypothesis X and other cities he visited X and cities such as X X soundtrack Miami Twisp Star Wars X lodging

  11. Two Research Questions -- term-context matrix -- columns of M for contexts suggesting C. -- prior estimate that x  C Formalizing the KnowItAll hypothesis: What is an expression for ? Formalizing the distributional hypothesis: What is an expression for ?

  12. Key Requirements for Models • Produce probabilities • Execute at “interactive” speed • No hand-tagged data

  13. Outline • Two Research Questions • URNS model • REALM • Proposal for DH • Chez KnowItAll

  14. X and other cities he visited X and cities such as X X soundtrack Miami Twisp Star Wars X lodging KnowItAll Hypothesis Distributional Hypothesis

  15. X and other cities he visited X and cities such as X X soundtrack Miami Twisp Star Wars X lodging KnowItAll Hypothesis Distributional Hypothesis

  16. 1. Modeling Redundancy – The Problem Consider a single context, e.g.: “citiessuch as x” If an extraction x appears k times in a set of n sentences containing this pattern, what is the probability that x  C?

  17. Modeling with k Country(x) extractions, n = 10 “…countries such asSaudi Arabia…” “…countries such as theUnited States…” “…countries such asSaudi Arabia…” “…countries such asJapan…” “…countries such asAfrica…” “…countries such asJapan…” “…countries such as theUnited Kingdom…” “…countries such asIraq…” “…countries such asAfghanistan…” “…countries such asAustralia…”

  18. Modeling with k Country(x) extractions, n = 10 k Noisy-Or Model: Saudi Arabia Japan United States Africa United Kingdom Iraq Afghanistan Australia 2 2 1 1 1 1 1 1 0.99 0.99 0.9 0.9 0.9 0.9 0.9 0.9 p is the probability that a single sentence is true, i.e. p = 0.9 • Important: • Sample size (n) • Distribution of C } Noisy-orignores these

  19. Needed in Model: Sample Size Country(x) extractions, n ~50,000 Country(x) extractions, n = 10 k k Saudi Arabia Japan United States Africa United Kingdom Iraq Afghanistan Australia 2 2 1 1 1 1 1 1 0.99 0.99 0.9 0.9 0.9 0.9 0.9 0.9 Japan Norway Israil OilWatch Africa Religion Paraguay Chicken Mole Republics of Kenya Atlantic Ocean New Zeland 1723 295 1 1 1 1 1 1 1 0.9999… 0.9999… 0.9 0.9 0.9 0.9 0.9 0.9 0.9 As sample size increases, noisy-or becomes inaccurate.

  20. Needed in Model: Distribution of C Country(x) extractions, n ~50,000 k Japan Norway Israil OilWatch Africa Religion Paraguay Chicken Mole Republics of Kenya Atlantic Ocean New Zeland 1723 295 1 1 1 1 1 1 1 0.9999… 0.9999… 0.9 0.9 0.9 0.9 0.9 0.9 0.9

  21. Needed in Model: Distribution of C Country(x) extractions, n ~50,000 k Japan Norway Israil OilWatch Africa Religion Paraguay Chicken Mole Republics of Kenya Atlantic Ocean New Zeland 1723 295 1 1 1 1 1 1 1 0.9999… 0.9999… 0.05 0.05 0.05 0.05 0.05 0.05 0.05

  22. Needed in Model: Distribution of C Country(x) extractions, n ~50,000 City(x) extractions, n ~50,000 k k Japan Norway Israil OilWatch Africa Religion Paraguay Chicken Mole Republics of Kenya Atlantic Ocean New Zeland 1723 295 1 1 1 1 1 1 1 0.9999… 0.9999… 0.05 0.05 0.05 0.05 0.05 0.05 0.05 Toronto Belgrade Lacombe Kent County Nikki Ragaz Villegas Cres Northeastwards 274 81 1 1 1 1 1 1 1 0.9999… 0.98 0.05 0.05 0.05 0.05 0.05 0.05 0.05 Probability that x  C depends on the distribution of C.

  23. The URNS Model – Single Urn

  24. The URNS Model – Single Urn Urn for City(x) Tokyo Sydney U.K. Cairo U.K. Tokyo Utah Atlanta Yakima Atlanta

  25. The URNS Model – Single Urn Urn for City(x) Tokyo Tokyo Sydney U.K. Cairo U.K. Tokyo …cities such as Tokyo… Utah Atlanta Yakima Atlanta

  26. Single Urn – Formal Definition C – set of unique target labels E – set of unique error labels num(b) – number of balls labeled by b  C  E num(B) –distribution giving the number of balls for each label b  B.

  27. Single Urn Example Urn for City(x) Tokyo Sydney U.K. Cairo U.K. num(“Atlanta”) = 2 num(C) = {2, 2, 1, 1, 1} num(E) = {2, 1} Tokyo Utah Atlanta Yakima Atlanta Estimated from data

  28. Single Urn: Computing Probabilities If an extraction x appears k times in a set of n sentences containing a pattern, what is the probability that x  C?

  29. Single Urn: Computing Probabilities Given that an extraction x appears k times in n draws from the urn (with replacement), what is the probability that x C? where s is the total number of balls in the urn

  30. Uniform Special Case Consider the case where num(ci) = RCand num(ej) = RE for all ci C, ej  E Then: Then using a Poisson Approximation: Odds increase exponentially with k, but decrease exponentially with n.

  31. The URNS Model – Multiple Urns Correlation across contexts is higher for elements of C than for elements of E.

  32. Unsupervised Performance

  33. Outline • Two Research Questions • URNS model • REALM • Proposal for DH • Chez KnowItAll

  34. e.g., (MichaelBloomberg, New York City) Tend to be correct e.g.,(Dave Shaver, Pickerington)(Ronald McDonald, McDonaldland) A mixture of correct and incorrect Redundancy fails on “sparse” facts

  35. KnowItAll Hypothesis Distributional Hypothesis X and other cities he visited X and cities such as X X soundtrack Miami Twisp Star Wars X lodging

  36. X and other cities he visited X and cities such as X X soundtrack Miami Twisp Star Wars X lodging KnowItAll Hypothesis Distributional Hypothesis

  37. Assessing Sparse Extractions Task: Identify which sparse extractions are correct. Strategy: • Build a model of how common extractions occur in text • Rank sparse extractions by fit to model • The distributional hypothesis! Our contribution: Unsupervised language models. • Methods for mitigating sparsity • Precomputed, so greatly improved scalability

  38. The REALM Architecture RElation Assessment using Language Models Input: Set of extractions for relation R ER = {(arg11, arg21), …, (arg1M, arg2M)} • Seeds SR = s most frequent pairs in ER(assume these are correct) • Output ranking of (arg1, arg2)  ER by distributional similarity to each (seed1, seed2) in SR

  39. Distributional Similarity (1) N-gram Language Model: Estimate P(wi | wi-1, … wi-k) #Parameters scales with (Vocab. Size)k+1

  40. Distributional Similarity (2) Naïve Approach: Compare context distributions: P(wg,…, wj | seed1, seed2 ) P(wg,…, wj | arg1, arg2) But j-g can be large Many parameters, sparse data => inaccuracy

  41. The REALM Architecture Two steps for assessing R(arg1, arg2) • Typechecking • e.g., AuthorOf( arg1, arg2 )arg1 must be an author, arg2 a written workValuable, but allows errors like:AuthorOf(Danielle Steele, Hamlet) • Relation Assessment • Ensure Ractually holds between arg1and arg2 Both steps use small, pre-computed language models => Scaleable

  42. Typechecking and HMM-T Task: For each extraction (arg1, arg2) ER, determine if arg1 and arg2 are the proper type for R. Solution: Assume seedj SRare of the proper type, and rank argj by distributional similarity to each seedj Computing Distributional Similarity: • Offline, train Hidden Markov Model (HMM) of corpus • Measure distance between argj , seedj in HMM’sN-dimensional latent state space.

  43. HMM Language Model k = 1 case: ti ti+1 ti+2 ti+3 wi wi+1 wi+2 wi+3 cities such as Seattle Offline Training: Learn P(w | t), P(ti | ti-1, …, ti-k) to maximize probability of corpus (using EM).

  44. HMM-T Trained HMM gives “distributional summary” of each w: N-dimensional state distribution P(t | w) Typecheck each arg by comparing state distributions: Rank extractions in ascending order of f(arg)summed over arguments.

  45. Why not use context vectors? Miami:<> Twisp:<> Problems: • Vectors are large • Intersections are sparse visited X and other X and other cities when he visited X he visited X and

  46. HMM-T Advantages (1) Miami:<> P(t | Miami): Latent state distribution P(t | w) • Compact (efficient – 10-50x less data retrieved) • Dense (accurate) t=12N

  47. HMM-T Advantages (2) Is Pickeringtonof the same type as Chicago? Chicago , Illinois Pickerington , Ohio Chicago: Pickerington: => N-grams says no, Dot product is 0! <x> , Illinois <x> , Ohio … …

  48. HMM-T Advantages (3) HMM Generalizes: Chicago , Illinois Pickerington , Ohio

  49. HMM-T Limitations • Learning time is proportional to (corpus size *Tk+1) • T = number of latent states • k = HMM order • We use limited values T=20, k=3 • Sufficient for typechecking (Santa Clara is a city) • Too coarse for relation assessment(Santa Clara is where Intel is headquartered)

  50. Outline • Two Research Questions • URNS model • REALM • Proposal for Formalizing the DH • Chez KnowItAll

More Related