1 / 221

Representation, Inference and Learning in Relational Probabilistic Languages

Representation, Inference and Learning in Relational Probabilistic Languages. Lise Getoor University of Maryland College Park. Avi Pfeffer Harvard University. IJCAI 2005 Tutorial. Introduction. Probability is good First-order representations are good

roman
Download Presentation

Representation, Inference and Learning in Relational Probabilistic Languages

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Representation, Inference and Learning in Relational Probabilistic Languages Lise Getoor University of Maryland College Park Avi Pfeffer Harvard University IJCAI 2005 Tutorial

  2. Introduction • Probability is good • First-order representations are good • Variety of approaches that combine them • We won’t cover all of them in detail • apologies if we leave out your favorite • We will cover three broad classes of approaches, and present exemplars of each approach • We will highlight common issues, themes, and techniques that recur in different approaches

  3. Running Example • There are papers, researchers, citations, reviews… • Papers have a quality, and may or may not be accepted • Authors may be smart and good writers • Papers have topics, and cite other papers which may or may not be on the same topic • Papers are reviewed by reviewers, who have moods that are influenced by the quality of the writing

  4. Some Queries • What is the probability that a researcher is famous, given that one of her papers was accepted despite the fact that a reviewer was in a bad mood? • What is the probability that a paper is accepted, given that another paper by the same author is accepted? • What is the probability that a paper is an AI paper, given that it is cited by an AI paper? • What is the probability that a student of a famous advisor has seven high quality papers?

  5. Sample Domains • Web Pages and Link Analysis • Battlespace Awareness • Epidemiological Studies • Citation Networks • Communication Networks (Cellphone Fraud Detection) • Intelligence Analysis (Terrorist Networks) • Financial Transactions (Money Laundering) • Computational Biology • Object Recognition and Scene Analysis • Natural Language Processing (e.g. Information Extraction and Semantic Parsing)

  6. Roadmap • Motivation • Background: Bayesian network inference and learning • Rule-based Approaches • Frame-based Approaches • Undirected Relational Approaches • Programming Language Approaches

  7. conditional probability table (CPT) S P(Q| W, S) W w s 0.6 0.4 w s 0.3 0.7 w s 0.4 0.6 0.1 0.9 w s Bayesian Networks [Pearl 87] Smart Good Writer Reviewer Mood Quality nodes = domain variables edges = direct causal influence Review Length Accepted Network structure encodes conditional independencies: I(Review-Length , Good-Writer | Reviewer-Mood)

  8. S W M Q L A BN Semantics • Compact & natural representation: • nodes have  k parents  O(2k n) vs. O(2n) params • natural parameters conditional independencies in BN structure local CPTs full joint distribution over domain + =

  9. mood good writer pissy false 1 pissy true 0 good false 0.7 good true 0.3 Variable Elimination [Zhang & Poole 96, Dechter 98] • To compute factors A factor is a function from values of variables to positive real numbers

  10. Variable Elimination • To compute

  11. Variable Elimination • To compute sum out l

  12. Variable Elimination • To compute new factor

  13. Variable Elimination • To compute multiply factors together then sum out w

  14. Variable Elimination • To compute new factor

  15. Variable Elimination • To compute

  16. Some Other Inference Algorithms • Exact • Junction Tree [Lauritzen & Spiegelhalter 88] • Cutset Conditioning [Pearl 87] • Approximate • Loopy Belief Propagation [McEliece et al 98] • Likelihood Weighting [Shwe & Cooper 91] • Markov Chain Monte Carlo [eg MacKay 98] • Gibbs Sampling [Geman & Geman 84] • Metropolis-Hastings [Metropolis et al 53, Hastings 70] • Variational Methods [Jordan et al 98] • etc.

  17. Parameter Estimation in BNs • Assume known dependency structure G • Goal: estimate BN parameters q • entries in local probability models, • q is good if it’s likely to generate observed data. • MLE Principle: Choose q* so as to maximize l • Alternative: incorporate a prior

  18. Learning With Complete Data • Fully observed data: data consists of set of instances, each with a value for all BN variables • With fully observed data, we can compute = number of instances with , and • and similarly for other counts • We then estimate

  19. Learning with Missing Data: Expectation-Maximization (EM) • Can’t compute • But • Given parameter values, can compute expected counts: • Given expected counts, estimate parameters: • Begin with arbitrary parameter values • Iterate these two steps • Converges to local maximum of likelihood this requires BN inference

  20. Structure search • Begin with an empty network • Consider all neighbors reached by a search operator that are acyclic • add an edge • remove an edge • reverse an edge • For each neighbor • compute ML parameter values • compute score(s) = • Choose the neighbor with the highest score • Continue until we reach a local maximum

  21. Limitations of BNs • Inability to generalize across collection of individuals within a domain • if you want to talk about multiple individuals in a domain, you have to talk about each one explicitly, with its own local probability model • Domains have fixed structure: e.g. one author, one paper and one reviewer • if you want to talk about domains with multiple inter-related individuals, you have to create a special purpose network for the domain • For learning, all instances have to have the same set of entities

  22. First Order Approaches • Advantages of first order probabilistic models • represent world in terms of individuals and relationships between them • ability to generalize about many instances in same domain • allow compact parameterization • support reasoning about general classes of individuals rather than the individuals themselves • allow representation of high level structure, in which objects interact weakly with each other

  23. Three Different Approaches • Rule-based approaches focus on facts • what is true in the world? • what facts do other facts depend on? • Frame-based approaches focus on objects and relationships • what types of objects are there, and how are they related to each other? • how does a property of an object depend on other properties (of the same or other objects)? • Programming language approaches focus on processes • how is the world generated? • how does one event influence another event?

  24. Roadmap • Motivation • Background • Rule-based Approaches • Basic Approach • Knowledge-Based Model Construction • Issues • First-Order Variable Elimination • Learning • Frame-based Approaches • Undirected Relational Approaches • Programming Language Approaches

  25. Flavors • Goldman & Charniak [93] • Breese [92] • Probabilistic Horn Abduction [Poole 93] • Probabilistic Logic Programming [Ngo & Haddawy 96] • Relational Bayesian Networks [Jaeger 97] • Bayesian Logic Programs [Kersting & de Raedt 00] • Stochastic Logic Programs [Muggleton 96] • PRISM [Sato & Kameya 97] • CLP(BN) [Costa et al. 03] • etc.

  26. Intuitive Approach In logic programming, accepted(P) :- author(P,A), famous(A). means For all P,A if A is the author of P and Ais famous, then Pis accepted This is a categorical inference But this will not be true in many cases

  27. Fudge Factors Use accepted(P) :- author(P,A), famous(A). (0.6) This means For all P,A if A is the author of P and Ais famous, then Pis accepted with probability 0.6 But what does this mean when there are other possible causes of a paper being accepted? e.g.accepted(P) :- high_quality(P). (0.8)

  28. Intuitive Meaning accepted(P) :- author(P,A), famous(A). (0.6) means For all P,A if A is the author of P and Ais famous, then Pis accepted with probability 0.6, provided no other possible cause of the paper being accepted holds If more than one possible cause holds, a combining rule is needed to combine the probabilities

  29. Meaning of Disjunction In logic programming accepted(P) :- author(P,A), famous(A). accepted(P) :- high_quality(P). means For all P,A if A is the author of P and Ais famous, or if P is high quality, then Pis accepted

  30. Intuitive Meaning of Probabilistic Disjunction For us accepted(P) :- author(P,A), famous(A). (0.6) accepted(P) :- high_quality(P). (0.8) means For all P,A, if (A is the author of P and Ais famous successfully cause P to be accepted) or (P is high quality successfully causes P to be accepted), then P is accepted. If A is the author of P and Ais famous, they successfully cause P to be accepted with probability 0.6. If P is high quality, it successfully causes P to be accepted with probability 0.8.

  31. Noisy-Or • Multiple possible causes of an effect • Each cause, if it is true, successfully causes the effect with a given probability • Effect is true if any of the possible causes is true and successfully causes it • All causes act independently to produce the effect (causal independence) • Note: accepted(P) :- author(P,A), famous(A). (0.6)may produce multiple possible causes for different values of A • Leak probability: effect may happen with no cause • e.g. accepted(P). (0.1)

  32. Noisy-Or author(p1,alice) author(p1,bob) high_quality(p1) famous(alice) famous(bob) 0.6 0.6 0.8 accepted(p1)

  33. Computing Noisy-Or Probabilities • What is P(accepted(p1)) given that Alice is an author and Alice is famous, and that the paper is high quality, but no other possible cause is true? leak

  34. Combination Rules • Other combination rules are possible • E.g. max • In our case, P(accepted(p1)) = max {0.6,0.8,0.1} = 0.8 • Harder to interpret in terms of logic program

  35. Roadmap • Motivation • Background • Rule-based Approaches • Basic Approach • Knowledge-Based Model Construction • Issues • First-Order Variable Elimination • Learning • Frame-based Approaches • Undirected Relational Approaches • Programming Language Approaches

  36. Knowledge-Based Model Construction (KBMC) • Construct a Bayesian network, given a query Q and evidence E • query and evidence are sets of ground atoms, i.e., predicates with no variable symbols • e.g. author(p1,alice) • Construct network by searching for possible proofs of the query and the variables • Use standard BN inference techniques on constructed network

  37. KBMC Example smart(alice). (0.8) smart(bob). (0.9) author(p1,alice). (0.7) author(p1,bob). (0.3) high_quality(P) :- author(P,A), smart(A). (0.5) high_quality(P). (0.1) accepted(P) :- high_quality(P). (0.9) Query isaccepted(p1). Evidence issmart(bob).

  38. Backward Chaining Start with evidence variable smart(bob) smart(bob)

  39. Backward Chaining Rule for smart(bob)has no antecedents – stop backward chaining smart(bob)

  40. Backward Chaining Begin with query variable accepted(p1) smart(bob) accepted(p1)

  41. Backward Chaining Rule for accepted(p1) has antecedent high_quality(p1) – add high_quality(p1) to network, and make parent of accepted(p1) smart(bob) high_quality(p1) accepted(p1)

  42. Backward Chaining All of accepted(p1)’s parents have been found – create its conditional probability table (CPT) smart(bob) high_quality(p1) high_quality(p1) accepted(p1) hq 0.7 0.3 accepted(p1) hq 0 1

  43. Backward Chaining high_quality(p1) :- author(p1,A), smart(A)has two groundings: A=aliceand A=bob smart(bob) high_quality(p1) accepted(p1)

  44. Backward Chaining For grounding A=alice, add author(p1,alice) and smart(alice) to network, and make parents of high_quality(p1) smart(alice) smart(bob) author(p1,alice) high_quality(p1) accepted(p1)

  45. Backward Chaining For grounding A=bob, add author(p1,bob)to network. smart(bob) is already in network. Make both parents of high_quality(p1) smart(alice) smart(bob) author(p1,alice) author(p1,bob) high_quality(p1) accepted(p1)

  46. Backward Chaining Create CPT for high_quality(p1) – make noisy-or, and don’t forget leak probability smart(alice) smart(bob) author(p1,alice) author(p1,bob) high_quality(p1) accepted(p1)

  47. Backward Chaining author(p1,alice), smart(alice) and author(p1,bob) have no antecedents – stop backward chaining smart(alice) smart(bob) author(p1,alice) author(p1,bob) high_quality(p1) accepted(p1)

  48. Backward Chaining • assert evidencesmart(bob) = true, and compute P(accepted(p1) | smart(bob) = true) true smart(alice) smart(bob) author(p1,alice) author(p1,bob) high_quality(p1) accepted(p1)

  49. Roadmap • Motivation • Background • Rule-based Approaches • Basic Approach • Knowledge-Based Model Construction • Issues • First-Order Variable Elimination • Learning • Frame-based Approaches • Undirected Relational Approaches • Programming Language Approaches

  50. The Role of Context • Context is deterministic knowledge known prior to the network being constructed • May be defined by its own logic program • Is not a random variable in the BN • Used to determine the structure of the constructed BN • If a context predicate P appears in the body of a rule R, only backward chain on R if P is true

More Related