Exact and approximate inference in probabilistic graphical models

Exact and approximate inference in probabilistic graphical models Kevin MurphyMIT CSAIL UBC CS/Stats www.ai.mit.edu/~murphyk/AAAI04 AAAI 2004 tutorial SP2-1

Outline • Introduction • Exact inference • Approximate inference SP2-2

Outline • Introduction • What are graphical models? • What is inference? • Exact inference • Approximate inference SP2-3

Probabilistic graphical models Probabilistic models Graphical models Undirected Directed (Markov Randomfields - MRFs) (Bayesian networks) SP2-4

Bayesian networks • Directed acyclic graph (DAG) • Nodes – random variables • Edges – direct influence (“causation”) • Xi? Xancestors | Xparents • e.g., C ? {R,B,E} | A • Simplifies chain rule by using conditional independencies Earthquake Burglary Radio Alarm Call Pearl, 1988 SP2-5

Each node specifies a distribution over its values given its parents values P(Xi | XPai) Full table needs 25-1=31parameters, BN needs 10 Conditional probability distributions (CPDs) Earthquake Burglary Radio Alarm Call Pearl, 1988 SP2-6

X1 X2 X3 Y1 Y3 Y2 Example BN: Hidden Markov Model (HMM) Hidden stateseg. words Observationseg. sounds SP2-7

CPDs for HMMs 1 2 3 A=state transition matrix A X1 X2 X3  Parameter tyeing Y1 Y3 B Y2 Transition matrix Observation matrix Initial state distribution SP2-8

X1 X2 X3 X4 X5 Markov Random Fields (MRFs) • Undirected graph • Xi? Xrest | Xnbrs • Each clique c has a potential function c Hammersley-Clifford theorem: P(X) = (1/Z) cyc(Xc) The normalization constant (partition function) is SP2-9

X1 X1 X2 X2 X3 X3 X4 X4 X5 X5 Potentials for MRFs One potential per maximal clique One potential per edge 12 123 23 13 35 34 35 34 SP2-10

Example MRF: Ising/ Potts model y   Parametertying     x   Compatibility with neighbors Local evidence (compatibility with image) SP2-11

Lafferty01,Kumar03,etc Conditional Random Field (CRF) y   Parametertying     x   Compatibility with neighbors Local evidence (compatibility with image) SP2-12

Directed vs undirected models X1 X2 X1 X2 moralization X3 X3 X5 X4 X5 X4 separation => cond. indepence d-separation => cond. indepence Parameter learning hard Parameter learning easy Inference is same! SP2-13

Factor graphs Kschischang01 X1 X2 X1 X2 X1 X2 Pairwise Markov net Markov net Bayes net X3 X3 X3 X5 X4 X5 X5 X4 X4 X2 X1 X2 X1 X2 X1 X3 X3 X3 X4 X5 X4 X4 X5 X5 Bipartite graph SP2-14

Outline • Introduction • What are graphical models? • What is inference? • Exact inference • Approximate inference SP2-15

Inference (state estimation) Earthquake Burglary Radio Alarm Call C=t SP2-16

Inference P(B=t|C=t) = 0.7 P(E=t|C=t)=0.1 Earthquake Burglary Radio Alarm Call C=t SP2-17

Inference P(B=t|C=t) = 0.7 P(E=t|C=t)=0.1 Earthquake Burglary Radio Alarm R=t Call C=t SP2-18

Inference P(B=t|C=t) = 0.7 P(E=t|C=t)=0.1 Earthquake Burglary P(B=t|C=t,R=t) = 0.1 P(E=t|C=t,R=t)=0.97 Radio Alarm R=t Call C=t SP2-19

Inference P(B=t|C=t) = 0.7 P(E=t|C=t)=0.1 Earthquake Burglary P(B=t|C=t,R=t) = 0.1 P(E=t|C=t,R=t)=0.97 Radio Alarm R=t Call C=t Explaining away effect SP2-20

Inference P(B=t|C=t) = 0.7 P(E=t|C=t)=0.1 Earthquake Burglary P(B=t|C=t,R=t) = 0.1 P(E=t|C=t,R=t)=0.97 Radio Alarm R=t Call C=t • “Probability theory is nothing but common sense reduced to calculation” – Pierre Simon Laplace SP2-21

Inference tasks • Posterior probabilities of Query given Evidence • Marginalize out Nuisance variables • Sum-product • Most Probable Explanation (MPE)/ Viterbi • max-product • “Marginal Maximum A Posteriori (MAP)” • max-sum-product SP2-22

Causal vs diagnostic reasoning • Sometimes easier to specify P(effect|cause) than P(cause|effect) [stable mechanism] • Use Bayes’ rule to invert causal model Diseases, H Symptoms, v SP2-23

Applications of Bayesian inference SP2-24

Decision theory • Decision theory = probability theory + utility theory • Bayesian networks + actions/utilities = influence/ decision diagrams • Maximize expected utility SP2-25

Outline • Introduction • Exact inference • Brute force enumeration • Variable elimination algorithm • Complexity of exact inference • Belief propagation algorithm • Junction tree algorithm • Linear Gaussian models • Approximate inference SP2-26

Brute force enumeration • We can compute in O(KN) time, where K=|Xi| • By using BN, we can represent joint in O(N) space B E A M J SP2-27

Brute force enumeration Russell & Norvig SP2-28

Enumeration tree Russell & Norvig SP2-29

Enumeration tree contains repeated sub-expressions SP2-30

Variable/bucket elimination Kschischang01,Dechter96 • Push sums inside products (generalized distributive law) • Carry out summations right to left, storing intermediate results (factors) to avoid recomputation (dynamic programming) SP2-31

Variable elimination SP2-32

VarElim: basic operations • Pointwise product • Summing out Only multiply factors which contain summand (lazy evaluation) SP2-33

Variable elimination Russell & Norvig SP2-34

VarElim on loopy graphs Let us work right-to-left, eliminating variables, and adding arcs to ensurethat any two terms that co-occur in a factor are connected in the graph 2 4 2 4 2 4 Elim 5 Elim 4 Elim 6 1 1 6 6 1 6 5 3 5 3 3 5 SP2-36

Complexity of VarElim • Time/space for single query = O(N Kw+1) for N nodes of K states, where w=w(G, ) = width of graph induced by elimination order  • w* = argmin w(G,) = treewidth of G • Thm: finding an order to minimize treewidth is NP-complete • Does there exist a more efficient exact inference algorithm? Yannakakis81 SP2-37

Exact inference is #P-complete Dagum93 • Can reduce 3SAT to exact inference=> NP-hard • Equivalent to counting num. satisfying assignments => #P-complete Literals A C D B P(A)=P(B)=P(C)=P(D)=0.5 C1 = A v B v C C2 = C v D v ~A C3 = B v C v ~D S = C1 ^ C2 ^ C3 Clauses C1 C2 C3 Sentence S SP2-38

Summary so far • Brute force enumeration O(KN) time,O(N KC) space (where C=max clique size) • VarElim O(N Kw+1) time/space • w = w(G,) = induced treewidth • Exact inference is #P-complete • Motivates need for approximate inference SP2-39

MINVOLSET KINKEDTUBE PULMEMBOLUS INTUBATION VENTMACH DISCONNECT PAP SHUNT VENTLUNG VENITUBE PRESS MINOVL VENTALV PVSAT ARTCO2 EXPCO2 SAO2 TPR INSUFFANESTH HYPOVOLEMIA LVFAILURE CATECHOL LVEDVOLUME STROEVOLUME ERRCAUTER HR ERRBLOWOUTPUT HISTORY CO CVP PCWP HREKG HRSAT HRBP BP Treewidth Low treewidth High tree width Chains N=nxn grid W* = 1 Trees (no loops) Arnborg85 W* = O(n) = O(p N) Loopy graphs W* = #parents W* = NP-hard to find SP2-40

Golumbic80 Graph triangulation • A graph is triangulated (chordal, perfect) if it has no chordless cycles of length > 3. • To triangulate a graph, for each node Xi in order , ensure all neighbors of Xi form a clique by adding fill-in edges; then remove Xi 2 4 2 4 2 4 Elim 5 Elim 4 Elim 6 1 1 6 6 1 6 5 3 5 3 3 5 SP2-41

Graph triangulation • A graph is triangulated (chordal) if it has no chordless cycles of length > 3 5 Not triangulated! 7 7 4 4 1 1 2 5 2 8 8 3 3 6 9 6 9 Chordless 6-cycle SP2-42

Graph triangulation • Triangulation is not just adding triangles 5 Still not triangulated… 7 7 4 4 1 1 2 5 2 8 8 3 3 6 9 6 9 Chordless 4-cycle SP2-43

Graph triangulation • Triangulation creates large cliques 5 Triangulated at last! 7 7 1 4 4 1 2 5 2 8 8 3 3 6 9 6 9 SP2-44

Finding an elimination order • The size of the induced clique depends on the elimination order. • Since this is NP-hard to optimize, it is common to apply greedy search techniques: Kjaerulff90 • At each iteration, eliminate the node that would result in the smallest • Num. fill-in edges [min-fill] • Resulting clique weight [min-weight] (Weight of clique = product of number of states per node in clique) • There are some approximation algorithms Amir01 SP2-45

Speedup tricks for VarElim • Remove nodes that are irrelevant to the query • Exploit special forms of P(Xi|Xi) to sum out variables efficiently SP2-46

Irrelevant variables B E A M J • M is irrelevant to computing P(j|b) • Thm: Xi irrelevant unlessXi2 Ancestors({XQ} [ XE) • Here, Ancestors({J} [ {B}) = {A,E} • => hidden leaves (barren nodes) can always be removed =1 SP2-47

Irrelevant variables B E A M J • M, B and E irrelevant to computing P(j|a) • All variables relevant to a query can be found in O(N) time • Variable elimination supports query-specific optimizations SP2-48

Structured CPDs • Sometimes P(Xi|Xpii) has special structure, which we can exploit computationally • Context-specific independence (eg. CPD=decision tree) • Causal independence (eg CPD=noisy-OR) • Determinism • Such “non-graphical” structure complicates the search for the optimal triangulation Boutilier96b,Zhang99 Rish98,Zhang96b Zweig98,Bartels04 Bartels04 SP2-49

Exact and approximate inference in probabilistic graphical models

Exact and approximate inference in probabilistic graphical models

Presentation Transcript

Exact Inference

Approximate Probabilistic Optimization Using Exact-Capacity-Approximate-Response-Distribution (ECARD)

Exact and Approximate Inference in Associative Hierarchical Networks using Graph Cuts

Exact and approximate inference in probabilistic graphical models

Exact Inference on Graphical Models

Graphical Models - Inference -

Inference using Graphical Models and Software Tools

Lecture 22: Inference in Graphical Models

Probabilistic graphical models

Probabilistic Graphical Models

Directed Graphical Probabilistic Models:

Probabilistic graphical models and regulatory networks

Probabilistic Graphical Models

Probabilistic and Possibilistic Graphical Models in Complex Applications

Query-Specific Learning and Inference for Probabilistic Graphical Models

Probabilistic Graphical Models

Causal Inference and Graphical Models

Approximate Inference