1 / 78

Sampling Bayesian Networks

Sampling Bayesian Networks. ICS 295 2008. Algorithm Tree. Sampling Fundamentals. Given a set of variables X = {X 1 , X 2 , … X n }, a joint probability distribution  (X) and some function g(X), we can compute expected value of g(X) :. Sampling From  (X).

seanberry
Download Presentation

Sampling Bayesian Networks

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Sampling Bayesian Networks ICS 295 2008

  2. Algorithm Tree

  3. Sampling Fundamentals Given a set of variables X = {X1, X2, … Xn}, a joint probability distribution (X) and some function g(X), we can compute expected value of g(X) :

  4. Sampling From (X) A sample St is an instantiation: Given independent, identically distributed samples (iid) S1, S2, …ST from (X), it follows from Strong Law of Large Numbers:

  5. Sampling Challenge • It is hard to generate samples from (X) • Trade-Offs: • Generate samples from Q(X) • Forward Sampling, Likelyhood Weighting, IS • Try to find Q(X) close to (X) • Generate dependent samples forming a Markov Chain from P’(X)(X) • Metropolis, Metropolis-Hastings, Gibbs • Try to reduce dependence between samples

  6. Markov Chain • A sequence of random values x0,x1,… , defined on a finite state space D(X), is called a Markov Chain if it satisfies the Markov Property: • If P(xt+1 =y |xt) does not change with t (time homogeneous), then it is often expressed as a transition function, A(x,y) Liu, Ch.12, p 245

  7. Markov Chain Monte Carlo • First, define a transition probability P(xt+1=y|xt) • Pick initial state x0, usually not important because it becomes “forgotten” • Generate samples x1, x2,… sampling each next value from P(X| xt) x0 x1 xt xt+1 • If we choose proper P(xt+1|xt), we can guarantee that the distribution represented by samples x0,x1,… converges to (X)

  8. Markov Chain Properties • Irreducibility • Periodicity • Recurrence • Revercibility • Ergodicity • Stationary Distribution

  9. Irreducible • A station x is said to be irreducible if under the transition rule one has nonzero probability of moving from x to any other state and then coming back in a finite number of steps. • If on state is irreducible, then all the sates must be irreducible. Liu, Ch. 12, pp. 249, Def. 12.1.1

  10. Aperiodic • A state x is aperiodic if the greatest common divider of {n : An(x,x) > 0} is 1. • If state x is aperiodic and the chain is irreducible, then every state must be aperiodic. Liu, Ch. 12, pp.240-250, Def. 12.1.1

  11. Recurrence • A state x is recurrent if the chain returns to x with probability 1 • State x is recurrentif and only if: • Let M(x) be the expected number of steps to return to state x • State x is positive recurrent if M(x) is finite • The recurrent states in a finite state chain are positive recurrent.

  12. Ergodicity • A state x is ergodic if it is aperiodic and positive recurrent. • If all states in a Markov chain are ergodic then the chain is ergodic.

  13. Reversibility • Detail balance condition: • Markov chain is reversible if there is a  such that: • For a reversible Markov chain,  is always a stationary distribution.

  14. Stationary Distribution • If the Markov chain is time-homogeneous, then the vector (X) is a stationary distribution (aka invariantorequilibrium distribution, aka “fixed point”), if its entries sum up to 1 and satisfy: • An irreducible chain has a stationary distributionif and only if all of its states are positive recurrent. The distribution is unique.

  15. Stationary Distribution In Finite State Space • Stationary distribution always exists but may not be unique • If a finite-state Markov chain is irreducible and aperiodic, it is guaranteed to be unique and A(n)=P(xn = y | x0) converges to a rank-one matrix in which each row is the stationary distribution . • Thus, initial state x0 is not important for convergence: it gets forgotten and we start sampling from target distribution • However, it is important how long it takes to forget it!

  16. Convergence Theorem • Given a finite state Markov Chain whose transition function is irreducible and aperiodic, then An(x0,y) converges to its invariant distribution (y) geometrically in variation distance, then there exists a 0 < r < 1 and c > 0 s.t.:

  17. Eigen-Value Condition • Convergence to stationary distribution is driven by eigen-values of matrix A(x,y). • “The chain will converge to its unique invariant distribution if and only if matrix A’s second largest eigen-value in modular is strictly less than 1.” • Many proofs of convergence are centered around analyzing second eigen-value. Liu, Ch. 12, p. 249

  18. Convergence In Finite State Space • Assume a finite-state Markov chain is irreducible and aperiodic • Initial state x0 is not important for convergence: it gets forgotten and we start sampling from target distribution • However, it is important how long it takes to forget it! – known as burn-in time • Since the first k states are not drown exactly from , they are often thrown away. Open question: how big a k ?

  19. Sampling in BN • Same Idea: generate a set of samples T • Estimate P(Xi|E) from samples • Challenge: X is a vector and P(X) is a huge distribution represented by BN • Need to know: • How to generate a new sample ? • How many samples T do we need ? • How to estimate P(E=e) and P(Xi|e) ?

  20. Sampling Algorithms • Forward Sampling • Gibbs Sampling (MCMC) • Blocking • Rao-Blackwellised • Likelihood Weighting • Importance Sampling • Sequential Monte-Carlo (Particle Filtering) in Dynamic Bayesian Networks

  21. Gibbs Sampling • Markov Chain Monte Carlo method (Gelfand and Smith, 1990, Smith and Roberts, 1993, Tierney, 1994) • Transition probability equals the conditional distribution • Example: (X,Y), A(xt+1|yt)=P(x|y), A(yt+1|xt) = P(y|x) y1 y0 x0 x1

  22. Gibbs Sampling for BN • Samples are dependent, form Markov Chain • Sample from P’(X|e)which converges toP(X|e) • Guaranteed to converge when all P > 0 • Methods to improve convergence: • Blocking • Rao-Blackwellised • Error Bounds • Lag-t autocovariance • Multiple Chains, Chebyshev’s Inequality

  23. Gibbs Sampling (Pearl, 1988) • A sample t[1,2,…],is an instantiation of all variables in the network: • Sampling process • Fix values of observed variables e • Instantiate node values in sample x0 at random • Generate samples x1,x2,…xT from P(x|e) • Compute posteriors from samples

  24. Ordered Gibbs Sampler Generate sample xt+1 from xt : In short, for i=1 to N: Process All Variables In Some Order

  25. Gibbs Sampling (cont’d)(Pearl, 1988) Markov blanket:

  26. Ordered Gibbs Sampling Algorithm Input: X, E Output: T samples {xt } • Fix evidence E • Generate samples from P(X | E) • For t = 1 to T (compute samples) • For i = 1 to N (loop through variables) • Xi sample xit from P(Xi | markovt \ Xi)

  27. Answering Queries • Query: P(xi |e) = ? • Method 1: count #of samples where Xi=xi: Method 2: average probability (mixture estimator):

  28. Gibbs Sampling Example - BN X = {X1,X2,…,X9} E = {X9} X1 X3 X6 X2 X5 X8 X9 X4 X7

  29. Gibbs Sampling Example - BN X1 = x10X6 = x60 X2 = x20X7 = x70 X3 = x30X8 = x80 X4 = x40 X5 = x50 X1 X3 X6 X2 X5 X8 X9 X4 X7

  30. Gibbs Sampling Example - BN X1 P (X1 |X02,…,X08 ,X9} E = {X9} X1 X3 X6 X2 X5 X8 X9 X4 X7

  31. Gibbs Sampling Example - BN X2 P(X2 |X11,…,X08 ,X9} E = {X9} X1 X3 X6 X2 X5 X8 X9 X4 X7

  32. Gibbs Sampling: Illustration

  33. Gibbs Sampling Example – Init Initialize nodes with random values: X1 = x10X6 = x60 X2 = x20X7 = x70 X3 = x30X8 = x80 X4 = x40 X5 = x50 • Initialize Running Sums: SUM1 = 0 SUM2 = 0 SUM3 = 0 SUM4 = 0 SUM5 = 0 SUM6 = 0 SUM7 = 0 SUM8 = 0

  34. Gibbs Sampling Example – Step 1 • Generate Sample 1 • compute SUM1 += P(x1| x20, x30, x40, x50, x60, x70, x80, x9 ) • select and assign new value X1 = x11 • compute SUM2 += P(x2| x11, x30, x40, x50, x60, x70, x80, x9 ) • select and assign new value X2 = x21 • compute SUM3 += P(x2| x11, x21, x40, x50, x60, x70, x80, x9 ) • select and assign new value X3 = x31 • ….. • At the end, have new sample: S1 = {x11, x21, x41, x51, x61, x71, x81, x9}

  35. Gibbs Sampling Example – Step 2 • Generate Sample 2 • Compute P(x1| x21, x31, x41, x51, x61, x71, x81, x9 ) • select and assign new value X1 = x11 • update SUM1 += P(x1| x21, x31, x41, x51, x61, x71, x81, x9 ) • Compute P(x2| x12, x31, x41, x51, x61, x71, x81, x9 ) • select and assign new value X2 = x21 • update SUM2 += P(x2| x12, x31, x41, x51, x61, x71, x81, x9 ) • Compute P(x3| x12, x22, x41, x51, x61, x71, x81, x9 ) • select and assign new value X3 = x31 • compute SUM3 += P(x2| x12, x22, x41, x51, x61, x71, x81, x9 ) • ….. • New sample: S2 = {x12, x22, x42, x52, x62, x72, x82, x9}

  36. Gibbs Sampling Example – Answering Queries P(x1|x9) = SUM1 /2 P(x2|x9) = SUM2 /2 P(x3|x9) = SUM3 /2 P(x4|x9) = SUM4 /2 P(x5|x9) = SUM5 /2 P(x6|x9) = SUM6 /2 P(x7|x9) = SUM7 /2 P(x8|x9) = SUM8 /2

  37. pij pij > 0 Si Sj Gibbs Convergence • Stationary distribution = target sampling distribution • MCMC converges to the stationary distribution if network is ergodic • Chain is ergodic if all probabilities are positive • If i,j such that pij = 0 , then we may not be able to explore full sampling space !

  38. Gibbs Sampling: Burn-In • We want to sample from P(X | E) • But…starting point is random • Solution: throw away first K samples • Known As “Burn-In” • What is K ? Hard to tell. Use intuition. • Alternatives: sample first sample values from approximate P(x|e) (for example, run IBP first)

  39. Gibbs Sampling: Performance +Advantage: guaranteed to converge to P(X|E) -Disadvantage: convergence may be slow Problems: • Samples are dependent ! • Statistical variance is too big in high-dimensional problems

  40. Gibbs: Speeding Convergence Objectives: • Reduce dependence between samples (autocorrelation) • Skip samples • Randomize Variable Sampling Order • Reduce variance • Blocking Gibbs Sampling • Rao-Blackwellisation

  41. Skipping Samples • Pick only every k-th sample (Gayer, 1992) Can reduce dependence between samples ! Increases variance ! Waists samples !

  42. Randomized Variable Order Random Scan Gibbs Sampler Pick each next variable Xi for update at random with probability pi , i pi = 1. (In the simplest case, pi are distributed uniformly.) In some instances, reduces variance (MacEachern, Peruggia, 1999 “Subsampling the Gibbs Sampler: Variance Reduction”)

  43. Blocking • Sample several variables together, as a block • Example: Given three variables X,Y,Z, with domains of size 2, group Y and Z together to form a variable W={Y,Z} with domain size 4. Then, given sample (xt,yt,zt), compute next sample: Xt+1 P(yt,zt)=P(wt) (yt+1,zt+1)=Wt+1 P(xt+1) + Can improve convergence greatly when two variables are strongly correlated! - Domain of the block variable grows exponentially with the #variables in a block!

  44. Blocking Gibbs Sampling Jensen, Kong, Kjaerulff, 1993 “Blocking Gibbs Sampling Very Large Probabilistic Expert Systems” • Select a set of subsets: E1, E2, E3, …, Ek s.t. Ei X Ui Ei = X Ai = X \ Ei • Sample P(Ei | Ai)

  45. Rao-Blackwellisation • Do not sample all variables! • Sample a subset! • Example: Given three variables X,Y,Z, sample only X and Y, sum out Z. Given sample (xt,yt), compute next sample: Xt+1 P(x|yt) yt+1  P(y|xt+1)

  46. Rao-Blackwell Theorem Bottom line: reducing number of variables in a sample reduce variance!

  47. Blocking vs. Rao-Blackwellisation • Standard Gibbs: P(x|y,z),P(y|x,z),P(z|x,y) (1) • Blocking: P(x|y,z), P(y,z|x) (2) • Rao-Blackwellised: P(x|y), P(y|x) (3) Var3 < Var2 < Var1 [Liu, Wong, Kong, 1994 Covariance structure of the Gibbs sampler…] X Y Z

  48. Rao-Blackwellised Gibbs: Cutset Sampling • Select C  X(possibly cycle-cutset), |C| = m • Fix evidence E • Initialize nodes with random values: For i=1 to m: ci to Ci = c0i • For t=1 to n , generate samples: For i=1 to m: Ci=cit+1 P(ci|c1 t+1,…,ci-1 t+1,ci+1t,…,cmt ,e)

  49. Cutset Sampling • Select a subset C={C1,…,CK} X • A sample t[1,2,…],is an instantiation of C: • Sampling process • Fix values of observed variables e • Generate sample c0 at random • Generate samples c1,c2,…cT from P(c|e) • Compute posteriors from samples

  50. Cutset SamplingGenerating Samples Generate sample ct+1 from ct : In short, for i=1 to K:

More Related