Sampling Bayesian Networks

Sampling Bayesian Networks ICS 295 2008

Algorithm Tree

Sampling Fundamentals Given a set of variables X = {X1, X2, … Xn}, a joint probability distribution (X) and some function g(X), we can compute expected value of g(X) :

Sampling From (X) A sample St is an instantiation: Given independent, identically distributed samples (iid) S1, S2, …ST from (X), it follows from Strong Law of Large Numbers:

Sampling Challenge • It is hard to generate samples from (X) • Trade-Offs: • Generate samples from Q(X) • Forward Sampling, Likelyhood Weighting, IS • Try to find Q(X) close to (X) • Generate dependent samples forming a Markov Chain from P’(X)(X) • Metropolis, Metropolis-Hastings, Gibbs • Try to reduce dependence between samples

Markov Chain • A sequence of random values x0,x1,… , defined on a finite state space D(X), is called a Markov Chain if it satisfies the Markov Property: • If P(xt+1 =y |xt) does not change with t (time homogeneous), then it is often expressed as a transition function, A(x,y) Liu, Ch.12, p 245

Markov Chain Monte Carlo • First, define a transition probability P(xt+1=y|xt) • Pick initial state x0, usually not important because it becomes “forgotten” • Generate samples x1, x2,… sampling each next value from P(X| xt) x0 x1 xt xt+1 • If we choose proper P(xt+1|xt), we can guarantee that the distribution represented by samples x0,x1,… converges to (X)

Markov Chain Properties • Irreducibility • Periodicity • Recurrence • Revercibility • Ergodicity • Stationary Distribution

Irreducible • A station x is said to be irreducible if under the transition rule one has nonzero probability of moving from x to any other state and then coming back in a finite number of steps. • If on state is irreducible, then all the sates must be irreducible. Liu, Ch. 12, pp. 249, Def. 12.1.1

Aperiodic • A state x is aperiodic if the greatest common divider of {n : An(x,x) > 0} is 1. • If state x is aperiodic and the chain is irreducible, then every state must be aperiodic. Liu, Ch. 12, pp.240-250, Def. 12.1.1

Recurrence • A state x is recurrent if the chain returns to x with probability 1 • State x is recurrentif and only if: • Let M(x) be the expected number of steps to return to state x • State x is positive recurrent if M(x) is finite • The recurrent states in a finite state chain are positive recurrent.

Ergodicity • A state x is ergodic if it is aperiodic and positive recurrent. • If all states in a Markov chain are ergodic then the chain is ergodic.

Reversibility • Detail balance condition: • Markov chain is reversible if there is a  such that: • For a reversible Markov chain,  is always a stationary distribution.

Stationary Distribution • If the Markov chain is time-homogeneous, then the vector (X) is a stationary distribution (aka invariantorequilibrium distribution, aka “fixed point”), if its entries sum up to 1 and satisfy: • An irreducible chain has a stationary distributionif and only if all of its states are positive recurrent. The distribution is unique.

Stationary Distribution In Finite State Space • Stationary distribution always exists but may not be unique • If a finite-state Markov chain is irreducible and aperiodic, it is guaranteed to be unique and A(n)=P(xn = y | x0) converges to a rank-one matrix in which each row is the stationary distribution . • Thus, initial state x0 is not important for convergence: it gets forgotten and we start sampling from target distribution • However, it is important how long it takes to forget it!

Convergence Theorem • Given a finite state Markov Chain whose transition function is irreducible and aperiodic, then An(x0,y) converges to its invariant distribution (y) geometrically in variation distance, then there exists a 0 < r < 1 and c > 0 s.t.:

Eigen-Value Condition • Convergence to stationary distribution is driven by eigen-values of matrix A(x,y). • “The chain will converge to its unique invariant distribution if and only if matrix A’s second largest eigen-value in modular is strictly less than 1.” • Many proofs of convergence are centered around analyzing second eigen-value. Liu, Ch. 12, p. 249

Convergence In Finite State Space • Assume a finite-state Markov chain is irreducible and aperiodic • Initial state x0 is not important for convergence: it gets forgotten and we start sampling from target distribution • However, it is important how long it takes to forget it! – known as burn-in time • Since the first k states are not drown exactly from , they are often thrown away. Open question: how big a k ?

Sampling in BN • Same Idea: generate a set of samples T • Estimate P(Xi|E) from samples • Challenge: X is a vector and P(X) is a huge distribution represented by BN • Need to know: • How to generate a new sample ? • How many samples T do we need ? • How to estimate P(E=e) and P(Xi|e) ?

Sampling Algorithms • Forward Sampling • Gibbs Sampling (MCMC) • Blocking • Rao-Blackwellised • Likelihood Weighting • Importance Sampling • Sequential Monte-Carlo (Particle Filtering) in Dynamic Bayesian Networks

Gibbs Sampling • Markov Chain Monte Carlo method (Gelfand and Smith, 1990, Smith and Roberts, 1993, Tierney, 1994) • Transition probability equals the conditional distribution • Example: (X,Y), A(xt+1|yt)=P(x|y), A(yt+1|xt) = P(y|x) y1 y0 x0 x1

Gibbs Sampling for BN • Samples are dependent, form Markov Chain • Sample from P’(X|e)which converges toP(X|e) • Guaranteed to converge when all P > 0 • Methods to improve convergence: • Blocking • Rao-Blackwellised • Error Bounds • Lag-t autocovariance • Multiple Chains, Chebyshev’s Inequality

Gibbs Sampling (Pearl, 1988) • A sample t[1,2,…],is an instantiation of all variables in the network: • Sampling process • Fix values of observed variables e • Instantiate node values in sample x0 at random • Generate samples x1,x2,…xT from P(x|e) • Compute posteriors from samples

Ordered Gibbs Sampler Generate sample xt+1 from xt : In short, for i=1 to N: Process All Variables In Some Order

Gibbs Sampling (cont’d)(Pearl, 1988) Markov blanket:

Ordered Gibbs Sampling Algorithm Input: X, E Output: T samples {xt } • Fix evidence E • Generate samples from P(X | E) • For t = 1 to T (compute samples) • For i = 1 to N (loop through variables) • Xi sample xit from P(Xi | markovt \ Xi)

Answering Queries • Query: P(xi |e) = ? • Method 1: count #of samples where Xi=xi: Method 2: average probability (mixture estimator):

Gibbs Sampling Example - BN X = {X1,X2,…,X9} E = {X9} X1 X3 X6 X2 X5 X8 X9 X4 X7

Gibbs Sampling Example - BN X1 = x10X6 = x60 X2 = x20X7 = x70 X3 = x30X8 = x80 X4 = x40 X5 = x50 X1 X3 X6 X2 X5 X8 X9 X4 X7

Gibbs Sampling Example - BN X1 P (X1 |X02,…,X08 ,X9} E = {X9} X1 X3 X6 X2 X5 X8 X9 X4 X7

Gibbs Sampling Example - BN X2 P(X2 |X11,…,X08 ,X9} E = {X9} X1 X3 X6 X2 X5 X8 X9 X4 X7

Gibbs Sampling: Illustration

Gibbs Sampling Example – Init Initialize nodes with random values: X1 = x10X6 = x60 X2 = x20X7 = x70 X3 = x30X8 = x80 X4 = x40 X5 = x50 • Initialize Running Sums: SUM1 = 0 SUM2 = 0 SUM3 = 0 SUM4 = 0 SUM5 = 0 SUM6 = 0 SUM7 = 0 SUM8 = 0

Gibbs Sampling Example – Step 1 • Generate Sample 1 • compute SUM1 += P(x1| x20, x30, x40, x50, x60, x70, x80, x9 ) • select and assign new value X1 = x11 • compute SUM2 += P(x2| x11, x30, x40, x50, x60, x70, x80, x9 ) • select and assign new value X2 = x21 • compute SUM3 += P(x2| x11, x21, x40, x50, x60, x70, x80, x9 ) • select and assign new value X3 = x31 • ….. • At the end, have new sample: S1 = {x11, x21, x41, x51, x61, x71, x81, x9}

Gibbs Sampling Example – Step 2 • Generate Sample 2 • Compute P(x1| x21, x31, x41, x51, x61, x71, x81, x9 ) • select and assign new value X1 = x11 • update SUM1 += P(x1| x21, x31, x41, x51, x61, x71, x81, x9 ) • Compute P(x2| x12, x31, x41, x51, x61, x71, x81, x9 ) • select and assign new value X2 = x21 • update SUM2 += P(x2| x12, x31, x41, x51, x61, x71, x81, x9 ) • Compute P(x3| x12, x22, x41, x51, x61, x71, x81, x9 ) • select and assign new value X3 = x31 • compute SUM3 += P(x2| x12, x22, x41, x51, x61, x71, x81, x9 ) • ….. • New sample: S2 = {x12, x22, x42, x52, x62, x72, x82, x9}

pij pij > 0 Si Sj Gibbs Convergence • Stationary distribution = target sampling distribution • MCMC converges to the stationary distribution if network is ergodic • Chain is ergodic if all probabilities are positive • If i,j such that pij = 0 , then we may not be able to explore full sampling space !

Gibbs Sampling: Burn-In • We want to sample from P(X | E) • But…starting point is random • Solution: throw away first K samples • Known As “Burn-In” • What is K ? Hard to tell. Use intuition. • Alternatives: sample first sample values from approximate P(x|e) (for example, run IBP first)

Gibbs Sampling: Performance +Advantage: guaranteed to converge to P(X|E) -Disadvantage: convergence may be slow Problems: • Samples are dependent ! • Statistical variance is too big in high-dimensional problems

Gibbs: Speeding Convergence Objectives: • Reduce dependence between samples (autocorrelation) • Skip samples • Randomize Variable Sampling Order • Reduce variance • Blocking Gibbs Sampling • Rao-Blackwellisation

Skipping Samples • Pick only every k-th sample (Gayer, 1992) Can reduce dependence between samples ! Increases variance ! Waists samples !

Randomized Variable Order Random Scan Gibbs Sampler Pick each next variable Xi for update at random with probability pi , i pi = 1. (In the simplest case, pi are distributed uniformly.) In some instances, reduces variance (MacEachern, Peruggia, 1999 “Subsampling the Gibbs Sampler: Variance Reduction”)

Blocking • Sample several variables together, as a block • Example: Given three variables X,Y,Z, with domains of size 2, group Y and Z together to form a variable W={Y,Z} with domain size 4. Then, given sample (xt,yt,zt), compute next sample: Xt+1 P(yt,zt)=P(wt) (yt+1,zt+1)=Wt+1 P(xt+1) + Can improve convergence greatly when two variables are strongly correlated! - Domain of the block variable grows exponentially with the #variables in a block!

Blocking Gibbs Sampling Jensen, Kong, Kjaerulff, 1993 “Blocking Gibbs Sampling Very Large Probabilistic Expert Systems” • Select a set of subsets: E1, E2, E3, …, Ek s.t. Ei X Ui Ei = X Ai = X \ Ei • Sample P(Ei | Ai)

Rao-Blackwellisation • Do not sample all variables! • Sample a subset! • Example: Given three variables X,Y,Z, sample only X and Y, sum out Z. Given sample (xt,yt), compute next sample: Xt+1 P(x|yt) yt+1  P(y|xt+1)

Rao-Blackwell Theorem Bottom line: reducing number of variables in a sample reduce variance!

Rao-Blackwellised Gibbs: Cutset Sampling • Select C  X(possibly cycle-cutset), |C| = m • Fix evidence E • Initialize nodes with random values: For i=1 to m: ci to Ci = c0i • For t=1 to n , generate samples: For i=1 to m: Ci=cit+1 P(ci|c1 t+1,…,ci-1 t+1,ci+1t,…,cmt ,e)

Cutset Sampling • Select a subset C={C1,…,CK} X • A sample t[1,2,…],is an instantiation of C: • Sampling process • Fix values of observed variables e • Generate sample c0 at random • Generate samples c1,c2,…cT from P(c|e) • Compute posteriors from samples

Cutset SamplingGenerating Samples Generate sample ct+1 from ct : In short, for i=1 to K:

Sampling Bayesian Networks