1 / 94

# Sampling Bayesian Networks - PowerPoint PPT Presentation

Sampling Bayesian Networks. ICS 276 2007. Answering BN Queries. Probability of Evidence P(e) ? NP-hard Conditional Prob. P(x i |e) ? NP-hard MPE x = arg max P(x|e) ? NP-hard MAP y = arg max P(y|e), y  x ? NP PP -hard Approximating P(e) or P(x i |e) within : NP-hard.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about ' Sampling Bayesian Networks' - turi

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

### Sampling Bayesian Networks

ICS 276

2007

• Probability of Evidence P(e) ? NP-hard

• Conditional Prob. P(xi|e) ? NP-hard

• MPE x = arg max P(x|e) ? NP-hard

• MAP y = arg max P(y|e), y  x ?

NPPP-hard

• Approximating P(e) or P(xi|e) within : NP-hard

Structural Approximations

• Eliminate some dependencies

• Remove edges

• Mini-Bucket Approach

Search

Approach for optimization tasks: MPE, MAP

Sampling

Generate random samples and compute values of interest from samples,

not original network

• Input: Bayesian network with set of nodes X

• Sample = a tuple with assigned values

s=(X1=x1,X2=x2,… ,Xk=xk)

• Tuple may include all variables (except evidence) or a subset

• Sampling schemas dictate how to generate samples (tuples)

• Ideally, samples are distributed according to P(X|E)

Given a set of variables X = {X1, X2, … Xn} that represent joint probability distribution (X) and some function g(X), we can compute expected value of g(X) :

Sampling From (X)

A sample St is an instantiation:

Given independent, identically distributed samples (iid) S1, S2, …ST from (X), it follows from Strong Law of Large Numbers:

• Given random variable X, D(X)={0, 1}

• Given P(X) = {0.3, 0.7}

• Generate k samples: 0,1,1,1,0,1,1,0,1

• Approximate P’(X):

• Given random variable X, D(X)={0, 1}

• Given P(X) = {0.3, 0.7}

• Sample X  P (X)

• draw random number r  [0, 1]

• If (r < 0.3) then set X=0

• Else set X=1

• Can generalize for any domain size

• Same Idea: generate a set of samples T

• Estimate P(Xi|E) from samples

• Challenge: X is a vector and P(X) is a huge distribution represented by BN

• Need to know:

• How to generate a new sample ?

• How many samples T do we need ?

• How to estimate P(E=e) and P(Xi|e) ?

• Forward Sampling

• Gibbs Sampling (MCMC)

• Blocking

• Rao-Blackwellised

• Likelihood Weighting

• Importance Sampling

• Sequential Monte-Carlo (Particle Filtering) in Dynamic Bayesian Networks

• Forward Sampling

• Case with No evidence E={}

• Case with Evidence E=e

• # samples N and Error Bounds

Forward Sampling No Evidence(Henrion 1988)

Input: Bayesian network

X= {X1,…,XN}, N- #nodes, T - # samples

Output: T samples

Process nodes in topological order – first process the ancestors of a node, then the node itself:

• For t = 0 to T

• For i = 0 to N

• Xi sample xit from P(xi | pai)

0

0.3

1

Sampling A Value

What does it mean to sample xit from P(Xi | pai) ?

• Assume D(Xi)={0,1}

• Assume P(Xi | pai) = (0.3, 0.7)

• Draw a random number r from [0,1]

If r falls in [0,0.3], set Xi = 0

If r falls in [0.3,1], set Xi=1

X1

X3

X2

X4

estimate P(Xi = xi) :

Basically, count the proportion of samples where Xi = xi

Input: Bayesian network

X= {X1,…,XN}, N- #nodes

E – evidence, T - # samples

Output: T samples consistent with E

• For t=1 to T

• For i=1 to N

• Xi sample xit from P(xi | pai)

• If Xi in E and Xi xi, reject sample:

• i = 1 and go to step 2

X1

X3

X2

X4

Let Y be a subset of evidence nodes s.t. Y=u

Theorem: Let s(y) be the estimate of P(y) resulting from a randomly chosen sample set S with T samples. Then, to guarantee relative error at most with probability at least 1- it is enough to have:

Derived from Chebychev’s Bound.

Theorem: Let s(y) be the estimate of P(y) resulting from a randomly chosen sample set S with T samples. Then, to guarantee relative error at most with probability at least 1- it is enough to have:

Derived from Hoeffding’s Bound (full proof is given in Koller).

• P(xi | pa(xi)) is readily available

• Samples are independent !

Drawbacks:

• If evidence E is rare (P(e) is low), then we will reject most of the samples!

• Since P(y) in estimate of T is unknown, must estimate P(y) from samples themselves!

• If P(e) is small, T will become very big!

• Forward Sampling

• High Rejection Rate

• Fix evidence values

• Gibbs sampling (MCMC)

• Likelihood Weighting

• Importance Sampling

• {henrion88} M. Henrion, "Propagating uncertainty in Bayesian networks by probabilistic logic sampling”, Uncertainty in AI, pp. = 149-163,1988

• Markov Chain Monte Carlo method

(Gelfand and Smith, 1990, Smith and Roberts, 1993, Tierney, 1994)

• Samples are dependent, form Markov Chain

• Sample from P’(X|e)which converges toP(X|e)

• Guaranteed to converge when all P > 0

• Methods to improve convergence:

• Blocking

• Rao-Blackwellised

• Error Bounds

• Lag-t autocovariance

• Multiple Chains, Chebyshev’s Inequality

• A sample t[1,2,…],is an instantiation of all variables in the network:

• Sampling process

• Fix values of observed variables e

• Instantiate node values in sample x0 at random

• Generate samples x1,x2,…xT from P(x|e)

• Compute posteriors from samples

Generate sample xt+1 from xt :

In short, for i=1 to N:

Process

All

Variables

In Some

Order

Gibbs Sampling (cont’d)(Pearl, 1988)

Markov blanket:

Input: X, E

Output: T samples {xt }

• Fix evidence E

• Generate samples from P(X | E)

• For t = 1 to T (compute samples)

• For i = 1 to N (loop through variables)

• Xi sample xit from P(Xi | markovt \ Xi)

• Query: P(xi |e) = ?

• Method 1: count #of samples where Xi=xi:

Method 2: average probability (mixture estimator):

X = {X1,X2,…,X9}

E = {X9}

X1

X3

X6

X2

X5

X8

X9

X4

X7

X1 = x10X6 = x60

X2 = x20X7 = x70

X3 = x30X8 = x80

X4 = x40

X5 = x50

X1

X3

X6

X2

X5

X8

X9

X4

X7

X1 P (X1 |X02,…,X08 ,X9}

E = {X9}

X1

X3

X6

X2

X5

X8

X9

X4

X7

X2 P(X2 |X11,…,X08 ,X9}

E = {X9}

X1

X3

X6

X2

X5

X8

X9

X4

X7

Gibbs Sampling: Burn-In

• We want to sample from P(X | E)

• But…starting point is random

• Solution: throw away first K samples

• Known As “Burn-In”

• What is K ? Hard to tell. Use intuition.

• Alternatives: sample first sample valkues from approximate P(x|e) (for example, run IBP first)

• Converge to stationary distribution * :

* = * P

where P is a transition kernel

pij = P(Xi Xj)

• Guaranteed to converge iff chain is :

• irreducible

• aperiodic

• ergodic ( i,j pij > 0)

• A Markov chain (or its probability transition matrix) is said to be irreducible if it is possible to reach every state from every other state (not necessarily in one step).

• In other words, i,j k : P(k)ij > 0 where k is the number of steps taken to get to state j from state i.

• Define d(i) = g.c.d.{n > 0 | it is possible to go from i to i in n steps}. Here, g.c.d. means the greatest common divisor of the integers in the set. If d(i)=1 for i, then chain is aperiodic.

• A recurrent state is a state to which the chain returns with probability 1:

nP(n)ij = 

• Recurrent, aperiodic states are ergodic.

Note: an extra condition for ergodicity is that expected recurrence time is finite. This holds for recurrent states in a finite state chain.

• Gibbs convergence is generally guaranteed as long as all probabilities are positive!

• Intuition for ergodicity requirement: if nodes X and Y are correlated s.t. X=0 Y=0, then:

• once we sample and assign X=0, then we are forced to assign Y=0;

• once we sample and assign Y=0, then we are forced to assign X=0;

 we will never be able to change their values again!

• Another problem: it can take a very long time to converge!

+Advantage: guaranteed to converge to P(X|E)

Problems:

• Samples are dependent !

• Statistical variance is too big in high-dimensional problems

Objectives:

• Reduce dependence between samples (autocorrelation)

• Skip samples

• Randomize Variable Sampling Order

• Reduce variance

• Blocking Gibbs Sampling

• Rao-Blackwellisation

• Pick only every k-th sample (Gayer, 1992)

Can reduce dependence between samples !

Increases variance ! Waists samples !

Random Scan Gibbs Sampler

Pick each next variable Xi for update at random with probability pi , i pi = 1.

(In the simplest case, pi are distributed uniformly.)

In some instances, reduces variance (MacEachern, Peruggia, 1999

“Subsampling the Gibbs Sampler: Variance Reduction”)

• Sample several variables together, as a block

• Example: Given three variables X,Y,Z, with domains of size 2, group Y and Z together to form a variable W={Y,Z} with domain size 4. Then, given sample (xt,yt,zt), compute next sample:

Xt+1 P(yt,zt)=P(wt)

(yt+1,zt+1)=Wt+1 P(xt+1)

+ Can improve convergence greatly when two variables are strongly correlated!

- Domain of the block variable grows exponentially with the #variables in a block!

Jensen, Kong, Kjaerulff, 1993

“Blocking Gibbs Sampling Very Large Probabilistic Expert Systems”

• Select a set of subsets:

E1, E2, E3, …, Ek s.t. Ei X

Ui Ei = X

Ai = X \ Ei

• Sample P(Ei | Ai)

• Do not sample all variables!

• Sample a subset!

• Example: Given three variables X,Y,Z, sample only X and Y, sum out Z. Given sample (xt,yt), compute next sample:

Xt+1 P(yt)

yt+1  P(xt+1)

Bottom line: reducing number of variables in a sample reduce variance!

• Standard Gibbs:

P(x|y,z),P(y|x,z),P(z|x,y) (1)

• Blocking:

P(x|y,z), P(y,z|x) (2)

• Rao-Blackwellised:

P(x|y), P(y|x) (3)

Var3 < Var2 < Var1

[Liu, Wong, Kong, 1994

Covariance structure of the Gibbs sampler…]

X

Y

Z

Rao-Blackwellised Gibbs: Cutset Sampling

• Select C  X(possibly cycle-cutset), |C| = m

• Fix evidence E

• Initialize nodes with random values:

For i=1 to m: ci to Ci = c0i

• For t=1 to n , generate samples:

For i=1 to m:

Ci=cit+1 P(ci|c1 t+1,…,ci-1 t+1,ci+1t,…,cmt ,e)

• Select a subset C={C1,…,CK} X

• A sample t[1,2,…],is an instantiation of C:

• Sampling process

• Fix values of observed variables e

• Generate sample c0 at random

• Generate samples c1,c2,…cT from P(c|e)

• Compute posteriors from samples

Cutset SamplingGenerating Samples

Generate sample ct+1 from ct :

In short, for i=1 to K:

Rao-Blackwellised Gibbs: Cutset Sampling

How to compute P(ci|ct\ci, e) ?

• Compute joint P(ci, ct\ci, e) for each ci D(Ci)

• Then normalize:

P(ci| ct\ci , e) =  P(ci, ct\ci , e)

• Computation efficiency depends

on choice of C

Rao-Blackwellised Gibbs: Cutset Sampling

How to choose C ?

• Special case: C is cycle-cutset, O(N)

• General case: apply Bucket Tree Elimination (BTE), O(exp(w)) where w is the induced width of the network when nodes in C are observed.

• Pick C wisely so as to minimize w  notion of w-cutset

• C=w-cutset of the network, a set of nodes such that when C and E are instantiated, the adjusted induced width of the network is w

• Complexity of exact inference:

bounded by w !

• cycle-cutset is a special case

• Query: ci C, P(ci |e)=? same as Gibbs:

• Special case of w-cutset

computed while generating sample t

• Query: P(xi |e) = ?

compute after generating sample t

X1

X2

X3

X4

X5

X6

X9

X7

X8

E=x9

Sample a new value for X2 :

X1

X2

X3

X4

X5

X6

X9

X7

X8

Sample a new value for X5 :

X1

X2

X3

X4

X5

X6

X9

X7

X8

Query P(x2|e) for sampling node X2 :

Sample 1

X1

X2

X3

Sample 2

X4

X5

X6

Sample 3

X9

X7

X8

Query P(x3 |e) for non-sampled node X3 :

X1

X2

X3

X4

X5

X6

X9

X7

X8

Objectives:

• Estimate needed number of samples T

• Estimate error

Methodology:

• 1 chain  use lag-k autocovariance

• Estimate T

• M chains  standard sampling variance

• Estimate Error

Lag-k autocovariance

Estimate Monte Carlo variance:

Here,  is the smallest positive integer satisfying:

Effective chain size:

In absense of autocovariance:

• Generate M chains of size K

• Each chain produces independent estimate Pm:

Estimate P(xi|e) as average of Pm (xi|e) :

Treat Pm as independent random variables.

{ Pm } are independent random variables

Therefore:

• Geman, S. & Geman D., 1984. Stocahstic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEE Trans.Pat.Anal.Mach.Intel. 6, 721-41.

• Introduce Gibbs sampling;

• Place the idea of Gibbs sampling in a general setting in which the collection of variables is structured in a graphical model and each variable has a neighborhood corresponding to a local region of the graphical structure. Geman and Geman use the Gibbs distribution to define the joint distribution on this structured set of variables.

• Tanner and Wong (1987)

• Data-augmentation

• Convergence Results

• Pearl,1988. Probabilistic Reasoning in Intelligent Systems, Morgan-Kaufmann.

• In the case of Bayesian networks, the neighborhoods correspond to the Markov blanket of a variable and the joint distribution is defined by the factorization of the network.

• Gelfand, A.E. and Smith, A.F.M., 1990. Sampling-based approaches to calculating marginal densities. J. Am.Statist. Assoc. 85, 398-409.

• Show variance reduction in using mixture estimator for posterior marginals.

• R. M. Neal, 1992. Connectionist learning of belief networks, Artifical Intelligence, v. 56, pp. 71-118.

• Stochastic simulation in Noisy-Or networks.

MSE vs. #samples (left) and time (right)

Ergodic, |X| = 54, D(Xi) = 2, |C| = 15, |E| = 4

Exact Time = 30 sec using Cutset Conditioning

MSE vs. #samples (left) and time (right)

Non-Ergodic (1 deterministic CPT entry)

|X| = 179, |C| = 8, 2<= D(Xi)<=4, |E| = 35

Exact Time = 122 sec using Loop-Cutset Conditioning

MSE vs. #samples (left) and time (right)

Ergodic, |X| = 360, D(Xi)=2, |C| = 21, |E| = 36

Exact Time > 60 min using Cutset Conditioning

Exact Values obtained via Bucket Elimination

MSE vs. #samples (left) and time (right)

|X| = 100, D(Xi) =2,|C| = 13, |E| = 15-20

Exact Time = 30 sec using Cutset Conditioning

x1

x2

x3

x4

u1

u2

u3

u4

p1

p2

p3

p4

y1

y2

y3

y4

MSE vs. time (right)

Non-Ergodic, |X| = 100, D(Xi)=2, |C| = 13-16, |E| = 50

Sample Ergodic Subspace U={U1, U2,…Uk}

Exact Time = 50 sec using Cutset Conditioning

MSE vs. #samples (left) and time (right)

Non-Ergodic, |X| = 56, |C| = 5, 2 <=D(Xi) <=11, |E| = 0

Exact Time = 2 sec using Loop-Cutset Conditioning

MSE vs. Time

Non-Ergodic, |X| = 360, |C| = 26, D(Xi)=2

Exact Time = 50 min using BTE

Likelihood Weighting(Fung and Chang, 1990; Shachter and Peot, 1990)

“Clamping” evidence+

forward sampling+

weighing samples by evidence likelihood

Works well for likelyevidence!

Sample in topological order over X !

e

e

e

e

e

e

e

e

e

xiP(Xi|pai)

P(Xi|pai) is a look-up in CPT!

Estimate Posterior Marginals:

• Converges to exact posterior marginals

• Generates Samples Fast

• Sampling distribution is close to prior (especially if E  Leaf Nodes)

• Increasing sampling variance

Convergence may be slow

Many samples with P(x(t))=0 rejected

Likelihood Convergence(Chebychev’s Inequality)

• Assume P(X=x|e) has mean  and variance 2

• Chebychev:

=P(x|e) is unknown => obtain it from samples!

K is a Bernoulli random variable

• Assume P(X=x|e) has mean  and variance 2

• Zero-One Estimation Theory (Karp et al.,1989):

=P(x|e) is unknown => obtain it from samples!

Local Variance Bound (LVB)(Dagum&Luby, 1994)

• Let  be LVB of a binary valued network:

• Using the LVB, the Zero-One Estimator can be re-written:

• In general, it is hard to sample from target distribution P(X|E)

• Generate samples from sampling (proposal) distribution Q(X)

• Weigh each sample against P(X|E)

• Nodes sampled in topological order

• Sampling distribution (for non-instantiated nodes) equal to the prior conditionals