Sampling Bayesian Networks

1 / 94

# Sampling Bayesian Networks - PowerPoint PPT Presentation

Sampling Bayesian Networks. ICS 276 2007. Answering BN Queries. Probability of Evidence P(e) ? NP-hard Conditional Prob. P(x i |e) ? NP-hard MPE x = arg max P(x|e) ? NP-hard MAP y = arg max P(y|e), y  x ? NP PP -hard Approximating P(e) or P(x i |e) within : NP-hard.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about ' Sampling Bayesian Networks' - turi

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

### Sampling Bayesian Networks

ICS 276

2007

• Probability of Evidence P(e) ? NP-hard
• Conditional Prob. P(xi|e) ? NP-hard
• MPE x = arg max P(x|e) ? NP-hard
• MAP y = arg max P(y|e), y  x ?

NPPP-hard

• Approximating P(e) or P(xi|e) within : NP-hard
Approximation Algorithms

Structural Approximations

• Eliminate some dependencies
• Remove edges
• Mini-Bucket Approach

Search

Approach for optimization tasks: MPE, MAP

Sampling

Generate random samples and compute values of interest from samples,

not original network

Sampling
• Input: Bayesian network with set of nodes X
• Sample = a tuple with assigned values

s=(X1=x1,X2=x2,… ,Xk=xk)

• Tuple may include all variables (except evidence) or a subset
• Sampling schemas dictate how to generate samples (tuples)
• Ideally, samples are distributed according to P(X|E)
Sampling Fundamentals

Given a set of variables X = {X1, X2, … Xn} that represent joint probability distribution (X) and some function g(X), we can compute expected value of g(X) :

Sampling From (X)

A sample St is an instantiation:

Given independent, identically distributed samples (iid) S1, S2, …ST from (X), it follows from Strong Law of Large Numbers:

Sampling Basics
• Given random variable X, D(X)={0, 1}
• Given P(X) = {0.3, 0.7}
• Generate k samples: 0,1,1,1,0,1,1,0,1
• Approximate P’(X):
How to draw a sample ?
• Given random variable X, D(X)={0, 1}
• Given P(X) = {0.3, 0.7}
• Sample X  P (X)
• draw random number r  [0, 1]
• If (r < 0.3) then set X=0
• Else set X=1
• Can generalize for any domain size
Sampling in BN
• Same Idea: generate a set of samples T
• Estimate P(Xi|E) from samples
• Challenge: X is a vector and P(X) is a huge distribution represented by BN
• Need to know:
• How to generate a new sample ?
• How many samples T do we need ?
• How to estimate P(E=e) and P(Xi|e) ?
Sampling Algorithms
• Forward Sampling
• Gibbs Sampling (MCMC)
• Blocking
• Rao-Blackwellised
• Likelihood Weighting
• Importance Sampling
• Sequential Monte-Carlo (Particle Filtering) in Dynamic Bayesian Networks
Forward Sampling
• Forward Sampling
• Case with No evidence E={}
• Case with Evidence E=e
• # samples N and Error Bounds
Forward Sampling No Evidence(Henrion 1988)

Input: Bayesian network

X= {X1,…,XN}, N- #nodes, T - # samples

Output: T samples

Process nodes in topological order – first process the ancestors of a node, then the node itself:

• For t = 0 to T
• For i = 0 to N
• Xi sample xit from P(xi | pai)

r

0

0.3

1

Sampling A Value

What does it mean to sample xit from P(Xi | pai) ?

• Assume D(Xi)={0,1}
• Assume P(Xi | pai) = (0.3, 0.7)
• Draw a random number r from [0,1]

If r falls in [0,0.3], set Xi = 0

If r falls in [0.3,1], set Xi=1

estimate P(Xi = xi) :

Basically, count the proportion of samples where Xi = xi

Forward Sampling w/ Evidence

Input: Bayesian network

X= {X1,…,XN}, N- #nodes

E – evidence, T - # samples

Output: T samples consistent with E

• For t=1 to T
• For i=1 to N
• Xi sample xit from P(xi | pai)
• If Xi in E and Xi xi, reject sample:
• i = 1 and go to step 2
Forward Sampling: Illustration

Let Y be a subset of evidence nodes s.t. Y=u

Forward Sampling –How many samples?

Theorem: Let s(y) be the estimate of P(y) resulting from a randomly chosen sample set S with T samples. Then, to guarantee relative error at most with probability at least 1- it is enough to have:

Derived from Chebychev’s Bound.

Forward Sampling - How many samples?

Theorem: Let s(y) be the estimate of P(y) resulting from a randomly chosen sample set S with T samples. Then, to guarantee relative error at most with probability at least 1- it is enough to have:

Derived from Hoeffding’s Bound (full proof is given in Koller).

Forward Sampling:Performance

• P(xi | pa(xi)) is readily available
• Samples are independent !

Drawbacks:

• If evidence E is rare (P(e) is low), then we will reject most of the samples!
• Since P(y) in estimate of T is unknown, must estimate P(y) from samples themselves!
• If P(e) is small, T will become very big!
Problem: Evidence
• Forward Sampling
• High Rejection Rate
• Fix evidence values
• Gibbs sampling (MCMC)
• Likelihood Weighting
• Importance Sampling
Forward Sampling Bibliography
• {henrion88} M. Henrion, "Propagating uncertainty in Bayesian networks by probabilistic logic sampling”, Uncertainty in AI, pp. = 149-163,1988
Gibbs Sampling
• Markov Chain Monte Carlo method

(Gelfand and Smith, 1990, Smith and Roberts, 1993, Tierney, 1994)

• Samples are dependent, form Markov Chain
• Sample from P’(X|e)which converges toP(X|e)
• Guaranteed to converge when all P > 0
• Methods to improve convergence:
• Blocking
• Rao-Blackwellised
• Error Bounds
• Lag-t autocovariance
• Multiple Chains, Chebyshev’s Inequality
Gibbs Sampling (Pearl, 1988)
• A sample t[1,2,…],is an instantiation of all variables in the network:
• Sampling process
• Fix values of observed variables e
• Instantiate node values in sample x0 at random
• Generate samples x1,x2,…xT from P(x|e)
• Compute posteriors from samples
Ordered Gibbs Sampler

Generate sample xt+1 from xt :

In short, for i=1 to N:

Process

All

Variables

In Some

Order

Ordered Gibbs Sampling Algorithm

Input: X, E

Output: T samples {xt }

• Fix evidence E
• Generate samples from P(X | E)
• For t = 1 to T (compute samples)
• For i = 1 to N (loop through variables)
• Xi sample xit from P(Xi | markovt \ Xi)
• Query: P(xi |e) = ?
• Method 1: count #of samples where Xi=xi:

Method 2: average probability (mixture estimator):

Gibbs Sampling Example - BN

X = {X1,X2,…,X9}

E = {X9}

X1

X3

X6

X2

X5

X8

X9

X4

X7

Gibbs Sampling Example - BN

X1 = x10X6 = x60

X2 = x20X7 = x70

X3 = x30X8 = x80

X4 = x40

X5 = x50

X1

X3

X6

X2

X5

X8

X9

X4

X7

Gibbs Sampling Example - BN

X1 P (X1 |X02,…,X08 ,X9}

E = {X9}

X1

X3

X6

X2

X5

X8

X9

X4

X7

Gibbs Sampling Example - BN

X2 P(X2 |X11,…,X08 ,X9}

E = {X9}

X1

X3

X6

X2

X5

X8

X9

X4

X7

Gibbs Sampling: Burn-In
• We want to sample from P(X | E)
• But…starting point is random
• Solution: throw away first K samples
• Known As “Burn-In”
• What is K ? Hard to tell. Use intuition.
• Alternatives: sample first sample valkues from approximate P(x|e) (for example, run IBP first)
Gibbs Sampling: Convergence
• Converge to stationary distribution * :

* = * P

where P is a transition kernel

pij = P(Xi Xj)

• Guaranteed to converge iff chain is :
• irreducible
• aperiodic
• ergodic ( i,j pij > 0)
Irreducible
• A Markov chain (or its probability transition matrix) is said to be irreducible if it is possible to reach every state from every other state (not necessarily in one step).
• In other words, i,j k : P(k)ij > 0 where k is the number of steps taken to get to state j from state i.
Aperiodic
• Define d(i) = g.c.d.{n > 0 | it is possible to go from i to i in n steps}. Here, g.c.d. means the greatest common divisor of the integers in the set. If d(i)=1 for i, then chain is aperiodic.
Ergodicity
• A recurrent state is a state to which the chain returns with probability 1:

nP(n)ij = 

• Recurrent, aperiodic states are ergodic.

Note: an extra condition for ergodicity is that expected recurrence time is finite. This holds for recurrent states in a finite state chain.

Gibbs Convergence
• Gibbs convergence is generally guaranteed as long as all probabilities are positive!
• Intuition for ergodicity requirement: if nodes X and Y are correlated s.t. X=0 Y=0, then:
• once we sample and assign X=0, then we are forced to assign Y=0;
• once we sample and assign Y=0, then we are forced to assign X=0;

 we will never be able to change their values again!

• Another problem: it can take a very long time to converge!
Gibbs Sampling: Performance

+Advantage: guaranteed to converge to P(X|E)

Problems:

• Samples are dependent !
• Statistical variance is too big in high-dimensional problems
Gibbs: Speeding Convergence

Objectives:

• Reduce dependence between samples (autocorrelation)
• Skip samples
• Randomize Variable Sampling Order
• Reduce variance
• Blocking Gibbs Sampling
• Rao-Blackwellisation
Skipping Samples
• Pick only every k-th sample (Gayer, 1992)

Can reduce dependence between samples !

Increases variance ! Waists samples !

Randomized Variable Order

Random Scan Gibbs Sampler

Pick each next variable Xi for update at random with probability pi , i pi = 1.

(In the simplest case, pi are distributed uniformly.)

In some instances, reduces variance (MacEachern, Peruggia, 1999

“Subsampling the Gibbs Sampler: Variance Reduction”)

Blocking
• Sample several variables together, as a block
• Example: Given three variables X,Y,Z, with domains of size 2, group Y and Z together to form a variable W={Y,Z} with domain size 4. Then, given sample (xt,yt,zt), compute next sample:

Xt+1 P(yt,zt)=P(wt)

(yt+1,zt+1)=Wt+1 P(xt+1)

+ Can improve convergence greatly when two variables are strongly correlated!

- Domain of the block variable grows exponentially with the #variables in a block!

Blocking Gibbs Sampling

Jensen, Kong, Kjaerulff, 1993

“Blocking Gibbs Sampling Very Large Probabilistic Expert Systems”

• Select a set of subsets:

E1, E2, E3, …, Ek s.t. Ei X

Ui Ei = X

Ai = X \ Ei

• Sample P(Ei | Ai)
Rao-Blackwellisation
• Do not sample all variables!
• Sample a subset!
• Example: Given three variables X,Y,Z, sample only X and Y, sum out Z. Given sample (xt,yt), compute next sample:

Xt+1 P(yt)

yt+1  P(xt+1)

Rao-Blackwell Theorem

Bottom line: reducing number of variables in a sample reduce variance!

Blocking vs. Rao-Blackwellisation
• Standard Gibbs:

P(x|y,z),P(y|x,z),P(z|x,y) (1)

• Blocking:

P(x|y,z), P(y,z|x) (2)

• Rao-Blackwellised:

P(x|y), P(y|x) (3)

Var3 < Var2 < Var1

[Liu, Wong, Kong, 1994

Covariance structure of the Gibbs sampler…]

X

Y

Z

Rao-Blackwellised Gibbs: Cutset Sampling
• Select C  X(possibly cycle-cutset), |C| = m
• Fix evidence E
• Initialize nodes with random values:

For i=1 to m: ci to Ci = c0i

• For t=1 to n , generate samples:

For i=1 to m:

Ci=cit+1 P(ci|c1 t+1,…,ci-1 t+1,ci+1t,…,cmt ,e)

Cutset Sampling
• Select a subset C={C1,…,CK} X
• A sample t[1,2,…],is an instantiation of C:
• Sampling process
• Fix values of observed variables e
• Generate sample c0 at random
• Generate samples c1,c2,…cT from P(c|e)
• Compute posteriors from samples
Cutset SamplingGenerating Samples

Generate sample ct+1 from ct :

In short, for i=1 to K:

Rao-Blackwellised Gibbs: Cutset Sampling

How to compute P(ci|ct\ci, e) ?

• Compute joint P(ci, ct\ci, e) for each ci D(Ci)
• Then normalize:

P(ci| ct\ci , e) =  P(ci, ct\ci , e)

• Computation efficiency depends

on choice of C

Rao-Blackwellised Gibbs: Cutset Sampling

How to choose C ?

• Special case: C is cycle-cutset, O(N)
• General case: apply Bucket Tree Elimination (BTE), O(exp(w)) where w is the induced width of the network when nodes in C are observed.
• Pick C wisely so as to minimize w  notion of w-cutset
w-cutset Sampling
• C=w-cutset of the network, a set of nodes such that when C and E are instantiated, the adjusted induced width of the network is w
• Complexity of exact inference:

bounded by w !

• cycle-cutset is a special case
• Query: ci C, P(ci |e)=? same as Gibbs:
• Special case of w-cutset

computed while generating sample t

• Query: P(xi |e) = ?

compute after generating sample t

Cutset Sampling Example

X1

X2

X3

X4

X5

X6

X9

X7

X8

E=x9

Cutset Sampling Example

Sample a new value for X2 :

X1

X2

X3

X4

X5

X6

X9

X7

X8

Cutset Sampling Example

Sample a new value for X5 :

X1

X2

X3

X4

X5

X6

X9

X7

X8

Cutset Sampling Example

Query P(x2|e) for sampling node X2 :

Sample 1

X1

X2

X3

Sample 2

X4

X5

X6

Sample 3

X9

X7

X8

Cutset Sampling Example

Query P(x3 |e) for non-sampled node X3 :

X1

X2

X3

X4

X5

X6

X9

X7

X8

Gibbs: Error Bounds

Objectives:

• Estimate needed number of samples T
• Estimate error

Methodology:

• 1 chain  use lag-k autocovariance
• Estimate T
• M chains  standard sampling variance
• Estimate Error
Gibbs: lag-k autocovariance

Lag-k autocovariance

Gibbs: lag-k autocovariance

Estimate Monte Carlo variance:

Here,  is the smallest positive integer satisfying:

Effective chain size:

In absense of autocovariance:

Gibbs: Multiple Chains
• Generate M chains of size K
• Each chain produces independent estimate Pm:

Estimate P(xi|e) as average of Pm (xi|e) :

Treat Pm as independent random variables.

Gibbs: Multiple Chains

{ Pm } are independent random variables

Therefore:

Geman&Geman1984
• Geman, S. & Geman D., 1984. Stocahstic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEE Trans.Pat.Anal.Mach.Intel. 6, 721-41.
• Introduce Gibbs sampling;
• Place the idea of Gibbs sampling in a general setting in which the collection of variables is structured in a graphical model and each variable has a neighborhood corresponding to a local region of the graphical structure. Geman and Geman use the Gibbs distribution to define the joint distribution on this structured set of variables.
Tanner&Wong 1987
• Tanner and Wong (1987)
• Data-augmentation
• Convergence Results
Pearl1988
• Pearl,1988. Probabilistic Reasoning in Intelligent Systems, Morgan-Kaufmann.
• In the case of Bayesian networks, the neighborhoods correspond to the Markov blanket of a variable and the joint distribution is defined by the factorization of the network.
Gelfand&Smith,1990
• Gelfand, A.E. and Smith, A.F.M., 1990. Sampling-based approaches to calculating marginal densities. J. Am.Statist. Assoc. 85, 398-409.
• Show variance reduction in using mixture estimator for posterior marginals.
Neal, 1992
• R. M. Neal, 1992. Connectionist learning of belief networks, Artifical Intelligence, v. 56, pp. 71-118.
• Stochastic simulation in Noisy-Or networks.
CPCS54 Test Results

MSE vs. #samples (left) and time (right)

Ergodic, |X| = 54, D(Xi) = 2, |C| = 15, |E| = 4

Exact Time = 30 sec using Cutset Conditioning

CPCS179 Test Results

MSE vs. #samples (left) and time (right)

Non-Ergodic (1 deterministic CPT entry)

|X| = 179, |C| = 8, 2<= D(Xi)<=4, |E| = 35

Exact Time = 122 sec using Loop-Cutset Conditioning

CPCS360b Test Results

MSE vs. #samples (left) and time (right)

Ergodic, |X| = 360, D(Xi)=2, |C| = 21, |E| = 36

Exact Time > 60 min using Cutset Conditioning

Exact Values obtained via Bucket Elimination

Random Networks

MSE vs. #samples (left) and time (right)

|X| = 100, D(Xi) =2,|C| = 13, |E| = 15-20

Exact Time = 30 sec using Cutset Conditioning

Coding Networks

x1

x2

x3

x4

u1

u2

u3

u4

p1

p2

p3

p4

y1

y2

y3

y4

MSE vs. time (right)

Non-Ergodic, |X| = 100, D(Xi)=2, |C| = 13-16, |E| = 50

Sample Ergodic Subspace U={U1, U2,…Uk}

Exact Time = 50 sec using Cutset Conditioning

Non-Ergodic Hailfinder

MSE vs. #samples (left) and time (right)

Non-Ergodic, |X| = 56, |C| = 5, 2 <=D(Xi) <=11, |E| = 0

Exact Time = 2 sec using Loop-Cutset Conditioning

Non-Ergodic CPCS360b - MSE

MSE vs. Time

Non-Ergodic, |X| = 360, |C| = 26, D(Xi)=2

Exact Time = 50 min using BTE

“Clamping” evidence+

forward sampling+

weighing samples by evidence likelihood

Works well for likelyevidence!

Likelihood Weighting

Sample in topological order over X !

e

e

e

e

e

e

e

e

e

xiP(Xi|pai)

P(Xi|pai) is a look-up in CPT!

Likelihood Weighting

Estimate Posterior Marginals:

Likelihood Weighting
• Converges to exact posterior marginals
• Generates Samples Fast
• Sampling distribution is close to prior (especially if E  Leaf Nodes)
• Increasing sampling variance

Convergence may be slow

Many samples with P(x(t))=0 rejected

Likelihood Convergence(Chebychev’s Inequality)
• Assume P(X=x|e) has mean  and variance 2
• Chebychev:

=P(x|e) is unknown => obtain it from samples!

Error Bound Derivation

K is a Bernoulli random variable

Likelihood Convergence 2
• Assume P(X=x|e) has mean  and variance 2
• Zero-One Estimation Theory (Karp et al.,1989):

=P(x|e) is unknown => obtain it from samples!

Local Variance Bound (LVB)(Dagum&Luby, 1994)
• Let  be LVB of a binary valued network:
• Using the LVB, the Zero-One Estimator can be re-written:
Importance Sampling Idea
• In general, it is hard to sample from target distribution P(X|E)
• Generate samples from sampling (proposal) distribution Q(X)
• Weigh each sample against P(X|E)
Importance Sampling Variants