- By
**turi** - Follow User

- 98 Views
- Uploaded on

Download Presentation
## PowerPoint Slideshow about ' Sampling Bayesian Networks' - turi

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

Answering BN Queries

- Probability of Evidence P(e) ? NP-hard
- Conditional Prob. P(xi|e) ? NP-hard
- MPE x = arg max P(x|e) ? NP-hard
- MAP y = arg max P(y|e), y x ?

NPPP-hard

- Approximating P(e) or P(xi|e) within : NP-hard

Approximation Algorithms

Structural Approximations

- Eliminate some dependencies
- Remove edges
- Mini-Bucket Approach

Search

Approach for optimization tasks: MPE, MAP

Sampling

Generate random samples and compute values of interest from samples,

not original network

Sampling

- Input: Bayesian network with set of nodes X
- Sample = a tuple with assigned values

s=(X1=x1,X2=x2,… ,Xk=xk)

- Tuple may include all variables (except evidence) or a subset
- Sampling schemas dictate how to generate samples (tuples)
- Ideally, samples are distributed according to P(X|E)

Sampling Fundamentals

Given a set of variables X = {X1, X2, … Xn} that represent joint probability distribution (X) and some function g(X), we can compute expected value of g(X) :

Sampling From (X)

A sample St is an instantiation:

Given independent, identically distributed samples (iid) S1, S2, …ST from (X), it follows from Strong Law of Large Numbers:

Sampling Basics

- Given random variable X, D(X)={0, 1}
- Given P(X) = {0.3, 0.7}
- Generate k samples: 0,1,1,1,0,1,1,0,1
- Approximate P’(X):

How to draw a sample ?

- Given random variable X, D(X)={0, 1}
- Given P(X) = {0.3, 0.7}
- Sample X P (X)
- draw random number r [0, 1]
- If (r < 0.3) then set X=0
- Else set X=1
- Can generalize for any domain size

Sampling in BN

- Same Idea: generate a set of samples T
- Estimate P(Xi|E) from samples
- Challenge: X is a vector and P(X) is a huge distribution represented by BN
- Need to know:
- How to generate a new sample ?
- How many samples T do we need ?
- How to estimate P(E=e) and P(Xi|e) ?

Sampling Algorithms

- Forward Sampling
- Gibbs Sampling (MCMC)
- Blocking
- Rao-Blackwellised
- Likelihood Weighting
- Importance Sampling
- Sequential Monte-Carlo (Particle Filtering) in Dynamic Bayesian Networks

Forward Sampling

- Forward Sampling
- Case with No evidence E={}
- Case with Evidence E=e
- # samples N and Error Bounds

Forward Sampling No Evidence(Henrion 1988)

Input: Bayesian network

X= {X1,…,XN}, N- #nodes, T - # samples

Output: T samples

Process nodes in topological order – first process the ancestors of a node, then the node itself:

- For t = 0 to T
- For i = 0 to N
- Xi sample xit from P(xi | pai)

0

0.3

1

Sampling A ValueWhat does it mean to sample xit from P(Xi | pai) ?

- Assume D(Xi)={0,1}
- Assume P(Xi | pai) = (0.3, 0.7)
- Draw a random number r from [0,1]

If r falls in [0,0.3], set Xi = 0

If r falls in [0.3,1], set Xi=1

Forward Sampling-Answering Queries

Task: given T samples {S1,S2,…,Sn}

estimate P(Xi = xi) :

Basically, count the proportion of samples where Xi = xi

Forward Sampling w/ Evidence

Input: Bayesian network

X= {X1,…,XN}, N- #nodes

E – evidence, T - # samples

Output: T samples consistent with E

- For t=1 to T
- For i=1 to N
- Xi sample xit from P(xi | pai)
- If Xi in E and Xi xi, reject sample:
- i = 1 and go to step 2

Forward Sampling: Illustration

Let Y be a subset of evidence nodes s.t. Y=u

Forward Sampling –How many samples?

Theorem: Let s(y) be the estimate of P(y) resulting from a randomly chosen sample set S with T samples. Then, to guarantee relative error at most with probability at least 1- it is enough to have:

Derived from Chebychev’s Bound.

Forward Sampling - How many samples?

Theorem: Let s(y) be the estimate of P(y) resulting from a randomly chosen sample set S with T samples. Then, to guarantee relative error at most with probability at least 1- it is enough to have:

Derived from Hoeffding’s Bound (full proof is given in Koller).

Forward Sampling:Performance

Advantages:

- P(xi | pa(xi)) is readily available
- Samples are independent !

Drawbacks:

- If evidence E is rare (P(e) is low), then we will reject most of the samples!
- Since P(y) in estimate of T is unknown, must estimate P(y) from samples themselves!
- If P(e) is small, T will become very big!

Problem: Evidence

- Forward Sampling
- High Rejection Rate
- Fix evidence values
- Gibbs sampling (MCMC)
- Likelihood Weighting
- Importance Sampling

Forward Sampling Bibliography

- {henrion88} M. Henrion, "Propagating uncertainty in Bayesian networks by probabilistic logic sampling”, Uncertainty in AI, pp. = 149-163,1988

Gibbs Sampling

- Markov Chain Monte Carlo method

(Gelfand and Smith, 1990, Smith and Roberts, 1993, Tierney, 1994)

- Samples are dependent, form Markov Chain
- Sample from P’(X|e)which converges toP(X|e)
- Guaranteed to converge when all P > 0
- Methods to improve convergence:
- Blocking
- Rao-Blackwellised
- Error Bounds
- Lag-t autocovariance
- Multiple Chains, Chebyshev’s Inequality

Gibbs Sampling (Pearl, 1988)

- A sample t[1,2,…],is an instantiation of all variables in the network:
- Sampling process
- Fix values of observed variables e
- Instantiate node values in sample x0 at random
- Generate samples x1,x2,…xT from P(x|e)
- Compute posteriors from samples

Ordered Gibbs Sampler

Generate sample xt+1 from xt :

In short, for i=1 to N:

Process

All

Variables

In Some

Order

Gibbs Sampling (cont’d)(Pearl, 1988)

Markov blanket:

Ordered Gibbs Sampling Algorithm

Input: X, E

Output: T samples {xt }

- Fix evidence E
- Generate samples from P(X | E)
- For t = 1 to T (compute samples)
- For i = 1 to N (loop through variables)
- Xi sample xit from P(Xi | markovt \ Xi)

Answering Queries

- Query: P(xi |e) = ?
- Method 1: count #of samples where Xi=xi:

Method 2: average probability (mixture estimator):

Gibbs Sampling Example - BN

X1 = x10X6 = x60

X2 = x20X7 = x70

X3 = x30X8 = x80

X4 = x40

X5 = x50

X1

X3

X6

X2

X5

X8

X9

X4

X7

Gibbs Sampling: Burn-In

- We want to sample from P(X | E)
- But…starting point is random
- Solution: throw away first K samples
- Known As “Burn-In”
- What is K ? Hard to tell. Use intuition.
- Alternatives: sample first sample valkues from approximate P(x|e) (for example, run IBP first)

Gibbs Sampling: Convergence

- Converge to stationary distribution * :

* = * P

where P is a transition kernel

pij = P(Xi Xj)

- Guaranteed to converge iff chain is :
- irreducible
- aperiodic
- ergodic ( i,j pij > 0)

Irreducible

- A Markov chain (or its probability transition matrix) is said to be irreducible if it is possible to reach every state from every other state (not necessarily in one step).
- In other words, i,j k : P(k)ij > 0 where k is the number of steps taken to get to state j from state i.

Aperiodic

- Define d(i) = g.c.d.{n > 0 | it is possible to go from i to i in n steps}. Here, g.c.d. means the greatest common divisor of the integers in the set. If d(i)=1 for i, then chain is aperiodic.

Ergodicity

- A recurrent state is a state to which the chain returns with probability 1:

nP(n)ij =

- Recurrent, aperiodic states are ergodic.

Note: an extra condition for ergodicity is that expected recurrence time is finite. This holds for recurrent states in a finite state chain.

Gibbs Convergence

- Gibbs convergence is generally guaranteed as long as all probabilities are positive!
- Intuition for ergodicity requirement: if nodes X and Y are correlated s.t. X=0 Y=0, then:
- once we sample and assign X=0, then we are forced to assign Y=0;
- once we sample and assign Y=0, then we are forced to assign X=0;

we will never be able to change their values again!

- Another problem: it can take a very long time to converge!

Gibbs Sampling: Performance

+Advantage: guaranteed to converge to P(X|E)

-Disadvantage: convergence may be slow

Problems:

- Samples are dependent !
- Statistical variance is too big in high-dimensional problems

Gibbs: Speeding Convergence

Objectives:

- Reduce dependence between samples (autocorrelation)
- Skip samples
- Randomize Variable Sampling Order
- Reduce variance
- Blocking Gibbs Sampling
- Rao-Blackwellisation

Skipping Samples

- Pick only every k-th sample (Gayer, 1992)

Can reduce dependence between samples !

Increases variance ! Waists samples !

Randomized Variable Order

Random Scan Gibbs Sampler

Pick each next variable Xi for update at random with probability pi , i pi = 1.

(In the simplest case, pi are distributed uniformly.)

In some instances, reduces variance (MacEachern, Peruggia, 1999

“Subsampling the Gibbs Sampler: Variance Reduction”)

Blocking

- Sample several variables together, as a block
- Example: Given three variables X,Y,Z, with domains of size 2, group Y and Z together to form a variable W={Y,Z} with domain size 4. Then, given sample (xt,yt,zt), compute next sample:

Xt+1 P(yt,zt)=P(wt)

(yt+1,zt+1)=Wt+1 P(xt+1)

+ Can improve convergence greatly when two variables are strongly correlated!

- Domain of the block variable grows exponentially with the #variables in a block!

Blocking Gibbs Sampling

Jensen, Kong, Kjaerulff, 1993

“Blocking Gibbs Sampling Very Large Probabilistic Expert Systems”

- Select a set of subsets:

E1, E2, E3, …, Ek s.t. Ei X

Ui Ei = X

Ai = X \ Ei

- Sample P(Ei | Ai)

Rao-Blackwellisation

- Do not sample all variables!
- Sample a subset!
- Example: Given three variables X,Y,Z, sample only X and Y, sum out Z. Given sample (xt,yt), compute next sample:

Xt+1 P(yt)

yt+1 P(xt+1)

Rao-Blackwell Theorem

Bottom line: reducing number of variables in a sample reduce variance!

Blocking vs. Rao-Blackwellisation

- Standard Gibbs:

P(x|y,z),P(y|x,z),P(z|x,y) (1)

- Blocking:

P(x|y,z), P(y,z|x) (2)

- Rao-Blackwellised:

P(x|y), P(y|x) (3)

Var3 < Var2 < Var1

[Liu, Wong, Kong, 1994

Covariance structure of the Gibbs sampler…]

X

Y

Z

Rao-Blackwellised Gibbs: Cutset Sampling

- Select C X(possibly cycle-cutset), |C| = m
- Fix evidence E
- Initialize nodes with random values:

For i=1 to m: ci to Ci = c0i

- For t=1 to n , generate samples:

For i=1 to m:

Ci=cit+1 P(ci|c1 t+1,…,ci-1 t+1,ci+1t,…,cmt ,e)

Cutset Sampling

- Select a subset C={C1,…,CK} X
- A sample t[1,2,…],is an instantiation of C:
- Sampling process
- Fix values of observed variables e
- Generate sample c0 at random
- Generate samples c1,c2,…cT from P(c|e)
- Compute posteriors from samples

Rao-Blackwellised Gibbs: Cutset Sampling

How to compute P(ci|ct\ci, e) ?

- Compute joint P(ci, ct\ci, e) for each ci D(Ci)
- Then normalize:

P(ci| ct\ci , e) = P(ci, ct\ci , e)

- Computation efficiency depends

on choice of C

Rao-Blackwellised Gibbs: Cutset Sampling

How to choose C ?

- Special case: C is cycle-cutset, O(N)
- General case: apply Bucket Tree Elimination (BTE), O(exp(w)) where w is the induced width of the network when nodes in C are observed.
- Pick C wisely so as to minimize w notion of w-cutset

w-cutset Sampling

- C=w-cutset of the network, a set of nodes such that when C and E are instantiated, the adjusted induced width of the network is w
- Complexity of exact inference:

bounded by w !

- cycle-cutset is a special case

Cutset Sampling-Answering Queries

- Query: ci C, P(ci |e)=? same as Gibbs:
- Special case of w-cutset

computed while generating sample t

- Query: P(xi |e) = ?

compute after generating sample t

Cutset Sampling Example

Query P(x2|e) for sampling node X2 :

Sample 1

X1

X2

X3

Sample 2

X4

X5

X6

Sample 3

X9

X7

X8

Gibbs: Error Bounds

Objectives:

- Estimate needed number of samples T
- Estimate error

Methodology:

- 1 chain use lag-k autocovariance
- Estimate T
- M chains standard sampling variance
- Estimate Error

Gibbs: lag-k autocovariance

Lag-k autocovariance

Gibbs: lag-k autocovariance

Estimate Monte Carlo variance:

Here, is the smallest positive integer satisfying:

Effective chain size:

In absense of autocovariance:

Gibbs: Multiple Chains

- Generate M chains of size K
- Each chain produces independent estimate Pm:

Estimate P(xi|e) as average of Pm (xi|e) :

Treat Pm as independent random variables.

Geman&Geman1984

- Geman, S. & Geman D., 1984. Stocahstic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEE Trans.Pat.Anal.Mach.Intel. 6, 721-41.
- Introduce Gibbs sampling;
- Place the idea of Gibbs sampling in a general setting in which the collection of variables is structured in a graphical model and each variable has a neighborhood corresponding to a local region of the graphical structure. Geman and Geman use the Gibbs distribution to define the joint distribution on this structured set of variables.

Tanner&Wong 1987

- Tanner and Wong (1987)
- Data-augmentation
- Convergence Results

Pearl1988

- Pearl,1988. Probabilistic Reasoning in Intelligent Systems, Morgan-Kaufmann.
- In the case of Bayesian networks, the neighborhoods correspond to the Markov blanket of a variable and the joint distribution is defined by the factorization of the network.

Gelfand&Smith,1990

- Gelfand, A.E. and Smith, A.F.M., 1990. Sampling-based approaches to calculating marginal densities. J. Am.Statist. Assoc. 85, 398-409.
- Show variance reduction in using mixture estimator for posterior marginals.

Neal, 1992

- R. M. Neal, 1992. Connectionist learning of belief networks, Artifical Intelligence, v. 56, pp. 71-118.
- Stochastic simulation in Noisy-Or networks.

CPCS54 Test Results

MSE vs. #samples (left) and time (right)

Ergodic, |X| = 54, D(Xi) = 2, |C| = 15, |E| = 4

Exact Time = 30 sec using Cutset Conditioning

CPCS179 Test Results

MSE vs. #samples (left) and time (right)

Non-Ergodic (1 deterministic CPT entry)

|X| = 179, |C| = 8, 2<= D(Xi)<=4, |E| = 35

Exact Time = 122 sec using Loop-Cutset Conditioning

CPCS360b Test Results

MSE vs. #samples (left) and time (right)

Ergodic, |X| = 360, D(Xi)=2, |C| = 21, |E| = 36

Exact Time > 60 min using Cutset Conditioning

Exact Values obtained via Bucket Elimination

Random Networks

MSE vs. #samples (left) and time (right)

|X| = 100, D(Xi) =2,|C| = 13, |E| = 15-20

Exact Time = 30 sec using Cutset Conditioning

Coding Networks

x1

x2

x3

x4

u1

u2

u3

u4

p1

p2

p3

p4

y1

y2

y3

y4

MSE vs. time (right)

Non-Ergodic, |X| = 100, D(Xi)=2, |C| = 13-16, |E| = 50

Sample Ergodic Subspace U={U1, U2,…Uk}

Exact Time = 50 sec using Cutset Conditioning

Non-Ergodic Hailfinder

MSE vs. #samples (left) and time (right)

Non-Ergodic, |X| = 56, |C| = 5, 2 <=D(Xi) <=11, |E| = 0

Exact Time = 2 sec using Loop-Cutset Conditioning

Non-Ergodic CPCS360b - MSE

MSE vs. Time

Non-Ergodic, |X| = 360, |C| = 26, D(Xi)=2

Exact Time = 50 min using BTE

Likelihood Weighting(Fung and Chang, 1990; Shachter and Peot, 1990)

“Clamping” evidence+

forward sampling+

weighing samples by evidence likelihood

Works well for likelyevidence!

Likelihood Weighting

Sample in topological order over X !

e

e

e

e

e

e

e

e

e

xiP(Xi|pai)

P(Xi|pai) is a look-up in CPT!

Likelihood Weighting

Estimate Posterior Marginals:

Likelihood Weighting

- Converges to exact posterior marginals
- Generates Samples Fast
- Sampling distribution is close to prior (especially if E Leaf Nodes)
- Increasing sampling variance

Convergence may be slow

Many samples with P(x(t))=0 rejected

Likelihood Convergence(Chebychev’s Inequality)

- Assume P(X=x|e) has mean and variance 2
- Chebychev:

=P(x|e) is unknown => obtain it from samples!

Error Bound Derivation

K is a Bernoulli random variable

Likelihood Convergence 2

- Assume P(X=x|e) has mean and variance 2
- Zero-One Estimation Theory (Karp et al.,1989):

=P(x|e) is unknown => obtain it from samples!

Local Variance Bound (LVB)(Dagum&Luby, 1994)

- Let be LVB of a binary valued network:

LVB Estimate(Pradhan,Dagum,1996)

- Using the LVB, the Zero-One Estimator can be re-written:

Importance Sampling Idea

- In general, it is hard to sample from target distribution P(X|E)
- Generate samples from sampling (proposal) distribution Q(X)
- Weigh each sample against P(X|E)

Importance Sampling Variants

Importance sampling: forward, non-adaptive

- Nodes sampled in topological order
- Sampling distribution (for non-instantiated nodes) equal to the prior conditionals

Importance sampling: forward, adaptive

- Nodes sampled in topological order
- Sampling distribution adapted according to average importance weights obtained in previous samples [Cheng,Druzdzel2000]

AIS-BN

- The most efficient variant of importance sampling to-date is AIS-BN – Adaptive Importance Sampling for Bayesian networks.
- Jian Cheng and Marek J. Druzdzel. AIS-BN: An adaptive importance sampling algorithm for evidential reasoning in large Bayesian networks.Journal of Artificial Intelligence Research (JAIR), 13:155-188, 2000.

Download Presentation

Connecting to Server..