sampling bayesian networks
Download
Skip this Video
Download Presentation
Sampling Bayesian Networks

Loading in 2 Seconds...

play fullscreen
1 / 94

Sampling Bayesian Networks - PowerPoint PPT Presentation


  • 98 Views
  • Uploaded on

Sampling Bayesian Networks. ICS 276 2007. Answering BN Queries. Probability of Evidence P(e) ? NP-hard Conditional Prob. P(x i |e) ? NP-hard MPE x = arg max P(x|e) ? NP-hard MAP y = arg max P(y|e), y  x ? NP PP -hard Approximating P(e) or P(x i |e) within : NP-hard.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Sampling Bayesian Networks' - turi


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
answering bn queries
Answering BN Queries
  • Probability of Evidence P(e) ? NP-hard
  • Conditional Prob. P(xi|e) ? NP-hard
  • MPE x = arg max P(x|e) ? NP-hard
  • MAP y = arg max P(y|e), y  x ?

NPPP-hard

  • Approximating P(e) or P(xi|e) within : NP-hard
approximation algorithms
Approximation Algorithms

Structural Approximations

  • Eliminate some dependencies
    • Remove edges
    • Mini-Bucket Approach

Search

Approach for optimization tasks: MPE, MAP

Sampling

Generate random samples and compute values of interest from samples,

not original network

sampling
Sampling
  • Input: Bayesian network with set of nodes X
  • Sample = a tuple with assigned values

s=(X1=x1,X2=x2,… ,Xk=xk)

  • Tuple may include all variables (except evidence) or a subset
  • Sampling schemas dictate how to generate samples (tuples)
  • Ideally, samples are distributed according to P(X|E)
sampling fundamentals
Sampling Fundamentals

Given a set of variables X = {X1, X2, … Xn} that represent joint probability distribution (X) and some function g(X), we can compute expected value of g(X) :

sampling from x
Sampling From (X)

A sample St is an instantiation:

Given independent, identically distributed samples (iid) S1, S2, …ST from (X), it follows from Strong Law of Large Numbers:

sampling basics
Sampling Basics
  • Given random variable X, D(X)={0, 1}
  • Given P(X) = {0.3, 0.7}
  • Generate k samples: 0,1,1,1,0,1,1,0,1
  • Approximate P’(X):
how to draw a sample
How to draw a sample ?
  • Given random variable X, D(X)={0, 1}
  • Given P(X) = {0.3, 0.7}
  • Sample X  P (X)
    • draw random number r  [0, 1]
    • If (r < 0.3) then set X=0
    • Else set X=1
  • Can generalize for any domain size
sampling in bn
Sampling in BN
  • Same Idea: generate a set of samples T
  • Estimate P(Xi|E) from samples
  • Challenge: X is a vector and P(X) is a huge distribution represented by BN
  • Need to know:
    • How to generate a new sample ?
    • How many samples T do we need ?
    • How to estimate P(E=e) and P(Xi|e) ?
sampling algorithms
Sampling Algorithms
  • Forward Sampling
  • Gibbs Sampling (MCMC)
    • Blocking
    • Rao-Blackwellised
  • Likelihood Weighting
  • Importance Sampling
  • Sequential Monte-Carlo (Particle Filtering) in Dynamic Bayesian Networks
forward sampling
Forward Sampling
  • Forward Sampling
    • Case with No evidence E={}
    • Case with Evidence E=e
    • # samples N and Error Bounds
forward sampling no evidence henrion 1988
Forward Sampling No Evidence(Henrion 1988)

Input: Bayesian network

X= {X1,…,XN}, N- #nodes, T - # samples

Output: T samples

Process nodes in topological order – first process the ancestors of a node, then the node itself:

  • For t = 0 to T
  • For i = 0 to N
  • Xi sample xit from P(xi | pai)
sampling a value

r

0

0.3

1

Sampling A Value

What does it mean to sample xit from P(Xi | pai) ?

  • Assume D(Xi)={0,1}
  • Assume P(Xi | pai) = (0.3, 0.7)
  • Draw a random number r from [0,1]

If r falls in [0,0.3], set Xi = 0

If r falls in [0.3,1], set Xi=1

forward sampling answering queries
Forward Sampling-Answering Queries

Task: given T samples {S1,S2,…,Sn}

estimate P(Xi = xi) :

Basically, count the proportion of samples where Xi = xi

forward sampling w evidence
Forward Sampling w/ Evidence

Input: Bayesian network

X= {X1,…,XN}, N- #nodes

E – evidence, T - # samples

Output: T samples consistent with E

  • For t=1 to T
  • For i=1 to N
  • Xi sample xit from P(xi | pai)
  • If Xi in E and Xi xi, reject sample:
  • i = 1 and go to step 2
forward sampling illustration
Forward Sampling: Illustration

Let Y be a subset of evidence nodes s.t. Y=u

forward sampling how many samples
Forward Sampling –How many samples?

Theorem: Let s(y) be the estimate of P(y) resulting from a randomly chosen sample set S with T samples. Then, to guarantee relative error at most with probability at least 1- it is enough to have:

Derived from Chebychev’s Bound.

forward sampling how many samples1
Forward Sampling - How many samples?

Theorem: Let s(y) be the estimate of P(y) resulting from a randomly chosen sample set S with T samples. Then, to guarantee relative error at most with probability at least 1- it is enough to have:

Derived from Hoeffding’s Bound (full proof is given in Koller).

forward sampling performance
Forward Sampling:Performance

Advantages:

  • P(xi | pa(xi)) is readily available
  • Samples are independent !

Drawbacks:

  • If evidence E is rare (P(e) is low), then we will reject most of the samples!
  • Since P(y) in estimate of T is unknown, must estimate P(y) from samples themselves!
  • If P(e) is small, T will become very big!
problem evidence
Problem: Evidence
  • Forward Sampling
    • High Rejection Rate
  • Fix evidence values
    • Gibbs sampling (MCMC)
    • Likelihood Weighting
    • Importance Sampling
forward sampling bibliography
Forward Sampling Bibliography
  • {henrion88} M. Henrion, "Propagating uncertainty in Bayesian networks by probabilistic logic sampling”, Uncertainty in AI, pp. = 149-163,1988
gibbs sampling
Gibbs Sampling
  • Markov Chain Monte Carlo method

(Gelfand and Smith, 1990, Smith and Roberts, 1993, Tierney, 1994)

  • Samples are dependent, form Markov Chain
  • Sample from P’(X|e)which converges toP(X|e)
  • Guaranteed to converge when all P > 0
  • Methods to improve convergence:
    • Blocking
    • Rao-Blackwellised
  • Error Bounds
    • Lag-t autocovariance
    • Multiple Chains, Chebyshev’s Inequality
gibbs sampling pearl 1988
Gibbs Sampling (Pearl, 1988)
  • A sample t[1,2,…],is an instantiation of all variables in the network:
  • Sampling process
    • Fix values of observed variables e
    • Instantiate node values in sample x0 at random
    • Generate samples x1,x2,…xT from P(x|e)
    • Compute posteriors from samples
ordered gibbs sampler
Ordered Gibbs Sampler

Generate sample xt+1 from xt :

In short, for i=1 to N:

Process

All

Variables

In Some

Order

ordered gibbs sampling algorithm
Ordered Gibbs Sampling Algorithm

Input: X, E

Output: T samples {xt }

  • Fix evidence E
  • Generate samples from P(X | E)
  • For t = 1 to T (compute samples)
  • For i = 1 to N (loop through variables)
  • Xi sample xit from P(Xi | markovt \ Xi)
answering queries
Answering Queries
  • Query: P(xi |e) = ?
  • Method 1: count #of samples where Xi=xi:

Method 2: average probability (mixture estimator):

gibbs sampling example bn
Gibbs Sampling Example - BN

X = {X1,X2,…,X9}

E = {X9}

X1

X3

X6

X2

X5

X8

X9

X4

X7

gibbs sampling example bn1
Gibbs Sampling Example - BN

X1 = x10X6 = x60

X2 = x20X7 = x70

X3 = x30X8 = x80

X4 = x40

X5 = x50

X1

X3

X6

X2

X5

X8

X9

X4

X7

gibbs sampling example bn2
Gibbs Sampling Example - BN

X1 P (X1 |X02,…,X08 ,X9}

E = {X9}

X1

X3

X6

X2

X5

X8

X9

X4

X7

gibbs sampling example bn3
Gibbs Sampling Example - BN

X2 P(X2 |X11,…,X08 ,X9}

E = {X9}

X1

X3

X6

X2

X5

X8

X9

X4

X7

gibbs sampling burn in
Gibbs Sampling: Burn-In
  • We want to sample from P(X | E)
  • But…starting point is random
  • Solution: throw away first K samples
  • Known As “Burn-In”
  • What is K ? Hard to tell. Use intuition.
  • Alternatives: sample first sample valkues from approximate P(x|e) (for example, run IBP first)
gibbs sampling convergence
Gibbs Sampling: Convergence
  • Converge to stationary distribution * :

* = * P

where P is a transition kernel

pij = P(Xi Xj)

  • Guaranteed to converge iff chain is :
    • irreducible
    • aperiodic
    • ergodic ( i,j pij > 0)
irreducible
Irreducible
  • A Markov chain (or its probability transition matrix) is said to be irreducible if it is possible to reach every state from every other state (not necessarily in one step).
  • In other words, i,j k : P(k)ij > 0 where k is the number of steps taken to get to state j from state i.
aperiodic
Aperiodic
  • Define d(i) = g.c.d.{n > 0 | it is possible to go from i to i in n steps}. Here, g.c.d. means the greatest common divisor of the integers in the set. If d(i)=1 for i, then chain is aperiodic.
ergodicity
Ergodicity
  • A recurrent state is a state to which the chain returns with probability 1:

nP(n)ij = 

  • Recurrent, aperiodic states are ergodic.

Note: an extra condition for ergodicity is that expected recurrence time is finite. This holds for recurrent states in a finite state chain.

gibbs convergence
Gibbs Convergence
  • Gibbs convergence is generally guaranteed as long as all probabilities are positive!
  • Intuition for ergodicity requirement: if nodes X and Y are correlated s.t. X=0 Y=0, then:
    • once we sample and assign X=0, then we are forced to assign Y=0;
    • once we sample and assign Y=0, then we are forced to assign X=0;

 we will never be able to change their values again!

  • Another problem: it can take a very long time to converge!
gibbs sampling performance
Gibbs Sampling: Performance

+Advantage: guaranteed to converge to P(X|E)

-Disadvantage: convergence may be slow

Problems:

  • Samples are dependent !
  • Statistical variance is too big in high-dimensional problems
gibbs speeding convergence
Gibbs: Speeding Convergence

Objectives:

  • Reduce dependence between samples (autocorrelation)
    • Skip samples
    • Randomize Variable Sampling Order
  • Reduce variance
    • Blocking Gibbs Sampling
    • Rao-Blackwellisation
skipping samples
Skipping Samples
  • Pick only every k-th sample (Gayer, 1992)

Can reduce dependence between samples !

Increases variance ! Waists samples !

randomized variable order
Randomized Variable Order

Random Scan Gibbs Sampler

Pick each next variable Xi for update at random with probability pi , i pi = 1.

(In the simplest case, pi are distributed uniformly.)

In some instances, reduces variance (MacEachern, Peruggia, 1999

“Subsampling the Gibbs Sampler: Variance Reduction”)

blocking
Blocking
  • Sample several variables together, as a block
  • Example: Given three variables X,Y,Z, with domains of size 2, group Y and Z together to form a variable W={Y,Z} with domain size 4. Then, given sample (xt,yt,zt), compute next sample:

Xt+1 P(yt,zt)=P(wt)

(yt+1,zt+1)=Wt+1 P(xt+1)

+ Can improve convergence greatly when two variables are strongly correlated!

- Domain of the block variable grows exponentially with the #variables in a block!

blocking gibbs sampling
Blocking Gibbs Sampling

Jensen, Kong, Kjaerulff, 1993

“Blocking Gibbs Sampling Very Large Probabilistic Expert Systems”

  • Select a set of subsets:

E1, E2, E3, …, Ek s.t. Ei X

Ui Ei = X

Ai = X \ Ei

  • Sample P(Ei | Ai)
rao blackwellisation
Rao-Blackwellisation
  • Do not sample all variables!
  • Sample a subset!
  • Example: Given three variables X,Y,Z, sample only X and Y, sum out Z. Given sample (xt,yt), compute next sample:

Xt+1 P(yt)

yt+1  P(xt+1)

rao blackwell theorem
Rao-Blackwell Theorem

Bottom line: reducing number of variables in a sample reduce variance!

blocking vs rao blackwellisation
Blocking vs. Rao-Blackwellisation
  • Standard Gibbs:

P(x|y,z),P(y|x,z),P(z|x,y) (1)

  • Blocking:

P(x|y,z), P(y,z|x) (2)

  • Rao-Blackwellised:

P(x|y), P(y|x) (3)

Var3 < Var2 < Var1

[Liu, Wong, Kong, 1994

Covariance structure of the Gibbs sampler…]

X

Y

Z

rao blackwellised gibbs cutset sampling
Rao-Blackwellised Gibbs: Cutset Sampling
  • Select C  X(possibly cycle-cutset), |C| = m
  • Fix evidence E
  • Initialize nodes with random values:

For i=1 to m: ci to Ci = c0i

  • For t=1 to n , generate samples:

For i=1 to m:

Ci=cit+1 P(ci|c1 t+1,…,ci-1 t+1,ci+1t,…,cmt ,e)

cutset sampling
Cutset Sampling
  • Select a subset C={C1,…,CK} X
  • A sample t[1,2,…],is an instantiation of C:
  • Sampling process
    • Fix values of observed variables e
    • Generate sample c0 at random
    • Generate samples c1,c2,…cT from P(c|e)
    • Compute posteriors from samples
cutset sampling generating samples
Cutset SamplingGenerating Samples

Generate sample ct+1 from ct :

In short, for i=1 to K:

rao blackwellised gibbs cutset sampling1
Rao-Blackwellised Gibbs: Cutset Sampling

How to compute P(ci|ct\ci, e) ?

  • Compute joint P(ci, ct\ci, e) for each ci D(Ci)
  • Then normalize:

P(ci| ct\ci , e) =  P(ci, ct\ci , e)

  • Computation efficiency depends

on choice of C

rao blackwellised gibbs cutset sampling2
Rao-Blackwellised Gibbs: Cutset Sampling

How to choose C ?

  • Special case: C is cycle-cutset, O(N)
  • General case: apply Bucket Tree Elimination (BTE), O(exp(w)) where w is the induced width of the network when nodes in C are observed.
  • Pick C wisely so as to minimize w  notion of w-cutset
w cutset sampling
w-cutset Sampling
  • C=w-cutset of the network, a set of nodes such that when C and E are instantiated, the adjusted induced width of the network is w
  • Complexity of exact inference:

bounded by w !

  • cycle-cutset is a special case
cutset sampling answering queries
Cutset Sampling-Answering Queries
  • Query: ci C, P(ci |e)=? same as Gibbs:
  • Special case of w-cutset

computed while generating sample t

  • Query: P(xi |e) = ?

compute after generating sample t

cutset sampling example
Cutset Sampling Example

X1

X2

X3

X4

X5

X6

X9

X7

X8

E=x9

cutset sampling example1
Cutset Sampling Example

Sample a new value for X2 :

X1

X2

X3

X4

X5

X6

X9

X7

X8

cutset sampling example2
Cutset Sampling Example

Sample a new value for X5 :

X1

X2

X3

X4

X5

X6

X9

X7

X8

cutset sampling example3
Cutset Sampling Example

Query P(x2|e) for sampling node X2 :

Sample 1

X1

X2

X3

Sample 2

X4

X5

X6

Sample 3

X9

X7

X8

cutset sampling example4
Cutset Sampling Example

Query P(x3 |e) for non-sampled node X3 :

X1

X2

X3

X4

X5

X6

X9

X7

X8

gibbs error bounds
Gibbs: Error Bounds

Objectives:

  • Estimate needed number of samples T
  • Estimate error

Methodology:

  • 1 chain  use lag-k autocovariance
    • Estimate T
  • M chains  standard sampling variance
    • Estimate Error
gibbs lag k autocovariance
Gibbs: lag-k autocovariance

Lag-k autocovariance

gibbs lag k autocovariance1
Gibbs: lag-k autocovariance

Estimate Monte Carlo variance:

Here,  is the smallest positive integer satisfying:

Effective chain size:

In absense of autocovariance:

gibbs multiple chains
Gibbs: Multiple Chains
  • Generate M chains of size K
  • Each chain produces independent estimate Pm:

Estimate P(xi|e) as average of Pm (xi|e) :

Treat Pm as independent random variables.

gibbs multiple chains1
Gibbs: Multiple Chains

{ Pm } are independent random variables

Therefore:

geman geman1984
Geman&Geman1984
  • Geman, S. & Geman D., 1984. Stocahstic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEE Trans.Pat.Anal.Mach.Intel. 6, 721-41.
    • Introduce Gibbs sampling;
    • Place the idea of Gibbs sampling in a general setting in which the collection of variables is structured in a graphical model and each variable has a neighborhood corresponding to a local region of the graphical structure. Geman and Geman use the Gibbs distribution to define the joint distribution on this structured set of variables.
tanner wong 1987
Tanner&Wong 1987
  • Tanner and Wong (1987)
    • Data-augmentation
    • Convergence Results
pearl1988
Pearl1988
  • Pearl,1988. Probabilistic Reasoning in Intelligent Systems, Morgan-Kaufmann.
    • In the case of Bayesian networks, the neighborhoods correspond to the Markov blanket of a variable and the joint distribution is defined by the factorization of the network.
gelfand smith 1990
Gelfand&Smith,1990
  • Gelfand, A.E. and Smith, A.F.M., 1990. Sampling-based approaches to calculating marginal densities. J. Am.Statist. Assoc. 85, 398-409.
    • Show variance reduction in using mixture estimator for posterior marginals.
neal 1992
Neal, 1992
  • R. M. Neal, 1992. Connectionist learning of belief networks, Artifical Intelligence, v. 56, pp. 71-118.
    • Stochastic simulation in Noisy-Or networks.
cpcs54 test results
CPCS54 Test Results

MSE vs. #samples (left) and time (right)

Ergodic, |X| = 54, D(Xi) = 2, |C| = 15, |E| = 4

Exact Time = 30 sec using Cutset Conditioning

cpcs179 test results
CPCS179 Test Results

MSE vs. #samples (left) and time (right)

Non-Ergodic (1 deterministic CPT entry)

|X| = 179, |C| = 8, 2<= D(Xi)<=4, |E| = 35

Exact Time = 122 sec using Loop-Cutset Conditioning

cpcs360b test results
CPCS360b Test Results

MSE vs. #samples (left) and time (right)

Ergodic, |X| = 360, D(Xi)=2, |C| = 21, |E| = 36

Exact Time > 60 min using Cutset Conditioning

Exact Values obtained via Bucket Elimination

random networks
Random Networks

MSE vs. #samples (left) and time (right)

|X| = 100, D(Xi) =2,|C| = 13, |E| = 15-20

Exact Time = 30 sec using Cutset Conditioning

coding networks
Coding Networks

x1

x2

x3

x4

u1

u2

u3

u4

p1

p2

p3

p4

y1

y2

y3

y4

MSE vs. time (right)

Non-Ergodic, |X| = 100, D(Xi)=2, |C| = 13-16, |E| = 50

Sample Ergodic Subspace U={U1, U2,…Uk}

Exact Time = 50 sec using Cutset Conditioning

non ergodic hailfinder
Non-Ergodic Hailfinder

MSE vs. #samples (left) and time (right)

Non-Ergodic, |X| = 56, |C| = 5, 2 <=D(Xi) <=11, |E| = 0

Exact Time = 2 sec using Loop-Cutset Conditioning

non ergodic cpcs360b mse
Non-Ergodic CPCS360b - MSE

MSE vs. Time

Non-Ergodic, |X| = 360, |C| = 26, D(Xi)=2

Exact Time = 50 min using BTE

likelihood weighting fung and chang 1990 shachter and peot 1990
Likelihood Weighting(Fung and Chang, 1990; Shachter and Peot, 1990)

“Clamping” evidence+

forward sampling+

weighing samples by evidence likelihood

Works well for likelyevidence!

likelihood weighting
Likelihood Weighting

Sample in topological order over X !

e

e

e

e

e

e

e

e

e

xiP(Xi|pai)

P(Xi|pai) is a look-up in CPT!

likelihood weighting1
Likelihood Weighting

Estimate Posterior Marginals:

likelihood weighting2
Likelihood Weighting
  • Converges to exact posterior marginals
  • Generates Samples Fast
  • Sampling distribution is close to prior (especially if E  Leaf Nodes)
  • Increasing sampling variance

Convergence may be slow

Many samples with P(x(t))=0 rejected

likelihood convergence chebychev s inequality
Likelihood Convergence(Chebychev’s Inequality)
  • Assume P(X=x|e) has mean  and variance 2
  • Chebychev:

=P(x|e) is unknown => obtain it from samples!

error bound derivation
Error Bound Derivation

K is a Bernoulli random variable

likelihood convergence 2
Likelihood Convergence 2
  • Assume P(X=x|e) has mean  and variance 2
  • Zero-One Estimation Theory (Karp et al.,1989):

=P(x|e) is unknown => obtain it from samples!

local variance bound lvb dagum luby 1994
Local Variance Bound (LVB)(Dagum&Luby, 1994)
  • Let  be LVB of a binary valued network:
lvb estimate pradhan dagum 1996
LVB Estimate(Pradhan,Dagum,1996)
  • Using the LVB, the Zero-One Estimator can be re-written:
importance sampling idea
Importance Sampling Idea
  • In general, it is hard to sample from target distribution P(X|E)
  • Generate samples from sampling (proposal) distribution Q(X)
  • Weigh each sample against P(X|E)
importance sampling variants
Importance Sampling Variants

Importance sampling: forward, non-adaptive

  • Nodes sampled in topological order
  • Sampling distribution (for non-instantiated nodes) equal to the prior conditionals

Importance sampling: forward, adaptive

  • Nodes sampled in topological order
  • Sampling distribution adapted according to average importance weights obtained in previous samples [Cheng,Druzdzel2000]
ais bn
AIS-BN
  • The most efficient variant of importance sampling to-date is AIS-BN – Adaptive Importance Sampling for Bayesian networks.
  • Jian Cheng and Marek J. Druzdzel. AIS-BN: An adaptive importance sampling algorithm for evidential reasoning in large Bayesian networks.Journal of Artificial Intelligence Research (JAIR), 13:155-188, 2000.
ad