Sampling bayesian networks
Download
1 / 94

Sampling Bayesian Networks - PowerPoint PPT Presentation


  • 98 Views
  • Uploaded on

Sampling Bayesian Networks. ICS 276 2007. Answering BN Queries. Probability of Evidence P(e) ? NP-hard Conditional Prob. P(x i |e) ? NP-hard MPE x = arg max P(x|e) ? NP-hard MAP y = arg max P(y|e), y  x ? NP PP -hard Approximating P(e) or P(x i |e) within : NP-hard.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Sampling Bayesian Networks' - turi


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

Answering bn queries
Answering BN Queries

  • Probability of Evidence P(e) ? NP-hard

  • Conditional Prob. P(xi|e) ? NP-hard

  • MPE x = arg max P(x|e) ? NP-hard

  • MAP y = arg max P(y|e), y  x ?

    NPPP-hard

  • Approximating P(e) or P(xi|e) within : NP-hard


Approximation algorithms
Approximation Algorithms

Structural Approximations

  • Eliminate some dependencies

    • Remove edges

    • Mini-Bucket Approach

      Search

      Approach for optimization tasks: MPE, MAP

      Sampling

      Generate random samples and compute values of interest from samples,

      not original network



Sampling
Sampling

  • Input: Bayesian network with set of nodes X

  • Sample = a tuple with assigned values

    s=(X1=x1,X2=x2,… ,Xk=xk)

  • Tuple may include all variables (except evidence) or a subset

  • Sampling schemas dictate how to generate samples (tuples)

  • Ideally, samples are distributed according to P(X|E)


Sampling fundamentals
Sampling Fundamentals

Given a set of variables X = {X1, X2, … Xn} that represent joint probability distribution (X) and some function g(X), we can compute expected value of g(X) :


Sampling from x
Sampling From (X)

A sample St is an instantiation:

Given independent, identically distributed samples (iid) S1, S2, …ST from (X), it follows from Strong Law of Large Numbers:


Sampling basics
Sampling Basics

  • Given random variable X, D(X)={0, 1}

  • Given P(X) = {0.3, 0.7}

  • Generate k samples: 0,1,1,1,0,1,1,0,1

  • Approximate P’(X):


How to draw a sample
How to draw a sample ?

  • Given random variable X, D(X)={0, 1}

  • Given P(X) = {0.3, 0.7}

  • Sample X  P (X)

    • draw random number r  [0, 1]

    • If (r < 0.3) then set X=0

    • Else set X=1

  • Can generalize for any domain size


Sampling in bn
Sampling in BN

  • Same Idea: generate a set of samples T

  • Estimate P(Xi|E) from samples

  • Challenge: X is a vector and P(X) is a huge distribution represented by BN

  • Need to know:

    • How to generate a new sample ?

    • How many samples T do we need ?

    • How to estimate P(E=e) and P(Xi|e) ?


Sampling algorithms
Sampling Algorithms

  • Forward Sampling

  • Gibbs Sampling (MCMC)

    • Blocking

    • Rao-Blackwellised

  • Likelihood Weighting

  • Importance Sampling

  • Sequential Monte-Carlo (Particle Filtering) in Dynamic Bayesian Networks


Forward sampling
Forward Sampling

  • Forward Sampling

    • Case with No evidence E={}

    • Case with Evidence E=e

    • # samples N and Error Bounds


Forward sampling no evidence henrion 1988
Forward Sampling No Evidence(Henrion 1988)

Input: Bayesian network

X= {X1,…,XN}, N- #nodes, T - # samples

Output: T samples

Process nodes in topological order – first process the ancestors of a node, then the node itself:

  • For t = 0 to T

  • For i = 0 to N

  • Xi sample xit from P(xi | pai)


Sampling a value

r

0

0.3

1

Sampling A Value

What does it mean to sample xit from P(Xi | pai) ?

  • Assume D(Xi)={0,1}

  • Assume P(Xi | pai) = (0.3, 0.7)

  • Draw a random number r from [0,1]

    If r falls in [0,0.3], set Xi = 0

    If r falls in [0.3,1], set Xi=1



Forward sampling answering queries
Forward Sampling-Answering Queries

Task: given T samples {S1,S2,…,Sn}

estimate P(Xi = xi) :

Basically, count the proportion of samples where Xi = xi


Forward sampling w evidence
Forward Sampling w/ Evidence

Input: Bayesian network

X= {X1,…,XN}, N- #nodes

E – evidence, T - # samples

Output: T samples consistent with E

  • For t=1 to T

  • For i=1 to N

  • Xi sample xit from P(xi | pai)

  • If Xi in E and Xi xi, reject sample:

  • i = 1 and go to step 2



Forward sampling illustration
Forward Sampling: Illustration

Let Y be a subset of evidence nodes s.t. Y=u


Forward sampling how many samples
Forward Sampling –How many samples?

Theorem: Let s(y) be the estimate of P(y) resulting from a randomly chosen sample set S with T samples. Then, to guarantee relative error at most with probability at least 1- it is enough to have:

Derived from Chebychev’s Bound.


Forward sampling how many samples1
Forward Sampling - How many samples?

Theorem: Let s(y) be the estimate of P(y) resulting from a randomly chosen sample set S with T samples. Then, to guarantee relative error at most with probability at least 1- it is enough to have:

Derived from Hoeffding’s Bound (full proof is given in Koller).


Forward sampling performance
Forward Sampling:Performance

Advantages:

  • P(xi | pa(xi)) is readily available

  • Samples are independent !

    Drawbacks:

  • If evidence E is rare (P(e) is low), then we will reject most of the samples!

  • Since P(y) in estimate of T is unknown, must estimate P(y) from samples themselves!

  • If P(e) is small, T will become very big!


Problem evidence
Problem: Evidence

  • Forward Sampling

    • High Rejection Rate

  • Fix evidence values

    • Gibbs sampling (MCMC)

    • Likelihood Weighting

    • Importance Sampling


Forward sampling bibliography
Forward Sampling Bibliography

  • {henrion88} M. Henrion, "Propagating uncertainty in Bayesian networks by probabilistic logic sampling”, Uncertainty in AI, pp. = 149-163,1988


Gibbs sampling
Gibbs Sampling

  • Markov Chain Monte Carlo method

    (Gelfand and Smith, 1990, Smith and Roberts, 1993, Tierney, 1994)

  • Samples are dependent, form Markov Chain

  • Sample from P’(X|e)which converges toP(X|e)

  • Guaranteed to converge when all P > 0

  • Methods to improve convergence:

    • Blocking

    • Rao-Blackwellised

  • Error Bounds

    • Lag-t autocovariance

    • Multiple Chains, Chebyshev’s Inequality


Gibbs sampling pearl 1988
Gibbs Sampling (Pearl, 1988)

  • A sample t[1,2,…],is an instantiation of all variables in the network:

  • Sampling process

    • Fix values of observed variables e

    • Instantiate node values in sample x0 at random

    • Generate samples x1,x2,…xT from P(x|e)

    • Compute posteriors from samples


Ordered gibbs sampler
Ordered Gibbs Sampler

Generate sample xt+1 from xt :

In short, for i=1 to N:

Process

All

Variables

In Some

Order


Gibbs sampling cont d pearl 1988
Gibbs Sampling (cont’d)(Pearl, 1988)

Markov blanket:


Ordered gibbs sampling algorithm
Ordered Gibbs Sampling Algorithm

Input: X, E

Output: T samples {xt }

  • Fix evidence E

  • Generate samples from P(X | E)

  • For t = 1 to T (compute samples)

  • For i = 1 to N (loop through variables)

  • Xi sample xit from P(Xi | markovt \ Xi)


Answering queries
Answering Queries

  • Query: P(xi |e) = ?

  • Method 1: count #of samples where Xi=xi:

    Method 2: average probability (mixture estimator):


Gibbs sampling example bn
Gibbs Sampling Example - BN

X = {X1,X2,…,X9}

E = {X9}

X1

X3

X6

X2

X5

X8

X9

X4

X7


Gibbs sampling example bn1
Gibbs Sampling Example - BN

X1 = x10X6 = x60

X2 = x20X7 = x70

X3 = x30X8 = x80

X4 = x40

X5 = x50

X1

X3

X6

X2

X5

X8

X9

X4

X7


Gibbs sampling example bn2
Gibbs Sampling Example - BN

X1 P (X1 |X02,…,X08 ,X9}

E = {X9}

X1

X3

X6

X2

X5

X8

X9

X4

X7


Gibbs sampling example bn3
Gibbs Sampling Example - BN

X2 P(X2 |X11,…,X08 ,X9}

E = {X9}

X1

X3

X6

X2

X5

X8

X9

X4

X7



Gibbs sampling burn in
Gibbs Sampling: Burn-In

  • We want to sample from P(X | E)

  • But…starting point is random

  • Solution: throw away first K samples

  • Known As “Burn-In”

  • What is K ? Hard to tell. Use intuition.

  • Alternatives: sample first sample valkues from approximate P(x|e) (for example, run IBP first)


Gibbs sampling convergence
Gibbs Sampling: Convergence

  • Converge to stationary distribution * :

    * = * P

    where P is a transition kernel

    pij = P(Xi Xj)

  • Guaranteed to converge iff chain is :

    • irreducible

    • aperiodic

    • ergodic ( i,j pij > 0)


Irreducible
Irreducible

  • A Markov chain (or its probability transition matrix) is said to be irreducible if it is possible to reach every state from every other state (not necessarily in one step).

  • In other words, i,j k : P(k)ij > 0 where k is the number of steps taken to get to state j from state i.


Aperiodic
Aperiodic

  • Define d(i) = g.c.d.{n > 0 | it is possible to go from i to i in n steps}. Here, g.c.d. means the greatest common divisor of the integers in the set. If d(i)=1 for i, then chain is aperiodic.


Ergodicity
Ergodicity

  • A recurrent state is a state to which the chain returns with probability 1:

    nP(n)ij = 

  • Recurrent, aperiodic states are ergodic.

    Note: an extra condition for ergodicity is that expected recurrence time is finite. This holds for recurrent states in a finite state chain.


Gibbs convergence
Gibbs Convergence

  • Gibbs convergence is generally guaranteed as long as all probabilities are positive!

  • Intuition for ergodicity requirement: if nodes X and Y are correlated s.t. X=0 Y=0, then:

    • once we sample and assign X=0, then we are forced to assign Y=0;

    • once we sample and assign Y=0, then we are forced to assign X=0;

       we will never be able to change their values again!

  • Another problem: it can take a very long time to converge!


Gibbs sampling performance
Gibbs Sampling: Performance

+Advantage: guaranteed to converge to P(X|E)

-Disadvantage: convergence may be slow

Problems:

  • Samples are dependent !

  • Statistical variance is too big in high-dimensional problems


Gibbs speeding convergence
Gibbs: Speeding Convergence

Objectives:

  • Reduce dependence between samples (autocorrelation)

    • Skip samples

    • Randomize Variable Sampling Order

  • Reduce variance

    • Blocking Gibbs Sampling

    • Rao-Blackwellisation


Skipping samples
Skipping Samples

  • Pick only every k-th sample (Gayer, 1992)

    Can reduce dependence between samples !

    Increases variance ! Waists samples !


Randomized variable order
Randomized Variable Order

Random Scan Gibbs Sampler

Pick each next variable Xi for update at random with probability pi , i pi = 1.

(In the simplest case, pi are distributed uniformly.)

In some instances, reduces variance (MacEachern, Peruggia, 1999

“Subsampling the Gibbs Sampler: Variance Reduction”)


Blocking
Blocking

  • Sample several variables together, as a block

  • Example: Given three variables X,Y,Z, with domains of size 2, group Y and Z together to form a variable W={Y,Z} with domain size 4. Then, given sample (xt,yt,zt), compute next sample:

    Xt+1 P(yt,zt)=P(wt)

    (yt+1,zt+1)=Wt+1 P(xt+1)

    + Can improve convergence greatly when two variables are strongly correlated!

    - Domain of the block variable grows exponentially with the #variables in a block!


Blocking gibbs sampling
Blocking Gibbs Sampling

Jensen, Kong, Kjaerulff, 1993

“Blocking Gibbs Sampling Very Large Probabilistic Expert Systems”

  • Select a set of subsets:

    E1, E2, E3, …, Ek s.t. Ei X

    Ui Ei = X

    Ai = X \ Ei

  • Sample P(Ei | Ai)


Rao blackwellisation
Rao-Blackwellisation

  • Do not sample all variables!

  • Sample a subset!

  • Example: Given three variables X,Y,Z, sample only X and Y, sum out Z. Given sample (xt,yt), compute next sample:

    Xt+1 P(yt)

    yt+1  P(xt+1)


Rao blackwell theorem
Rao-Blackwell Theorem

Bottom line: reducing number of variables in a sample reduce variance!


Blocking vs rao blackwellisation
Blocking vs. Rao-Blackwellisation

  • Standard Gibbs:

    P(x|y,z),P(y|x,z),P(z|x,y) (1)

  • Blocking:

    P(x|y,z), P(y,z|x) (2)

  • Rao-Blackwellised:

    P(x|y), P(y|x) (3)

    Var3 < Var2 < Var1

    [Liu, Wong, Kong, 1994

    Covariance structure of the Gibbs sampler…]

X

Y

Z


Rao blackwellised gibbs cutset sampling
Rao-Blackwellised Gibbs: Cutset Sampling

  • Select C  X(possibly cycle-cutset), |C| = m

  • Fix evidence E

  • Initialize nodes with random values:

    For i=1 to m: ci to Ci = c0i

  • For t=1 to n , generate samples:

    For i=1 to m:

    Ci=cit+1 P(ci|c1 t+1,…,ci-1 t+1,ci+1t,…,cmt ,e)


Cutset sampling
Cutset Sampling

  • Select a subset C={C1,…,CK} X

  • A sample t[1,2,…],is an instantiation of C:

  • Sampling process

    • Fix values of observed variables e

    • Generate sample c0 at random

    • Generate samples c1,c2,…cT from P(c|e)

    • Compute posteriors from samples


Cutset sampling generating samples
Cutset SamplingGenerating Samples

Generate sample ct+1 from ct :

In short, for i=1 to K:


Rao blackwellised gibbs cutset sampling1
Rao-Blackwellised Gibbs: Cutset Sampling

How to compute P(ci|ct\ci, e) ?

  • Compute joint P(ci, ct\ci, e) for each ci D(Ci)

  • Then normalize:

    P(ci| ct\ci , e) =  P(ci, ct\ci , e)

  • Computation efficiency depends

    on choice of C


Rao blackwellised gibbs cutset sampling2
Rao-Blackwellised Gibbs: Cutset Sampling

How to choose C ?

  • Special case: C is cycle-cutset, O(N)

  • General case: apply Bucket Tree Elimination (BTE), O(exp(w)) where w is the induced width of the network when nodes in C are observed.

  • Pick C wisely so as to minimize w  notion of w-cutset


W cutset sampling
w-cutset Sampling

  • C=w-cutset of the network, a set of nodes such that when C and E are instantiated, the adjusted induced width of the network is w

  • Complexity of exact inference:

    bounded by w !

  • cycle-cutset is a special case


Cutset sampling answering queries
Cutset Sampling-Answering Queries

  • Query: ci C, P(ci |e)=? same as Gibbs:

  • Special case of w-cutset

computed while generating sample t

  • Query: P(xi |e) = ?

compute after generating sample t


Cutset sampling example
Cutset Sampling Example

X1

X2

X3

X4

X5

X6

X9

X7

X8

E=x9


Cutset sampling example1
Cutset Sampling Example

Sample a new value for X2 :

X1

X2

X3

X4

X5

X6

X9

X7

X8


Cutset sampling example2
Cutset Sampling Example

Sample a new value for X5 :

X1

X2

X3

X4

X5

X6

X9

X7

X8


Cutset sampling example3
Cutset Sampling Example

Query P(x2|e) for sampling node X2 :

Sample 1

X1

X2

X3

Sample 2

X4

X5

X6

Sample 3

X9

X7

X8


Cutset sampling example4
Cutset Sampling Example

Query P(x3 |e) for non-sampled node X3 :

X1

X2

X3

X4

X5

X6

X9

X7

X8


Gibbs error bounds
Gibbs: Error Bounds

Objectives:

  • Estimate needed number of samples T

  • Estimate error

    Methodology:

  • 1 chain  use lag-k autocovariance

    • Estimate T

  • M chains  standard sampling variance

    • Estimate Error


Gibbs lag k autocovariance
Gibbs: lag-k autocovariance

Lag-k autocovariance


Gibbs lag k autocovariance1
Gibbs: lag-k autocovariance

Estimate Monte Carlo variance:

Here,  is the smallest positive integer satisfying:

Effective chain size:

In absense of autocovariance:


Gibbs multiple chains
Gibbs: Multiple Chains

  • Generate M chains of size K

  • Each chain produces independent estimate Pm:

Estimate P(xi|e) as average of Pm (xi|e) :

Treat Pm as independent random variables.


Gibbs multiple chains1
Gibbs: Multiple Chains

{ Pm } are independent random variables

Therefore:


Geman geman1984
Geman&Geman1984

  • Geman, S. & Geman D., 1984. Stocahstic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEE Trans.Pat.Anal.Mach.Intel. 6, 721-41.

    • Introduce Gibbs sampling;

    • Place the idea of Gibbs sampling in a general setting in which the collection of variables is structured in a graphical model and each variable has a neighborhood corresponding to a local region of the graphical structure. Geman and Geman use the Gibbs distribution to define the joint distribution on this structured set of variables.


Tanner wong 1987
Tanner&Wong 1987

  • Tanner and Wong (1987)

    • Data-augmentation

    • Convergence Results


Pearl1988
Pearl1988

  • Pearl,1988. Probabilistic Reasoning in Intelligent Systems, Morgan-Kaufmann.

    • In the case of Bayesian networks, the neighborhoods correspond to the Markov blanket of a variable and the joint distribution is defined by the factorization of the network.


Gelfand smith 1990
Gelfand&Smith,1990

  • Gelfand, A.E. and Smith, A.F.M., 1990. Sampling-based approaches to calculating marginal densities. J. Am.Statist. Assoc. 85, 398-409.

    • Show variance reduction in using mixture estimator for posterior marginals.


Neal 1992
Neal, 1992

  • R. M. Neal, 1992. Connectionist learning of belief networks, Artifical Intelligence, v. 56, pp. 71-118.

    • Stochastic simulation in Noisy-Or networks.


Cpcs54 test results
CPCS54 Test Results

MSE vs. #samples (left) and time (right)

Ergodic, |X| = 54, D(Xi) = 2, |C| = 15, |E| = 4

Exact Time = 30 sec using Cutset Conditioning


Cpcs179 test results
CPCS179 Test Results

MSE vs. #samples (left) and time (right)

Non-Ergodic (1 deterministic CPT entry)

|X| = 179, |C| = 8, 2<= D(Xi)<=4, |E| = 35

Exact Time = 122 sec using Loop-Cutset Conditioning


Cpcs360b test results
CPCS360b Test Results

MSE vs. #samples (left) and time (right)

Ergodic, |X| = 360, D(Xi)=2, |C| = 21, |E| = 36

Exact Time > 60 min using Cutset Conditioning

Exact Values obtained via Bucket Elimination


Random networks
Random Networks

MSE vs. #samples (left) and time (right)

|X| = 100, D(Xi) =2,|C| = 13, |E| = 15-20

Exact Time = 30 sec using Cutset Conditioning


Coding networks
Coding Networks

x1

x2

x3

x4

u1

u2

u3

u4

p1

p2

p3

p4

y1

y2

y3

y4

MSE vs. time (right)

Non-Ergodic, |X| = 100, D(Xi)=2, |C| = 13-16, |E| = 50

Sample Ergodic Subspace U={U1, U2,…Uk}

Exact Time = 50 sec using Cutset Conditioning


Non ergodic hailfinder
Non-Ergodic Hailfinder

MSE vs. #samples (left) and time (right)

Non-Ergodic, |X| = 56, |C| = 5, 2 <=D(Xi) <=11, |E| = 0

Exact Time = 2 sec using Loop-Cutset Conditioning


Non ergodic cpcs360b mse
Non-Ergodic CPCS360b - MSE

MSE vs. Time

Non-Ergodic, |X| = 360, |C| = 26, D(Xi)=2

Exact Time = 50 min using BTE



Likelihood weighting fung and chang 1990 shachter and peot 1990
Likelihood Weighting(Fung and Chang, 1990; Shachter and Peot, 1990)

“Clamping” evidence+

forward sampling+

weighing samples by evidence likelihood

Works well for likelyevidence!


Likelihood weighting
Likelihood Weighting

Sample in topological order over X !

e

e

e

e

e

e

e

e

e

xiP(Xi|pai)

P(Xi|pai) is a look-up in CPT!



Likelihood weighting1
Likelihood Weighting

Estimate Posterior Marginals:


Likelihood weighting2
Likelihood Weighting

  • Converges to exact posterior marginals

  • Generates Samples Fast

  • Sampling distribution is close to prior (especially if E  Leaf Nodes)

  • Increasing sampling variance

    Convergence may be slow

    Many samples with P(x(t))=0 rejected


Likelihood convergence chebychev s inequality
Likelihood Convergence(Chebychev’s Inequality)

  • Assume P(X=x|e) has mean  and variance 2

  • Chebychev:

=P(x|e) is unknown => obtain it from samples!


Error bound derivation
Error Bound Derivation

K is a Bernoulli random variable


Likelihood convergence 2
Likelihood Convergence 2

  • Assume P(X=x|e) has mean  and variance 2

  • Zero-One Estimation Theory (Karp et al.,1989):

=P(x|e) is unknown => obtain it from samples!


Local variance bound lvb dagum luby 1994
Local Variance Bound (LVB)(Dagum&Luby, 1994)

  • Let  be LVB of a binary valued network:


Lvb estimate pradhan dagum 1996
LVB Estimate(Pradhan,Dagum,1996)

  • Using the LVB, the Zero-One Estimator can be re-written:


Importance sampling idea
Importance Sampling Idea

  • In general, it is hard to sample from target distribution P(X|E)

  • Generate samples from sampling (proposal) distribution Q(X)

  • Weigh each sample against P(X|E)


Importance sampling variants
Importance Sampling Variants

Importance sampling: forward, non-adaptive

  • Nodes sampled in topological order

  • Sampling distribution (for non-instantiated nodes) equal to the prior conditionals

    Importance sampling: forward, adaptive

  • Nodes sampled in topological order

  • Sampling distribution adapted according to average importance weights obtained in previous samples [Cheng,Druzdzel2000]


Ais bn
AIS-BN

  • The most efficient variant of importance sampling to-date is AIS-BN – Adaptive Importance Sampling for Bayesian networks.

  • Jian Cheng and Marek J. Druzdzel. AIS-BN: An adaptive importance sampling algorithm for evidential reasoning in large Bayesian networks.Journal of Artificial Intelligence Research (JAIR), 13:155-188, 2000.