380 likes | 549 Views
Approximate Inference. Slides by Nir Friedman. When can we hope to approximate?. Two situations: Highly stochastic distributions “Far” evidence is discarded “Peaked” distributions improbable values are ignored. X n+1. X 1. X 2. X 3. Stochasticity & Approximations. Consider a chain:
E N D
Approximate Inference Slides by Nir Friedman .
When can we hope to approximate? Two situations: • Highly stochastic distributions “Far” evidence is discarded • “Peaked” distributions improbable values are ignored
Xn+1 X1 X2 X3 Stochasticity & Approximations • Consider a chain: • P(Xi+1 = t | Xi = t) = 1- P(Xi+1 = f | Xi = f) = 1- • Computing the probability of Xn+1 given X1 , we get Even # of flips: Odd # of flips:
1 n = 5 n = 10 n = 20 0.9 0.8 0.7 0.6 0.5 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 Plot of P(Xn = t | X1 = t)
Stochastic Processes • This behavior of a chain (a Markov Process) is called Mixing. • In general Bayes nets there is a similar behavior. • If probabilities are far from 0 & 1, then effect of “far” evidence vanishes (and so can be discarded in approximations).
“Peaked” distributions • If the distribution is “peaked”, then most of the mass is on few instances • If we can focus on these instances, we can ignore the rest Instances
a b b a A B D D E C C E J J L M L M K I I K Global conditioning Fixing value of A & B Fixing values in the beginning of the summation can decrease tables formed by variable elimination. This way space is traded with time. Special case: choose to fix a set of nodes that “break all loops”. This method is called cutset-conditioning.
A B Bounded conditioning Fixing value of A & B By examining only the probable assignment of A & B, we perform several simple computations instead of a complex one.
Bounded conditioning • Choose A and B so that P(Y,e |a,b) can be computed easily. E.g., a cycle cutset. • Search for highly probable assignments to A,B. • Option 1--- select a,b with high P(a,b). • Option 2--- select a,b with high P(a,b | e). • We need to search for such high mass values and that can be hard.
Bounded Conditioning Advantages: • Combines exact inference within approximation • Continuous: more time can be used to examine more cases • Bounds: unexamined mass used to compute error-bars Possible problems: • P(a,b)is prior mass not the posterior. • If posterior is significantly different P(a,b|e),Computation can be wasted on irrelevant assignments
Network Simplifications • In these approaches, we try to replace the original network with a simpler one • the resulting network allows fast exact methods
Network Simplifications Typical simplifications: • Remove parts of the network • Remove edges • Reduce the number of values (value abstraction) • Replace a sub-network with a simpler one(model abstraction) • These simplifications are often w.r.t. to the particular evidence and query
Stochastic Simulation • Suppose our goal is the compute the likelihood of evidence P(e) where e is an assignment to some variables in {X1,…,Xn}. • Assume that we can sample instances <x1,…,xn> according to the distribution P(x1,…,xn). • What is then the probability that a random sample <x1,…,xn> satisfies e? Answer: simplyP(e) which is what we wish to compute. • Each sample simulates the tossing of a biased coin with probability P(e) of “Heads”.
Zeros or ones Stochastic Sampling • Intuition: given a sufficient number of samples x[1],…,x[N], we can estimate • Law of large number implies that as N grows, our estimate will converge to p with high probability • How many samples do we need to get a reliable estimation? We will not discuss this issue here.
Sampling a Bayesian Network • If P(X1,…,Xn) is represented by a Bayesian network, can we efficiently sample from it? • Idea: sample according to structure of the network • Write distribution using the chain rule, and then sample each variable given its parents
0.03 Burglary Earthquake Radio Alarm Call B E A C R b b e e a b e b e Samples: Logic sampling P(b) 0.03 P(e) 0.001 b e P(a) 0.4 0.01 0.98 0.7 e P(r) 0.3 0.001 a P(c) 0.05 0.8
0.001 Burglary Earthquake Radio Alarm Call B E A C R b b e e a b e b e Samples: Logic sampling P(b) 0.03 P(e) 0.001 b e P(a) 0.4 0.01 0.98 0.7 e P(r) 0.3 0.001 a P(c) 0.05 0.8 e
Burglary Earthquake 0.4 Radio Alarm Call B E A C R b b e e a b e b e Samples: Logic sampling P(b) 0.03 P(e) 0.001 b e P(a) 0.4 0.01 0.98 0.7 e P(r) 0.3 0.001 a P(c) 0.05 0.8 e a
Burglary Earthquake Radio Alarm 0.8 Call B E A C R b b e e a b e b e Samples: Logic sampling P(b) 0.03 P(e) 0.001 b e P(a) 0.4 0.01 0.98 0.7 e P(r) 0.3 0.001 a P(c) 0.05 0.8 e a c
Burglary Earthquake Radio Alarm 0.3 Call B E A C R b b e e a b e b e Samples: Logic sampling P(b) 0.03 P(e) 0.001 b e P(a) 0.4 0.01 0.98 0.7 e P(r) 0.3 0.001 a P(c) 0.05 0.8 e a c
Burglary Earthquake Radio Alarm Call B E A C R b b e e a b e b e r Samples: Logic sampling P(b) 0.03 P(e) 0.001 b e P(a) 0.4 0.01 0.98 0.7 e P(r) 0.3 0.001 a P(c) 0.05 0.8 e a c
Logic Sampling • Let X1, …, Xn be order of variables consistent with arc direction • for i = 1, …, n do • sample xi from P(Xi | pai ) • (Note: since Pai {X1,…,Xi-1}, we already assigned values to them) • return x1, …,xn
Logic Sampling • Sampling a complete instance is linear in number of variables • Regardless of structure of the network • However, if P(e) is small, we need many samples to get a decent estimate
Can we sample from P(Xi|e) ? • If evidence e is in roots of the Bayes network, easily • If evidence is in leaves of the network, we have a problem: • Our sampling method proceeds according to the order of nodes in the network. B Z R A=a X
Y X Likelihood Weighting • Can we ensure that all of our sample satisfy e? • One simple (but wrong) solution: • When we need to sample a variable Y that is assigned value by e, use its specified value. • For example: we know Y = 1 • Sample X from P(X) • Then take Y = 1 • Is this a sample from P(X,Y|Y = 1) ? NO.
Y X Likelihood Weighting • Problem: these samples of X are from P(X) • Solution: • Penalize samples in which P(Y=1|X) is small • We now sample as follows: • Let xi be a sample from P(x) • Let wi= P(Y = 1|X = xi )
Likelihood Weighting • Let X1, …, Xn be order of variables consistent with arc direction • w = 1 • for i = 1, …, n do • if Xi = xi has been observed • w w P(Xi= xi| pai ) • else • sample xi from P(Xi | pai ) • return x1, …,xn, and w
0.03 Burglary Earthquake = a Radio Alarm Call Weight b b e r b e b e Likelihood Weighting P(b) 0.03 P(e) 0.001 b e P(a) 0.4 0.01 0.98 0.7 = r r P(r) 0.3 0.001 a a P(c) B E A C R 0.05 0.8 Samples:
0.001 Burglary Earthquake = a Radio Alarm Call B E A C R Weight b b e a r b e b e Samples: Likelihood Weighting P(b) 0.03 P(e) 0.001 b e P(a) 0.4 0.01 0.98 0.7 = r r P(r) 0.3 0.001 a P(c) 0.05 0.8 e
Burglary Earthquake 0.4 = a Radio Alarm Call B E A C R Weight b b e a a r b e b e Samples: Likelihood Weighting P(b) 0.03 P(e) 0.001 b e P(a) 0.4 0.01 0.98 0.7 = r r P(r) 0.3 0.001 a P(c) 0.05 0.8 0.6 e
Burglary Earthquake = a Radio Alarm 0.05 Call B E A C R Weight b b e a a r b e b e Samples: Likelihood Weighting P(b) 0.03 P(e) 0.001 b e P(a) 0.4 0.01 0.98 0.7 = r r P(r) 0.3 0.001 a P(c) 0.05 0.8 0.6 e c
Burglary Earthquake = a Radio Alarm 0.3 Call Weight b b e a a r b e b e Likelihood Weighting P(b) 0.03 P(e) 0.001 b e P(a) 0.4 0.01 0.98 0.7 = r r P(r) 0.3 0.001 a P(c) B E A C R 0.05 0.8 0.6 *0.3 e c r Samples:
Likelihood Weighting • Why does this make sense? • When N is large, we expect to sample NP(X = x) samples with x[i] = x • Thus,
Summary Approximate inference is needed for large pedigrees. We have seen a few methods today. Some could fit genetic linkage analysis and some do not. There are many other approximation algorithms: Variational methods, MCMC, and others. In next semester’s project of Bioinformatics (236524), we will offer projects that seek to implement some approximation methods and embed them in the superlink software.