80 Views

Download Presentation
## CS b553 : A lgorithms for Optimization and Learning

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -

**CS b553: Algorithms for Optimization and Learning**Monte Carlo Methods for Probabilistic Inference**Agenda**• Monte Carlo methods • O(1/sqrt(N)) standard deviation • For Bayesian inference • Likelihood weighting • Gibbs sampling**Monte Carlo Integration**• Estimate large integrals/sums: • I = f(x)p(x) dx • I = f(x)p(x) • Using a sample of N i.i.d. samples from p(x) • I 1/N f(x(i)) • Examples: • [a,b]f(x) dx (b-a)/N Sf(x(i)) • E[X] = x p(x) dx 1/N S x(i) • Volume of a set in Rn**Mean & Variance of estimate**• Let IN be the random variable denoting the estimate of the integral with N samples • What is the bias (mean error) E[I-IN]?**Mean & Variance of estimate**• Let IN be the random variable denoting the estimate of the integral with N samples • What is the bias (mean error) E[I-IN]? • E[I-IN]=I-E[IN] (linearity of expectation)**Mean & Variance of estimate**• Let IN be the random variable denoting the estimate of the integral with N samples • What is the bias (mean error) E[I-IN]? • E[I-IN]=I-E[IN] (linearity of expectation) = E[f(x)] - 1/N S E[f(x(i))] (definition of I and IN)**Mean & Variance of estimate**• Let IN be the random variable denoting the estimate of the integral with N samples • What is the bias (mean error) E[I-IN]? • E[I-IN]=I-E[IN] (linearity of expectation) = E[f(x)] - 1/N S E[f(x(i))] (definition of I and IN) = 1/N S(E[f(x)]-E[f(x(i))]) = 1/N S0 (x and x(i) are distributed w.r.t. p(x)) = 0**Mean & Variance of estimate**• Let IN be the random variable denoting the estimate of the integral with N samples • What is the bias (mean error) E[I-IN]? • Unbiased estimator • What is the variance Var[IN]?**Mean & Variance of estimate**• Let IN be the random variable denoting the estimate of the integral with N samples • What is the bias (mean error) E[I-IN]? • Unbiased estimator • What is the variance Var[IN]? • Var[IN] = Var[1/N S f(x(i))] (definition)**Mean & Variance of estimate**• Let IN be the random variable denoting the estimate of the integral with N samples • What is the bias (mean error) E[I-IN]? • Unbiased estimator • What is the variance Var[IN]? • Var[IN] = Var[1/N S f(x(i))] (definition) = 1/N2Var[Sf(x(i))] (scaling of variance)**Mean & Variance of estimate**• Let IN be the random variable denoting the estimate of the integral with N samples • What is the bias (mean error) E[I-IN]? • Unbiased estimator • What is the variance Var[IN]? • Var[IN] = Var[1/N S f(x(i))] (definition) = 1/N2Var[Sf(x(i))] (scaling of variance) = 1/N2SVar[f(x(i))] (variance of a sum of independent variables)**Mean & Variance of estimate**• Let IN be the random variable denoting the estimate of the integral with N samples • What is the bias (mean error) E[I-IN]? • Unbiased estimator • What is the variance Var[IN]? • Var[IN] = Var[1/N S f(x(i))] (definition) = 1/N2Var[Sf(x(i))] (scaling of variance) = 1/N2SVar[f(x(i))] = 1/N Var[f(x)] (i.i.d. sample)**Mean & Variance of estimate**• Let IN be the random variable denoting the estimate of the integral with N samples • What is the bias (mean error) E[I-IN]? • Unbiased estimator • What is the variance Var[IN]? • 1/N Var[f(x)] • Standard deviation: O(1/sqrt(N))**Approximate Inference Through Sampling**• Unconditional simulation: • To estimate the probability of a coin flipping heads, I can flip it a huge number of times and count the fraction of heads observed**Approximate Inference Through Sampling**• Unconditional simulation: • To estimate the probability of a coin flipping heads, I can flip it a huge number of times and count the fraction of heads observed • Conditional simulation: • To estimate the probability P(H) that a coin picked out of bucket B flips heads: • Repeat for i=1,…,N: • Pick a coin C out of a random bucket b(i) chosen with probability P(B) • h(i) = flip C according to probability P(H|b(i)) • Sample (h(i),b(i)) comes from distribution P(H,B) • Result approximates P(H,B)**Monte Carlo Inference In Bayes Nets**• BN over variables X • Repeat for i=1,…,N • In top-down order, generate x(i)as follows: • Sample xj(i) ~ P(Xj|paXj(i)) • (RHS is taken by putting parent values in sample into the CPT for Xj) • Sample x(1)…x(N) approximates the distribution over X**Burglary**Earthquake Alarm JohnCalls MaryCalls Approximate Inference: Monte-Carlo Simulation • Sample from the joint distribution B=0 E=0 A=0 J=1 M=0**Approximate Inference: Monte-Carlo Simulation**• As more samples are generated, the distribution of the samples approaches the joint distribution B=0 E=0 A=0 J=1 M=0 B=0 E=0 A=0 J=0 M=0 B=0 E=0 A=0 J=0 M=0 B=1 E=0 A=1 J=1 M=0**Basic method for Handling Evidence**• Inference: given evidence E=e (e.g., J=1), approximate P(X/E|E=e) • Remove the samples that conflict B=0 E=0 A=0 J=1 M=0 B=0 E=0 A=0 J=0 M=0 B=0 E=0 A=0 J=0 M=0 B=1 E=0 A=1 J=1 M=0 Distribution of remaining samples approximates the conditional distribution**Rare Event Problem:**• What if some events are really rare (e.g., burglary & earthquake ?) • # of samples must be huge to get a reasonable estimate • Solution: likelihood weighting • Enforce that each sample agrees with evidence • While generating a sample, keep track of the ratio of • (how likely the sampled value is to occur in the real world)(how likely you were to generate the sampled value)**Burglary**Earthquake Alarm JohnCalls MaryCalls Likelihood weighting • Suppose evidence Alarm & MaryCalls • Sample B,E with P=0.5 w=1**Burglary**Earthquake Alarm JohnCalls MaryCalls Likelihood weighting • Suppose evidence Alarm & MaryCalls • Sample B,E with P=0.5 w=0.008 B=0 E=1**Burglary**Earthquake Alarm JohnCalls MaryCalls Likelihood weighting • Suppose evidence Alarm & MaryCalls • Sample B,E with P=0.5 w=0.0023 B=0 E=1 A=1 A=1 is enforced, and the weight updated to reflect the likelihood that this occurs**Burglary**Earthquake Alarm JohnCalls MaryCalls Likelihood weighting • Suppose evidence Alarm & MaryCalls • Sample B,E with P=0.5 w=0.0016 B=0 E=1 A=1 M=1 J=1**Burglary**Earthquake Alarm JohnCalls MaryCalls Likelihood weighting • Suppose evidence Alarm & MaryCalls • Sample B,E with P=0.5 w=3.988 B=0 E=0**Burglary**Earthquake Alarm JohnCalls MaryCalls Likelihood weighting • Suppose evidence Alarm & MaryCalls • Sample B,E with P=0.5 w=0.004 B=0 E=0 A=1**Burglary**Earthquake Alarm JohnCalls MaryCalls Likelihood weighting • Suppose evidence Alarm & MaryCalls • Sample B,E with P=0.5 w=0.0028 B=0 E=0 A=1 M=1 J=1**Burglary**Earthquake Alarm JohnCalls MaryCalls Likelihood weighting • Suppose evidence Alarm & MaryCalls • Sample B,E with P=0.5 w=0.00375 B=1 E=0 A=1**Burglary**Earthquake Alarm JohnCalls MaryCalls Likelihood weighting • Suppose evidence Alarm & MaryCalls • Sample B,E with P=0.5 w=0.0026 B=1 E=0 A=1 M=1 J=1**Burglary**Earthquake Alarm JohnCalls MaryCalls Likelihood weighting • Suppose evidence Alarm & MaryCalls • Sample B,E with P=0.5 w=5e-7 B=1 E=1 A=1 M=1 J=1**Likelihood weighting**• Suppose evidence Alarm & MaryCalls • Sample B,E with P=0.5 • N=4 gives P(B|A,M)~=0.371 • Exact inference gives P(B|A,M) = 0.375 w=0.0016 w=0.0028 w=0.0026 w~=0 B=0 E=1 A=1 M=1 J=1 B=0 E=0 A=1 M=1 J=1 B=1 E=0 A=1 M=1 J=1 B=1 E=1 A=1 M=1 J=1**Another Rare-Event Problem**• B=b given as evidence • Probability each bi is rare given all but one setting of Ai(say, Ai=1) • Chance of sampling all 1’s is very low => most likelihood weights will be too low • Problem: evidence is not being used to sample A’s effectively (i.e., near P(Ai|b)) A1 A2 A10 B1 B2 B10**Gibbs Sampling**• Idea: reduce the computational burden of sampling from a multidimensional distribution P(x)=P(x1,…,xn) by doing repeated draws of individual attributes • Cycle through j=1,…,n • Sample xj ~ P(xj | x[1…j-1,j+1,…n]) • Over the long run, the random walk taken by x approaches the true distribution P(x)**Gibbs Sampling in BNs**• Each Gibbs sampling step: 1) pick a variable Xi, 2) sample xi ~ P(Xi|X/Xi) • Look at values of “Markov blanket” of Xi: • Parents PaXi • Children Y1,…,Yk • Parents of children (excluding Xi) PaY1/Xi, …,PaYk/Xi • Xi is independent of rest of network given Markov blanket • Sample xi~P(Xi|, Y1, PaY1/Xi, …, Yk, PaYk/Xi)= 1/Z P(Xi|PaXi) P(Y1|PaY1) *…* P(Yk|PaYk) • Product of Xi’s factor and the factors of its children**Handling evidence**• Simply set each evidence variable to its appropriate value, don’t sample • Resulting walk approximates distribution P(X/E|E=e) • Uses evidence more efficiently than likelihood weighting**Gibbs sampling issues**• Demonstrating correctness & convergence requires examining Markov Chain random walk (more later) • Need to take many steps before the effects of poor initialization wear off (mixing time) • Difficult to tell how much is needed a priori • Numerous variants • Known as Markov Chain Monte Carlo techniques**Next time**• Continuous and hybrid distributions