# CS b553 : A lgorithms for Optimization and Learning - PowerPoint PPT Presentation

CS b553 : A lgorithms for Optimization and Learning

1 / 37

## CS b553 : A lgorithms for Optimization and Learning

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
##### Presentation Transcript

1. CS b553: Algorithms for Optimization and Learning Monte Carlo Methods for Probabilistic Inference

2. Agenda • Monte Carlo methods • O(1/sqrt(N)) standard deviation • For Bayesian inference • Likelihood weighting • Gibbs sampling

3. Monte Carlo Integration • Estimate large integrals/sums: • I =  f(x)p(x) dx • I =  f(x)p(x) • Using a sample of N i.i.d. samples from p(x) • I  1/N  f(x(i)) • Examples: • [a,b]f(x) dx  (b-a)/N Sf(x(i)) • E[X] =  x p(x) dx  1/N S x(i) • Volume of a set in Rn

4. Mean & Variance of estimate • Let IN be the random variable denoting the estimate of the integral with N samples • What is the bias (mean error) E[I-IN]?

5. Mean & Variance of estimate • Let IN be the random variable denoting the estimate of the integral with N samples • What is the bias (mean error) E[I-IN]? • E[I-IN]=I-E[IN] (linearity of expectation)

6. Mean & Variance of estimate • Let IN be the random variable denoting the estimate of the integral with N samples • What is the bias (mean error) E[I-IN]? • E[I-IN]=I-E[IN] (linearity of expectation) = E[f(x)] - 1/N S E[f(x(i))] (definition of I and IN)

7. Mean & Variance of estimate • Let IN be the random variable denoting the estimate of the integral with N samples • What is the bias (mean error) E[I-IN]? • E[I-IN]=I-E[IN] (linearity of expectation) = E[f(x)] - 1/N S E[f(x(i))] (definition of I and IN) = 1/N S(E[f(x)]-E[f(x(i))]) = 1/N S0 (x and x(i) are distributed w.r.t. p(x)) = 0

8. Mean & Variance of estimate • Let IN be the random variable denoting the estimate of the integral with N samples • What is the bias (mean error) E[I-IN]? • Unbiased estimator • What is the variance Var[IN]?

9. Mean & Variance of estimate • Let IN be the random variable denoting the estimate of the integral with N samples • What is the bias (mean error) E[I-IN]? • Unbiased estimator • What is the variance Var[IN]? • Var[IN] = Var[1/N S f(x(i))] (definition)

10. Mean & Variance of estimate • Let IN be the random variable denoting the estimate of the integral with N samples • What is the bias (mean error) E[I-IN]? • Unbiased estimator • What is the variance Var[IN]? • Var[IN] = Var[1/N S f(x(i))] (definition) = 1/N2Var[Sf(x(i))] (scaling of variance)

11. Mean & Variance of estimate • Let IN be the random variable denoting the estimate of the integral with N samples • What is the bias (mean error) E[I-IN]? • Unbiased estimator • What is the variance Var[IN]? • Var[IN] = Var[1/N S f(x(i))] (definition) = 1/N2Var[Sf(x(i))] (scaling of variance) = 1/N2SVar[f(x(i))] (variance of a sum of independent variables)

12. Mean & Variance of estimate • Let IN be the random variable denoting the estimate of the integral with N samples • What is the bias (mean error) E[I-IN]? • Unbiased estimator • What is the variance Var[IN]? • Var[IN] = Var[1/N S f(x(i))] (definition) = 1/N2Var[Sf(x(i))] (scaling of variance) = 1/N2SVar[f(x(i))] = 1/N Var[f(x)] (i.i.d. sample)

13. Mean & Variance of estimate • Let IN be the random variable denoting the estimate of the integral with N samples • What is the bias (mean error) E[I-IN]? • Unbiased estimator • What is the variance Var[IN]? • 1/N Var[f(x)] • Standard deviation: O(1/sqrt(N))

14. Approximate Inference Through Sampling • Unconditional simulation: • To estimate the probability of a coin flipping heads, I can flip it a huge number of times and count the fraction of heads observed

15. Approximate Inference Through Sampling • Unconditional simulation: • To estimate the probability of a coin flipping heads, I can flip it a huge number of times and count the fraction of heads observed • Conditional simulation: • To estimate the probability P(H) that a coin picked out of bucket B flips heads: • Repeat for i=1,…,N: • Pick a coin C out of a random bucket b(i) chosen with probability P(B) • h(i) = flip C according to probability P(H|b(i)) • Sample (h(i),b(i)) comes from distribution P(H,B) • Result approximates P(H,B)

16. Monte Carlo Inference In Bayes Nets • BN over variables X • Repeat for i=1,…,N • In top-down order, generate x(i)as follows: • Sample xj(i) ~ P(Xj|paXj(i)) • (RHS is taken by putting parent values in sample into the CPT for Xj) • Sample x(1)…x(N) approximates the distribution over X

17. Burglary Earthquake Alarm JohnCalls MaryCalls Approximate Inference: Monte-Carlo Simulation • Sample from the joint distribution B=0 E=0 A=0 J=1 M=0

18. Approximate Inference: Monte-Carlo Simulation • As more samples are generated, the distribution of the samples approaches the joint distribution B=0 E=0 A=0 J=1 M=0 B=0 E=0 A=0 J=0 M=0 B=0 E=0 A=0 J=0 M=0 B=1 E=0 A=1 J=1 M=0

19. Basic method for Handling Evidence • Inference: given evidence E=e (e.g., J=1), approximate P(X/E|E=e) • Remove the samples that conflict B=0 E=0 A=0 J=1 M=0 B=0 E=0 A=0 J=0 M=0 B=0 E=0 A=0 J=0 M=0 B=1 E=0 A=1 J=1 M=0 Distribution of remaining samples approximates the conditional distribution

20. Rare Event Problem: • What if some events are really rare (e.g., burglary & earthquake ?) • # of samples must be huge to get a reasonable estimate • Solution: likelihood weighting • Enforce that each sample agrees with evidence • While generating a sample, keep track of the ratio of • (how likely the sampled value is to occur in the real world)(how likely you were to generate the sampled value)

21. Burglary Earthquake Alarm JohnCalls MaryCalls Likelihood weighting • Suppose evidence Alarm & MaryCalls • Sample B,E with P=0.5 w=1

22. Burglary Earthquake Alarm JohnCalls MaryCalls Likelihood weighting • Suppose evidence Alarm & MaryCalls • Sample B,E with P=0.5 w=0.008 B=0 E=1

23. Burglary Earthquake Alarm JohnCalls MaryCalls Likelihood weighting • Suppose evidence Alarm & MaryCalls • Sample B,E with P=0.5 w=0.0023 B=0 E=1 A=1 A=1 is enforced, and the weight updated to reflect the likelihood that this occurs

24. Burglary Earthquake Alarm JohnCalls MaryCalls Likelihood weighting • Suppose evidence Alarm & MaryCalls • Sample B,E with P=0.5 w=0.0016 B=0 E=1 A=1 M=1 J=1

25. Burglary Earthquake Alarm JohnCalls MaryCalls Likelihood weighting • Suppose evidence Alarm & MaryCalls • Sample B,E with P=0.5 w=3.988 B=0 E=0

26. Burglary Earthquake Alarm JohnCalls MaryCalls Likelihood weighting • Suppose evidence Alarm & MaryCalls • Sample B,E with P=0.5 w=0.004 B=0 E=0 A=1

27. Burglary Earthquake Alarm JohnCalls MaryCalls Likelihood weighting • Suppose evidence Alarm & MaryCalls • Sample B,E with P=0.5 w=0.0028 B=0 E=0 A=1 M=1 J=1

28. Burglary Earthquake Alarm JohnCalls MaryCalls Likelihood weighting • Suppose evidence Alarm & MaryCalls • Sample B,E with P=0.5 w=0.00375 B=1 E=0 A=1

29. Burglary Earthquake Alarm JohnCalls MaryCalls Likelihood weighting • Suppose evidence Alarm & MaryCalls • Sample B,E with P=0.5 w=0.0026 B=1 E=0 A=1 M=1 J=1

30. Burglary Earthquake Alarm JohnCalls MaryCalls Likelihood weighting • Suppose evidence Alarm & MaryCalls • Sample B,E with P=0.5 w=5e-7 B=1 E=1 A=1 M=1 J=1

31. Likelihood weighting • Suppose evidence Alarm & MaryCalls • Sample B,E with P=0.5 • N=4 gives P(B|A,M)~=0.371 • Exact inference gives P(B|A,M) = 0.375 w=0.0016 w=0.0028 w=0.0026 w~=0 B=0 E=1 A=1 M=1 J=1 B=0 E=0 A=1 M=1 J=1 B=1 E=0 A=1 M=1 J=1 B=1 E=1 A=1 M=1 J=1

32. Another Rare-Event Problem • B=b given as evidence • Probability each bi is rare given all but one setting of Ai(say, Ai=1) • Chance of sampling all 1’s is very low => most likelihood weights will be too low • Problem: evidence is not being used to sample A’s effectively (i.e., near P(Ai|b)) A1 A2 A10 B1 B2 B10

33. Gibbs Sampling • Idea: reduce the computational burden of sampling from a multidimensional distribution P(x)=P(x1,…,xn) by doing repeated draws of individual attributes • Cycle through j=1,…,n • Sample xj ~ P(xj | x[1…j-1,j+1,…n]) • Over the long run, the random walk taken by x approaches the true distribution P(x)

34. Gibbs Sampling in BNs • Each Gibbs sampling step: 1) pick a variable Xi, 2) sample xi ~ P(Xi|X/Xi) • Look at values of “Markov blanket” of Xi: • Parents PaXi • Children Y1,…,Yk • Parents of children (excluding Xi) PaY1/Xi, …,PaYk/Xi • Xi is independent of rest of network given Markov blanket • Sample xi~P(Xi|, Y1, PaY1/Xi, …, Yk, PaYk/Xi)= 1/Z P(Xi|PaXi) P(Y1|PaY1) *…* P(Yk|PaYk) • Product of Xi’s factor and the factors of its children

35. Handling evidence • Simply set each evidence variable to its appropriate value, don’t sample • Resulting walk approximates distribution P(X/E|E=e) • Uses evidence more efficiently than likelihood weighting

36. Gibbs sampling issues • Demonstrating correctness & convergence requires examining Markov Chain random walk (more later) • Need to take many steps before the effects of poor initialization wear off (mixing time) • Difficult to tell how much is needed a priori • Numerous variants • Known as Markov Chain Monte Carlo techniques

37. Next time • Continuous and hybrid distributions