1 / 54

CS 5331: Applied Machine Learning Spring 2011

CS 5331: Applied Machine Learning Spring 2011. Probability Distributions 1/26/2011. Mohan Sridharan. Overview. Probability density estimation given a set of i.i.d. data observations: ill-posed problem! Parametric methods: Specific functional form with parameters to estimate.

malini
Download Presentation

CS 5331: Applied Machine Learning Spring 2011

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS 5331: Applied Machine LearningSpring 2011 Probability Distributions 1/26/2011 Mohan Sridharan

  2. Overview • Probability density estimation given a set of i.i.d. data observations: ill-posed problem! • Parametric methods: • Specific functional form with parameters to estimate. • Binomial and Multinomial distributions for discrete RVs. • Gaussian distribution for continuous RV. • Conjugate priors: exponential family of distributions. • Non-parametric methods: • Parameters control model complexity instead of functional form. • Histograms, Nearest-neighbors, Kernels.

  3. Parametric Distributions • Basic building blocks: • Need to determine given • Representation: or ?

  4. Binary Variables (1) • Coin flipping: heads=1, tails=0. • Bernoulli Distribution:

  5. Parameter Estimation (1) • ML for Bernoulli. Given: • Compute:

  6. Parameter Estimation (2) • Example: • Prediction: all future tosses will land heads up! • Overfitting to D 

  7. Binary Variables (2) • N coin flips: • Binomial Distribution:

  8. Binomial Distribution

  9. Beta Distribution • Distribution of mean: . • Hyper-parameters: a, b.

  10. Bayesian Bernoulli The Beta distribution provides the conjugate prior for the Bernoulli distribution.

  11. Beta Distribution

  12. Prior ∙ Likelihood = Posterior

  13. Properties of the Posterior • As the size of the dataset (N) increases: • Is this true for all of Bayesian learning?

  14. Prediction under the Posterior What is the probability that the next coin toss will land heads up?

  15. Multinomial Variables 1-of-K coding scheme:

  16. ML Parameter estimation • Given: • Ensure using a Lagrange multiplier:

  17. The Multinomial Distribution

  18. The Dirichlet Distribution Conjugate prior for the multinomial distribution.

  19. Bayesian Multinomial (1)

  20. The Gaussian Distribution

  21. Central Limit Theorem • The distribution of the sum of N i.i.d. random variables becomes increasingly Gaussian as N grows. • Example: N uniform [0,1] random variables.

  22. Bayesian Multinomial (2)

  23. Geometry of Multivariate Gaussian(Section 2.3, PRML)

  24. Moments of the Multivariate Gaussian (1) thanks to anti-symmetry of z

  25. Moments of the Multivariate Gaussian (2)

  26. Maximum Likelihood for the Gaussian (1) • Sections 2.3.1--2.3.3: conditional and marginal Gaussians. • Given i.i.d. data the log likelihood function is: • Sufficient statistics:

  27. Maximum Likelihood for the Gaussian (2) • Set the derivative of the log likelihood function to zero: • To obtain: • Similarly:

  28. Maximum Likelihood for the Gaussian (3) Under the true distribution Hence define

  29. Sequential Estimation Contribution of the Nth data point, xN correction given xN correction weight old estimate

  30. Bayesian Inference for the Gaussian (1) • Assume variance is known. Given datathe likelihood function is given by: • This has the form of quadratic exponential of but it is not a distribution over .

  31. Bayesian Inference for the Gaussian (2) • Combined with a Gaussian prior: • Gives the posterior: • Completing the square in the exponent:

  32. Bayesian Inference for the Gaussian (3) • Where: • Note:

  33. Bayesian Inference for the Gaussian (4) • Example: for N = 0, 1, 2 and 10.

  34. Bayesian Inference for the Gaussian (5) • Sequential estimation: • The posterior obtained after observing N-1 data points becomes the prior when we observe the Nth data point.

  35. Bayesian Inference for the Gaussian (6) • Now assume is known and need to estimate . • This has a Gamma shape as a function of:

  36. Bayesian Inference for the Gaussian (7) • The Gamma distribution:

  37. Bayesian Inference for the Gaussian (8) • Now we combine a Gamma prior with the likelihood function to obtain: • Which is the same as

  38. Bayesian Inference for the Gaussian (9) • If both mean and variance are unknown, the joint likelihood function is given by: • We need a prior with the same functional dependence on mean and variance.

  39. Bayesian Inference for the Gaussian (10) • The Gaussian-gamma distribution: • Quadratic in mean. • Linear in variance. • Gamma distribution of precision. • Independent of mean.

  40. Bayesian Inference for the Gaussian (11) • The Gaussian-gamma distribution:

  41. Bayesian Inference for the Gaussian (12) • Multivariate conjugate priors. • Mean unknown, precision known: Gaussian prior. • Precision unknown, mean known: Wishart prior. • Mean and precision unknown: Gaussian-Wishart prior.

  42. Student’s t-Distribution where Infinite mixture of Gaussians.

  43. Student’s t-Distribution

  44. Student’s t-Distribution • Robustness to outliers: Gaussian vs. t-distribution.

  45. Periodic variables • Examples: calendar time, direction etc. • We require:

  46. von Mises Distribution (1) This requirement is satisfied by: Where: is the 0th order modified Bessel function of the 1st kind.

  47. von Mises Distribution (4)

  48. Maximum Likelihood for von Mises • Given a data set the log likelihood function is given by: • Maximizing with respect to µ0 we directly obtain: • Similarly, maximizing with respect to m we get: • which can be solved numerically for mML.

  49. Mixtures of Gaussians (1) • Old Faithful data set: Single Gaussian Mixture of two Gaussians

  50. Mixtures of Gaussians (2) Combine simple models into a complex model: Component Mixing coefficient K=3

More Related