1 / 23

LECTURE 06: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION

LECTURE 06: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION. • Objectives: Bias in ML Estimates Bayesian Estimation Example Resources: D.H.S.: Chapter 3 (Part 2) Wiki: Maximum Likelihood M.Y.: Maximum Likelihood Tutorial J.O.S.: Bayesian Parameter Estimation J.H.: Euro Coin. Audio:. URL:.

havily
Download Presentation

LECTURE 06: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. LECTURE 06: MAXIMUM LIKELIHOOD ANDBAYESIAN ESTIMATION • Objectives:Bias in ML EstimatesBayesian EstimationExample • Resources: • D.H.S.: Chapter 3 (Part 2)Wiki: Maximum LikelihoodM.Y.: Maximum Likelihood TutorialJ.O.S.: Bayesian Parameter EstimationJ.H.: Euro Coin Audio: URL:

  2. because: Gaussian Case: Unknown Mean (Review) • Consider the case where only the mean,  = , is unknown: which implies:

  3. Gaussian Case: Unknown Mean (Review) • Substituting into the expression for the total likelihood: • Rearranging terms: • Significance???

  4. Gaussian Case: Unknown Mean and Variance (Review) • Let  = [,2]. The log likelihood of a SINGLE point is: • The full likelihood leads to:

  5. Gaussian Case: Unknown Mean and Variance (Review) • This leads to these equations: • In the multivariate case: • The true covariance is the expected value of the matrix ,which is a familiar result.

  6. Convergence of the Mean (Review) • Does the maximum likelihood estimate of the variance converge to the true value of the variance? Let’s start with a few simple results we will need later. • Expected value of the ML estimate of the mean:

  7. Variance of the ML Estimate of the Mean (Review) • The expected value of xixjwill be 2 for j  k since the two random variables are independent. • The expected value of xi2will be 2 + 2. • Hence, in the summation above, we have n2-n terms with expected value 2and n terms with expected value 2 + 2. • Thus, which implies: • We see that the variance of the estimate goes to zero as n goes to infinity, and our estimate converges to the true estimate (error goes to zero).

  8. Variance Relationships • We will need one more result: Note that this implies: • Now we can combine these results. Recall our expression for the ML estimate of the variance:

  9. Covariance Expansion • Expand the covariance and simplify: • One more intermediate term to derive:

  10. Biased Variance Estimate • Substitute our previously derived expression for the second term:

  11. Expectation Simplification • Therefore, the ML estimate is biased: However, the ML estimate converges (and is MSE). • An unbiased estimator is: • These are related by: which is asymptotically unbiased. See Burl, AJWills and AWM for excellent examples and explanations of the details of this derivation.

  12. Introduction to Bayesian Parameter Estimation • In Chapter 2, we learned how to design an optimal classifier if we knew the prior probabilities, P(i), and class-conditional densities, p(x|i). • Bayes: treat the parameters as random variables having some known prior distribution. Observations of samples converts this to a posterior. • Bayesian learning: sharpen the a posteriori density causing it to peak near the true value. • Supervised vs. unsupervised: do we know the class assignments of the training data. • Bayesian estimation and ML estimation produce very similar results in many cases. • Reduces statistical inference (prior knowledge or beliefs about the world) to probabilities.

  13. Class-Conditional Densities • Posterior probabilities, P(i|x), are central to Bayesian classification. • Bayes formula allows us to compute P(i|x) from the priors,P(i), and the likelihood, p(x|i). • But what If the priors and class-conditional densities are unknown? • The answer is that we can compute the posterior, P(i|x), using all of the information at our disposal (e.g., training data). • For a training set, D, Bayes formula becomes: • We assume priors are known: P(i|D) = P(i). • Also, assume functional independence:Di have no influence on • This gives:

  14. The Parameter Distribution • Assume the parametric form of the evidence, p(x), is known: p(x|). • Any information we have about  prior to collecting samples is contained in a known prior density p(). • Observation of samples converts this to a posterior, p(|D), which we hope is peaked around the true value of. • Our goal is to estimate a parameter vector: • We can write the joint distribution as a product: because the samples are drawn independently. • This equation links the class-conditional density to the posterior, . But numerical solutions are typically required!

  15. Case: only mean unknown • Known prior density: Univariate Gaussian Case • Using Bayes formula: • Rationale: Once a value of  is known, the density for x is completely known.  is a normalization factor that depends on the data, D.

  16. Univariate Gaussian Case • Applying our Gaussian assumptions:

  17. Univariate Gaussian Case (Cont.) • Now we need to work this into a simpler form:

  18. Univariate Gaussian Case (Cont.) • p(|D) is an exponential of a quadratic function, which makes it a normal distribution. Because this is true for any n, it is referred to as a reproducing density. • p() is referred to as a conjugate prior. • Write p(|D) ~ N(n,n2): • Expand the quadratic term: • Equate coefficients of our two functions:

  19. Univariate Gaussian Case (Cont.) • Rearrange terms so that the dependencies on  are clear: • Associate terms related to 2and : • There is actually a third equation involving terms not related to : • but we can ignore this since it is not a function of and is a complicated equation to solve.

  20. Univariate Gaussian Case (Cont.) • Two equations and two unknowns. Solve for n and n2. First, solve for n2 : • Next, solve for n: • Summarizing:

  21. Bayesian Learning • nrepresents our best guess after n samples. • n2represents our uncertainty about this guess. • n2approaches 2/n for large n – each additional observation decreases our uncertainty. • The posterior, p(|D), becomes more sharply peaked as n grows large. This is known as Bayesian learning.

  22. “The Euro Coin” • Getting ahead a bit, let’s see how we can put these ideas to work on a simple example due to David MacKay, and explained by Jon Hamaker.

  23. Summary • Review of maximum likelihood parameter estimation in the Gaussian case, with an emphasis on convergence and bias of the estimates. • Introduction of Bayesian parameter estimation. • The role of the class-conditional distribution in a Bayesian estimate. • Estimation of the posterior and probability density function assuming the only unknown parameter is the mean, and the conditional density of the “features” given the mean, p(x|), can be modeled as a Gaussian distribution.

More Related