1 / 30

Pattern Recognition

Bayesian Decision Theory & ML Estimation. Pattern Recognition. Bayesian Decision Theory. Bayesian Decision Theory. Fundamental statistical approach to problem classification.

Download Presentation

Pattern Recognition

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Bayesian Decision Theory & ML Estimation Pattern Recognition

  2. Bayesian Decision Theory

  3. Bayesian Decision Theory • Fundamental statistical approach to problem classification. • Quantifies the tradeoffs between various classification decisions using probabilities and the costs associated with such decisions. • Each action is associated with a cost or risk. • The simplest risk is the classification error. • Design classifiers to recommend actions that minimize some total expected risk.

  4. Terminology (using sea bass – salmon classification example) • State of nature ω(random variable): • ω1 for sea bass, ω2 for salmon. • Probabilities P(ω1) and P(ω2 )(priors) • prior knowledge of how likely is to get a sea bass or a salmon • Probability density function p(x) (evidence): • how frequently we will measure a pattern with feature value x (e.g., x is a lightness measurement) Note: if x and y are different measurements, p(x) and p(y) correspond to different pdfs: pX(x) and pY(y)

  5. Terminology (cont’d)(using sea bass – salmon classification example) • Conditional probability density p(x/ωj) (likelihood): • how frequently we will measure a pattern with feature value x given that the pattern belongs to class ωj e.g., lightness distributions between salmon/sea-bass populations

  6. Terminology (cont’d)(using sea bass – salmon classification example) • Conditional probability P(ωj /x) (posterior): • the probability that the fish belongs to class ωj given measurement x. Note: we will be using an uppercase P(.) to denote a probability mass function (pmf) and a lowercase p(.) to denote a probability density function (pdf).

  7. Decision Rule Using Priors Only Decide ω1 ifP(ω1) > P(ω2); otherwise decideω2 P(error) = min[P(ω1), P(ω2)] • Favours the most likely class … (optimum if no other info is available). • This rule would be making the same decision all the times! • Makes sense to use for judging just one fish …

  8. Decision Rule Using Conditional pdf • Using Bayes’ rule, the posterior probability of category ωj given measurement x is given by: where (scale factor – sum of probs = 1) Decide ω1 if P(ω1 /x) > P(ω2/x);otherwise decide ω2or Decide ω1 if p(x/ω1)P(ω1)>p(x/ω2)P(ω2) otherwise decide ω2

  9. Decision Rule Using Conditional pdf (cont’d)

  10. Probability of Error • The probability of error is defined as: • The average probability error is given by: • The Bayes rule is optimum, that is, it minimizes the average probability error since: P(error/x) = min[P(ω1/x), P(ω2/x)]

  11. Discriminant Functions • Functional structure of a general statistical classifier Assignx to ωi if: gi(x) > gj(x) for all (discriminant functions) pick max

  12. Discriminants for Bayes Classifier • Using risks: gi(x)=-R(αi/x) • Using zero-one loss function (i.e., min error rate): gi(x)=P(ωi/x) • Is the choice of gi unique? • Replacing gi(x) with f(gi(x)), where f() is monotonically increasing, does not change the classification results.

  13. Decision Regions and Boundaries • Decision rules divide the feature space in decision regions R1, R2, …, Rc • The boundaries of the decision regions are the decision boundaries. g1(x)=g2(x) at the decision boundaries

  14. Case of two categories • More common to use a single discriminant function (dichotomizer) instead of two: • Examples of dichotomizers:

  15. Discriminant Function for Multivariate Gaussian • Assume the following discriminant function: N(μ,Σ) p(x/ωi)

  16. Multivariate Gaussian Density:Case I • Assumption: Σi=σ2 • Features are statistically independent • Each feature has the same variance favors the a-priori more likely category

  17. Multivariate Gaussian Density:Case I (cont’d) wi= threshold or bias ) )

  18. Multivariate Gaussian Density:Case I (cont’d) • Comments about this hyperplane: • It passes through x0 • It is orthogonal to the line linking the means. • What happens when P(ωi)= P(ωj) ? • If P(ωi)= P(ωj), then x0 shifts away from the more likely mean. • If σ is very small, the position of the boundary is insensitive to P(ωi) and P(ωj)

  19. Multivariate Gaussian Density:Case I (cont’d)

  20. Multivariate Gaussian Density:Case I (cont’d) • Minimum distance classifier • When P(ωi) is the same for each of the c classes

  21. Maximum Likelihood Estimation

  22. Practical Issues • We could design an optimal classifier if we knew: • P(i) (priors) • p(x/i) (class-conditional densities) • In practice, we rarely have this complete information! • Design the classifier from a set of training examples. • Estimating P(i) is usually easy. • Estimating p(x/i) is more difficult: • Number of samples is often too small • Dimensionality of feature space is large

  23. Parameter Estimation • Assumptions • We are given a sample set D ={x1, x2, ...., xn}, where the samples were drawn according to p(x|wj) • p(x|wj) has a known parametric form, that is, it is determined by parameters q e.g., p(x/i) ~ N( i, i) • Parameter estimation problem • Given D, find the best possible q • This is a classical problem in statistics!

  24. Main Methods inParameter Estimation • Maximum Likelihood (ML) • It assumes that the values of the parameters are fixed but unknown. • Best estimate is obtained by maximizing the probability of obtaining the samples actually observed (i.e., training data) • Bayesian Estimation • It assumes that the parameters are random variables having some known a-priori distribution. • To determine the true value of parameters, it converts this to a posterior density using the samples.

  25. Maximum Likelihood (ML)Estimation - Assumptions • Suppose the training data is divided in c sets (i.e., one for each class): D1, D2, ...,Dc • Assume that samples in Dj have been drawn independently according to p(x/ωj). • Assume that p(x/ωj) has known parametric form with parameters θj, : e.g, θj =(μj , Σj) for Gaussian distributions or, in general, θj =(θ1 , θ2, …, θp)t

  26. ML Estimation - Problem Definition and Solution • Problem: given D1, D2, ...,Dc and a model for each class, estimate θ1, θ2,…, θc • If samples in Dj give no information about θi( ), we need to solve c independent problems (i.e., one for each class) • The ML estimate for D={x1,x2,..,xn} is the value that maximizes p(D/θ) (i.e., best supports the training data).

  27. ML Parameter Estimation (cont’d) θ=μ

  28. ML Parameter Estimation (cont’d) • How to find the maximum? • Easier to consider • The solution maximizes p(D/ θ) or ln p(D/ θ)

  29. ML for Gaussian Density:Case of Unknown θ=μ Consider ln p(x/μ) where Computing the gradient, we have where (by setting x=xk)

  30. ML for Gaussian Density:Case of Unknown θ=μ (cont’d) • Setting we have: • The solution is given by • The ML estimate is simply the “sample mean”.

More Related