1 / 54

Visual Recognition: Bayesian Decision Theory

Learn about Bayesian decision theory, a statistical approach to pattern classification, where decisions are made optimally based on probabilistic information. Explore the use of Bayes' formula, error minimization, loss functions, and decision rules in the classification process.

larsenb
Download Presentation

Visual Recognition: Bayesian Decision Theory

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Outline • Bayesian Decision Theory • Bayes' formula • Error • Bayes' Decision Rule • Loss function and Risk • Two-Category Classification Born: 1702 • Classifiers, Discriminant Functions, and Decision Surfaces • Discriminant Functions for the Normal Density 236875 Visual Recognition

  2. Bayesian Decision Theory • Bayesian decision theory is a fundamental statistical approach to the problem of pattern classification. • Decision making when all the probabilistic information is known. • For given probabilities the decision is optimal. • When new information is added, it is assimilated in optimal fashion for improvement of decisions. 236875 Visual Recognition

  3. Bayesian Decision Theory cont. • Fish Example: • Each fish is in one of 2 states: sea bass or salmon • Let w denote the state of nature • w = w1for sea bass • w = w2 for salmon 236875 Visual Recognition

  4. Bayesian Decision Theory cont. •  The State of nature is unpredictable w is a variable that must be described probabilistically. •  If the catch produced as much salmon as sea bass the next fish is equally likely to be sea bass or salmon. •  Define • P(w1 ) : a priori probability that the next fish is sea bass • P(w2 ):a priori probability that the next fish is salmon. 236875 Visual Recognition

  5. Bayesian Decision Theory cont. • If other types of fish are irrelevant: P( w1 ) + P( w2 ) = 1. •  Prior probabilities reflect our prior knowledge (e.g. time of year, fishing area, …) • Simple decision Rule: • Make a decision without seeing the fish. • Decide w1 if P( w1 ) > P( w2 ); w2 otherwise. • OK if deciding for one fish • If several fish, all assigned to same class. 236875 Visual Recognition

  6. Bayesian Decision Theory cont. •  In general, we will have some features and more information. •  Feature: lightness measurement = x • Different fish yield different lightness readings (x is a random variable) 236875 Visual Recognition

  7. Bayesian Decision Theory cont. • Define   p(x|w1) = Class Conditional Probability Density Probability density function for x given that the state of nature is w1 • The difference between p(x|w1 ) and p(x|w2 ) describes the difference in lightness between sea bass and salmon. 236875 Visual Recognition

  8. Bayesian Decision Theory cont. • Hypothetical class-conditional probability • Density functions are normalized (area under each curve is 1.0) 236875 Visual Recognition

  9. Bayesian Decision Theory cont. • Suppose that we know The prior probabilities P(w1 ) and P(w2 ), The conditional densities and Measure lightness of a fish = x. • What is the category of the fish ? 236875 Visual Recognition

  10. Bayes' formula P(wj | x) = P(x |wj ) P(wj ) / P(x), where 236875 Visual Recognition

  11. Bayes' formula cont. • p(x|wj )is called thelikelihoodof wjwith respect to x. (the wjcategory for which p(x|wj ) is large is more "likely" to be the true category) • p(x) is theevidence how frequently we will measure a pattern with feature value x. Scale factor that guarantees that the posterior probabilities sum to 1. 236875 Visual Recognition

  12. Bayes' formula cont. Posterior probabilities for the particular priors P(w1)=2/3 and P(w2)=1/3. At every x the posteriors sum to 1. 236875 Visual Recognition

  13. Error For a given x, we can minimize the probability of error by deciding w1 if P(w1|x) > P(w2|x) and w2 otherwise. 236875 Visual Recognition

  14. Bayes' Decision Rule(Minimizes the probability of error)  w1 : if P(w1|x) > P(w2|x) w2: otherwise or w1: if P ( x |w1) P(w1) > P(x|w2) P(w2) w2: otherwise and P(Error|x) = min [P(w1|x) , P(w2|x)] 236875 Visual Recognition

  15. Bayesian Decision Theory: Continuous Features: General Case Formalize the ideas just considered in 4 ways: • Allow more than one feature Replace the scalar x by the feature vector A d-dimensional Euclidean space Rd is called the feature space. • Allow more than 2 states of nature Generalize to several classes • Allow actions other than merely deciding the state of nature Possibility of rejection, i.e., of refusing to make a decision in close cases.  • Introducing general loss function 236875 Visual Recognition

  16. Loss function • Loss ( or cost ) function states exactly how costly each action is, and is used to convert a probability determination into a decision. Loss functions let us treat situations in which some kinds of classification mistakes are more costly than others. 236875 Visual Recognition

  17. Formulation • Let {w1, ... , wc } be the finite set of c states of nature ("categories"). • Let be the finite set of a possible actions. • The loss function = loss incurred for taking action when the state of nature is wj. • x = d-dimensional feature vector (random variable) • P(x|wj ) = the state conditional probability density function for x (The probability density function for x conditioned on wjbeing the true state of nature) • P(wj ) = prior probability that nature is in state wj. 236875 Visual Recognition

  18. Expected Loss • Suppose that we observe a particular x and that we contemplate taking action . • If the true state of nature is wj thenloss is • Before we have done an observation the expected loss is 236875 Visual Recognition

  19. Conditional Risk • After the observation theexpected riskwhich is called now “conditional risk” isgiven by 236875 Visual Recognition

  20. Total Risk • Objective: Select the action that minimizes the conditional risk • A general decision rule is a function • For every x, the decision function assumes one of the a values • The “total risk” is 236875 Visual Recognition

  21. Bayes Decision Rule: • Compute the conditional risk for i =1, ... , a. • Select the action for which is minimum. • The resulting minimum total risk is called the Bayes Risk, denoted R*, and is the best performance that can be achieved. 236875 Visual Recognition

  22. Two-Category Classification • Action = deciding that the true state is w1 • Action = deciding that the true state is w2. • Let be the loss incurred for deciding wiwhen the true state iswj. • Decidew1if or if or if and w2 otherwise 236875 Visual Recognition

  23. Two-Category Likelihood Ratio Test • Under reasonable assumption that (why?) decidew1if and w2 otherwise. The ratio is called the likelihood ratio. We can decide w1 if the likelihood ratio exceeds a threshold T value that is independent of the observation x. 236875 Visual Recognition

  24. Minimum-Error-Rate Classification • In classification problems, each state is usually associated with one of a different C classes. • Action = Decision that the true state is wi. • If action is taken, and the true state iswj, then the decision is correct if i = j, and in error otherwise. • The Zero-One Loss function is defined as for i,j=1,…,c all errors are equally costly 236875 Visual Recognition

  25. Minimum-Error-Rate Classificationcont. • The conditional risk is • To minimize the average probability of error, we should select the i that maximizes the posterior probability P(wi|x) Decide wi if P(wi|x) > P(wj|x) for all (same as Bayes' decision rule) 236875 Visual Recognition

  26. Decision Regions • The likelihood ratio p(x| w1 ) /p(x| w2 ) vs. x • The threshold qa for zero-one loss function • If we put l12> l21 we shall get qb> qa 236875 Visual Recognition

  27. Classifiers, Discriminant Functions, and Decision SurfacesThe Multi-Category Case • A pattern classifier can be represented by a set of discriminant functions gi(x); i=1, .., C. •  The classifier assigns a feature vector x to class wi if gi(x) > gj(x) for all 236875 Visual Recognition

  28. Statistical Pattern Classifier Statistical pattern classifier 236875 Visual Recognition

  29. The Bayes Classifier • A Bayes classifier can be represented in this way • For the general case with risks • For the minimum error-rate case If we replace every gi(x) by f(gi(x)), where f(.) is a monotonically increasing function, the resulting classification is unchanged, e.g. any of the following choices gives identical classification results 236875 Visual Recognition

  30. The Bayes Classifiercont. • The effect of any decision rule is to divide the feature space into Cdecision regions, R1, .., Rc. • If gi(x) > gj(x) for all then x is in Ri, and x is assigned to wi. • Decision regions are separated by decision boundaries. • Decision boundaries are surfaces in the feature space. 236875 Visual Recognition

  31. The Decision Regions Two dimensional two category classifier 236875 Visual Recognition

  32. The Two-Category Case • Use 2 discriminant functions g1 and g2, and assigning x to w1 if g1>g2. • Alternative: define a single discriminant function g(x) = g1(x) - g2(x), decide w1 if g(x)>0, otherwise decide w2. • In two category case two forms are frequently used: 236875 Visual Recognition

  33. Normal Density - Univariate Case • Gaussian density with mean and standard deviation ( named variance ) • It can be shown that: 236875 Visual Recognition

  34. Entropy • Entropy is given by and measured by nats; if is used instead, the unit is the bit. The entropy measures the fundamental uncertainty in the values of points selected randomly from a distribution. Normal distribution has the maximum entropy of all distributions having a given mean and variance. As stated by the Central Limit Theorem, the aggregative effect of the sum of a large number small, iid random disturbances will lead to a Gaussian distribution. Because many patterns can be viewed as some ideal or prototype pattern corrupted by a large number of random processes, the Gaussian is often a good model for the actual probability distribution. 236875 Visual Recognition

  35. Normal Density - Multivariate Case • The general multivariate normal density (MND) in a d dimensions is written as • It can be shown that: which means for components • The covariance matrix is always symmetric and positive semidefinite. 236875 Visual Recognition

  36. Normal Density - Multivariate Case cont. • Diagonal elements are variances and the off-diagonal elements are covariances of xi and xj • If xi and xj are statistically independent, If all then p(x) is a product of univariate normal densities. • Linear combination of jointly normally distributed random variables are normally distributed: if and where A is d-by-k matrix, then • If A is a vector a, y=atx is a scalar , is a variance of a projection of x onto a. 236875 Visual Recognition

  37. Whitening transform • Define to be the matrix whose columns are the orthogonal eigenvectors of , and the diagonal matrix of the corres-ponding eigenvalues. The transformation with converts an arbitrary MND into a spherical – with covariance matrix I . 236875 Visual Recognition

  38. Normal Density - Multivariate Case cont. • The multivariate normal density MND is completely specified by d+d(d+1)/2 parameters . Samples drawn from MND fall in a cluster which center is determined by and a shape by The loci of points of constant density are hyperellipsoids The r is called Mahalonobis distance from x to . The principal axes of the hyperelli- psoid are given by the eigenvectors of . 236875 Visual Recognition

  39. Normal Density - Multivariate Case cont. • The minimum-error-rate classification can be achieved using the discriminant functions: or • If then 236875 Visual Recognition

  40. Discriminant Functions for the Normal Density • The features are statistically independent, and each feature has the same variance. • Determinant is and the inverse of is is independent of i and can be ignored 236875 Visual Recognition

  41. Case1 cont. where denotes the Eucledian norm is independent ofi or as a linear discriminant function: where 236875 Visual Recognition

  42. Case1 cont. •   is called the threshold or bias in the ith direction. • A classifier that uses linear discriminant functions is called a linear machine. • The decision surfaces of a linear machine are pieces of hyperplanes defined by the linear equations for the 2 categories with the highest posterior probabilities. •  For this particular example, setting reduces to 236875 Visual Recognition

  43. Case1 cont. • where • The above equation defines a hyperplane through x0 and orthogonal to w (line linking the means) • If P( wi )=P( wj ), then x0 is halfway between the means. 236875 Visual Recognition

  44. Case1 cont. • 1D 2D 236875 Visual Recognition

  45. Case1 cont. • If the covariances of 2 distributions are equal and proportional to the identity matrix, then the distributions are spherical in d-dimensions, and the boundary is a generalized hyperplane of d-1 dimensions, perpendicular to the line separating the means.  • If P(wi) is not equal to P(wj), the point x0 shifts away from the more likely mean. 236875 Visual Recognition

  46. Case1 cont. 1D 236875 Visual Recognition

  47. Minimum Distance Classifier • As the priors are changed, the decision boundary shifts. • If all prior probabilities are the same, the optimum decision rule becomes: • Measure the Euclidean distance from each x to each of the C mean vectors. • Assign x to the class of the nearest mean. 236875 Visual Recognition

  48. Discriminant Functions for the Normal Density Case2. Common Covariance Matrices • Case 2: • Covariance matrices for all of the classes are identical but arbitrary. • is independent of i and can be ignored 236875 Visual Recognition

  49. Case2 cont. or • If all prior probabilities are the same, the optimum decision rule becomes: • Measure the squared Mahalanobis distance from x to each of the C mean vectors. •  Assign x to the class of the nearest mean. 236875 Visual Recognition

  50. Case2 cont. • Expanding and dropping we shall have a linear classifier where Decision boundaries are given by 236875 Visual Recognition

More Related