Document Analysis: Non Parametric Methods for Pattern Recognition

Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008 Document Analysis:Non Parametric Methods for Pattern Recognition

Outline Introduction Density estimation from training samples Two different approaches Parzen Windows k-Nearest-neighbor approach k-Nearest-neighbor rule Nearest Neighbor rule Error bounds Distances

Introduction It is often not obvious to characterize the densities by parametric functions typically when distributions have multiple and irregular peaks The principle consists in estimating density functions directly from training sets

Density Estimation (1)‏ For a given class, suppose P being the probability for a randomly selected sample to belong to a regions R, i.e. The probability that k samples out of n belong to the same region is given by the binomial below from which we get the expectation for k : E[k] = nP If p(x) is continuous and R is a very small region around x, we get where V is the volume of region R which is leading to the following estimator :

Density Estimation (2)‏ When using respectively 1,2,...n samples, let us consider a sequence of regions around x denoted R1, R2, ... , Rn let Vn be the volume of Rn let kn be the number of samples falling in Rn Then it can be shown that the sequence p1 (x), p2 (x), ... , pn (x) is converging to p(x) if the following conditions are all satisfied

Two different approaches Two approaches satisfy these conditions Parzen windows, defining the regions by their volumes k-nearest-neighbor rule (kNN), defining the regions by the number of samples falling in them

Principle of Parzen Windows Each sample of the training set contributes to the estimated density by contributing to it with a window function the width of the window must be chosen carefully if the window width is too large, the decision boundaries have too less resolution if the window width is too small, there is a risk of overfitting

Decision boundaries for different Parzen window widths In fact the window width should be adapted locally

k-nearest-neighbor approach The k-nearest-neighbor approach avoids the problem of Parzen windows: the "window width" is automatically adapted to the local density, i.e. to the k closest samples

Density functions for k-nearest-neighbors The density functions are continuous, but not their derivative ! Illustration of density functions for k = 3 and k = 5

Estimation of a posteriori probabilities Lets consider a region centered at x having a volume V and containing exactly k samples from the training set, ki of them are supposed to belong to class i The joint probability of x and i is The estimated a posteriori probabilities are This justifies the rule of choosing the class i corresponding to the highest value for ki

Choice of k for the k-nearest neighbor rule The parameter k is chosen as a function of n by choosing we get showing that V0 is depending on pn(x)‏

Nearest Neighbor rule The nearest neighbor rule is a suboptimal rule that is classifying a sample x to the class of the nearest neighbor It can be shown that theprobability of error P of the nearest neighbor rule is bounded by where P* represents the Bayes error

Generalization to the kNN rule The error rate of the KNN rule is plotted in the graphic below for the two category case it shows that asymptotically (when k→∞) the error rate converge to the Bayes error

Distances The k-nearest neighbor relies on a distance (or metric)‏ Algebraically, a distance must satisfy four properties non-negativity : d(a,b) ≥ 0 reflexivity : d(a,b) = 0if and only if a=b symmetry : d(a,b) = d(b,a)‏ triangle inequality : d(a,b) + d(b,c) ≥ d(a,c)‏

Problem with distances Scaling the coordinates of a feature space can change the relationship induced by the distance To avoid arbitrary scaling, it is recommended to perform feature normalization, i.e. determining the scale accordingly to min-max interval of each feature standard deviation of individual feature distribution

Generalized distances The Minkovski distance generalizing the Euclidian distance is defined by it leads to the following special cases the Euclidian distance (for k=2)‏ the Manhattan distance or city block distance (for k=1)‏ the maximumdistance (for k=∞)‏ Many other distances do exist

Document Analysis: Non Parametric Methods for Pattern Recognition