Nearest Neighbor Classifier

Artificial Intelligence Nearest Neighbor Classifier Dae-Won Kim School of Computer Science & Engineering Chung-Ang University

In last class, we learned bayesian classification approaches.

However, we see some limitations.

Limit 1. We rarely have class-conditional probability P(x|class), thus it should be estimated.

Samples are too small for class-conditional estimation.

Limit 2. Samples and features are independent each other.

They are often not independent.

Limit 3. All parametric densities are uni-modal (normal distribution).

Many practical problems involve multi-modal densities.

We want to start without the assumption that the forms of the underlying densities are known.

These methods are called Non-parametric approaches.

There are two types of non-parametric methods.

Parzen window vs. Nearest neighbor

Parzen window estimates multi-modal P(x|class) from sample patterns.

Nearest Neighbor try to solve the unknown “best” window function.

NN algorithms bypass probability and go directly to a posterior probability P(class|x).

A cell of volume V around x and capture k samples where ki samples turned out to be wi.

Let x’ be the closest training pattern to a test pattern x, then the NN rule is to assign the label with x’.

If the number of data is large, the error of NN is never worse than twice the Bayes rate.

The K-NN classify x by assigning it the most frequent label among the k nearest samples and use a voting.

Therefore, the NN-type methods are called instance-based classifier, without explicit learning.

Issues: - the number of k, - distance measure • Feature weighting • Scalability issue: linear scan

Nonparametric Approach • All Parametric densities are unimodal (have a single local maximum), whereas many practical problems involve multi-modal densities • Nonparametric procedures can be used with arbitrary distributions and without the assumption that the forms of the underlying densities are known • There are two types of nonparametric methods: • Estimating density function P(x | j ): Parzen window • Bypass probability and go directly to a-posteriori probability P(j | x) : k-NN

K-Nearest-Neighbor Estimation Motivation: solve the unknown “best” window function • Let the cell volume be a function of the training data • Center a cell about x and let it grows until it captures knsamples • knare called the knnearest-neighbors of x 2 possibilities can occur: • Density is high near x; therefore the cell will be small which provides a good resolution • Density is low; therefore the cell will grow large and stop until higher density regions are reached

K-Nearest-Neighbor Estimation Estimate a-posteriori probabilities P(i | x) from a set of n labeled samples Let’s place a cell of volume V around x and capture k samples kisamples amongst k turned out to be labeled ithen: pn(x, i) = ki / n An estimate for pn(i| x) is: • ki/k is the fraction of the samples within the cell that are labeled i • If k is large and the cell sufficiently small, the performance will approach the best possible

Nearest-Neighbor Rule Let Dn = {x1, x2, …, xn} be a set of n labeled prototypes Let x’  Dn be the closest prototype to a test point x then the nearest-neighbor rule for classifying x is to assign it the label associated with x’ If the number of prototype is large (unlimited), the error rate of the nearest-neighbor classifier is never worse than twice the Bayes rate If n  , it is always possible to find x’ sufficiently close so that:P(i | x’)  P(i | x) Classify x by assigning it the label most frequently represented among the k nearest samples and use a voting

Other simple classifiers.

Linear Discriminant Classifier

Linear Classifier 1. In previous classifiers • Underlying probability densities were known (or given) • The training samples was used to estimate the parameters of probabilities 2. Linear classifier • Instead, we assume the proper forms of discriminant functions • Use the samples to estimate the values of parameters of the classifier • They may not be optimal, but they are very simple to use • Attractive candidates for initial, trial classifiers

Linear Discriminant Functions “Finding a linear discriminant function is formulated as a problem of minimizing a criterion function, i.e., training error.” 1. Definition It is a function that is a linear combination of the components of x g(x) = wtx + w0where w is the weight vector and w0 the bias A two-category classifier with a discriminant function uses the following rule: Decide 1 if g(x) > 0 and 2 if g(x) < 0 • when g(x) is linear, this decision surface is a hyperplane • decision regions for a linear machine are convex • this restriction limits the flexibility and accuracy of the classifier • applicable for unimodal application

Linear Discriminant Functions 2. Training • Samples: x1, …, xn, and labels w1, w2 • g(x) = wtx  Find a weight vector ‘w’ that classifies all of the samples correctly • Decide 1 if g(x) > 0 and 2 if g(x) < 0 • By normalization, wtx > 0 : solution vector or separating vector  not unique 3. Gradient descent procedure • Unconstrained optimization problem: find ‘w’ that minimize J(w) • Start with some w(1) and compute gradient vector J(w(1)). • w(2) is obtained by moving distance from w(1) in the direction of steepest descent. w(k+1) = w(k) - (k) J(w(k)) •  is learning rate. If it is too large, the process will overshoot and diverge. • An alternative method: Newton’s method (second-order)

Neural Network Classifier, SVMs

Perceptron 1. Perceptron criterion function • Simplest: J(w) is the number of samples misclassified by w • An example, • J(w) =  (- wt y) where y  Y is the set of samples misclassified by ‘w’ • J(w) is never negative, being zero if w is a solution vector • 2. Minimizing the perceptron • the gradient of J(w)  J(w) =  (- y) • updating rule  w(k+1) = w(k) + (k)  (y) • 3. Relaxation procedure • Generalized perceptron training procedure • A broader criterion functions and minimization methods • An example, 1) J(w) =  (- wt y)2 J: continuous and a smoother search • 2) relaxation with margin  avoid the useless solution w = 0

Multilayer Neural Networks 1. Classify objects by learning nonlinearity • There are many problems for which linear discriminants are insufficient • Central difficulty was the choice of the appropriate nonlinear functions • A “brute” approach: polynomial function  too many parameters • No automatic method for determining the nonlinearities 2. Multilayer neural networks or multilayer perceptrons • Layered topology of linear discriminants provides nonlinear mapping • The form of the nonlinearity is learned from the training data • ‘Backpropagation’ is the most popular learning method • Optimal topology depends on the problem at hand

A Thee-Layer Neural Network 1. Net activation • the inner product of inputs with weights at the hidden layer • each hidden unit emits an output that is a nonlinear function or its activation • each output unit similarly computes its net activation based on the hidden unit • an output unit computes the nonlinear function of its net: zk = f(netk)

A Thee-Layer Neural Network 2. An example • nety1=x1+x2+0.5 • nety2 = x1+x2-1.5 • netz = 0.7*y1 - 0.4*y2 - 1

A Thee-Layer Neural Network 3. Expressive power • Q: Can every decision be implemented by a three-layer network? • A: Any continuous function can be implemented, given sufficient number of hidden units Two-layer network classifier can only implement a linear decision boundary

Backpropagation Algorithm Network learning • Learn the interconnection weights based on the training patterns and outputs • Computes an error for each hidden unit, and drive a learning rule “Feedforward” and “Learning”

Decision Trees

Decision Trees In previous classifiers, • Feature vectors of real-valued numbers and there has been distance measures • How can we use nominal data for classification? • How can we efficiently learn categories such nonmetric data? • Classification of ‘fruits’ based on their color and shape Apple = (red, small_sphere), Watermelon=(green, big_sphere)

Decision Trees 1. Decision tree • natural and intuitive to classify a pattern through a sequence of questions • a sequence of questions is displayed in a directed decision tree • begins at the root node, which asks for the value of property of the pattern • the links must be mutually distinct and exhaustive • each leaf node bears a category • easy interpretation, rapid classification, easy to incorporate prior knowledge • rule-based classification 2. Issues of tree-growing algorithms • how many decision outcomes or splits will there be at a node? • which property should be tested at a node? • when should a node be declared a leaf?

The desire to find reliable answers demands more powerful classification algorithms with better understanding of the data. Pattern Recognition, Oracle Mining in Spring Class

Imbalance and Sampling Ensemble, Bootstrapping, Bagging Cross Validation

Nearest Neighbor Classifier

Nearest Neighbor Classifier

Presentation Transcript

K-nearest neighbor methods

K-Nearest Neighbor Learning

Part III The Nearest-Neighbor Classifier

Nearest Neighbor Classifiers

Review of Assignment 2 (k-nearest neighbor classifier)

Reverse Nearest Neighbor Aggregates

Nearest-Neighbor Classifiers

Nearest Neighbor

Nearest neighbor matching

Nearest-Neighbor Classifiers

Classification Nearest Neighbor

Nearest Neighbor and Reverse Nearest Neighbor Queries for Moving Objects

The Nearest-Neighbor Classifier

Nearest Neighbor

K nearest neighbor

Exact Nearest Neighbor Algorithms

K-Nearest Neighbor

K-Nearest Neighbor Learning

A vector quantization method for nearest neighbor classifier design

Classification Nearest Neighbor

Learning: Nearest Neighbor

Classification Nearest Neighbor