- 204 Views
- Uploaded on
- Presentation posted in: General

Part III The Nearest-Neighbor Classifier

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Part IIIThe Nearest-Neighbor Classifier

NASA Space Program

- Training and test data accuracy
- The nearest-neighbor classifier

- Say we have N feature vectors
- Say we know the true class label for each feature vector
- We can measure how accurate a classifier is by how many feature vectors it classifies correctly
- Accuracy = percentage of feature vectors correctly classified
- training accuracy = accuracy on training data
- test accuracy = accuracy on new data not used in training

- Training data
- labeled data used to build a classifier

- Test data
- new data, not used in the training process, to evaluate how well a classifier does on new data

- Memorization versus Generalization
- better training_accuracy
- “memorizing” the training data:

- better test_accuracy
- “generalizing” to new data

- in general, we would like our classifier to perform well on new test data, not just on training data,
- i.e., we would like it to generalize to new data

- better training_accuracy

- Speech Recognition
- Training data: words recorded and labeled in a laboratory
- Test data: words recorded from new speakers, new locations

- Zipcode Recognition
- Training data: zipcodes manually selected, scanned, labeled
- Test data: actual letters being scanned in a post office

- Credit Scoring
- Training data: historical database of loan applications with payment history or decision at that time
- Test data: you

- Training Data
- Dtrain = { [x(1), c(1)] , [x(2), c(2)] , …………[x(N), c(N)] }
- N pairs of feature vectors and class labels

- Feature Vectors and Class Labels:
- x(i) is the ith training data feature vector
- in MATLAB this could be the ith row of an N x d matrix
- c(i) is the class label of the ith feature vector
- in general, c(i) can take m different class values, e.g., c = 1, c = 2, ...
- Let y be a new feature vector whose class label we do not know, i.e., we wish to classify it.

- y is a new feature vector whose class label is unknown
- Search Dtrain for the closest feature vector to y
- let this “closest feature vector” be x(j)

- Classify y with the same label as x(j), i.e.
- y is assigned label c(j)

- How are “closest x” vectors determined?
- typically use minimum Euclidean distance
- dE(x, y) = sqrt(S (xi - yi)2 )

- typically use minimum Euclidean distance
- Side note: this produces a “Voronoi tesselation” of the d-space
- each point “claims” a cell surrounding it
- cell boundaries are polygons

1

2

Feature 2

1

2

2

1

Feature 1

Each data point defines a “cell” of space that is closest to it. All points within that cell are assigned that class

1

2

Feature 2

1

2

2

1

Feature 1

Overall decision boundary = union

of cell boundaries where class

decision is different on each side

1

2

Feature 2

1

2

2

1

Feature 1

1

2

Feature 2

1

2

?

2

1

Feature 1

Boundary? Points that are equidistant

between points of class 1 and 2

Note: locally the boundary is

(1) linear (because of Euclidean distance)

(2) halfway between the 2 class points

(3) at right angles to connector

1

2

Feature 2

1

2

?

2

1

Feature 1

1

2

Feature 2

1

2

?

2

1

Feature 1

1

2

Feature 2

1

2

?

2

1

Feature 1

1

2

Feature 2

1

2

?

2

1

Feature 1

Decision Region

for Class 1

Decision Region

for Class 2

1

2

Feature 2

1

2

?

2

1

Feature 1

?

1

2

Feature 2

1

2

2

1

Feature 1

Feature 2

1

1

1

2

2

1

1

2

2

1

2

1

1

2

2

2

Feature 1

1

In general:

Nearest-neighbor classifier

produces piecewise linear

decision boundaries

1

1

Feature 2

2

2

1

1

2

2

1

2

1

1

2

2

2

Feature 1

- Find the k-nearest neighbors to y in Dtrain
- i.e., rank the feature vectors according to Euclidean distance
- select the k vectors which have smallest distance to y

- Classification
- ranking yields k feature vectors and a set of k class labels
- pick the class label which is most common in this set (“vote”)
- classify y as belonging to this class

- Theoretical Considerations
- as k increases
- we are averaging over more neighbors
- the effective decision boundary is more “smooth”

- as N increases, the optimal k value tends to increase in proportion to log N

- as k increases

- Notes:
- In effect, the classifier uses the nearest k feature vectors from Dtrain to “vote” on the class label for y
- the single-nearest neighbor classifier is the special case of k=1
- for two-class problems, if we choose k to be odd (i.e., k=1, 3, 5,…) then there will never be any “ties”
- “training” is trivial for the kNN classifier, i.e., we just use Dtrain as a “lookup table” when we want to classify a new feature vector

- Extensions of the Nearest Neighbor classifier
- weighted distances
- e.g., if some of the features are more important
- e.g., if features are irrelevant

- fast search techniques (indexing) to find k-nearest neighbors in d-space

- weighted distances

Training Accuracy = 1/n SDtrain I( o(i), c(i) )

where I( o(i), c(i) ) = 1 if o(i) = c(i), and 0 otherwise

where o(i) is the output of the classifier for training feature x(i)

and c(i) is the true class for training data vector x(i)

Let Dtest be a set of new data, unseen in the training process: but

assume that Dtest is being generated by the same “mechanism” as generated Dtrain:

Test Accuracy = 1/ntestSDtest I( o(j), c(j) )

Test Accuracy is what we are really interested in: unfortunately

test accuracy is usually greater on average than train accuracy

- The accuracy of our classifier on new unseen data is a fair/honest assessment of the performance of our classifier
- Why is training accuracy not good enough?
- Training accuracy is optimistic
- a classifier like nearest-neighbor can construct boundaries which always separate all training data points, but which do not separate new points
- e.g., what is the training accuracy of kNN, k = 1?

- A flexible classifier can “overfit” the training data
- in effect it just memorizes the training data, but does not learn the general relationship between x and C

- Generalization
- We are really interested in how our classifier generalizes to new data
- test data accuracy is a good estimate of generalization performance

- 3 parts
- classplot:Plot classification data in two-dimensions
- knn: Implement a nearest-neighbor classifier
- knn_test: Test the effect of the value k on the accuracy of the classifier

- Test data
- knnclassifier.mat

function classplot(data, x, y);

% function classplot(data, x, y);

%

% brief description of what the function does

% ......

% Your Name, ICS 175A, date

%

% Inputs

% data: (a structure with the same fields as described above:

% your comment header should describe the structure explicitly)

% Note that if you are only using certain fields in the structure

% in the function below, you need only define these fields in the input comments

-------- Your code goes here -------

function [class_predictions] = knn(traindata,trainlabels,k, testdata)

% function [class_predictions] = knn(traindata,trainlabels,k, testdata)

%

% a brief description of what the function does

% ......

% Your Name, ICS 175A, date

%

% Inputs

% traindata: a N1 x d vector of feature data (the "memory" for kNN)

% trainlabels: a N1 x 1 vector of classlabels for traindata

% k: an odd positive integer indicating the number of neighbors to use

% testdata: a N2 x d vector of feature data for testing the knn classifier

%

% Outputs

% class_predictions: N2 x 1 vector of predicted class values

-------- Pseudocode -------

Read in the training data set Dtrain y = feature vector to be classified kneighbors = k-nearest neighbors to y in Dtrain kclasses = class values of the kneighbors kvote = the most common target value in kclasses predicted_class(y) = kvote

function [accuracies] = knn_test(traindata,trainlabels, testdata, testlabels,kmax,plotflag)

% function [accuracies] = knn_test(traindata,trainlabels, testdata, testlabels,kmax,plotflag)

%

% a brief description of what the function does

% ......

% Your Name, ICS 175A, date

%

% Inputs

% traindata: a N1 x d vector of feature data (the "memory" for kNN)

% trainlabels: a N1 x 1 vector of classlabels for traindata

% testdata: a N2 x d vector of feature data for testing the knn classifier

% testlabels: a N2 x 1 vector of classlabels for traindata

% kmax: an odd positive integer indicating the maximum number of neighbors

% plotflag: (optional argument) if 1, the accuracy versus k is plotted,

% otherwise no plot.

%

% Outputs

% accuracies: r x 1 vector of accuracies on testdata, where r is the

% number of values of k that are tested.

-------- Pseudocode -------

Read in the training data set Dtrain, and Dtest

For k = 1, 3, 5, ... Kmax (odd numbers) classify each point in Dtest using the k nearest neighbors in Dtest

error_k = 100*(number of points incorrectly classified)/(number of points in Dtest)

end

- Important Concepts
- classification is an important component in intelligent systems
- a classifier = a mapping from feature space to a class label
- decision boundaries = boundaries between classes

- classification learning
- using training data to define a classifier

- the nearest-neighbor classifier
- training accuracy versus test accuracy