Part iii the nearest neighbor classifier l.jpg
This presentation is the property of its rightful owner.
Sponsored Links
1 / 31

Part III The Nearest-Neighbor Classifier PowerPoint PPT Presentation


  • 200 Views
  • Uploaded on
  • Presentation posted in: General

Part III The Nearest-Neighbor Classifier. NASA Space Program. Outline. Training and test data accuracy The nearest-neighbor classifier. Classification Accuracy. Say we have N feature vectors Say we know the true class label for each feature vector

Download Presentation

Part III The Nearest-Neighbor Classifier

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Part iii the nearest neighbor classifier l.jpg

Part IIIThe Nearest-Neighbor Classifier

NASA Space Program


Outline l.jpg

Outline

  • Training and test data accuracy

  • The nearest-neighbor classifier


Classification accuracy l.jpg

Classification Accuracy

  • Say we have N feature vectors

  • Say we know the true class label for each feature vector

  • We can measure how accurate a classifier is by how many feature vectors it classifies correctly

  • Accuracy = percentage of feature vectors correctly classified

    • training accuracy = accuracy on training data

    • test accuracy = accuracy on new data not used in training


Training data and test data l.jpg

Training Data and Test Data

  • Training data

    • labeled data used to build a classifier

  • Test data

    • new data, not used in the training process, to evaluate how well a classifier does on new data

  • Memorization versus Generalization

    • better training_accuracy

      • “memorizing” the training data:

    • better test_accuracy

      • “generalizing” to new data

    • in general, we would like our classifier to perform well on new test data, not just on training data,

      • i.e., we would like it to generalize to new data


Examples of training and test data l.jpg

Examples of Training and Test Data

  • Speech Recognition

    • Training data: words recorded and labeled in a laboratory

    • Test data: words recorded from new speakers, new locations

  • Zipcode Recognition

    • Training data: zipcodes manually selected, scanned, labeled

    • Test data: actual letters being scanned in a post office

  • Credit Scoring

    • Training data: historical database of loan applications with payment history or decision at that time

    • Test data: you


Some notation l.jpg

Some Notation

  • Training Data

    • Dtrain = { [x(1), c(1)] , [x(2), c(2)] , …………[x(N), c(N)] }

    • N pairs of feature vectors and class labels

  • Feature Vectors and Class Labels:

    • x(i) is the ith training data feature vector

    • in MATLAB this could be the ith row of an N x d matrix

    • c(i) is the class label of the ith feature vector

    • in general, c(i) can take m different class values, e.g., c = 1, c = 2, ...

    • Let y be a new feature vector whose class label we do not know, i.e., we wish to classify it.


Nearest neighbor classifier l.jpg

Nearest Neighbor Classifier

  • y is a new feature vector whose class label is unknown

  • Search Dtrain for the closest feature vector to y

    • let this “closest feature vector” be x(j)

  • Classify y with the same label as x(j), i.e.

    • y is assigned label c(j)

  • How are “closest x” vectors determined?

    • typically use minimum Euclidean distance

      • dE(x, y) = sqrt(S (xi - yi)2 )

  • Side note: this produces a “Voronoi tesselation” of the d-space

    • each point “claims” a cell surrounding it

    • cell boundaries are polygons

  • Analogous to “memory-based” reasoning in humans


  • Geometric interpretation of nearest neighbor l.jpg

    Geometric Interpretation of Nearest Neighbor

    1

    2

    Feature 2

    1

    2

    2

    1

    Feature 1


    Regions for nearest neighbors l.jpg

    Regions for Nearest Neighbors

    Each data point defines a “cell” of space that is closest to it. All points within that cell are assigned that class

    1

    2

    Feature 2

    1

    2

    2

    1

    Feature 1


    Nearest neighbor decision boundary l.jpg

    Nearest Neighbor Decision Boundary

    Overall decision boundary = union

    of cell boundaries where class

    decision is different on each side

    1

    2

    Feature 2

    1

    2

    2

    1

    Feature 1


    How should the new point be classified l.jpg

    How should the new point be classified?

    1

    2

    Feature 2

    1

    2

    ?

    2

    1

    Feature 1


    Local decision boundaries l.jpg

    Local Decision Boundaries

    Boundary? Points that are equidistant

    between points of class 1 and 2

    Note: locally the boundary is

    (1) linear (because of Euclidean distance)

    (2) halfway between the 2 class points

    (3) at right angles to connector

    1

    2

    Feature 2

    1

    2

    ?

    2

    1

    Feature 1


    Finding the decision boundaries l.jpg

    Finding the Decision Boundaries

    1

    2

    Feature 2

    1

    2

    ?

    2

    1

    Feature 1


    Finding the decision boundaries14 l.jpg

    Finding the Decision Boundaries

    1

    2

    Feature 2

    1

    2

    ?

    2

    1

    Feature 1


    Finding the decision boundaries15 l.jpg

    Finding the Decision Boundaries

    1

    2

    Feature 2

    1

    2

    ?

    2

    1

    Feature 1


    Overall boundary piecewise linear l.jpg

    Overall Boundary = Piecewise Linear

    Decision Region

    for Class 1

    Decision Region

    for Class 2

    1

    2

    Feature 2

    1

    2

    ?

    2

    1

    Feature 1


    Geometric interpretation of knn k 1 l.jpg

    Geometric Interpretation of kNN (k=1)

    ?

    1

    2

    Feature 2

    1

    2

    2

    1

    Feature 1


    More data points l.jpg

    More Data Points

    Feature 2

    1

    1

    1

    2

    2

    1

    1

    2

    2

    1

    2

    1

    1

    2

    2

    2

    Feature 1


    More complex decision boundary l.jpg

    More Complex Decision Boundary

    1

    In general:

    Nearest-neighbor classifier

    produces piecewise linear

    decision boundaries

    1

    1

    Feature 2

    2

    2

    1

    1

    2

    2

    1

    2

    1

    1

    2

    2

    2

    Feature 1


    K nearest neighbor knn classifier l.jpg

    K-Nearest Neighbor (kNN) Classifier

    • Find the k-nearest neighbors to y in Dtrain

      • i.e., rank the feature vectors according to Euclidean distance

      • select the k vectors which have smallest distance to y

    • Classification

      • ranking yields k feature vectors and a set of k class labels

      • pick the class label which is most common in this set (“vote”)

      • classify y as belonging to this class


    K nearest neighbor knn classifier21 l.jpg

    K-Nearest Neighbor (kNN) Classifier

    • Theoretical Considerations

      • as k increases

        • we are averaging over more neighbors

        • the effective decision boundary is more “smooth”

      • as N increases, the optimal k value tends to increase in proportion to log N


    K nearest neighbor knn classifier22 l.jpg

    K-Nearest Neighbor (kNN) Classifier

    • Notes:

      • In effect, the classifier uses the nearest k feature vectors from Dtrain to “vote” on the class label for y

      • the single-nearest neighbor classifier is the special case of k=1

      • for two-class problems, if we choose k to be odd (i.e., k=1, 3, 5,…) then there will never be any “ties”

      • “training” is trivial for the kNN classifier, i.e., we just use Dtrain as a “lookup table” when we want to classify a new feature vector


    K nearest neighbor knn classifier23 l.jpg

    K-Nearest Neighbor (kNN) Classifier

    • Extensions of the Nearest Neighbor classifier

      • weighted distances

        • e.g., if some of the features are more important

        • e.g., if features are irrelevant

      • fast search techniques (indexing) to find k-nearest neighbors in d-space


    Accuracy on training data versus test data l.jpg

    Accuracy on Training Data versus Test Data

    Training Accuracy = 1/n SDtrain I( o(i), c(i) )

    where I( o(i), c(i) ) = 1 if o(i) = c(i), and 0 otherwise

    where o(i) is the output of the classifier for training feature x(i)

    and c(i) is the true class for training data vector x(i)

    Let Dtest be a set of new data, unseen in the training process: but

    assume that Dtest is being generated by the same “mechanism” as generated Dtrain:

    Test Accuracy = 1/ntestSDtest I( o(j), c(j) )

    Test Accuracy is what we are really interested in: unfortunately

    test accuracy is usually greater on average than train accuracy


    Test accuracy and generalization l.jpg

    Test Accuracy and Generalization

    • The accuracy of our classifier on new unseen data is a fair/honest assessment of the performance of our classifier

    • Why is training accuracy not good enough?

      • Training accuracy is optimistic

      • a classifier like nearest-neighbor can construct boundaries which always separate all training data points, but which do not separate new points

        • e.g., what is the training accuracy of kNN, k = 1?

      • A flexible classifier can “overfit” the training data

        • in effect it just memorizes the training data, but does not learn the general relationship between x and C


    Test accuracy and generalization26 l.jpg

    Test Accuracy and Generalization

    • Generalization

      • We are really interested in how our classifier generalizes to new data

      • test data accuracy is a good estimate of generalization performance


    Assignment l.jpg

    Assignment

    • 3 parts

      • classplot:Plot classification data in two-dimensions

      • knn: Implement a nearest-neighbor classifier

      • knn_test: Test the effect of the value k on the accuracy of the classifier

    • Test data

      • knnclassifier.mat


    Plotting function l.jpg

    Plotting Function

    function classplot(data, x, y);

    % function classplot(data, x, y);

    %

    % brief description of what the function does

    % ......

    % Your Name, ICS 175A, date

    %

    % Inputs

    % data: (a structure with the same fields as described above:

    % your comment header should describe the structure explicitly)

    % Note that if you are only using certain fields in the structure

    % in the function below, you need only define these fields in the input comments

    -------- Your code goes here -------


    Nearest neighbor classifier29 l.jpg

    Nearest Neighbor Classifier

    function [class_predictions] = knn(traindata,trainlabels,k, testdata)

    % function [class_predictions] = knn(traindata,trainlabels,k, testdata)

    %

    % a brief description of what the function does

    % ......

    % Your Name, ICS 175A, date

    %

    % Inputs

    % traindata: a N1 x d vector of feature data (the "memory" for kNN)

    % trainlabels: a N1 x 1 vector of classlabels for traindata

    % k: an odd positive integer indicating the number of neighbors to use

    % testdata: a N2 x d vector of feature data for testing the knn classifier

    %

    % Outputs

    % class_predictions: N2 x 1 vector of predicted class values

    -------- Pseudocode -------

    Read in the training data set Dtrain       y = feature vector to be classified       kneighbors =  k-nearest neighbors to y in Dtrain       kclasses = class values of the kneighbors       kvote = the most common target value in kclasses  predicted_class(y) = kvote


    Accuracy of knn classifier as k is varied l.jpg

    Accuracy of kNN Classifier as k is varied

    function [accuracies] = knn_test(traindata,trainlabels, testdata, testlabels,kmax,plotflag)

    % function [accuracies] = knn_test(traindata,trainlabels, testdata, testlabels,kmax,plotflag)

    %

    % a brief description of what the function does

    % ......

    % Your Name, ICS 175A, date

    %

    % Inputs

    % traindata: a N1 x d vector of feature data (the "memory" for kNN)

    % trainlabels: a N1 x 1 vector of classlabels for traindata

    % testdata: a N2 x d vector of feature data for testing the knn classifier

    % testlabels: a N2 x 1 vector of classlabels for traindata

    % kmax: an odd positive integer indicating the maximum number of neighbors

    % plotflag: (optional argument) if 1, the accuracy versus k is plotted,

    % otherwise no plot.

    %

    % Outputs

    % accuracies: r x 1 vector of accuracies on testdata, where r is the

    % number of values of k that are tested.

    -------- Pseudocode -------

    Read in the training data set Dtrain, and Dtest      

    For k = 1, 3, 5, ... Kmax (odd numbers)                classify each point in Dtest using the k nearest neighbors in Dtest            

    error_k = 100*(number of points incorrectly classified)/(number of points in Dtest)      

    end


    Summary l.jpg

    Summary

    • Important Concepts

      • classification is an important component in intelligent systems

      • a classifier = a mapping from feature space to a class label

        • decision boundaries = boundaries between classes

      • classification learning

        • using training data to define a classifier

      • the nearest-neighbor classifier

      • training accuracy versus test accuracy


  • Login