Similarity-Based Learning: Nearest Neighbor and Bayes Classifier

Nearest neighbor (NN), kNN and Bayes classifier Guoqing Tang June 10, 2019

Outline of the Presentation • Feature spaces and measures of similarity • Nearest neighbor classifier: k-nearest neighbors • Bayes’ theorem & Bayesian prediction • Bayes classifier: naïve Bayes classifier

Similarity-Based Learning • Similarity-based approaches to machine learning come from the idea that: • Similar examples have similar label, and • Classify new examples like similar training examples. • The fundamental concepts required to build a model based on this idea are: • Feature spaces, and • Measures of similarity • Algorithm: • Given some new example x for which we need to predict its class y • Find most similar training examples • Classify x “like” these most similar examples • Questions: • How to determine similarity? • How many similar training examples to consider? • How to resolve inconsistencies among the training examples?

An Example • One day in 1798, after an expedition up Hawkesbury River in New South Wales, a sailor named Jim Smith from the expedition told his boss, Lt. Col. David Collins, that he saw a strong animal near the river. • Lt. Col. Collins asked Jim to describe the animal, and Jim explained that he didn’t see it very well because, as he approached it, the animal growled at him so he didn’t get too close to the animal. But he did notice that the animal had webbed feet and a duck-billedsnout. • Based on Jim’s description, Lt. Col. Collins decided that he needed to classify the animal so that he can determine whether it is dangerous to approach it. • He did it by thinking about the animals he can remember coming across before and comparing features of these animals with the features Jim described to him. • Figure below illustrates this process by listing some of the animals he had encountered before and how they compared with the growling, web-footed, duck-billed animal that Jim described.

An Example • From each animal, Collins counted how many features it has in common with the unknown animal. • At the end of this process, Collins decided that the unknown animal is most likely a duck • A duck, no matter how strange, is not a dangerous animal, so Lt. Col. Collins told his men to get ready for another expedition up the river the next day. • Takeaway: If you are trying to make a predication for a current situation, then you should search your memory to find situations similar to the current one, and make a predication based on what was true for the most similar situation in your memory

Feature Space • As illustrated by the above example, a key component of the similarity-based approach to prediction is to define a computational measure of similarity between instances • Often this measure of similarity is some form of distance measure. • If we want to compute distances between instances, we need to have a concept of feature space in the representation of the domain used by our similarity-based model. • We formally define a feature space as an abstract m-dimensional space by: • making each descriptive feature in a dataset an axis of an m-dim coordinate system; and • mapping each instance in the dataset to a point in this coordinate system based on the values of its descriptive features

The following table lists an example dataset containing two descriptive features, Speed and Agility ratings for college athletes (both measures out of 10), and a target feature Draft indicating whether the athletes were drafted to a professional team. • Can represent this dataset in a feature space by taking each of the descriptive features to be an axis of a coordinate system as illustrated in a figure on the next slide. An Example of Feature Space

An Example of Feature Space • Can place each instance within the feature space based on the values of its descriptive features. • Figure on the right is a scatter plot to illustrate the feature space where the x-axis is the Speed and the y-axis is the Agility • The value of the Draft feature is indicated by the shape representing each instance as a point in the feature space: a triangle ∆ for no and a plus sign + for yes. Source: FMLPDA by JD Kelleher

Measure of Similarity • The simplest way to measure the similarity between two instances, a and b, ina dataset is to measure the distance between the instances in a feature space. • Can use a distance metric to do this: a metric d(a,b) is a function that returns the distance between two instances a and b. • Mathematically, a metric must satisfy the following four conditions: • Nonnegativity: d(a,b) >=0 • Identity: d(a,b) = 0 iff a = b • Symmetry: d(a,b) = d(b,a) • Triangular inequality: d(a,b) <= d(a,c) + d(c,b). • One of the well known distance metric is Euclidean distance, which computes the length of the straight line between two points.

Distance Metrics • Euclidean distance: • Manhattan distance: • Minkowski distance: • When p = 1, the Minkowski distance is the Manhanttan distance, while the Minkowski distance is the Euclidean distance when p = 2. • Although there are infinite number of Minkowski-based distance metrics to choose from, Euclidean distance and Manhattan distance are the most commonly used ones.

compute distance test sample training samples • Basic idea: • If it walks like a duck, quacks like a duck, then it’s probably a duck Nearest neighbor classifiers choose k of the “nearest” samples

p q Nearest neighbor search • Let P be a set of n points in Rd, d ≥ 2. Given a query point q, find the nearest neighborp of q in P. • Naïve approach: Compute the distance from the query point to every other point in the database, keeping track of the "best so far". • Running time is O(n). • Data Structure approach: Constructa search structure in which given a query point q, finds the nearest neighborp of q in P.

Nearest neighbor algorithm Pseudocode description of the nearest neighbor algorithm • Require: a set of training instances • Require: a query instance • Iterate across the instances in memory to find the nearest neighbor– this is the instance with the shortest distance across the feature space to the query instance. • Make a predication for the query instance that is equal to the value of the target feature of the nearest neighbor. • The default distance metric used in nearest neighbor model is Euclidean distance.

test sample Requires three inputs: The set of stored samples Distance metric to compute distance between samples The value of k, the number of nearest neighbors to retrieve Nearest neighbor classifiers

test sample To classify test sample: Compute distances to samples in training set Identify k nearest neighbors Use class labels of nearest neighbors to determine class label of test sample (e.g. by taking majority vote) Nearest neighbor classifiers

k-nearest neighbors of test sample x are training samples that have the k smallest distances to x k-nearest neighbors (KNN) 3-nearest neighbors 1-nearest neighbor 2-nearest neighbors

KNN • KNN is a non-parametric supervised learning technique in which we try to classify the data point to a given category with the help of training set. • In simple words, it captures information of all training cases and classifies new cases based on a similarity. • Predictions are made for a new instance (x) by searching through the entire training set for the k most similar cases (neighbors) and summarizing the output variable for those k cases. • In classification this is the mode (or most common) class value.

1-nearest neighbor • One of the simplest of all machine learning classifiers • Simple idea:label a new point the same as the closest known point Label it red.

1-nearest neighbor • A type of instance-based learning • Also known as “memory-based” learning • Partition the feature space into a Voronoi tessellation# and decide which Voronoi region the query belongs to. # A Voronoi tessellation is a way of decomposing a space into regions where each region belongs to an instance and contains all the points in the space whose distance to that instance is less than the distance to any other instance

Distance metrics • Different metrics can change the decision surface Dist(a,b) =(a1 – b1)2 + (a2 – b2)2 Dist(a,b) =(a1 – b1)2 + 9(a2 – b2)2

1-NN as an instance-based learner • A distance metric • Euclidean • When different units are used for each dimension  normalize each dimension by standard deviation • For discrete data, can use Hamming distance d(x1,x2) = number of features on which x1 and x2 differ • Others (e.g., normal, cosine) • How many nearby neighbors to look at? • One • How to fit with the local points? • Just predict the same output as the nearest neighbor.

k–nearest neighbors • Generalizes 1-NN to smooth away noise in the labels • A new point is now assigned the most frequent label of its k nearest neighbors Label it red, when k = 3 Label it blue, when k = 7

Selecting k value • Increase k: • Makes KNN less sensitive to noise • Decrease k: • Allows capturing finer structure of space • Pick k not too large, but not too small (depends on data)

Curse of Dimensionality • Prediction accuracy can quickly degrade when number of attributes grows. • Irrelevant attributes easily “swamp” information from relevant attributes • When many irrelevant attributes, similarity/distance measure becomes less reliable • Remedy • Try to remove irrelevant attributes in pre-processing step • Weight attributes differently • Increase k (but not too much)

Pros and cons of KNN • Pros • Easy to understand • No assumptions about data • Can be applied to both classification and regression • Works easily on multi-class problems • Cons • Memory intensive / computationally expensive • Sensitive to scale of data • Not work well on rare event (skewed) target variable • Struggle when high number of independent variables • A small value of k will lead to a large variance in predictions. • Setting k to a large value may lead to a large model bias.

How to find best k value • Cross-validation is a smart way to find out the optimal K value. It estimates the validation error rate by holding out a subset of the training set from the model building process. • Cross-validation (let’s say 10 fold validation) involves randomly dividing the training set into 10 groups, or folds, of approximately equal size. 90% data is used to train the model and remaining 10% to validate it. • The misclassification rate is then computed on the 10% validation data. This procedure repeats 10 times. • Different group of observations are treated as a validation set each of the 10 times. • It results to 10 estimates of the validation error which are then averaged out.

How KNN algorithm works • Suppose we have height, weight and T-shirt size of some customers and we need to predict the T-shirt size of a new customer given only height and weight information we have. • Data including height, weight and T-shirt size information is shown below

Step 1 : Calculate similarity based on distance function • There are many distance functions but Euclidean is the most commonly used measure. It is mainly used when data is continuous. • The Euclidean distance between two n-dimensional points x and u is d(x, u) =[ (x1-u1)2+ (x2-u2)2 + … + (xn-un)2 ]1/2 • Manhattan distance is also very common for continuous variables. • The Manhattan distance between two n-dimensional points x and u is d(x, u) =|x1-u1|+ |x2-u2|+ … + |xn-un| • The idea to use distance measure is to find the distance (similarity) between new sample and training cases and then finds the k-closest customers to new customer in terms of height and weight.

Step 1 : Calculate similarity based on distance function • New customer named ‘Monica’ has height 161cm and weight 61kg. • Euclidean distance between first observation and new observation (monica) is as follows: d = ((161-158)2+(61-58)2)1/2 = (9+9)1/2 =3 x 21/2 • Similarly, we will calculate distance of all the training cases with new case and calculates the rank in terms of distance. The smallest distance value will be ranked 1 and considered as nearest neighbor.

Step 2 : Find K-Nearest Neighbors • Let k be 5. Then the algorithm searches for the 5 customers closest to Monica, i.e. most similar to Monica in terms of attributes, and see what categories those 5 customers were in. If 4 of them had ‘Medium T shirt sizes’ and 1 had ‘Large T shirt size’, then your best guess for Monica is ‘Medium T shirt. See the calculation shown in the snapshot below

Step 2 : Find K-Nearest Neighbors • In the graph on right, binary dependent variable (T-shirt size) is displayed in blue and orange color. • ‘Medium T-shirt size’ is in blue color and ‘Large T-shirt size’ in orange color. • New customer information is exhibited in yellow circle. • Four blue highlighted data points and one orange highlighted data point are close to yellow circle. • so the prediction for the new case is blue highlighted data point which is Medium T-shirt size.

Assumptions of KNN • Standardization • When independent variables in training data are measured in different units, it is important to standardize variables before calculating distance. For example, if one variable is based on height in cms, and the other is based on weight in kgs, then height will influence more on the distance calculation. • In order to make them comparable we need to standardize them which can be done by any of the methods listed on the right.

Standardization • After standardization, 5th closest value got changed as height was dominating earlier before standardization. Hence, it is important to standardize predictors before running k-nearest neighbor algorithm. • Outlier • Low k-value is sensitive to outliers and a higher k-value is more resilient to outliers as it considers more voters to decide prediction.

Why KNN is non-parametric • Non-parametric means not making any assumptions on the underlying data distribution. • Non-parametric methods do not have fixed numbers of parameters in the model. • Similarly in KNN, model parameters actually grow with the training data set – you can imagine each training case as a “parameter” in the model.

When k =1, we classify the 21st athlete as being drafted. • For most k values, say k = 3, 5, 6 or 7, the 21st athlete is classified as not being drafted College Athlete Draft Example Revisited

A MATLAB Example • In MATLAB, the KNN classifier is constructed via the fitcknn() function, which returns a KNN classification model based on the predictors and the response provided. • Use the UCI ML dataset fisheriris (see figure in next slide) to carry out KNN classification • >> load fisheriris • This command creates two variables: meas and species. The 1st contains the data for the length and width of the sepal and petal (150x4 double). The 2nd covers the classification (150x1 double). • Set the # nearest nbhds in the predictors, k =3, for this data • >> knnModel = fitcknn(meas,species,’NumNeighbors’,3) • The fitcknn() function returns a nearest nbhd classification object named knnModel

A MATLAB Example • UCI Machine Learning’s Fisheriris dataset, in which the predictors are the petal length and petal width, and the classes are setosa, versicolor, and virginica.

Graphical Display

A MATLAB Example • We can examine the properties of classification objects by double-clicking knnModel in the Workspace window. This opens the VARIABLE editor, in which a long list of properties is shown. • To access the properties of the model just created, use dot notation: • knnModel.ClassNames ans = 3 x 1 cell array ‘setosa’ ‘versicolor’ ‘virginica’

A MATLAB Example • To test the performance of the model, compute the resubstitution error: • knn3ResubErr = resubLoss(knnModel) knn3ResubErr = 0.0400 • This means 4% of the observations are misclassified by the KNN algorithm. • To understand how these errors are distributed, we first collect the model predictions for the available data and then calculate the confusion matrix • predictedValue = predict(knnModel,meas); • confMat = confusionmat(species,predictedValue)

Cross-Validation • In cross-validation, all available data is used in fixed-size groups, or as a test set and a training set. • Each pattern is classified at least once and used for training • Practically the sample is subdivided into groups of equal numberm one group at a time is excluded and tries to predict it with non-excluded groups • This process is repeated k times such that each subset is used exactly once for validation to verify the quality of the prediction model used. • In MATLAB, cross-validation is performed by the crossval() function, which creates a partitioned model from fitted KNN classification model.

Cross-Validation • By default, the crossval() function uses 10-fold cross-validation on the training data to create the model. • >> CVModel = corssval(knnModel) CVModel = classreg.learning.partition.ClassficationPartitionedModel CrossValidatedModel: ‘KNN’ PredicatedNames: {‘x1’ ‘x2’ ‘x3’ ‘x4’} ResponseName: ‘Y’ NumObservations: 150 KFold: 10 Partition: [1x1 cvpartition] ClassNames: {‘setosa’, ‘versicolor’ ‘virginica’} ScoreTransform: ‘none’

Cross Validation • A ClassificationPartitionModelobject is created with several properties and methods, some of which were shown in the previous page. • We can examine the properties of the ClassificationPartitionModelobject by double-clicking CVModel in the workspace window. • We can review the cross-validation loss, which is the average loss of each cross-validation model when predication is executed on data that is not used for training. • KLossModel = kfoldLoss(CVModel) KLossModel = 0.0333 • The cross-validation classification accuracy of 96.67% is very close to the resubstitution accuracy of 96%. • It is expected that the classification model to misclassify approx. 4% of the new data

Cross Validation • Choosing the optimal K value is very important for creating the best KNN classification model. • We will try different K values to see what happen: • knnModel.NumNeighbors = 5 knnModel = ClassificationKNN ResponseName: “Y” CategoricalPredictors: [ ] ClassNames: {‘setos’ ‘versicolor’ ‘virginica’} ScoreTransform: ‘none’ NumObservations: 150 Distance: ‘Euclidean’ NumNeighbors: 5

Cross Validation • The model is newly fitted and a brief summary are listed on the command window • To see if there was an improvement in performance, compare the resubstitution predications and cross-validation loss with the new number of nearest nbhds • >> knn5ResubErr = resubLoss(knnModel) knn5ResubErr = 0.0333 • Run cross-validation again and display the cross-validation loss: • >> CVModel = crossval(knnModel) • >> K5LossModel = kfoldLoss(CVModel) K5LossModel = 0.0267 • The choice of k =5 improved the performance of the model from 96.67% to 97.33%.

Cross Validation • Recalculate the confusion matrix • >> PredictedValue = predict(knnModel,meas) • >> confMat = confusionmat(species,PredictedValue) confMat = 50 0 0 0 47 3 0 2 48 • The number of misclassification errors is reduced from 6 to 5

Bayes Theorem • We will start off with a visual intuition, before looking at the math…

Bayes Theorem • With a lot of data, we can build a histogram. Let us just build one for “Antenna Length” for now…

Bayes Theorem • We can leave the histograms as they are, or we can summarize them with two normal distributions. • Let us use two normal distributions for easeof visualization in the following slides…

Bayes Theorem • We want to classify an insect we have found. Its antennae are 3 units long. How can we classify it? • We can just ask ourselves, give the distributions of antennae lengths we have seen, is it more probable that our insect is a Grasshopper or a Katydid. • There is a formal way to discuss the most probable classification…

Similarity-Based Learning: Nearest Neighbor and Bayes Classifier