1 / 54

KNN_ Assignment

K-means Clustering Algorithm: Applications, Types, & How Does It Work?

REKHA15
Download Presentation

KNN_ Assignment

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Classification Using K-Nearest Neighbor Assignment Prepared By Rekha G

  2. OUTLINE • KNN • KNN: Classification Approach • KNN: Euclidean distance • Nearest Neighbor and Exemplar • Nearest Neighbor Search • The kNN Algorithm • Closest Neighbors • Choosing appropriate k • Finding error with changed k

  3. KNN • K-NearestNeighbors(KNN) • Simple,butaverypowerfulclassificationalgorithm • Classifiesbasedonasimilaritymeasure • Non-parametric • Lazylearning 🞑 Doesnot“learn”untilthetestexampleisgiven 🞑 Whenever wehavea newdata to classify,wefind its K-nearestneighborsfromthetraining data

  4. Pros and Cons of KNN Pros • It is very simple algorithm to understand and interpret. • It is very useful for nonlinear data because there is no assumption about data in this algorithm. • It is a versatile algorithm as we can use it for classification as well as regression. • It has relatively high accuracy but there are much better supervised learning models than KNN. Cons • It is computationally a bit expensive algorithm because it stores all the training data. • High memory storage required as compared to other supervised learning algorithms. • Prediction is slow in case of big N. • It is very sensitive to the scale of data as well as irrelevant features.

  5. THE KNN Algorithm

  6. KNN: Classification Approach • Classifiedby“MAJORITYVOTES”foritsneighbor classes 🞑 Assignedto the mostcommonclassamongst its K- nearestneighbors(bymeasuring“distant”between data)

  7. KNN: Example

  8. KNN: Pseudocode

  9. KNN: Example

  10. KNN: Euclidean distance

  11. KNN: Euclidean distance matrix

  12. Supervised Unsupervised • Labeled Data • Unlabeled Data

  13. Distance

  14. Distance

  15. Distances • Distance are used to measure similarity • There are many ways to measure the distances between two instances

  16. Distances • Manhattan Distance |X1-X2| + |Y1-Y2| • Euclidean Distance

  17. Distance ManhattanDistance EuclideanDistance

  18. Properties of Distance • Dist (x,y) >= 0 • Dist (x,y) = Dist (y,x) are Symmetric • Detours can not Shorten Distance Dist(x,z) <= Dist(x,y) + Dist (y,z) z y X X y z

  19. Distance Hamming Distance

  20. DistancesMeasure • Distance Measure – What does it mean “Similar"? • Minkowski Distance • Norm: • Chebyshew Distance • Mahalanobis distance: d(x , y) = |x – y|TSxy1|x – y|

  21. Nearest Neighbor and Exemplar

  22. Exemplar • Arithmetic Mean • Geometric Mean • Medoid • Centroid

  23. Arithmetic Mean

  24. Geometric Mean

  25. Nearest Neighbor Search • Given: a set P of n points in Rd • Goal: a data structure, which given a query point q, finds the nearest neighborp of q in P p q

  26. Definition of Nearest Neighbor K-nearest neighbors of a record x are data points that have the k smallest distance to x

  27. K-NN • (K-l)-NN: Reduce complexity by having a threshold on the majority. We could restrict the associations through (K-l)-NN.

  28. K-NN • (K-l)-NN: Reduce complexity by having a threshold on the majority. We could restrict the associations through (K-l)-NN.K=5

  29. K-NN • Select 5 Nearest Neighbors as Value of K=5 by Taking their Euclidean Disances

  30. K-NN • Decide if majority of Instances over a given value of K Here, K=5.

  31. Example

  32. KNN Example

  33. Scatter Plot

  34. Euclidean Distance From Each Point

  35. 3 Nearest NeighBour

  36. KNN Classification

  37. Variation In KNN

  38. Different Values of K

  39. The kNN Algorithm • The kNN algorithm begins with a training dataset madeupofexamplesthatareclassifiedintoseveral categories,aslabeledbyanominalvariable. • Assume that we have a test dataset containing unlabeledexamplesthatotherwisehavethesame featuresasthetrainingdata. • Foreachrecordinthetestdataset,kNNidentifiesk recordsinthetrainingdatathatarethe"nearest"in similarity,wherekisanintegerspecifiedinadvance. • Theunlabeledtestinstanceisassignedtheclassof themajorityoftheknearestneighbors

  40. Classify me now!

  41. Example

  42. Distance • Euclidean distance is specified by the following formula, where p andqarethexamplestobecompared,eachhavingnfeatures.The termp1referstothevalueofthefirstfeatureofexamplep,while q1referstothevalueofthefirstfeatureofexampleq: • The distance formula involves comparing the values of each feature. For example, to calculate the distance between the tomato (sweetness = 6, crunchiness = 4), and the green bean (sweetness=3,crunchiness=7),wecanusetheformulaasfollows:

  43. Closest Neighbors

  44. Choosing appropriate k • Deciding how many neighbors to use for kNN determineshowwellthemodewillgeneralizeto futuredata. • Thebalancebetweenoverfittingandunderfitting thetrainingdataisaproblemknownasthebias- variancetradeoff. • Choosingalargekreducestheimpactorvariance caused by noisy data, but can bias the learner such thatitrunstheriskofignoringsmall,butimportant patterns. • •

  45. Choosing appropriate k

  46. Choosing appropriate k • Inpractice,choosingkdependsonthedifficulty oftheconcepttobelearnedandthenumberof recordsinthetrainingdata. • Typically,kissetsomewherebetween3and10. One common practice is to set k equal to the squarerootofthenumberoftrainingexamples. • In the classifier, we might set k = 4, because there were 15 example ingredients in the trainingdataandthesquarerootof15is3.87.

  47. Sample Application

  48. Dataset

  49. KNN – Classification : Dataset

  50. Pre-processing

More Related