550 likes | 585 Views
K-means Clustering Algorithm: Applications, Types, & How Does It Work?
E N D
Classification Using K-Nearest Neighbor Assignment Prepared By Rekha G
OUTLINE • KNN • KNN: Classification Approach • KNN: Euclidean distance • Nearest Neighbor and Exemplar • Nearest Neighbor Search • The kNN Algorithm • Closest Neighbors • Choosing appropriate k • Finding error with changed k
KNN • K-NearestNeighbors(KNN) • Simple,butaverypowerfulclassificationalgorithm • Classifiesbasedonasimilaritymeasure • Non-parametric • Lazylearning 🞑 Doesnot“learn”untilthetestexampleisgiven 🞑 Whenever wehavea newdata to classify,wefind its K-nearestneighborsfromthetraining data
Pros and Cons of KNN Pros • It is very simple algorithm to understand and interpret. • It is very useful for nonlinear data because there is no assumption about data in this algorithm. • It is a versatile algorithm as we can use it for classification as well as regression. • It has relatively high accuracy but there are much better supervised learning models than KNN. Cons • It is computationally a bit expensive algorithm because it stores all the training data. • High memory storage required as compared to other supervised learning algorithms. • Prediction is slow in case of big N. • It is very sensitive to the scale of data as well as irrelevant features.
KNN: Classification Approach • Classifiedby“MAJORITYVOTES”foritsneighbor classes 🞑 Assignedto the mostcommonclassamongst its K- nearestneighbors(bymeasuring“distant”between data)
Supervised Unsupervised • Labeled Data • Unlabeled Data
Distances • Distance are used to measure similarity • There are many ways to measure the distances between two instances
Distances • Manhattan Distance |X1-X2| + |Y1-Y2| • Euclidean Distance
Distance ManhattanDistance EuclideanDistance
Properties of Distance • Dist (x,y) >= 0 • Dist (x,y) = Dist (y,x) are Symmetric • Detours can not Shorten Distance Dist(x,z) <= Dist(x,y) + Dist (y,z) z y X X y z
Distance Hamming Distance
DistancesMeasure • Distance Measure – What does it mean “Similar"? • Minkowski Distance • Norm: • Chebyshew Distance • Mahalanobis distance: d(x , y) = |x – y|TSxy1|x – y|
Exemplar • Arithmetic Mean • Geometric Mean • Medoid • Centroid
Nearest Neighbor Search • Given: a set P of n points in Rd • Goal: a data structure, which given a query point q, finds the nearest neighborp of q in P p q
Definition of Nearest Neighbor K-nearest neighbors of a record x are data points that have the k smallest distance to x
K-NN • (K-l)-NN: Reduce complexity by having a threshold on the majority. We could restrict the associations through (K-l)-NN.
K-NN • (K-l)-NN: Reduce complexity by having a threshold on the majority. We could restrict the associations through (K-l)-NN.K=5
K-NN • Select 5 Nearest Neighbors as Value of K=5 by Taking their Euclidean Disances
K-NN • Decide if majority of Instances over a given value of K Here, K=5.
The kNN Algorithm • The kNN algorithm begins with a training dataset madeupofexamplesthatareclassifiedintoseveral categories,aslabeledbyanominalvariable. • Assume that we have a test dataset containing unlabeledexamplesthatotherwisehavethesame featuresasthetrainingdata. • Foreachrecordinthetestdataset,kNNidentifiesk recordsinthetrainingdatathatarethe"nearest"in similarity,wherekisanintegerspecifiedinadvance. • Theunlabeledtestinstanceisassignedtheclassof themajorityoftheknearestneighbors
Distance • Euclidean distance is specified by the following formula, where p andqarethexamplestobecompared,eachhavingnfeatures.The termp1referstothevalueofthefirstfeatureofexamplep,while q1referstothevalueofthefirstfeatureofexampleq: • The distance formula involves comparing the values of each feature. For example, to calculate the distance between the tomato (sweetness = 6, crunchiness = 4), and the green bean (sweetness=3,crunchiness=7),wecanusetheformulaasfollows:
Choosing appropriate k • Deciding how many neighbors to use for kNN determineshowwellthemodewillgeneralizeto futuredata. • Thebalancebetweenoverfittingandunderfitting thetrainingdataisaproblemknownasthebias- variancetradeoff. • Choosingalargekreducestheimpactorvariance caused by noisy data, but can bias the learner such thatitrunstheriskofignoringsmall,butimportant patterns. • •
Choosing appropriate k • Inpractice,choosingkdependsonthedifficulty oftheconcepttobelearnedandthenumberof recordsinthetrainingdata. • Typically,kissetsomewherebetween3and10. One common practice is to set k equal to the squarerootofthenumberoftrainingexamples. • In the classifier, we might set k = 4, because there were 15 example ingredients in the trainingdataandthesquarerootof15is3.87.