1 / 28

K nearest neighbor and Rocchio algorithm

K nearest neighbor and Rocchio algorithm. LING 572 Fei Xia 1/11/2007. Announcement. Hw2 is online now. It is due on Jan 20. Hw1 is due at 11pm on Jan 13 (Sat). Lab session after the class. Read DT Tutorial before next Tues’ class. K-Nearest Neighbor (kNN). Instance-based (IB) learning.

regis
Download Presentation

K nearest neighbor and Rocchio algorithm

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. K nearest neighborand Rocchio algorithm LING 572 Fei Xia 1/11/2007

  2. Announcement • Hw2 is online now. It is due on Jan 20. • Hw1 is due at 11pm on Jan 13 (Sat). • Lab session after the class. • Read DT Tutorial before next Tues’ class.

  3. K-Nearest Neighbor (kNN)

  4. Instance-based (IB) learning • No training: store all training instances.  “Lazy learning” • Examples: • kNN • Locally weighted regression • Radial basis functions • Case-based reasoning • … • The most well-known IB method: kNN

  5. kNN

  6. kNN • For a new document d, • find k training documents that are closest to d. • perform majority voting or weighted voting. • Properties: • A “lazy” classifier. No training. • Feature selection and distance measure are crucial.

  7. The algorithm • Determine parameter K • Calculate the distance between query-instance and all the training instances • Sort the distances and determine K nearest neighbors • Gather the labels of the K nearest neighbors • Use simple majority voting or weighted voting.

  8. Picking K • Use N-fold cross validation: pick the one that minimizes cross validation error.

  9. Normalizing attribute values • Distance could be dominated by some attributes with large numbers: • Ex: features: age, income • Original data: x1=(35, 76K), x2=(36, 80K), x3=(70, 79K) • Assume: age 2 [0,100], income 2 [0, 200K] • After normalization: x1=(0.35, 0.38), x2=(0.36, 0.40), x3 = (0.70, 0.395).

  10. The Choice of Features • Imagine there are 100 features, and only 2 of them are relevant to the target label. • kNN is easily misled in high-dimensional space.  Feature weighting or feature selection

  11. Feature weighting • Stretch j-th axis by weight wj, • Use cross-validation to automatically choose weights w1, …, wn • Setting wj to zero eliminates this dimension altogether.

  12. Similarity measure • Euclidean distance: • Weighted Euclidean distance: • Similarity measure: cosine

  13. Voting • Majority voting: c* = arg maxci(c, fi(x)) • Weighted voting: weighting is on each neighbor c* = arg maxci wi(c, fi(x)) wi = 1/dist(x, xi)  We can use all the training examples.

  14. Summary of kNN • Strengths: • Simplicity (conceptual) • Efficiency at training: no training • Handling multi-class • Stability and robustness: averaging k neighbors • Predication accuracy: when the training data is large • Weakness: • Efficiency at testing time: need to calc all distances • Theoretical validity • It is not clear which types of distance measure and features to use.

  15. Rocchio Algorithm

  16. Relevance Feedback for IR • The issue: “plane” vs. “aircraft” • Take advantage of user feedback on relevance of docs to improve IR results: • User issues a short, simple query • The user marks returned documents as relevant or non-relevant. • The system computes a better representation of the information need based on feedback. • Relevance feedback can go through one or more iterations. • Idea: it may be difficult to formulate a good query when you don’t know the collection well, so iterate

  17. Rocchio Algorithm • The Rocchio algorithm incorporates relevance feedback information into the vector space model. • Want to maximize sim (Q, Cr) − sim (Q, Cnr) • The optimal query vector for separating relevant and non-relevant documents (with cosine sim.): • Qopt= optimal query; Cr= set of relevant doc vectors; N = collection size

  18. Rocchio 1971 Algorithm (SMART) qm= modified query vector; q0 = original query vector; ,,  : weights Dr = set of known relevant doc vectors; Dnr= set of known irrelevant doc vectors

  19. Relevance feedback assumptions Relevance prototypes are “well-behaved”. • Term distribution in relevant documents will be similar • Term distribution in non-relevant documents will be different from those in relevant documents: • Either: All relevant documents are tightly clustered around a single prototype. • Or: There are different prototypes, but they have significant vocabulary overlap. • Similarities between relevant and irrelevant documents are small.

  20. Rocchio Algorithm for text classification • Training time: construct a set of prototype vectors, one vector per class. • Testing time: for a new document, find the most similar prototype vector.

  21. Training time Cj: the set of positive examples for class j. D: the set of positive and negative examples ,: weights

  22. Why this formula? • Rocchio shows when ==1, each prototype vector maximizes • How maximizing this formula connects to the classification accuracy?

  23. Testing time • Given a new document d

  24. kNN vs. Rocchio • kNN: • Lazy learning: no training • Use all the training instances at testing time. • Rocchio algorithm: • At the training time, calculate prototype vectors. • At the test time, instead of using all the training instances, use only prototype vectors. • Linear classifier: not as expressive as kNN.

  25. Summary of Rocchio • Strengths: • Simplicity (conceptual) • Efficiency at training • Efficiency at testing time • Handling multi-class • Weakness: • Theoretical validity • Stability and robustness • Predication accuracy: it does not work well when the categories are not linearly separable.

  26. Additional slides

  27. Three major design choices • The weight of feature fk in document di: e.g., tf-idf • The document length normalization • The similarity measure: e.g., cosine

  28. Extending Rocchio? • Generalized instance set (GIS) algorithm (Lam and Ho, 1998). • …

More Related