1 / 37

Nearest Neighbor

Nearest Neighbor. Ling 572 Advanced Statistical Methods in NLP January 12, 2012. Roadmap. Instance-based learning Examples and Motivation Basic algorithm: k-Nearest Neighbors Issues: Selecting Weighting Voting Summary HW #2. Instance-based Learning. aka “Lazy Learning”

hume
Download Presentation

Nearest Neighbor

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Nearest Neighbor Ling 572 Advanced Statistical Methods in NLP January 12, 2012

  2. Roadmap • Instance-based learning • Examples and Motivation • Basic algorithm: k-Nearest Neighbors • Issues: • Selecting • Weighting • Voting • Summary • HW #2

  3. Instance-based Learning • aka “Lazy Learning” • No explicit training procedure • Just store / tabulate training instances • Many examples: • k-Nearest Neighbor (kNN) • Locally weighted regression • Radial basis functions • Case-based reasoning • kNN : most widely used variant

  4. Nearest NeighborExample I • Problem: Robot arm motion • Difficult to model analytically • Kinematic equations • Relate joint angles and manipulator positions • Dynamics equations • Relate motor torques to joint angles • Difficult to achieve good results modeling robotic arms or human arm • Many factors & measurements

  5. Nearest Neighbor Example • Solution: • Move robot arm around • Record parameters and trajectory segment • Table: torques, positions,velocities, squared velocities, velocity products, accelerations • To follow a new path: • Break into segments • Find closest segments in table • Get those torques (interpolate as necessary)

  6. Nearest Neighbor Example • Issue: Big table • First time with new trajectory • “Closest” isn’t close • Table is sparse - few entries • Solution: Practice • As attempt trajectory, fill in more of table • After few attempts, very close

  7. Nearest Neighbor Example II • Topic Tracking • Task: Given a sample number of exemplar documents, find other documents on same topic • Approach: • Features: ‘terms’ : words or phrases, stemmed • Weights: tf-idf: term frequency, inverse document freq • Modified • New document : assign label of nearest neighbors

  8. Instance-based Learning III • Example-based Machine Translation (EBMT) • Task: Translation in the sales domain • Approach: • Collect translation pairs: • Source language: target language • Decompose into subphrases via minimal pairs • How much is that red umbrella? Anoakaikasawaikuradesuka. • How much is that small camera? Anochiisaikamerawaikuradesuka. • New sentence to translate: • Find most similar source language subunits • Compose corresponding subunit translations

  9. Nearest Neighbor • Memory- or case- based learning • Consistency heuristic: Assume that a property is the same as that of the nearest reference case. • Supervised method: Training • Record labelled instances and feature-value vectors • For each new, unlabelled instance • Identify “nearest” labelled instance(s) • Assign same label

  10. Nearest Neighbor Example • Credit Rating: • Classifier: Good / Poor • Features: • L = # late payments/yr; • R = Income/Expenses Name L R G/P A 0 1.2 G B 25 0.4 P C 5 0.7 G D 20 0.8 P E 30 0.85 P F 11 1.2 G G 7 1.15 G H 15 0.8 P

  11. Nearest Neighbor Example Name L R G/P A 0 1.2 G A F B 25 0.4 P 1 G R E C 5 0.7 G H D C D 20 0.8 P E 30 0.85 P B F 11 1.2 G G 7 1.15 G 10 20 30 L H 15 0.8 P

  12. Nearest Neighbor Example Name L R G/P I 6 1.15 A F J 22 0.45 1 G E K 15 1.2 D H R C B 10 20 30 L

  13. Nearest Neighbor Example Name L R G/P I 6 1.15 A F J 22 0.45 1 I G E K 15 1.2 D H R C B 10 20 30 L

  14. Nearest Neighbor Example Name L R G/P I 6 1.15 G A F J 22 0.45 1 I G E K 15 1.2 D H R C B 10 20 30 L

  15. Nearest Neighbor Example Name L R G/P I 6 1.15 G A F J 22 0.45 1 I G E K 15 1.2 D H R C B 10 20 30 L

  16. Nearest Neighbor Example Name L R G/P I 6 1.15 G A F J 22 0.45 1 I G E K 15 1.2 D H R C J B 10 20 30 L

  17. Nearest Neighbor Example Name L R G/P I 6 1.15 G A F J 22 0.45 P 1 I G ?? E K 15 1.2 D H R C J B 10 20 30 L

  18. Nearest Neighbor Example Name L R G/P I 6 1.15 G A F K J 22 0.45 P 1 I G ?? E K 15 1.2 D H R C J B 10 20 30 L

  19. Nearest Neighbor Example Name L R G/P I 6 1.15 G A F K J 22 0.45 P 1 I G ?? E K 15 1.2 D H R C J B Distance Measure: Scaled distance 10 20 30 L

  20. k Nearest Neighbor • Determine a value k • Calculate the distance between new instance and all training instances • Sort training instances by distance; select k nearest • Identify labels of k nearest neighbors • Use voting mechanism on k to determine classification

  21. Issues: • What’s K? • How do we weight/scale/select features? • How do we combine instances by voting?

  22. Selecting K • Just pick a number? • Percentage of samples? • Cross-validation: • Split data into • Training set (and validation set) • Development set • Test set • Pick k with lowest cross-validation error • Under n-fold cross-validation

  23. Feature Values • Scale: • Remember credit rating • Features: income/outgo ratio [0.5,2] • Late payments: [0,30]; … • AGI: [8K,500K] • What happens with Euclidean distance? • Features with large magnitude dominate • Approach: • Rescale: i.e. normalize to [0,1]

  24. Feature Selection • Consider: • Large feature set (100) • Few relevant features (2) • What happens? • Trouble! • Differences in irrelevant features likely to dominate • k-NN sensitive to irrelevant attributes in high dimensional feature space • Can be useful, but feature selection is key

  25. Feature Weighting • What if you want some features to be more important? • Or less important • Reweighting a dimension iby weight wi • Can increase or decrease weight of feature on that dim. • Set weight to 0? • Ignores corresponding feature • How can we set the weights? • Cross-validation ;-)

  26. Distance Measures • Euclidean distance: • Weighted Euclidean distance: • Cosine similarity:

  27. Voting in kNN • Suppose we have found the k nearest neighbors • Let fi(x) be the class label of ith neighbor of x • δ(c,fi(x)) is 1 if fi(x) = c and 0 otherwise • Then g(c)=Σiδ(c,fi(x)): # of neighbors with label c

  28. Voting in kNN • Alternatives: • Majority vote: c*=argmaxc g(c) • Weighted voting: neighbors have different weights • c*= argmaxcΣiwiδ(c,fi(x)) • Weighted voting allows use many training examples • Could use all samples weighted by inverse distance

  29. kNN: Strengths • Advantages: • Simplicity (conceptual) • Efficiency (in training) • No work at all • Handles multi-class classification • Stability/robustness: average of many neighbor votes • Accuracy: with large data set

  30. kNN: Weaknesses • Disadvantages: • (In)efficiency at testing time • Distance computation for all N at test time • Lack of theoretical basis • Sensitivity to irrelevant features • Distance metrics unclear on non-numerical/binary values

  31. HW #2 • Decision trees • Text classification: • Same data, categories as Q5 (Ling570: HW #8) • Features: Words • Build and evaluate decision trees with Mallet • Build and evaluate decision trees with your own code

  32. Building your own DT learner • Nodes test single feature • Features are binary: • Present/not present • DT is binary branching • Selection criterion: • Information gain • Early stopping heuristics: • Information gain above threshold • Tree depth below threshold

  33. Efficiency • Caution: • There are LOTS of features • You’ll be testing whether or not a feature is present for each class repeatedly • (c,f) or (c,~f) • Think about ways to do this efficiently

  34. Efficient Implementations • Classification cost: • Find nearest neighbor: O(n) • Compute distance between unknown and all instances • Compare distances • Problematic for large data sets • Alternative: • Use binary search to reduce to O(log n)

  35. Efficient Implementation: K-D Trees • Divide instances into sets based on features • Binary branching: E.g. > value • 2^d leaves with d split path = n • d= O(log n) • To split cases into sets, • If there is one element in the set, stop • Otherwise pick a feature to split on • Find average position of two middle objects on that dimension • Split remaining objects based on average position • Recursively split subsets

  36. R > 0.825? L > 17.5? L > 9 ? R > 0.6? R > 0.75? R > 1.175 ? R > 1.025 ? K-D Trees: Classification Yes No No Yes Yes No No Yes No Yes No No Yes Yes Poor Good Good Poor Good Good Poor Good

  37. Efficient Implementation:Parallel Hardware • Classification cost: • # distance computations • Const time if O(n) processors • Cost of finding closest • Compute pairwise minimum, successively • O(log n) time

More Related