1 / 37

Nearest Neighbors in High-Dimensional Data: The Emergence and Influence of Hubs

Nearest Neighbors in High-Dimensional Data: The Emergence and Influence of Hubs. Milo š Radovanović 1 Alexandros Nanopoulos 2 Mirjana Ivanovi ć 1 1 Department of Mathematics and Informatics Faculty of Science, University of Novi Sad , Serbia 2 Institute of Computer Science

axl
Download Presentation

Nearest Neighbors in High-Dimensional Data: The Emergence and Influence of Hubs

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Nearest Neighborsin High-Dimensional Data:The Emergenceand Influence of Hubs Miloš Radovanović1 Alexandros Nanopoulos2 Mirjana Ivanović1 1Department of Mathematics and Informatics Faculty of Science, University of Novi Sad, Serbia 2Institute of Computer Science University of Hildesheim, Germany

  2. Introduction • The curse of dimensionality • Distance concentration • The tendency of distances between all pairs of points in high-dimensional data to become almost equal • Affects meaningfulness of nearest neighbors, indexing, classification, regression • [Beyer 1999, Aggarwal 2001, François 2007] • We study a related phenomenon which concerns k-NN directed graphs ICML'09 Miloš Radovanović

  3. k-occurrences • Nk(x), the number of k-occurrences of point x, is the number of times x occurs among k nearest neighbors of all other points in a data set • Nk(x) is the in-degree of node x in the k-NN digraph • It was observed that the distribution of Nk can become skewed, resulting in the emergence of hubs – points with high Nk • Music retrieval [Aucouturier 2007] • Speech recognition [Doddington 1998] • Fingerprint identification [Hicklin 2005] ICML'09 Miloš Radovanović

  4. k-occurrences • What caused the skewness of Nk? • Artefact of data? • Are some songs more similar to others? • Do some people have fingerprints or voices that are harder to distinguish from other people’s? • Specifics of modeling algorithms? • Inadequate choice of features? • Something more general? ICML'09 Miloš Radovanović

  5. ICML'09 Miloš Radovanović

  6. ICML'09 Miloš Radovanović

  7. ICML'09 Miloš Radovanović

  8. ICML'09 Miloš Radovanović

  9. ICML'09 Miloš Radovanović

  10. ICML'09 Miloš Radovanović

  11. The Causes of Skewness • Distance concentration • Ratio between a measure of spread (e.g., Std) and a measure of magnitude (e.g., E) of distances converges to 0 as d increases • High-dimensional data points approximately lie on a sphere centered at data set mean [Beyer 1999, Aggarwal 2001] • The distribution of distances to data set mean always has non-negligible variance [Demartines 1994, François 2007] • Existence of points closer to the data set mean is expected, even in high dimensions • Points closer to the data set mean tend to be closer to all other points (regardless of dimensionality) This tendency is amplified by high dimensionality ICML'09 Miloš Radovanović

  12. Skewness in Real Data • Important factors for real data • Dependent attributes • Grouping (clustering) • 50 data sets • From well known repositories (UCI, Kent Ridge) • Euclidean and cosine distances, as appropriate • Measurements: • SN10 – standardized 3rd moment of N10 • – Spearman correlation between N10 and distance from data set mean ICML'09 Miloš Radovanović

  13. 1. Dependent Attributes • Skewness of Nk depends on intrinsic dimensionality • dmle – MLE estimate of intrinsic dimensionality • Over 50 data sets:Corr(d, SN10) = 0.62, Corr(dmle, SN10) = 0.80 • Shuffle elements of each attribute, raising intrinsic to embedding dimensionality, but keeping attribute distributions [François 2007] ICML'09 Miloš Radovanović

  14. ICML'09 Miloš Radovanović

  15. 1. Dependent Attributes • The effect of dimensionality reduction ICML'09 Miloš Radovanović

  16. 2. Grouping (Clustering) • Hubs are in proximity of cluster centers • Measurement: • · – Spearman correlation between N10 and distance from closest cluster mean • K-means clustering • No. of clusters chosen to maximize ICML'09 Miloš Radovanović

  17. ICML'09 Miloš Radovanović

  18. Hubs and Outliers • In high dimensions, points with low Nk can be considered distance-based outliers • They are far away from other points in the data set / their cluster • Their existence is caused by high dimensionality ICML'09 Miloš Radovanović

  19. Hubs and Outliers • In high dimensions, points with low Nk can be considered distance-based outliers • They are far away from other points in the data set / their cluster • Their existence is caused by high dimensionality (k = 20) ICML'09 Miloš Radovanović

  20. hubs outliers Hubs and Outliers • Hubs can even be consideredprobabilistic outliers ICML'09 Miloš Radovanović

  21. Classification • Based on labels, k-occurrences can be distinguished into: • “Bad” k-occurrences,BNk(x) • “Good” k-occurrences,GNk(x) • Nk(x) = BNk(x) + GNk(x) • “Bad” hubs can appear • How do “bad” hubs originate? • What is the influence of (“bad”) hubs on classification algorithms? ICML'09 Miloš Radovanović

  22. How do “bad” hubs originate? • Measurements: • – normalized sum of all BN10 in data set • – correlation between BN10 and N10 • CAV – Cluster Assumption Violation coefficient • Cluster Assumption (CA): Most pairs of points in a cluster should be of the same class [Chapelle 2006] • CAV = a / (a + b) • a = no. of pairs of points w. different classes, same cluster • b = no. of pairs of points w. same class and cluster • K-means clustering ICML'09 Miloš Radovanović

  23. How do “bad” hubs originate? • Observations and answers: • High dimensionality and skewness of Nkdo not automatically induce “badness” • No correlation between and d, dmle, SN10 • “Bad” hubs originate from a combination of high dimensionality and violation of the CA • Corr( , CAV) = 0.85 • Corr(dmle, ) = 0.39 ICML'09 Miloš Radovanović

  24. Influence on the k-NN Classifier • “Bad” hubs provide erroneous class information to many other points • We introduce standardized “bad” hubness: hB(x) = (BNk(x) – μBNk) / σBNk • During majority voting, the vote of each neighbor x is weighted by exp(–hB(x)) ICML'09 Miloš Radovanović

  25. ICML'09 Miloš Radovanović

  26. Influence on SVMs • RBF (Gaussian) kernel: K(x, y) = exp(–γ||x–y||2) • Nk, BNk, GNk in kernel space exactly the same as in original space • We progressively remove points from training sets (10-fold CV) in the order of decreasing BNk, and at random ICML'09 Miloš Radovanović

  27. “Bad” hubs can be good support vectors: ICML'09 Miloš Radovanović

  28. Influence on AdaBoost + CART • AdaBoost assigns weights to training points, to be considered by weak learners • Weights initially equal (1/n) • Both hubs and outliers can harm AdaBoost • Standardized hubness: h(x) = (Nk(x) – μNk) / σNk • Set initial weight of each training point x to 1/(1+|h(x)|) (normalized by the sum over x) ICML'09 Miloš Radovanović

  29. (k = 20) (k = 40) ICML'09 Miloš Radovanović

  30. (k = 20) (k = 40) ICML'09 Miloš Radovanović

  31. Clustering • Distance-based clustering objectives: • Minimize within-cluster distance • Maximize between-cluster distance • Skewness of Nk affects both objectives • Outliers do not cluster well because of high within-cluster distance • Hubs also do not cluster well, but because of low between-cluster distance ICML'09 Miloš Radovanović

  32. Clustering • Silhouette coefficient (SC): For i-th point • ai = avg. distance to points from its cluster(within-cluster distance) • bi = min. avg. distance to points from other clusters(between-cluster distance) • SCi = (bi – ai) / max(ai, bi) • In range [–1, 1], higher is better • SC for a set of points is the average of SCi for every point i in the set ICML'09 Miloš Radovanović

  33. ICML'09 Miloš Radovanović

  34. Information Retrieval • Retrieving documents most similar to query document • Hubs harm precision • For document x from data set D, and query q, distance d(x,q) in [0,1], we increase d(x,q) for every x such that h(x) > 2 as follows: ICML'09 Miloš Radovanović

  35. Information Retrieval • Bag-of-words representation • Cosine distance • Leave-one-out cross-validation (k = 10) (k = 1) ICML'09 Miloš Radovanović

  36. Conclusion • Skewness of Nk – an under-studied phenomenon that can have a strong impact • Future work: • Theoretical study of impact on distance-based ML • Examine possible impact on non distance-based ML • Seeding iterative clustering algorithms • Outlier detection • Reverse k-NN queries • Time series • Using skewness of Nk to estimate intrinsic dimensionality ICML'09 Miloš Radovanović

  37. Thank You ICML'09 Miloš Radovanović

More Related