Nearest Neighbors in High-Dimensional Data: The Emergence and Influence of Hubs

Nearest Neighborsin High-Dimensional Data:The Emergenceand Influence of Hubs Miloš Radovanović1 Alexandros Nanopoulos2 Mirjana Ivanović1 1Department of Mathematics and Informatics Faculty of Science, University of Novi Sad, Serbia 2Institute of Computer Science University of Hildesheim, Germany

Introduction • The curse of dimensionality • Distance concentration • The tendency of distances between all pairs of points in high-dimensional data to become almost equal • Affects meaningfulness of nearest neighbors, indexing, classification, regression • [Beyer 1999, Aggarwal 2001, François 2007] • We study a related phenomenon which concerns k-NN directed graphs ICML'09 Miloš Radovanović

k-occurrences • Nk(x), the number of k-occurrences of point x, is the number of times x occurs among k nearest neighbors of all other points in a data set • Nk(x) is the in-degree of node x in the k-NN digraph • It was observed that the distribution of Nk can become skewed, resulting in the emergence of hubs – points with high Nk • Music retrieval [Aucouturier 2007] • Speech recognition [Doddington 1998] • Fingerprint identification [Hicklin 2005] ICML'09 Miloš Radovanović

k-occurrences • What caused the skewness of Nk? • Artefact of data? • Are some songs more similar to others? • Do some people have fingerprints or voices that are harder to distinguish from other people’s? • Specifics of modeling algorithms? • Inadequate choice of features? • Something more general? ICML'09 Miloš Radovanović

ICML'09 Miloš Radovanović

The Causes of Skewness • Distance concentration • Ratio between a measure of spread (e.g., Std) and a measure of magnitude (e.g., E) of distances converges to 0 as d increases • High-dimensional data points approximately lie on a sphere centered at data set mean [Beyer 1999, Aggarwal 2001] • The distribution of distances to data set mean always has non-negligible variance [Demartines 1994, François 2007] • Existence of points closer to the data set mean is expected, even in high dimensions • Points closer to the data set mean tend to be closer to all other points (regardless of dimensionality) This tendency is amplified by high dimensionality ICML'09 Miloš Radovanović

Skewness in Real Data • Important factors for real data • Dependent attributes • Grouping (clustering) • 50 data sets • From well known repositories (UCI, Kent Ridge) • Euclidean and cosine distances, as appropriate • Measurements: • SN10 – standardized 3rd moment of N10 • – Spearman correlation between N10 and distance from data set mean ICML'09 Miloš Radovanović

1. Dependent Attributes • Skewness of Nk depends on intrinsic dimensionality • dmle – MLE estimate of intrinsic dimensionality • Over 50 data sets:Corr(d, SN10) = 0.62, Corr(dmle, SN10) = 0.80 • Shuffle elements of each attribute, raising intrinsic to embedding dimensionality, but keeping attribute distributions [François 2007] ICML'09 Miloš Radovanović

1. Dependent Attributes • The effect of dimensionality reduction ICML'09 Miloš Radovanović

2. Grouping (Clustering) • Hubs are in proximity of cluster centers • Measurement: • · – Spearman correlation between N10 and distance from closest cluster mean • K-means clustering • No. of clusters chosen to maximize ICML'09 Miloš Radovanović

Hubs and Outliers • In high dimensions, points with low Nk can be considered distance-based outliers • They are far away from other points in the data set / their cluster • Their existence is caused by high dimensionality ICML'09 Miloš Radovanović

Hubs and Outliers • In high dimensions, points with low Nk can be considered distance-based outliers • They are far away from other points in the data set / their cluster • Their existence is caused by high dimensionality (k = 20) ICML'09 Miloš Radovanović

hubs outliers Hubs and Outliers • Hubs can even be consideredprobabilistic outliers ICML'09 Miloš Radovanović

Classification • Based on labels, k-occurrences can be distinguished into: • “Bad” k-occurrences,BNk(x) • “Good” k-occurrences,GNk(x) • Nk(x) = BNk(x) + GNk(x) • “Bad” hubs can appear • How do “bad” hubs originate? • What is the influence of (“bad”) hubs on classification algorithms? ICML'09 Miloš Radovanović

How do “bad” hubs originate? • Measurements: • – normalized sum of all BN10 in data set • – correlation between BN10 and N10 • CAV – Cluster Assumption Violation coefficient • Cluster Assumption (CA): Most pairs of points in a cluster should be of the same class [Chapelle 2006] • CAV = a / (a + b) • a = no. of pairs of points w. different classes, same cluster • b = no. of pairs of points w. same class and cluster • K-means clustering ICML'09 Miloš Radovanović

How do “bad” hubs originate? • Observations and answers: • High dimensionality and skewness of Nkdo not automatically induce “badness” • No correlation between and d, dmle, SN10 • “Bad” hubs originate from a combination of high dimensionality and violation of the CA • Corr( , CAV) = 0.85 • Corr(dmle, ) = 0.39 ICML'09 Miloš Radovanović

Influence on the k-NN Classifier • “Bad” hubs provide erroneous class information to many other points • We introduce standardized “bad” hubness: hB(x) = (BNk(x) – μBNk) / σBNk • During majority voting, the vote of each neighbor x is weighted by exp(–hB(x)) ICML'09 Miloš Radovanović

Influence on SVMs • RBF (Gaussian) kernel: K(x, y) = exp(–γ||x–y||2) • Nk, BNk, GNk in kernel space exactly the same as in original space • We progressively remove points from training sets (10-fold CV) in the order of decreasing BNk, and at random ICML'09 Miloš Radovanović

“Bad” hubs can be good support vectors: ICML'09 Miloš Radovanović

Influence on AdaBoost + CART • AdaBoost assigns weights to training points, to be considered by weak learners • Weights initially equal (1/n) • Both hubs and outliers can harm AdaBoost • Standardized hubness: h(x) = (Nk(x) – μNk) / σNk • Set initial weight of each training point x to 1/(1+|h(x)|) (normalized by the sum over x) ICML'09 Miloš Radovanović

(k = 20) (k = 40) ICML'09 Miloš Radovanović

Clustering • Distance-based clustering objectives: • Minimize within-cluster distance • Maximize between-cluster distance • Skewness of Nk affects both objectives • Outliers do not cluster well because of high within-cluster distance • Hubs also do not cluster well, but because of low between-cluster distance ICML'09 Miloš Radovanović

Clustering • Silhouette coefficient (SC): For i-th point • ai = avg. distance to points from its cluster(within-cluster distance) • bi = min. avg. distance to points from other clusters(between-cluster distance) • SCi = (bi – ai) / max(ai, bi) • In range [–1, 1], higher is better • SC for a set of points is the average of SCi for every point i in the set ICML'09 Miloš Radovanović

Information Retrieval • Retrieving documents most similar to query document • Hubs harm precision • For document x from data set D, and query q, distance d(x,q) in [0,1], we increase d(x,q) for every x such that h(x) > 2 as follows: ICML'09 Miloš Radovanović

Information Retrieval • Bag-of-words representation • Cosine distance • Leave-one-out cross-validation (k = 10) (k = 1) ICML'09 Miloš Radovanović

Conclusion • Skewness of Nk – an under-studied phenomenon that can have a strong impact • Future work: • Theoretical study of impact on distance-based ML • Examine possible impact on non distance-based ML • Seeding iterative clustering algorithms • Outlier detection • Reverse k-NN queries • Time series • Using skewness of Nk to estimate intrinsic dimensionality ICML'09 Miloš Radovanović

Thank You ICML'09 Miloš Radovanović

Nearest Neighbors in High-Dimensional Data: The Emergence and Influence of Hubs

Nearest Neighbors in High-Dimensional Data: The Emergence and Influence of Hubs

Presentation Transcript

Cluster and Outlier Analysis

Chapter 7. Cluster Analysis

Systemics and emergence for Architecture

Chapter 9

Vector Space Text Classification

Systemics and emergence for Architecture

Chapter 2 One-Dimensional Kinematics

The Great Emergence: How Christianity Is Changing and Why

AMCS/CS 340: Data Mining

Examples of One-Dimensional Systolic Arrays

Active Nearest N eighbor Queries for Moving Objects

Multi-dimensional Search Trees

Random Dot Product Graphs

Multivariate Analysis And PCA

Data Warehousing : and OLAP — BIS 541 — — Chapter 3 —

One-Dimensional Site Response Analysis

Data Process

Data

Chapter 6: China and Its Neighbors

Chapter 8 Arrays