Nearest Neighbors in High-Dimensional Data: The Emergence and Influence of Hubs

Download Presentation

Nearest Neighbors in High-Dimensional Data: The Emergence and Influence of Hubs

Loading in 2 Seconds...

- 100 Views
- Uploaded on
- Presentation posted in: General

Nearest Neighbors in High-Dimensional Data: The Emergence and Influence of Hubs

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Nearest Neighborsin High-Dimensional Data:The Emergenceand Influence of Hubs

Miloš Radovanović1 Alexandros Nanopoulos2

Mirjana Ivanović1

1Department of Mathematics and Informatics

Faculty of Science, University of Novi Sad, Serbia

2Institute of Computer Science

University of Hildesheim, Germany

- The curse of dimensionality
- Distance concentration
- The tendency of distances between all pairs of points in high-dimensional data to become almost equal
- Affects meaningfulness of nearest neighbors, indexing, classification, regression
- [Beyer 1999, Aggarwal 2001, François 2007]

- We study a related phenomenon which concerns k-NN directed graphs

ICML'09 Miloš Radovanović

- Nk(x), the number of k-occurrences of point x, is the number of times x occurs among k nearest neighbors of all other points in a data set
- Nk(x) is the in-degree of node x in the k-NN digraph

- It was observed that the distribution of Nk can become skewed, resulting in the emergence of hubs – points with high Nk
- Music retrieval [Aucouturier 2007]
- Speech recognition [Doddington 1998]
- Fingerprint identification [Hicklin 2005]

ICML'09 Miloš Radovanović

- What caused the skewness of Nk?
- Artefact of data?
- Are some songs more similar to others?
- Do some people have fingerprints or voices that are harder to distinguish from other people’s?

- Specifics of modeling algorithms?
- Inadequate choice of features?

- Something more general?

- Artefact of data?

ICML'09 Miloš Radovanović

ICML'09 Miloš Radovanović

ICML'09 Miloš Radovanović

ICML'09 Miloš Radovanović

ICML'09 Miloš Radovanović

ICML'09 Miloš Radovanović

ICML'09 Miloš Radovanović

- Distance concentration
- Ratio between a measure of spread (e.g., Std) and a measure of magnitude (e.g., E) of distances converges to 0 as d increases
- High-dimensional data points approximately lie on a sphere centered at data set mean [Beyer 1999, Aggarwal 2001]
- The distribution of distances to data set mean always has non-negligible variance [Demartines 1994, François 2007]
- Existence of points closer to the data set mean is expected, even in high dimensions

- Points closer to the data set mean tend to be closer to all other points (regardless of dimensionality)
This tendency is amplified by high dimensionality

ICML'09 Miloš Radovanović

- Important factors for real data
- Dependent attributes
- Grouping (clustering)

- 50 data sets
- From well known repositories (UCI, Kent Ridge)
- Euclidean and cosine distances, as appropriate

- Measurements:
- SN10 – standardized 3rd moment of N10
- – Spearman correlation between N10 and distance from data set mean

ICML'09 Miloš Radovanović

- Skewness of Nk depends on intrinsic dimensionality
- dmle – MLE estimate of intrinsic dimensionality
- Over 50 data sets:Corr(d, SN10) = 0.62, Corr(dmle, SN10) = 0.80

- Shuffle elements of each attribute, raising intrinsic to embedding dimensionality, but keeping attribute distributions [François 2007]

- dmle – MLE estimate of intrinsic dimensionality

ICML'09 Miloš Radovanović

ICML'09 Miloš Radovanović

- The effect of dimensionality reduction

ICML'09 Miloš Radovanović

- Hubs are in proximity of cluster centers
- Measurement:
- · – Spearman correlation between N10 and distance from closest cluster mean
- K-means clustering
- No. of clusters chosen to maximize

ICML'09 Miloš Radovanović

ICML'09 Miloš Radovanović

- In high dimensions, points with low Nk can be considered distance-based outliers
- They are far away from other points in the data set / their cluster
- Their existence is caused by high dimensionality

ICML'09 Miloš Radovanović

- In high dimensions, points with low Nk can be considered distance-based outliers
- They are far away from other points in the data set / their cluster
- Their existence is caused by high dimensionality

(k = 20)

ICML'09 Miloš Radovanović

hubs

outliers

- Hubs can even be consideredprobabilistic outliers

ICML'09 Miloš Radovanović

- Based on labels, k-occurrences can be distinguished into:
- “Bad” k-occurrences,BNk(x)
- “Good” k-occurrences,GNk(x)
- Nk(x) = BNk(x) + GNk(x)

- “Bad” hubs can appear
- How do “bad” hubs originate?
- What is the influence of (“bad”) hubs on classification algorithms?

ICML'09 Miloš Radovanović

- Measurements:
- – normalized sum of all BN10 in data set
- – correlation between BN10 and N10
- CAV – Cluster Assumption Violation coefficient
- Cluster Assumption (CA): Most pairs of points in a cluster should be of the same class [Chapelle 2006]
- CAV = a / (a + b)
- a = no. of pairs of points w. different classes, same cluster
- b = no. of pairs of points w. same class and cluster

- K-means clustering

ICML'09 Miloš Radovanović

- Observations and answers:
- High dimensionality and skewness of Nkdo not automatically induce “badness”
- No correlation between and d, dmle, SN10

- “Bad” hubs originate from a combination of high dimensionality and violation of the CA
- Corr( , CAV) = 0.85
- Corr(dmle, ) = 0.39

- High dimensionality and skewness of Nkdo not automatically induce “badness”

ICML'09 Miloš Radovanović

- “Bad” hubs provide erroneous class information to many other points
- We introduce standardized “bad” hubness:
hB(x) = (BNk(x) – μBNk) / σBNk

- During majority voting, the vote of each neighbor x is weighted by
exp(–hB(x))

ICML'09 Miloš Radovanović

ICML'09 Miloš Radovanović

- RBF (Gaussian) kernel:
K(x, y) = exp(–γ||x–y||2)

- Nk, BNk, GNk in kernel space exactly the same as in original space
- We progressively remove points from training sets (10-fold CV) in the order of decreasing BNk, and at random

ICML'09 Miloš Radovanović

- “Bad” hubs can be good support vectors:

ICML'09 Miloš Radovanović

- AdaBoost assigns weights to training points, to be considered by weak learners
- Weights initially equal (1/n)

- Both hubs and outliers can harm AdaBoost
- Standardized hubness:
h(x) = (Nk(x) – μNk) / σNk

- Set initial weight of each training point x to
1/(1+|h(x)|)

(normalized by the sum over x)

ICML'09 Miloš Radovanović

(k = 20)

(k = 40)

ICML'09 Miloš Radovanović

(k = 20)

(k = 40)

ICML'09 Miloš Radovanović

- Distance-based clustering objectives:
- Minimize within-cluster distance
- Maximize between-cluster distance

- Skewness of Nk affects both objectives
- Outliers do not cluster well because of high within-cluster distance
- Hubs also do not cluster well, but because of low between-cluster distance

ICML'09 Miloš Radovanović

- Silhouette coefficient (SC): For i-th point
- ai = avg. distance to points from its cluster(within-cluster distance)
- bi = min. avg. distance to points from other clusters(between-cluster distance)
- SCi = (bi – ai) / max(ai, bi)
- In range [–1, 1], higher is better

- SC for a set of points is the average of SCi for every point i in the set

ICML'09 Miloš Radovanović

ICML'09 Miloš Radovanović

- Retrieving documents most similar to query document
- Hubs harm precision
- For document x from data set D, and query q, distance d(x,q) in [0,1], we increase d(x,q) for every x such that h(x) > 2 as follows:

ICML'09 Miloš Radovanović

- Bag-of-words representation
- Cosine distance
- Leave-one-out cross-validation

(k = 10)

(k = 1)

ICML'09 Miloš Radovanović

- Skewness of Nk – an under-studied phenomenon that can have a strong impact
- Future work:
- Theoretical study of impact on distance-based ML
- Examine possible impact on non distance-based ML
- Seeding iterative clustering algorithms
- Outlier detection
- Reverse k-NN queries
- Time series
- Using skewness of Nk to estimate intrinsic dimensionality

ICML'09 Miloš Radovanović

ICML'09 Miloš Radovanović