Nearest neighbors in high dimensional data the emergence and influence of hubs
This presentation is the property of its rightful owner.
Sponsored Links
1 / 37

Nearest Neighbors in High-Dimensional Data: The Emergence and Influence of Hubs PowerPoint PPT Presentation


  • 78 Views
  • Uploaded on
  • Presentation posted in: General

Nearest Neighbors in High-Dimensional Data: The Emergence and Influence of Hubs. Milo š Radovanović 1 Alexandros Nanopoulos 2 Mirjana Ivanovi ć 1 1 Department of Mathematics and Informatics Faculty of Science, University of Novi Sad , Serbia 2 Institute of Computer Science

Download Presentation

Nearest Neighbors in High-Dimensional Data: The Emergence and Influence of Hubs

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Nearest neighbors in high dimensional data the emergence and influence of hubs

Nearest Neighborsin High-Dimensional Data:The Emergenceand Influence of Hubs

Miloš Radovanović1 Alexandros Nanopoulos2

Mirjana Ivanović1

1Department of Mathematics and Informatics

Faculty of Science, University of Novi Sad, Serbia

2Institute of Computer Science

University of Hildesheim, Germany


Introduction

Introduction

  • The curse of dimensionality

  • Distance concentration

    • The tendency of distances between all pairs of points in high-dimensional data to become almost equal

    • Affects meaningfulness of nearest neighbors, indexing, classification, regression

    • [Beyer 1999, Aggarwal 2001, François 2007]

  • We study a related phenomenon which concerns k-NN directed graphs

ICML'09 Miloš Radovanović


K occurrences

k-occurrences

  • Nk(x), the number of k-occurrences of point x, is the number of times x occurs among k nearest neighbors of all other points in a data set

    • Nk(x) is the in-degree of node x in the k-NN digraph

  • It was observed that the distribution of Nk can become skewed, resulting in the emergence of hubs – points with high Nk

    • Music retrieval [Aucouturier 2007]

    • Speech recognition [Doddington 1998]

    • Fingerprint identification [Hicklin 2005]

ICML'09 Miloš Radovanović


K occurrences1

k-occurrences

  • What caused the skewness of Nk?

    • Artefact of data?

      • Are some songs more similar to others?

      • Do some people have fingerprints or voices that are harder to distinguish from other people’s?

    • Specifics of modeling algorithms?

      • Inadequate choice of features?

    • Something more general?

ICML'09 Miloš Radovanović


Nearest neighbors in high dimensional data the emergence and influence of hubs

ICML'09 Miloš Radovanović


Nearest neighbors in high dimensional data the emergence and influence of hubs

ICML'09 Miloš Radovanović


Nearest neighbors in high dimensional data the emergence and influence of hubs

ICML'09 Miloš Radovanović


Nearest neighbors in high dimensional data the emergence and influence of hubs

ICML'09 Miloš Radovanović


Nearest neighbors in high dimensional data the emergence and influence of hubs

ICML'09 Miloš Radovanović


Nearest neighbors in high dimensional data the emergence and influence of hubs

ICML'09 Miloš Radovanović


The causes of skewness

The Causes of Skewness

  • Distance concentration

    • Ratio between a measure of spread (e.g., Std) and a measure of magnitude (e.g., E) of distances converges to 0 as d increases

    • High-dimensional data points approximately lie on a sphere centered at data set mean [Beyer 1999, Aggarwal 2001]

    • The distribution of distances to data set mean always has non-negligible variance [Demartines 1994, François 2007]

    • Existence of points closer to the data set mean is expected, even in high dimensions

  • Points closer to the data set mean tend to be closer to all other points (regardless of dimensionality)

    This tendency is amplified by high dimensionality

ICML'09 Miloš Radovanović


Skewness in real data

Skewness in Real Data

  • Important factors for real data

    • Dependent attributes

    • Grouping (clustering)

  • 50 data sets

    • From well known repositories (UCI, Kent Ridge)

    • Euclidean and cosine distances, as appropriate

  • Measurements:

    • SN10 – standardized 3rd moment of N10

    • – Spearman correlation between N10 and distance from data set mean

ICML'09 Miloš Radovanović


1 dependent attributes

1. Dependent Attributes

  • Skewness of Nk depends on intrinsic dimensionality

    • dmle – MLE estimate of intrinsic dimensionality

      • Over 50 data sets:Corr(d, SN10) = 0.62, Corr(dmle, SN10) = 0.80

    • Shuffle elements of each attribute, raising intrinsic to embedding dimensionality, but keeping attribute distributions [François 2007]

ICML'09 Miloš Radovanović


Nearest neighbors in high dimensional data the emergence and influence of hubs

ICML'09 Miloš Radovanović


1 dependent attributes1

1. Dependent Attributes

  • The effect of dimensionality reduction

ICML'09 Miloš Radovanović


2 grouping clustering

2. Grouping (Clustering)

  • Hubs are in proximity of cluster centers

  • Measurement:

    • · – Spearman correlation between N10 and distance from closest cluster mean

    • K-means clustering

    • No. of clusters chosen to maximize

ICML'09 Miloš Radovanović


Nearest neighbors in high dimensional data the emergence and influence of hubs

ICML'09 Miloš Radovanović


Hubs and outliers

Hubs and Outliers

  • In high dimensions, points with low Nk can be considered distance-based outliers

    • They are far away from other points in the data set / their cluster

    • Their existence is caused by high dimensionality

ICML'09 Miloš Radovanović


Hubs and outliers1

Hubs and Outliers

  • In high dimensions, points with low Nk can be considered distance-based outliers

    • They are far away from other points in the data set / their cluster

    • Their existence is caused by high dimensionality

(k = 20)

ICML'09 Miloš Radovanović


Hubs and outliers2

hubs

outliers

Hubs and Outliers

  • Hubs can even be consideredprobabilistic outliers

ICML'09 Miloš Radovanović


Classification

Classification

  • Based on labels, k-occurrences can be distinguished into:

    • “Bad” k-occurrences,BNk(x)

    • “Good” k-occurrences,GNk(x)

    • Nk(x) = BNk(x) + GNk(x)

  • “Bad” hubs can appear

    • How do “bad” hubs originate?

    • What is the influence of (“bad”) hubs on classification algorithms?

ICML'09 Miloš Radovanović


How do bad hubs originate

How do “bad” hubs originate?

  • Measurements:

    • – normalized sum of all BN10 in data set

    • – correlation between BN10 and N10

    • CAV – Cluster Assumption Violation coefficient

      • Cluster Assumption (CA): Most pairs of points in a cluster should be of the same class [Chapelle 2006]

      • CAV = a / (a + b)

        • a = no. of pairs of points w. different classes, same cluster

        • b = no. of pairs of points w. same class and cluster

      • K-means clustering

ICML'09 Miloš Radovanović


How do bad hubs originate1

How do “bad” hubs originate?

  • Observations and answers:

    • High dimensionality and skewness of Nkdo not automatically induce “badness”

      • No correlation between and d, dmle, SN10

    • “Bad” hubs originate from a combination of high dimensionality and violation of the CA

      • Corr( , CAV) = 0.85

      • Corr(dmle, ) = 0.39

ICML'09 Miloš Radovanović


Influence on the k nn classifier

Influence on the k-NN Classifier

  • “Bad” hubs provide erroneous class information to many other points

  • We introduce standardized “bad” hubness:

    hB(x) = (BNk(x) – μBNk) / σBNk

  • During majority voting, the vote of each neighbor x is weighted by

    exp(–hB(x))

ICML'09 Miloš Radovanović


Nearest neighbors in high dimensional data the emergence and influence of hubs

ICML'09 Miloš Radovanović


Influence on svms

Influence on SVMs

  • RBF (Gaussian) kernel:

    K(x, y) = exp(–γ||x–y||2)

  • Nk, BNk, GNk in kernel space exactly the same as in original space

  • We progressively remove points from training sets (10-fold CV) in the order of decreasing BNk, and at random

ICML'09 Miloš Radovanović


Nearest neighbors in high dimensional data the emergence and influence of hubs

  • “Bad” hubs can be good support vectors:

ICML'09 Miloš Radovanović


Influence on adaboost cart

Influence on AdaBoost + CART

  • AdaBoost assigns weights to training points, to be considered by weak learners

    • Weights initially equal (1/n)

  • Both hubs and outliers can harm AdaBoost

  • Standardized hubness:

    h(x) = (Nk(x) – μNk) / σNk

  • Set initial weight of each training point x to

    1/(1+|h(x)|)

    (normalized by the sum over x)

ICML'09 Miloš Radovanović


Nearest neighbors in high dimensional data the emergence and influence of hubs

(k = 20)

(k = 40)

ICML'09 Miloš Radovanović


Nearest neighbors in high dimensional data the emergence and influence of hubs

(k = 20)

(k = 40)

ICML'09 Miloš Radovanović


Clustering

Clustering

  • Distance-based clustering objectives:

    • Minimize within-cluster distance

    • Maximize between-cluster distance

  • Skewness of Nk affects both objectives

    • Outliers do not cluster well because of high within-cluster distance

    • Hubs also do not cluster well, but because of low between-cluster distance

ICML'09 Miloš Radovanović


Clustering1

Clustering

  • Silhouette coefficient (SC): For i-th point

    • ai = avg. distance to points from its cluster(within-cluster distance)

    • bi = min. avg. distance to points from other clusters(between-cluster distance)

    • SCi = (bi – ai) / max(ai, bi)

      • In range [–1, 1], higher is better

    • SC for a set of points is the average of SCi for every point i in the set

ICML'09 Miloš Radovanović


Nearest neighbors in high dimensional data the emergence and influence of hubs

ICML'09 Miloš Radovanović


Information retrieval

Information Retrieval

  • Retrieving documents most similar to query document

  • Hubs harm precision

  • For document x from data set D, and query q, distance d(x,q) in [0,1], we increase d(x,q) for every x such that h(x) > 2 as follows:

ICML'09 Miloš Radovanović


Information retrieval1

Information Retrieval

  • Bag-of-words representation

  • Cosine distance

  • Leave-one-out cross-validation

(k = 10)

(k = 1)

ICML'09 Miloš Radovanović


Conclusion

Conclusion

  • Skewness of Nk – an under-studied phenomenon that can have a strong impact

  • Future work:

    • Theoretical study of impact on distance-based ML

    • Examine possible impact on non distance-based ML

    • Seeding iterative clustering algorithms

    • Outlier detection

    • Reverse k-NN queries

    • Time series

    • Using skewness of Nk to estimate intrinsic dimensionality

ICML'09 Miloš Radovanović


Thank you

Thank You

ICML'09 Miloš Radovanović


  • Login