When Is “Nearest Neighbor” Meaningful?

When Is “Nearest Neighbor” Meaningful? By Kevin Beyer, Jonathan Goldstein, Raghu Ramakrishnan, and Uri Shaft

Talk overview • Motivation • Instability definition • Relationship between instability and indexability • Analysis of workloads • Conclusions • Future work

Definition of “Nearest Neighbor” • Given a relation, the nearest neighbor problem determines which tuple in the relation is closest to some given tuple (not necessarily from the original relation) assuming some distance function. • Usually the fields of the relation are reals, and the distance function is a metric. L2 is the most frequently used metric. • High dimensional nearest neighbor problems usually stem from similarity and approximate matching problems.

Motivation • Nearest neighbor processing techniques perform badly in high dimensionality. Why? • Is there a fundamental reason for this breakdown? • Is more than performance affected by this breakdown? • Are there high dimensional scenarios in which these techniques may perform well?

Instability Typical query in 2D Unstable query in 2D

Formal definition of instability (i.e. As dimensionality increases, all points become equidistant w.r.t. the query point)

Instability and indexability If a workload has the following properties: 1) The workload is unstable 2) Query distribution follows data distribution 3) Distance is calculated using the L2 metric 4) The number of data points is constant for all dimensionalities then as dimensionality increases, the probability that all (non-trivial) convex decompositions of the space result in examining all data points becomes 1.

Instability tool

IID result application • Assume the following: • The data distribution and query distribution are IID in all dimensions. • All the appropriate moments are finite (i.e., up to the é2pù’th moment). • The query point is chosen independently of the data points.

Variance goes to 0 result application

Examples that meet our condition: • All dimensions are IID; Q ~ P (Query distribution follows data distribution) • Variance converges to 0 at a bounded rate; Q ~ P • Variance converges to infinity at a bounded rate; Q ~ P • All dimensions have some correlation; Q ~ P • Variance converges to 0 at a bounded rate, all dimensions have some correlation; Q ~ P • The data contains perfect clusters; Q ~ IID uniform

Examples that don’t meet our condition: • All dimensions are completely correlated; Q ~ P • All dimensions are linear combinations of a fixed number of IID random variables; Q ~ P • The data contains perfect clusters; Q ~ P; a special case of this is the approximate matching problem

IID contrast as dimensionality increases

Contrast as dimensionality increases

Contrast in ideally clustered data Top right - Typical distance distribution Bottom left - Ideal clusters Bottom right - Distance distribution for ideally clustered data/queries

Contrast for a real image database

Distance distribution for fixed query NN

Conclusions • Serious questions are raised about techniques that map approximate similarity into high dimensional nearest neighbor problems. • The ease with which linear scan beats more complex access methods for high-D nearest neighbor is explained by our theorem. • These results should not be taken to mean that all high dimensional nearest neighbor problems are badly framed or that more complex access methods will always fail on individual high-D data sets.

Future Work • Examine the contrast produced by various mappings of similarity problems into high dimensional spaces • Does contrast fully capture the difficulty associated with the high dimensional nearest neighbor problem? • If so, find an indexing structure for nearest neighbor which has guaranteed good performance in high contrast situations • Determine the performance of various indexing structures compared to linear scan as dimensionality increases

When Is “Nearest Neighbor” Meaningful?

When Is “Nearest Neighbor” Meaningful?

Presentation Transcript

Nearest Neighbor Queries

K nearest neighbor and Rocchio algorithm

Near(est) Neighbor in High Dimensions

Given by: Erez Eyal Uri Klein

An Introduction of Support Vector Machine

Nearest Neighbor

The Moon Our Nearest Neighbor Introduction

Applied Anomaly Based IDS

Fast and Unified Local Search for Random Walk Based K-Nearest Neighbor Query in Large Graphs

k -d trees

Nearest Neighbor Editing and Condensing Techniques

Fast Nearest-neighbor Search in Disk-resident Graphs

Nearest Neighbor Searching Under Uncertainty

Nearest Neighbor Searching Under Uncertainty

Objectives: Density Estimation Parzen Windows k-Nearest Neighbor Properties of Metrics

6 The Mathematics of Touring

Optimized Nearest Neighbor Methods

Nearest Neighbor Retrieval Using Distance-Based Hashing

EXTENDED NEAREST NEIGHBOR CLASSIFICATION METHODS FOR PREDICTING SMALL MOLECULE ACTIVITY

Machine Learning Lecture 11: Nearest Neighbor