1 / 22

Asymmetric Distance Estimation with Sketches for Similarity Search in High-Dimensional Spaces

Asymmetric Distance Estimation with Sketches for Similarity Search in High-Dimensional Spaces. Wei Dong Joint work with Moses Charikar and Kai Li Computer Science Department Princeton University. Motivations. Feature-rich data grow explosively Images, audio, video, scientific sensor data

fraley
Download Presentation

Asymmetric Distance Estimation with Sketches for Similarity Search in High-Dimensional Spaces

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Asymmetric Distance Estimation with Sketches for Similarity Search in High-Dimensional Spaces Wei Dong Joint work with Moses Charikar and Kai Li Computer Science Department Princeton University

  2. Motivations • Feature-rich data grow explosively • Images, audio, video, scientific sensor data • Content-based retrieval needed • Features extracted with domain specific algorithms • K-NN search in feature space • Feature size becomes the bottleneck • Features are of high dimensions and trees fail • New methods like LSH are mostly smart ways of scanning

  3. Feature size is growing • Domain experts are shooting for high precision, and new methods are being developed • Example: image features • ~1995: 166D color histogram • 1999: SIFT 500 × 128D vectors/image = 64KB/image That’s almost the size of a image!

  4. Sketch Database Sketch Construction Sketch Construction Input Data Objects Input Query Object Feature Extraction Feature Extraction Filtering Results Ranking with features K-NN Search with Sketches • Sketch: compact approximation of a large object [Lv04] feature vector  bit vector L1 distance  hamming distance

  5. Our contribution • A new sketch for L2 distance • Asymmetric distance estimation • Using sketch of a data point + the raw query point • Applies to our proposed sketch and others • Evaluation with real life image and audio data • Sketches of < 10% the feature size for > 90% recall • Further 20% ~ 40% size reduction with asymmetric estimators

  6. L2 Sketch: the Idea • Randomly partition the space into stripes • Orange = 1; white = 0 • More random partitions to make a bit vector • Hamming distance reflects point proximity

  7. W L2 Sketch: the Proposed Scheme

  8. Sketch Database Sketch Construction Sketch Construction Input Data Objects Input Query Object Feature Extraction Feature Extraction Filtering Results Ranking with features Asymmetric Estimator: the Idea Information Loss ! • Query points are available at query time ! Information Loss !

  9. Sketch Database Sketch Construction Input Data Objects Feature Extraction Filtering Input Query Object Feature Extraction Results Ranking with features Asymmetric Estimator: the Idea • Exploit the query features for high precision How ?

  10. p3 Asymmetric Estimator: L2 Sketch • Partitions are not equally good to a query point • Weight each partition with its quality • Weight: distance between q and stripe boundary Bad !

  11. Asymmetric Estimator: L2 Sketch

  12. Generalize the Asymmetric Idea • 0/1 valued function  bipartition of the space • Asymmetric estimator: weighted hamming distance

  13. Example: Random Hyper-plane Sketch • Cosine Similarity • Partition with random hyper-plane [Charikar’02]

  14. Evaluation Datasets • Image: Caltech 101 • 101 categories, 9144 images in total • Feature extraction with SIFT • 4.5 million 128D features • Audio: LDC-SWITCHBOARD-1 collection • 2,400 phone conversation among 543 US speakers • Segmentation and feature extraction with Marsyas • 2.5 million 192D features We use floating point numbers for all feature vectors.

  15. L2 Distance Estimation • We want to see the relationship between • Estimation error and real distance • Estimation error and sketch size • Methods compared • Sketches with symmetric and asymmetric estimators • Our proposed L2 sketch • Random hyper-plane sketch (converted to L2 distance) • PCA and Random projection as baselines • Image data only for this task

  16. Error vs. L2 distance • Sample random point pairs • Estimate the distance and measure the error • All methods use 32 bytes for one sketch • Bin according to real distance

  17. Error vs. L2 distance • Our sketch scheme has tunable sensitive range Average 100-NN distance + one sigma

  18. Error vs. Sketch Size • Use points with real distances within [0, 300) • Asymmetric estimator reduces MSE by half

  19. K-NN Search • Search for 100-NN • Filter with sketch to obtain 2000 candidate • Rank with raw features and return the top 100 • Sizing sketch to meet specific average recall • Recall = % of true K-NNs retrieved • We always return 100 points, and precision = recall

  20. Results with Image Dataset Sketch size /byte need ed to achieve given recall. Raw feature size: 496 bytes

  21. Results with Audio Dataset Sketch size /byte need ed to achieve given recall. Raw feature size: 768 bytes

  22. Conclusion • A new sketch for L2 distance • Tunable sensitive distance range • New idea of asymmetric distance estimator • Exploit the query/data storage asymmetry • Applies to different sketch schemes • 20% to 40% space reduction with our datasets

More Related