Xianglong Liu 1 , Junfeng He 2,3 , and Bo Lang 1 1 Beihang University, Beijing, China 2 Columbia University, New York,

Reciprocal Hash Tables for Nearest Neighbor Search Xianglong Liu1, Junfeng He2,3, and Bo Lang1 1Beihang University, Beijing, China 2Columbia University, New York, NY, USA 3Facebook, Menlo Park, CA, USA

Outline • Introduction • Nearest Neighbor Search • Motivation • Reciprocal Hash Tables • Formulation • Solutions • Experiments • Conclusion

Introduction: Nearest Neighbor Search (1) • Definition Given a database and a query , the nearest neighbor of : , such that • Solutions • linear scan • time and memory consuming • tree-based: KD-tree, VP-tree, etc. • divide and conquer • degenerate to linear scan for high dimensional data

Introduction: Nearest Neighbor Search (2) • Hash based nearest neighbor search • Locality sensitive hashing [Indyk and Motwani, 1998]: close points in the original space have similar hash codes h1 h2 x1 x3 x4 010… 100… 111… 001… 110… x2 x5

Introduction: Nearest Neighbor Search (3) • Hash based nearest neighbor search • Compressed storage: binary codes • Efficient computations: hash table lookup or Hamming distance ranking based on binary operations 0010… wk 0110… Hashing Hash Table … … 0/-1 1 1111… Bucket Indexed Image

Introduction: Motivation • Problems • build multiple hash tables and probe multiple buckets to improve the search performance [Gionis, Indyk, and Motwani, 1999; Lv et al. 2007] • not much research studies the general strategy for multiple hash table construction • random selection: widely-used general strategy, usually need a large number of hash tables • Motivation • Similar to the well-studied feature selection problem, select the most informative and independent hash functions • support various types of hashing algorithms, different data sets and scenarios, etc. Search results Search results …

Reciprocal Hash Tables: Formulation (1) • Problem Definition • Suppose we have a pool of hash functions () with the index set • Given the training data set ( is the feature dimension, and is the training data size), we have • The goal: build tables , each of which consists of hash functions from : samples of Random Binary Vector :

Reciprocal Hash Tables: Formulation (2) • Graph Representation • represent the pooled hash functions as a vertex weighted and undirected edge-weighted graph • is the vertex set corresponding to the hash functions in • are the vertex weights • is the edge set • are the edge weights: is a non-negative weight corresponding to the edge between vertex and .

Reciprocal Hash Tables: Formulation (3) • Selection Criteria • vertex weight the quality of each hash function • Hash functions should preserve similarities between data points • Measured by the empirical accuracy [Wang, Kumar, and Chang 2012] • Based on similarity matrix considering both neighbors and non-neighbors • Edge weightthe pairwise relationships between hash functions • Hash functions should be independent to reduce bit redundancy • Measured by Mutual information among their bit variables • Based on the bit distribution for -thfunction and the joint distribution

Reciprocal Hash Tables: Solutions (1) • Informative Hash Tables informative hash table: the hash functions preserving neighbor relationships and mutually independent the most desired subset of hash functions with high vertex and edge weights inside the dominant set on the graph [Pavanand Pelillo2007; Liu et al. 2013]

Reciprocal Hash Tables: Solutions (2) Straightforward table construction strategy: iteratively build hash tables by solving the above problems with respect to the remaining unselected hash functions in the pool

Reciprocal Hash Tables: Solutions (3) • Reciprocal Hash Tables the redundancy among tables: tables should be complementary to each other, so that the nearest neighbors can be found in at least one of them. Improved table construction strategy: for each table sequentially select the dominant hash functions that well separate the previous misclassified neighbors in a boosting manner • Predict neighbor relations: current hash tables on the pair and : • Update the similarities: the weights on the misclassified neighbor pairs will be amplified to incur greater penalty, while those on the correctly classified ones will be shrunk

Sequential Strategy: Boosting • Boosting style: try to correct the previous mistakes by updating weights on neighbor pairs in each round > 0 < 0 = 0 similarities prediction error updated similarities x1 x2 x3 x4 x5 x6 x7 … x1 x2 x3 x4 x5 x6 x7 … x1 x2 x3 x4 x5 x6 x7 … … … … xl1 xl1 xl1 xl2 xl2 xl2 xl3 xl3 xl3 … … …

Reciprocal Hash Tables: Solutions (4)

Experiments • Datasets • SIFT-1M: 1 Million 128-D SIFT • GIST-1M: 1 Million 960-D GIST • Baselines: • Random selection • Setting: • 10,000 training samples and 1,000 queries on each set • 100 neighbors and 200 non-neighbors for each training sample • The groundtruthfor each query is defined as the top 5‰ nearest neighbors based on Euclidean distances • Average performance of 10 independent runs

Experiments: Over Basic Hashing Algorithms (1) • Hash Lookup Evaluation the precision of RAND deceases dramatically with more hash tables, while (R)DHF increase their performance first and attain significant performance gains over RAND both methods faithfully improve the performance over RAND in terms of hash lookup.

Experiments: Over Basic Hashing Algorithms (2) • Hamming Ranking Evaluation DHF and RDHF consistently achieve the best performance over LSH, KLSH and RMMH in most cases RDHF gains significant performance improvements over DHF

Experiments: Over Multiple Hashing Algorithms • build multiple hash tables using different hashing algorithms with different settings, because many hashing algorithms are prevented from being directly used to construct multiple tables, due to the upper limit of the hash function number • double bit (DB) quantization [Liu et al. 2011] on PCA-based Random Rotation Hashing (PCARDB) and Iterative Quantization (ITQDB) [Gong and Lazebnik2011].

Conclusion • Summary and contributions • a unified strategy for hash table construction supporting different hashing algorithms and various scenarios. • two important selection criteria for hashing performance • formalize it as the dominant set problem in a vertex- and edge-weighted graph representing all pooled hash functions • a reciprocal strategy based on boosting to reduce the redundancy between hash tables

Thank you!

Xianglong Liu 1 , Junfeng He 2,3 , and Bo Lang 1 1 Beihang University, Beijing, China 2 Columbia University, New York,

Xianglong Liu 1 , Junfeng He 2,3 , and Bo Lang 1 1 Beihang University, Beijing, China 2 Columbia University, New York,

Presentation Transcript

New York University

1 Center for Climate Systems Research, Columbia University, New York, NY

AP Lang Vocab 2:1

AP Lang: Vocab 1:2

Xianglong Liu 1 , Yadong Mu 2 , Bo Lang 1 and Shih-Fu Chang 2

1 – Environment Canada, Canada 2 – University of British Columbia, Canada

1+2+1 Community College – University Collaboration

Alexis Jackson 1, Paul Barber 2 1. Yale University 2. Boston University, MBL

Yuanyuan Zhao, Chunyang He, Yang yang Beijing Normal University, Beijing, China, 100875

Ruoying He 1 , Yizhen Li 1 , Dennis McGillicuddy 2 North Carolina State University 1

Statistics [1/2,3/2]

New York University

Columbia University Medical Center, New York

Slides by Yong Liu 1 , Deep Medhi 2 , and Michał Pióro 3 1 Polytechnic University, New York, USA

Slides by Yong Liu 1 , Deep Medhi 2 , and Michał Pióro 3 1 Polytechnic University, New York, USA

New York University

[1] German University [2] Czech University

UNIVERSITY PHYSICS 1

1 Boston University, Boston, MA, USA 2 FHI 360, Beijing, China

New York University

Juseon Bak 1 , Jae H. Kim 1 , Xiong Liu 2 1 Pusan National University

Eri Saikawa 1 , Vaishali Naik 1 , Larry W. Horowitz 2 , Junfeng Liu 1 , Denise Mauzerall 1