Embedding and Similarity Search for Point Sets under Translation

Embedding and Similarity Search for Point Sets under Translation Minkyoung Cho and David M. Mount University of Maryland SoCG 2008

Point Pattern Matching • Point Pattern Matching • Given two point sets P, Q, find Q’  Q • to minimize • Dist(P, Q’) = min dist(tP, Q’) • where t is a geometric transformation. • (e.g., translation, rotation, …) P Q

Point Pattern Similarity Search • Point Pattern Similarity Search • A collection of point setsS={P1,P2,…,PN} • has been preprocessed. Given a • query set Q, find (approximate) • nearest Pi with respect to a • distance function and • transformation group. … … Q … … S = {P1, P2, …, PN}

Results EMD: Earth Mover’s Distance SD: Symmetric Difference Distance

P = {0,12,14,23,35,54,59,64} t=3 Q = {15,17,20,26,38,57,65,67} … … … … … … Q … P … Problem Definition • Point Pattern Similarity Searching: • Distance Measure: • Symmetric Difference Distance • Error Model: • Outliers (but No Noise) • Transformation: • Translation • Restriction: • Coordinates are integers P = {p1,p2,p3,p4} Q = {p1,p2,p5,p6} {12,14,17,23,35,54,62,64} {0,12,14,23,35,54,59,64} { 12,14,23,35,54, 64}

Motivation: Sources of Complexity • Combination of Translation + Outliers • Translation Only • - translate the point set by aligning leftmost point to the origin • - trivial matching • Outliers Only • - Reduce to Nearest neighbor search in Hamming cube • (By hashing or random sampling)

Intuition Q P1 f f P2 f P3 f f P4 f PN Metric space

Embedding: Basic Definitions • Given metric spaces (X, d) and (X', d'), a mapf: X  X’is called an embedding. • The contraction of f is the maximum factor by which distances • are shrunk, i.e., • The expansion or stretch of f is the maximum factor by • which distances are stretched: • The distortion of f is the product of the contraction and expansion.

Main Result: Preliminaries • Main result: There exists an randomized embedding that maps a point set under symmetric difference with respect to translation into a metric space L1with distortion O(log2 n). • Assumption: • Each point set has at most n elements and is in dimension d. • Coordinates are integers of magnitude polynomial in n • Distance Function: Symmetric Difference with respect to translation • <PΔQ> = min |(P + t)ΔQ| • Target Metric: L1

1 0 0 1 0 0 1 0 0 0 1 3 0 0 2 0 0 1 0 Outline of Algorithm • 1. Transform d-dimension points into 1-d dimension points. • (Distortion: 1) • 2. Reduce the domain size using a linear hash function. • (Distortion: O(1)) • 3. Make invariant under translation. • (Distortion: O(log2n)) • 4. Reduce the target domain size using a universal hash function. • (Distortion: O(1)) {3,6,10,14,22} O(nlogn) {101010, ..., 010100, …, 11101}

Translation Invariant s 1 0 0 1 0 0 1 0 0 0 1 1 0 0 1 0 0 P = ρ= 4 … { 1101, 0000, 0010, 1100, 0001, 1010}

1 1 0 0 0 0 0 1 0 1 0 0 1 1 0 0 0 0 1 0 0 0 Intuition hP s hQ s Φ2P={10,01,00,10,01,00,10,00,00,01,00} If one of probeshits mismatched positions, then the bit patterns generated may differ. Φ2Q={10,00,01,00,11,00,10,01,00,11,00} The probabilitythatone of probes hits mismatched positionsincreases when the probe size increases. Φ4P={1101,0000,0010,1100,0000,0001, 1000,0010,0101,0000,0010} Φ4Q={1011,0100,0010,0101,1000,0011, 1100,0010,0100,1001,0000}

Relationship between ρ (probe size) and δ* δ: estimated distance δ*: original distance Expectation Unknown Upper bound >2s-2 Distance of Invariants ??? s/2i increases

Embedding δ: estimated distance δ*: original distance ??? Distance of Invariants 1 .5 20 21 22 … 2L … … 2H … 2log 2n=2n

Build Time • The expensive operations are of building invariant and hashing for large domain. • Building invariant : (# of Probes) * (# of Translations) • Trivial: O(s) * s = O(n log n) * O(n log n) = O(n2 log2 n) • Universal hash function: • (# of Elements) * (Matrix operation) • = (# of Elements) * (Input Size) * (Output Size) • Trivial: O(s) * O(s) * O(log s) = O(s2 log s) = O( n2 log3 n ) • We can improve it to O( n log3 n ) if we merge two operations. • Surprise!!!

1 0 0 0 1 0 1 0 0 1 0 y0 y1 y2 ys-1 r0 1 0 1 0 1 … … H Merge Two Operations P= s f 1 0 1 0 1 … rlog s Convolution can be computed in O(n log n) where n is the size of array

Main Result: Formal Statement • Given failure probability β, there exists a randomized embedding • from a point set P into a vector ΨP of dimension • O(n (log2n) log(1/β)) such that for any P, Q • This embedding can be computed in timeO(n (log4n) log(1/β))

Open Problems • Q1. Can we improve the distortion bound? currently O(log2 n) • Cormode & Muthukrishnan show how to embed a string under edit distance with moves into L1 with O(log n log* n) distortion. • Q2. Can we derandomize the algorithm? • Cormode & Muthukrishnan’s algorithm is deterministic. • Q3. Can we improve space/time complexities?

Other Extensions • Q1.Can we support a distance measure (e.g., Hausdorff distance that is robust to noisy data)? • Q2.Can we handle other transformation groups? • - integer scaling? • - integer scaling + translation? • - affine transformations over finite vector spaces? • Point Pattern Similarity Searching: • Distance Measure: • Symmetric Difference Distance • Error Model: • Outliers (but No Noise) • Transformation: • Translation • Restriction: • Coordinates are integral

Thank You!

2 0 0 1 0 2 0 1 0 0 0 0 0 0 0 0 0 0 0 0 Translation Invariant P = {3,6,10,14,22} h(x) = x mod s (e.g. s = 11) s 1 0 0 1 0 0 1 0 0 0 1 1 0 0 1 0 0 hP = ρ= 4 … { 1101, 0000, 0010, 1100, 0001, 1010} ΦρP = {13,0,2,12,1,…,10} h’(x) : (for simplicity, x mod 10) ΦρP = 1 3 4 2 0 5 6 7 8 9

Trial 1: Geometric Hashing for Translation • Naïve Version: • - Space complexity is O( N n2 ) since the frame size is 1. • - With outliers in a query: # of queries will increase • Adaptive Version: • To reduce space complexity, if store only c transformed sets, then • # of queries will increase. • Outliers may lead a false matching, thus they will increase the prob. of the false positive.

Geometric Hashing with Outliers (delete) • Based on the outliers $r$ and the frame size $k$, the number of queries will increase to get a correct result. • method 1. Pr[ choose a valid frame set] = ( 1 – r/n )^k • method 2. (r + 1) different trials ( deterministic) • method 3. pigeonhole theorem. • Pr[ choose a valid frame set] = 1-r/(n/k) • [Grimson&Huttenlocher 90] : Outliers lead a false matching and increase the prob. of the false positive.

0 1 0 0 0 0 0 1 0 0 0 1 0 0 1 d-Dimension  1-Dimension • Let u be the maximum coordinate value of each point. Then, we can map a d-dimensional point set to a 1-dimensional pointset with coordinates of size at most (3u)d. without changing the symmetric difference distance under translation. (5,3) (1,1) 0 1 0 0 1 … 0 0 1 0 0 … 0 1 0 0 0 … [6,15] [21,30] 1 35

# of Primes & Collision Prob. • Collision Probability • h(x) = x mod s where s is a prime number in Θ (n log n) • ( where s is chosen uniformly at random ) • For x != y • Pr[h(x) = h(y)] = Pr[(x mod s) = (y mod s)] • = Pr[(x-y) mod s = 0] • Since x, y Є Znc, |x – y| < nc. • Pr[h(x) = h(y)] < c/(# of primes) = 1/O(n) • Prime Number Theorem • There exist O(m/log m) prime numbers in range between 1 and m.

Distance Distortion by Hashing • We can achieve o(1)distortion with the hash function which the probability of collision is 1/O(n). • Note that the distance is always contracted due to collision.

P = {3,6,10,14,22} 1 0 0 1 0 0 1 0 0 0 1 Linear Hash Function (X) • h(x) = x mod s • where s is a prime number • in Θ(n log n) • Linearity • h( x + t ) = h(x) + h(t) • - translation • ΦρP = Φρ(P+t) S

Distance Distortion by Hashing (X) • We can achieve o(1)distortion with the hash function which the probability of collision is 1/O(n). • Note that the distance is always contracted due to collision.

Universal Hash Function for large domain • Since the maximum probe size is O(n log n), the input domain of hash function is O(2O(n log n)). However, it has only θ(n log n) elements. • H: 2s 2k • H(x) = R x + b (mod (2,2,…,2))R: a random k x s matrix • b: k bits random row vector. • Time Complexity: • For compute a value : O( k s )= O( (log n) n log n ) =O( n log2 n ) • For, all s (= O(n log n) ) , the time is O( n2 log3 n ).

Relationship between ρ and δ* δisa guess distance δ* isan optimal distance Expectation Unknown Upper bound >2s-2 ??? s/2i

Effect of Hash Functions ??? h’ h

Merge Two Operations using FFT & Convolution • П = random_probe( ρ, s ) • For t = 1, …., s, x(t) = (hP + t)[П] // make an invariant • For t = 1, …, s. • x’(t) = H x(t) + b ( mod (2,2,2,…,2) ) // H: O(log s) x ρ matrix • ΦρP[x’(t)]++ • Time Complexity: O(s) * O(matrix multi) = O( s ) * O(s log s) • ------------------------------------------------------------------------ • H = [r1, r2, …, rO(log s)]’ // ri : a binary row bit vector • Hx(t) = [ r1 x(t), r2 x(t), r3 x(t), …, rO(logs) x(t)]’ • ri x(t) = ri (hP + t)[П] =  (hP + t)[П ri] • [ri x(0), ri x(1), …, ri x(s)] = fliplr(hP)  [П ri] • Time Complexity: O(log s) * O(convolution) = O( log s ) * O(s log s)

Build Time

Embedding and Similarity Search for Point Sets under Translation

Embedding and Similarity Search for Point Sets under Translation

Presentation Transcript

Data-dependent Hashing for Similarity Search

Seeds for Similarity Search

Geometry of Similarity Search

A Metric Cache for Similarity Search

MATISSE - Modular Analysis for Topology of Interactions and Similarity SEts

A General Algorithm for Subtree Similarity-Search

Approximate Similarity Search in Genomic Sequence Databases using Landmark-Guided Embedding

Techniques and Data Structures for Efficient Multimedia Similarity Search

FTW: Fast Similarity Search under the Time Warping Distance

Database Similarity Search

Similarity Search for Web Services

Cache-Conscious Performance Optimization for Similarity Search

Connected Substructure Similarity Search

Similarity Search

Probabilistic Similarity Search for Uncertain Time Series

Content-Based Similarity Search

Similarity Joins for Strings and Sets

Feature Sets Based Similarity Measures for Image Retrieval

Similarity Joins for Strings and Sets

Using Sets of Feature Vectors for Similarity Search on Voxelized CAD Objects

Operators for Similarity Search

Database Similarity Search