Efficient Similarity Search with Cache-Conscious Data Traversal

Efficient Similarity Search with Cache-Conscious Data Traversal Xun Tang Committee: Tao Yang (Chair), Divy Agrawal, Xifeng Yan March 16, 2015

Roadmap • Similarity search background • Partition-based method background • Three main components in my thesis • Partition-based symmetric comparison and load balancing [SIGIR’14a]. • Fast runtime execution considering memory hierarchy [SIGIR’13 + Journal]. • Optimized search result ranking with cache-conscious traversal [SIGIR’14b]. • Conclusion

Similarity Search for Big Data • Finding pairs of data objects with similarity score above a threshold. • Example Applications: • Document clustering • Near duplicates • Spam detection • Query suggestion • Advertisement fraud detection • Collaborative filtering & recommendation • Very slow data processing for large datasets • How to make it fast and scalable?

Applications : Duplicate Detection & Clustering d1 = (1, 3,5, 0,0,0,4, 3,2,7) d2 = (1, 2,2, 0,0,0,4, 3,2,7)

All-Pairs Similarity Search (APSS) • Dataset • Cosine-based similarity: • Given n normalized vectors, compute all pairs of vectors such that • Quadratic complexity O(n2) >

Big Data Challenges for Similarity Search Sequential time (hour) • 4M tweets fit in memory, but take days to process • Approximated processing • Df-limit [Lin SIGIR’09]: remove features if their document frequency exceed an upper limit Values marked * are estimated by sampling

Inverted Indexing and Parallel Score Accumulation for APSS [Lin SIGIR’09; Baraglia et al. ICDM’10] f3 f5 f1 Vector d2 W2,1 W2,3 W2,5 W3,1 Map Partial result Partial result Partial result W4,1 W4,3 W4,5 Vector d4 Communication overhead + Reduce W7,3 = sim(d2,d4) W8,3

Parallel Solutions for Exact APSS • Parallel score accumulation[Lin SIGIR’09; Baraglia et al. ICDM’10] • Partition-based Similarity Search (PSS) [Alabduljalil et al. WSDM’13]

Parallel time comparison: PSS vs. Parallel Score Accumulation Inverted indexing with partial result parallelism 25x faster PSS Twitter

PSS: Partition-based similarity search • Key techniques : • Partition-based symmetric comparison and load balancing [SIGIR’14a]. • Challenge comes from the skewed distribution in data partition sizes and irregular dissimilarity relationship in large datasets. • Analysis on competitiveness to the optimum. • Scalable for large datasets on hundreds of cores. • Fast runtime execution considering memory hierarchy [SIGIR’13 + Journal].

Symmetry of Comparison • Partition-level comparison is symmetric • Example: Should Pi compare with Pjor Pjcompare with Pi? • Impact communication/load of corresponding tasks • Choice of comparison direction affects load balance Pi Pj Pj Pi

Similarity Graph  Comparison Graph Load assignment process:Transition from similarity graph to comparison graph

Load Balance Measurement & Examples • Load balance metric: Graph cost = Max (task cost) • Task cost is the sum of • Self comparison, including computation and I/O cost • Comparison to partitions point to itself

Challenges of Optimal Load Balance • Skewed distribution of node connectivity & node sizes • Empirical data

Two-Stage Load Balance Stage 1: Initial assignment of edge directions • Key Idea: tasks with small partitions or low connectivity should absorb more load • Optimize a sequence of steps that balances the load

Stage 2: Assignment refinement • Key Idea: gradually shift load of heavy tasks to their lightest neighbors • Only reverse an edge direction if beneficial Result of Stage 1 A refinement step

Competitive to Optimal Task Load Balancing • Is this two-stage algorithm competitive the optimum? • Optimum = minimum (maximum task cost) • Result: Two-stage solution ≤ (2 + δ) Optimum δ is the ratio of I/O and communication cost over computation cost In our tested cases, δ ≈ 10%

Competitive to Optimum Runtime Scheduler • Can the solution of task assignment algorithm be competitive to the one produced by the optimum runtime scheduling? • PTopt = Minimum (parallel time on q cores) • A greedy scheduler executes tasks produced by two-stage algorithm • E.g. Hadoop MapReduce • Yielded schedule length is PTq • Result:

Scalability: Parallel Time and Speedup • Efficiency decline caused by the increase of I/O overhead among machines in larger clusters • YMusic dataset is not large enough to use more cores for amortizing overhead

Comparison with Circular Load Assignment • [Alabduljalil et al. WSDM’13] • Parallel time reduction • Stage 1 up to 39% • Stage 2 up to 11% • Task cost Improvement percentage

PSS: Partition-based similarity search • Key techniques : • Partition-based symmetric comparison and load balancing [SIGIR’14a]. • Fast runtime execution considering memory hierarchy [SIGIR’13 + Journal]. • Splitting hosted partitions to fit into cache reduces slow memory data access (PSS1). • Coalescing vectors with size-controlled inverted indexing can improve the temporal locality of visited data (PSS2). • Cost modeling for memory hierarchy access as a guidance to optimize parameter setting.

Memory-hierarchy aware execution inPSS Task S = vectors of a partition owned B = vectors of other partitions to compare C = temporary storage • Read assigned partition into area S. Task steps: • Repeat • Read some vectors vi from other partitions • Compare vi with S • Output similar vector pairs • Until other potentially similar vectors are compared.

Problem: PSS area S is too big to fit in cache S C Inverted index of vectors Accumulator for S … B … Other vectors … … … … Too Long to fit in cache! … …

C PSS1: Cache-conscious data splitting Accumulator for Si After splitting: … S1 … S2 … Split Size? B aa aa aa aa aa aa aa aa … … … Sq …

PSS1 Task PSS1 Task Read S and divide into many splits Read other vectors into B For each split Sx Compare (Sx, B) Output similarity scores Compare(Sx, B) … for di in Sx for dj in B Sim(di,dj) += wi,t * wj,t if( sim(di,dj) + maxwdi * sumdj < ) then …

Modeling Memory/Cache Access of PSS1 Area Si Area B Sim(di,dj) = Sim(di,dj)+ wi,t * wj,t if( sim(di,dj) + maxwdi * sumdj < ) then Area C Total number of data accesses : D0 = D0(Si) + D0(B)+D0(C)

Cache misses and data access time Memory and cache access counts: D0 : total memory data accesses D1 : missed access at L1 D2 : missed access at L2 D3 : missed access at L3 Memory and cache access time: δi : access time at cache level i δmem : access time in memory. Total data access time = (D0-D1)δ1 + (D1-D2)δ2 + (D2-D3)δ3 + D3δmem

Total data access time Data found in L1 Total data access time ~2 cycles = (D0-D1)δ1 + (D1-D2)δ2 + (D2-D3)δ3 + D3δmem

Total data access time Data found in L2 Total data access time 6-10 cycles = (D0-D1)δ1 + (D1-D2)δ2 + (D2-D3)δ3 + D3δmem

Total data access time Data found in L3 Total data access time = (D0-D1)δ1 + (D1-D2)δ2 + (D2-D3)δ3 + D3δmem 30-40 cycles

Total data access time Data found in memory Total data access time = (D0-D1)δ1 + (D1-D2)δ2 + (D2-D3)δ3 + D3δmem 100- 300 cycles

Time comparisonPSS v.s. PSS1 • : L1 cache miss ratio. In practice > 10% • is two orders of magnitude slower than • Ideal ratio ~ 10x Consider case: PSS1 split fits in L2 cache

Actual vs. Predicted Avg. task time ≈ #features * (lookup + multiply + add) + accessmem/cache

PSS2: Vector coalescing • Issues: • PSS1 focused on splitting S to fit into cache. • PSS1 does not consider cache reuse to improve temporal locality in memory areas B and C. • Solution: coalescing multiple vectors in B

PSS2: Example for improved locality Striped areas in cache Si C … … … … B … … … • Improve temporal locality in memory areas B and C • Reduce the amortization of inverted index lookup cost

Affect of s & b on PSS2 performance (Twitter) fastest

Improvement Ratio of PSS1,PSS2 over PSS 2.7x

Incorporate LSH with PSS • LSH functions • Signature generation • MinHash • Jaccard similarity • Random Projection • cosine similarity

LSH Pipeline • LSH sub-steps • Projection generation • Signature generation • Bucket generation • Benefits • Great for parallelization • More accessible for larger dataset

Effectiveness of Our Method • 100% precision • As comparison: 67% [Ture et. al. SIGIR’11] • A guaranteed recall ratio for a certain similarity threshold k bits each round l rounds

Efficiency – 20M Tweets • >95% recall for 0.95 cosine similarity • 50 cores • Tradeoff of k • Too high: partition too small • Too low: not enough speedup via hashing

Method Comparison – 20M Tweets • LSH: improves efficiency (speed) with recall bound • PSS: guarantees precision

Efficiency – 40M ClueWeb • 95% recall for 0.95 cosine similarity • 300 cores • LSH+PSS better than Pure LSH • Precision is increased to 100% with faster speed • LSH+PSS better than Pure PSS • 71x speedup

PSS with Incremental Updates New documents are appended to the end of a new partition Compare the new partition with all the original partitions Update static partitions with new documents

Result Ranking After Similarity-based Retrieval or Other Metrics

Motivation • Machine-learnt ranking models are popular • Ranking ensembles such as • Gradient boosted regression trees (GBRT) • A large number of trees are used to improve accuracy • Winning teams at Yahoo! Learning-to-rank challenge used ensembles with 2k to 20k trees, or even 300k trees with bagging methods • Time consuming for computing large ensembles • Access of irregular document attributes impairs CPU cache reuse • Unorchestrated slow memory access incurs significant cost • Memory access latency is 200x slower than L1 cache • Dynamic tree branching impairs instruction branch prediction

Document-ordered Traversal(DOT) Data Traversal in Existing Solutions: Scorer-ordered Traversal (SOT)

Our Proposal: 2D Block Traversal

Why Better? • Total slow memory accesses in score calculation • 2D block can be up to s time faster. But s is capped by cache size • 2D Block fully exploits cache capacity for better temporal locality • Block-VPred: a combined solution that applies 2D Blocking on top of VPred [Asadi et al. TKDE’13] • Convert control dependence to data dependence to reduce instruction branch misprediction

Scoring Time per Document per Tree in Nanoseconds • Query latency = Scoring time * n * m • n docs ranked with an m-tree model

Efficient Similarity Search with Cache-Conscious Data Traversal

Efficient Similarity Search with Cache-Conscious Data Traversal

Presentation Transcript

Data-dependent Hashing for Similarity Search

Cache-Conscious Data Placement

Cache-Conscious Algorithms and Data Structures

Similarity Search in Visual Data

A Metric Cache for Similarity Search

Indexing similarity for efficient search in multimedia databases

An Efficient Video Similarity Search Algorithm

A Comparison of Cache-conscious and Cache-oblivious Programs

Cache-Conscious Wavefront Scheduling

Efficient Graph Traversal with Realistic Conditions

Building Efficient Time Series Similarity Search Operator

Techniques and Data Structures for Efficient Multimedia Similarity Search

Efficient Data Mining for Path Traversal Patterns

Efficient Search on Encrypted Data

Database Similarity Search

Cache-Conscious Data Placement

Cache-Conscious Performance Optimization for Similarity Search

Topics 10: Cache Conscious Indexes

Similarity Search

Cache Conscious Allocation of Pointer Based Data Structures, Revisited with HW/SW Prefetching

An Efficient Video Similarity Search Algorithm

Cache-Conscious Wavefront Scheduling