1 / 56

Efficient Similarity Search with Cache-Conscious Data Traversal

Efficient Similarity Search with Cache-Conscious Data Traversal. Xun Tang Committee: Tao Yang (Chair), Divy Agrawal, Xifeng Yan March 16, 2015. Roadmap. Similarity search background Partition-based method background Three main components in my thesis

leroya
Download Presentation

Efficient Similarity Search with Cache-Conscious Data Traversal

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Efficient Similarity Search with Cache-Conscious Data Traversal Xun Tang Committee: Tao Yang (Chair), Divy Agrawal, Xifeng Yan March 16, 2015

  2. Roadmap • Similarity search background • Partition-based method background • Three main components in my thesis • Partition-based symmetric comparison and load balancing [SIGIR’14a]. • Fast runtime execution considering memory hierarchy [SIGIR’13 + Journal]. • Optimized search result ranking with cache-conscious traversal [SIGIR’14b]. • Conclusion

  3. Similarity Search for Big Data • Finding pairs of data objects with similarity score above a threshold. • Example Applications: • Document clustering • Near duplicates • Spam detection • Query suggestion • Advertisement fraud detection • Collaborative filtering & recommendation • Very slow data processing for large datasets • How to make it fast and scalable?

  4. Applications : Duplicate Detection & Clustering d1 = (1, 3,5, 0,0,0,4, 3,2,7) d2 = (1, 2,2, 0,0,0,4, 3,2,7)

  5. All-Pairs Similarity Search (APSS) • Dataset • Cosine-based similarity: • Given n normalized vectors, compute all pairs of vectors such that • Quadratic complexity O(n2) >

  6. Big Data Challenges for Similarity Search Sequential time (hour) • 4M tweets fit in memory, but take days to process • Approximated processing • Df-limit [Lin SIGIR’09]: remove features if their document frequency exceed an upper limit Values marked * are estimated by sampling

  7. Inverted Indexing and Parallel Score Accumulation for APSS [Lin SIGIR’09; Baraglia et al. ICDM’10] f3 f5 f1 Vector d2 W2,1 W2,3 W2,5 W3,1 Map Partial result Partial result Partial result W4,1 W4,3 W4,5 Vector d4 Communication overhead + Reduce W7,3 = sim(d2,d4) W8,3

  8. Parallel Solutions for Exact APSS • Parallel score accumulation[Lin SIGIR’09; Baraglia et al. ICDM’10] • Partition-based Similarity Search (PSS) [Alabduljalil et al. WSDM’13]

  9. Parallel time comparison: PSS vs. Parallel Score Accumulation Inverted indexing with partial result parallelism 25x faster PSS Twitter

  10. PSS: Partition-based similarity search • Key techniques : • Partition-based symmetric comparison and load balancing [SIGIR’14a]. • Challenge comes from the skewed distribution in data partition sizes and irregular dissimilarity relationship in large datasets. • Analysis on competitiveness to the optimum. • Scalable for large datasets on hundreds of cores. • Fast runtime execution considering memory hierarchy [SIGIR’13 + Journal].

  11. Symmetry of Comparison • Partition-level comparison is symmetric • Example: Should Pi compare with Pjor Pjcompare with Pi? • Impact communication/load of corresponding tasks • Choice of comparison direction affects load balance Pi Pj Pj Pi

  12. Similarity Graph  Comparison Graph Load assignment process:Transition from similarity graph to comparison graph

  13. Load Balance Measurement & Examples • Load balance metric: Graph cost = Max (task cost) • Task cost is the sum of • Self comparison, including computation and I/O cost • Comparison to partitions point to itself

  14. Challenges of Optimal Load Balance • Skewed distribution of node connectivity & node sizes • Empirical data

  15. Two-Stage Load Balance Stage 1: Initial assignment of edge directions • Key Idea: tasks with small partitions or low connectivity should absorb more load • Optimize a sequence of steps that balances the load

  16. Stage 2: Assignment refinement • Key Idea: gradually shift load of heavy tasks to their lightest neighbors • Only reverse an edge direction if beneficial Result of Stage 1 A refinement step

  17. Competitive to Optimal Task Load Balancing • Is this two-stage algorithm competitive the optimum? • Optimum = minimum (maximum task cost) • Result: Two-stage solution ≤ (2 + δ) Optimum δ is the ratio of I/O and communication cost over computation cost In our tested cases, δ ≈ 10%

  18. Competitive to Optimum Runtime Scheduler • Can the solution of task assignment algorithm be competitive to the one produced by the optimum runtime scheduling? • PTopt = Minimum (parallel time on q cores) • A greedy scheduler executes tasks produced by two-stage algorithm • E.g. Hadoop MapReduce • Yielded schedule length is PTq • Result:

  19. Scalability: Parallel Time and Speedup • Efficiency decline caused by the increase of I/O overhead among machines in larger clusters • YMusic dataset is not large enough to use more cores for amortizing overhead

  20. Comparison with Circular Load Assignment • [Alabduljalil et al. WSDM’13] • Parallel time reduction • Stage 1 up to 39% • Stage 2 up to 11% • Task cost Improvement percentage

  21. PSS: Partition-based similarity search • Key techniques : • Partition-based symmetric comparison and load balancing [SIGIR’14a]. • Fast runtime execution considering memory hierarchy [SIGIR’13 + Journal]. • Splitting hosted partitions to fit into cache reduces slow memory data access (PSS1). • Coalescing vectors with size-controlled inverted indexing can improve the temporal locality of visited data (PSS2). • Cost modeling for memory hierarchy access as a guidance to optimize parameter setting.

  22. Memory-hierarchy aware execution inPSS Task S = vectors of a partition owned B = vectors of other partitions to compare C = temporary storage • Read assigned partition into area S. Task steps: • Repeat • Read some vectors vi from other partitions • Compare vi with S • Output similar vector pairs • Until other potentially similar vectors are compared.

  23. Problem: PSS area S is too big to fit in cache S C Inverted index of vectors Accumulator for S … B … Other vectors … … … … Too Long to fit in cache! … …

  24. C PSS1: Cache-conscious data splitting Accumulator for Si After splitting: … S1 … S2 … Split Size? B aa aa aa aa aa aa aa aa … … … Sq …

  25. PSS1 Task PSS1 Task Read S and divide into many splits Read other vectors into B For each split Sx Compare (Sx, B) Output similarity scores Compare(Sx, B) … for di in Sx for dj in B Sim(di,dj) += wi,t * wj,t if( sim(di,dj) + maxwdi * sumdj < ) then …

  26. Modeling Memory/Cache Access of PSS1 Area Si Area B Sim(di,dj) = Sim(di,dj)+ wi,t * wj,t if( sim(di,dj) + maxwdi * sumdj < ) then Area C Total number of data accesses : D0 = D0(Si) + D0(B)+D0(C)

  27. Cache misses and data access time Memory and cache access counts: D0 : total memory data accesses D1 : missed access at L1 D2 : missed access at L2 D3 : missed access at L3 Memory and cache access time: δi : access time at cache level i δmem : access time in memory. Total data access time = (D0-D1)δ1 + (D1-D2)δ2 + (D2-D3)δ3 + D3δmem

  28. Total data access time Data found in L1 Total data access time ~2 cycles = (D0-D1)δ1 + (D1-D2)δ2 + (D2-D3)δ3 + D3δmem

  29. Total data access time Data found in L2 Total data access time 6-10 cycles = (D0-D1)δ1 + (D1-D2)δ2 + (D2-D3)δ3 + D3δmem

  30. Total data access time Data found in L3 Total data access time = (D0-D1)δ1 + (D1-D2)δ2 + (D2-D3)δ3 + D3δmem 30-40 cycles

  31. Total data access time Data found in memory Total data access time = (D0-D1)δ1 + (D1-D2)δ2 + (D2-D3)δ3 + D3δmem 100- 300 cycles

  32. Time comparisonPSS v.s. PSS1 • : L1 cache miss ratio. In practice > 10% • is two orders of magnitude slower than • Ideal ratio ~ 10x Consider case: PSS1 split fits in L2 cache

  33. Actual vs. Predicted Avg. task time ≈ #features * (lookup + multiply + add) + accessmem/cache

  34. PSS2: Vector coalescing • Issues: • PSS1 focused on splitting S to fit into cache. • PSS1 does not consider cache reuse to improve temporal locality in memory areas B and C. • Solution: coalescing multiple vectors in B

  35. PSS2: Example for improved locality Striped areas in cache Si C … … … … B … … … • Improve temporal locality in memory areas B and C • Reduce the amortization of inverted index lookup cost

  36. Affect of s & b on PSS2 performance (Twitter) fastest

  37. Improvement Ratio of PSS1,PSS2 over PSS 2.7x

  38. Incorporate LSH with PSS • LSH functions • Signature generation • MinHash • Jaccard similarity • Random Projection • cosine similarity

  39. LSH Pipeline • LSH sub-steps • Projection generation • Signature generation • Bucket generation • Benefits • Great for parallelization • More accessible for larger dataset

  40. Effectiveness of Our Method • 100% precision • As comparison: 67% [Ture et. al. SIGIR’11] • A guaranteed recall ratio for a certain similarity threshold k bits each round l rounds

  41. Efficiency – 20M Tweets • >95% recall for 0.95 cosine similarity • 50 cores • Tradeoff of k • Too high: partition too small • Too low: not enough speedup via hashing

  42. Method Comparison – 20M Tweets • LSH: improves efficiency (speed) with recall bound • PSS: guarantees precision

  43. Efficiency – 40M ClueWeb • 95% recall for 0.95 cosine similarity • 300 cores • LSH+PSS better than Pure LSH • Precision is increased to 100% with faster speed • LSH+PSS better than Pure PSS • 71x speedup

  44. PSS with Incremental Updates New documents are appended to the end of a new partition Compare the new partition with all the original partitions Update static partitions with new documents

  45. Result Ranking After Similarity-based Retrieval or Other Metrics

  46. Motivation • Machine-learnt ranking models are popular • Ranking ensembles such as • Gradient boosted regression trees (GBRT) • A large number of trees are used to improve accuracy • Winning teams at Yahoo! Learning-to-rank challenge used ensembles with 2k to 20k trees, or even 300k trees with bagging methods • Time consuming for computing large ensembles • Access of irregular document attributes impairs CPU cache reuse • Unorchestrated slow memory access incurs significant cost • Memory access latency is 200x slower than L1 cache • Dynamic tree branching impairs instruction branch prediction

  47. Document-ordered Traversal(DOT) Data Traversal in Existing Solutions: Scorer-ordered Traversal (SOT)

  48. Our Proposal: 2D Block Traversal

  49. Why Better? • Total slow memory accesses in score calculation • 2D block can be up to s time faster. But s is capped by cache size • 2D Block fully exploits cache capacity for better temporal locality • Block-VPred: a combined solution that applies 2D Blocking on top of VPred [Asadi et al. TKDE’13] • Convert control dependence to data dependence to reduce instruction branch misprediction

  50. Scoring Time per Document per Tree in Nanoseconds • Query latency = Scoring time * n * m • n docs ranked with an m-tree model

More Related