1 / 30

RankReduce – Processing K-Nearest Neighbors Queries on Top of MapReduce

RankReduce – Processing K-Nearest Neighbors Queries on Top of MapReduce. Aleksandar Stupar , Sebastian Michel, and Ralf Schenkel LSDS-IR, Genéve 2010. Talk Outline. Motivation Background RankReduce Framework Experimental Evaluation Conclusion and Outline. Talk Outline.

sofia
Download Presentation

RankReduce – Processing K-Nearest Neighbors Queries on Top of MapReduce

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. RankReduce – Processing K-Nearest Neighbors Queries on Top of MapReduce AleksandarStupar, Sebastian Michel, and Ralf Schenkel LSDS-IR, Genéve 2010

  2. Talk Outline • Motivation • Background • RankReduce Framework • Experimental Evaluation • Conclusion and Outline

  3. Talk Outline • Motivation • Background • RankReduce Framework • Experimental Evaluation • Conclusion and Outline

  4. Similarity Search • Given a document find similar documents • e.g., given a photo find similar ones • Used extensively in Content Based Retrieval

  5. The Problem • Huge datasets • Number of digital cameras • Web 2.0 success: Facebook (FB), Flickr,… • 60+ million photos uploaded to FB weekly (2007) • Approximately 5000GB data • Similarity Search • How to? • Solution • Distributed • Reliable • Efficient

  6. The RankReduce Approach • Framework for similarity search • Large scale data • Vector based (pictures, music, video,…) • Built on top of • Locality Sensitive Hashing • MapReduce framework

  7. Talk Outline • Motivation • Background • RankReduce Framework • Experimental Evaluation • Conclusion and Outline

  8. K-Nearest Neighbors • Feature vectors representation • Similarity defined by distance measure • L1: • L2: • Exact solutions • Linear scan • Tree structures

  9. Locality Sensitive Hashing (LSH) (1) • Efficient • High dimensional data • Approximate K-Nearest Neighbors • What approximate means? • Trade off • (precision, space requirements, processing time) • Is approximate good enough? • sketches of the documents • E.g. color structure, Chroma features n1 n1 n2 n2 n3 n3 n4 n4 n5 n6 n6 n7 n7 n9 n8 n10 n9 n11 n10 n13 n11 n12 n13 n14 n15 …

  10. Locality Sensitive Hashing (LSH) (2) • Feature vectors are hashed to buckets • Neighbors collide to the same bucket • With high probability • Query processing • Multi probe [2]

  11. Locality Preserving Property Family of functions is sensitive if

  12. LSH based on p-stable distribution • Vector projection LSH • LSH parameters • Select a from Normal distribution • Select B from Uniform(0,W) distribution • W controls bucket size

  13. MapReduce • Large scale data processing • By Google [4] • Distributing • Data • Processing

  14. MapReduce Properties • Cluster of commodity machines • Fault tolerant • Scalable • Implementation • Distributed File System (DFS) • MapReduce jobs

  15. MapReduce Jobs • Data is pre-distributed • Calculations are done where data resides • Programming model • Map function • Reduce function • Job • Multiple map tasks • Sort and merge • One or multiple reduce tasks Map Map DFS Reduce … Map

  16. Perfect Marriage MapReduce + LSH = RankReduce

  17. Talk Outline • Motivation • Background • RankReduce Framework • Experimental Evaluation • Conclusion and Outline

  18. RankReduce (1) • LSH index is stored in Distributed File System • Hash Tables mapped to folders • Buckets mapped to files Distributed File System: /HashTable1 bucket_0_0_1_5 bucket_2_-3_3_1 /HashTable2 bucket_12_0_1_-1 bucket_8_-1_9_10 … /HashTableN bucket_0_0_0_-1 bucket_12_1_13_-9 …

  19. RankReduce (2) • Benefits • Fast look up in query time • Only probed data read • Block based sequential access • Downside • possible high number of files

  20. Query Processing (1) • As a MapReduce Job • List of buckets to probe as input • Single probe Distributed File System: /HashTable1 bucket_0_0_1_5 bucket_2_-3_3_1 /HashTable2 bucket_12_0_1_-1 bucket_8_-1_9_10 … /HashTableN bucket_0_0_0_-1 bucket_12_1_13_-9 Query MR Job

  21. Query Processing (2) • Map function calculates similarity • Reduce method sorts and emits K-nearest neighbors • Possible secondary sort Map Query Query Map Reduce KNN Query Probed buckets ... Query Map Query

  22. Talk Outline • Motivation • Background • RankReduce Framework • Experimental Evaluation • Conclusion and Outline

  23. Datasets • Flickr dataset (~ 54 million photos) • 64 dimensions • Color structure • CoPhIR data collection • Synthetic dataset • 32 dimensions • IID for dimensions N(0,1) scaled (*10) base vectors • Neighbors - slightly changed base vectors

  24. Parameter Tuning • Too many files can downgrade performance • (File Size < 64 KB) • LSH parameter tuning (precision, index size,…)

  25. Experimental Evaluation • RankReduce approach vs. linear scan • Hadoop • Open source implementation of MapReduce • Single machine installation • One mapper per machine allowed • Measured • Map task execution time (approximately constant) • Number of map tasks per job

  26. Results

  27. Results Interpretation • Precision • Real image data >85% • Synthetic data >70% • Runtime • Number of map tasks • Real image data 4-5 times better • Synthetic data 3-4 times better

  28. Talk Outline • Motivation • Background • RankReduce Framework • Experimental Evaluation • Conclusion and Outline

  29. Conclusion and Outlook • Similarity search on large datasets • Robust and scalable framework • Locality Sensitive Hashing & MapReduce • Experimental evaluation • Real computer cluster (several TBs) • Music retrieval • Chroma features • Time dimension

  30. Thank you! Questions?

More Related