1 / 47

On Large-Scale Retrieval Tasks with Ivory and MapReduce

On Large-Scale Retrieval Tasks with Ivory and MapReduce. Nov 7 th , 2012. My Field …. Information Retrieval (IR ) is … Finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections

lena
Download Presentation

On Large-Scale Retrieval Tasks with Ivory and MapReduce

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. On Large-Scale Retrieval Taskswith Ivory and MapReduce Nov 7th, 2012

  2. My Field … Information Retrieval (IR) is … Finding material(usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections • Quite effective (at some things) • Highly visible (mostly) • Commercially successful (some of them)

  3. IR is not just “Document Retrieval” • Clustering and Classification • Question answering • Filtering, tracking, routing • Recommender systems • Leveraging XML and other Metadata • Text mining • Novelty identification • Meta-search (multi-collection searching) • Summarization • Cross-language mechanisms • Evaluation techniques • Multimedia retrieval • Social media analysis • …

  4. My Research … emails Text web pages + Enron Large-ScaleProcessing CLuE Web ~500,000 IdentityResolution ~1,000,000,000 User Application WebSearch

  5. Back in 2009 … • Before 2009, small text collections are available • Largest: ~ 1M documents • ClueWeb09 • Crawled by CMU in 2009 • ~ 1B documents ! • need to move to cluster environments • MapReduce/Hadoopseems like promising framework

  6. Ivory • E2E Search Toolkit using MapReduce • Completely designed for the Hadoop environment • Experimental Platform for research • Supports common text collections • + ClueWeb09 • Open source release • Implements state-of-the-art retrieval models http://ivory.cc

  7. MapReduce Framework (a) Map (b) Shuffle (c) Reduce (k1, v1) [k2, v2] Shuffling group values by: [keys] [(k3, v3)] map (k2, [v2]) input reduce output map input reduce output map input reduce output map input Framework handles “everything else” !

  8. The IR Black Box Documents Query Hits

  9. Inside the IR Black Box Documents Query online offline Representation Function Representation Function Query Representation Document Representation Index Comparison Function Hits

  10. Indexing Collection Inverted Index A, 2 B, 1 Documents, IDs Terms, Posting Lists C, 1 A Clinton Obama Clinton A, 1 C, 1 Clinton B Clinton Cheney B, 1 C, 1 Obama C Clinton Barack Obama Cheney Barack

  11. Indexing Collection Inverted Index A, 2 B, 1 Documents, IDs Terms, Posting Lists C, 1 A Clinton Obama Clinton A, 1 C, 1 Clinton B Clinton Romney B, 1 C, 1 Obama C Clinton Barack Obama Romney Barack

  12. reduce reduce reduce reduce map map map Indexing (a) Map (b) Shuffle (c) Reduce A Shuffling Clinton Clinton Obama Clinton Clinton ObamaClinton 2 Clinton Clinton Obama Clinton B ClintonRomney Clinton Romney Obama Obama Clinton Romney Romney Romney C Clinton Clinton Barack Obama ClintonBarackObama Barack Barack Barack Obama

  13. Retrieval Directly from HDFS! Search Client RetrievalBroker • Cute hack: use Hadoop to launch partition servers • Embed an HTTP server inside each mapper • Mappers start up, initialize servers, enter into infinite service loop! • Why do this? • Unified Hadoopecosystem • Simplifies data management issues PartitionServer PartitionServer PartitionServer PartitionServer HDFSnamenode Local Disk HDFSdatanode HDFSdatanode HDFSdatanode HDFSdatanode TREC’09 TREC’10

  14. Roadmap CIKM 2011 ACL 2008 SIGIR 2011 Ivory SIGIR 2011 TREC 2009 TREC 2010 CloudCom 2011

  15. Roadmap ACL 2008 SIGIR 2011 Ivory

  16. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 0.20 0.30 0.54 0.21 0.00 0.34 0.34 0.13 0.74 0.20 0.30 0.54 0.21 0.00 0.34 0.34 0.13 0.74 0.20 0.30 0.54 0.21 0.00 0.34 0.34 0.13 0.74 0.20 0.30 0.54 0.21 0.00 0.34 0.34 0.13 0.74 0.20 0.30 0.54 0.21 0.00 0.34 0.34 0.13 0.74 Abstract Problem • Applications: • Clustering • Coreference resolution • “more-like-that” queries

  17. Decomposition Each term contributes only if appears in reduce map

  18. 2 2 2 1 2 2 1 1 2 3 1 1 1 Pairwise Similarity (a) Generate pairs (b) Group pairs (c) Sum pairs Clinton 1 2 1 Romney 1 Barack 1 Obama 1 1

  19. Terms: Zipfian Distribution each term t contributes o(dft2) partial results very few terms dominate the computations most frequent term (“said”)  3% most frequent 10 terms  15% most frequent 100 terms  57% most frequent 1000 terms  95% doc freq (df) ~0.1% of total terms(99.9% df-cut) term rank

  20. Efficiency (disk space) Aquaint-2 Collection, ~ 906k doc 8 trillionintermediate pairs 0.5 trillion intermediate pairs Hadoop, 19 PCs, each w/: 2 single-core processors, 4GB memory, 100GB disk

  21. ACL’08 Effectiveness Drop 0.1% of terms“Near-Linear” GrowthFit on diskCost 2% in Effectiveness Hadoop, 19 PCs, each w/: 2 single-core processors, 4GB memory, 100GB disk

  22. Cross-Lingual Pairwise Similarity • Find similar document pairs in different languages • Multilingual text mining, Machine Translation • Application: automatic generation of potential “interwiki” language links More difficult than monolingual!

  23. Vocabulary Space Matching German English MT MT translate Doc A doc vector A English Doc B doc vector B German CLIR CLIR project Doc A doc vector A doc vector A English Doc B doc vector B

  24. Locality-Sensitive Hashing (LSH) • Cosine score is a good similarity measure but expensive! • LSH is a method for effectively reducing the search space when looking for similar pairs • Each vector is converted into a compact representation, called a signature • A sliding window-based algorithm uses these signatures to search for similar articles in the collection Vectors close to each other are likely to have similar signatures

  25. Solution Overview Nf German articles Ne English articles CLIRprojection Preprocess Similar article pairs Ne+Nf English document vectors <nobel=0.324, prize=0.227, book=0.01, …> 11100001010 01110000101 Sliding window algorithm Signature generation Ne+Nf Signatures Random Projection/ Minhash/Simhash

  26. MapReduce 1: Table Generation Phase tables permute S1 S1’ sort p1 …. 11111101010 10011000110 01100100100 … …. 01100100100 10011000110 11111101010 … Signatures . . . . . . …. 11011011101 01110000101 10101010000 … SQ SQ’ sort pQ …. 11111001011 00101001110 10010000101 … …. 00101001110 10010000101 11111001011 …

  27. MapReduce 2: Detection Phase table chunks 00000110101 00010001111 00100101101 00110000000 00110010000 00110011111 00110101000 00111010010 10010011011 10010110011

  28. Evaluation • Ground truth: • Sample 1064 German articles • cosine score >= 0.3 • Compare sliding window with brute force approach • required for exact solution • good reference as an upper-bound for recall and running time

  29. Evaluation 95% recall at 39% cost 99% recall at 62% cost No Free Lunch!

  30. SIGIR’11 Contribution to Wikipedia • Identify links between German and English Wikipedia articles • “Metadaten”  “Metadata”, “Semantic Web”, “File Format” • “Pierre Curie”  “Marie Curie”, “Pierre Curie”, “Helene Langevin-Joliot” • “Kirgisistan”  “Kyrgyzstan”, “Tulip Revolution”, “2010 Kyrgyzstani uprising”, “2010 South Kyrgyzstan riots”, “Uzbekistan” • Bad results when significant difference in length.

  31. Roadmap CIKM 2011 Ivory

  32. Approximate Positional Indexes “Learning to Rank” models Approximate effective ranking functions √ Termpositions Proximity features Learn Largeindex Slow query evaluation X X Smaller index Faster query evaluation √ √ Close Enough is Good Enough?

  33. Variable-Width Buckets • 5 buckets / document 1 1 2 3 4 5 2 3 4 5 d2 d1 ………...........…. ………...........…. ………...........…. ………...........…. ………...........…. ………...........…. ………...........…. ………...........…. ………...........…. ………...........…. ………...........…. ………...........…. ………...........…. ………...........…. ………...........….

  34. Fixed-Width Buckets • Buckets of length W 1 2 3 1 2 3 4 5 d1 d2 ………...........…. ………...........…. ………...........…. ………...........…. ………...........…. ………...........…. ………...........…. ………...........…. ………...........…. ………...........…. ………...........…. ………...........…. ………...........…. ………...........…. ………...........….

  35. CIKM’11 Effectiveness

  36. Roadmap Ivory iHadoop SIGIR ‘11

  37. Test Collections • Documents, queries, and relevance judgments • Important driving force behind IR innovation • Without test collections, it’s impossible to: • Evaluate search systems • Tune ranking functions / train models • Traditional • Exhaustive • Pooling • Recent Methodologies • Behavioral logging (query logs, click logs, etc.) • Minimal test collections • Crowdsourcing

  38. Web Graph P3 P1 web search P6 SIGIR 2012 P2 P4 web search P5 web search P7 web search Google web search

  39. Queries and Judgments? P3 P1 anchor text lines ≈ pseudo queries SIGIR 2012 P5 target pages ≈ relevant candidates Bing Google P6 P2 web search P4 P7 noise reduction ?

  40. SIGIR’11

  41. Roadmap Ivory CloudCom 2011

  42. Iterative MapReduce Applications • Many machine learning, and data mining applications • PageRank, k-means, HITS, … • Every iteration has to wait until the previous iteration has written its output completely to the DFS (unnecessary waiting time) • Every iteration starts by reading from the DFS what has just been written by the earlier iteration (wastes CPU time, I/O, bandwidth) MapReduce is not designed to run iterative applications efficiently

  43. Goal

  44. CloudCom’11 Asynchronous Pipeline

  45. Conclusion • MapReduce allows large-scale processing over web data • Ivory • E2E open-source IR retrieval engine for research • Completely on Hadoop • even retrieval: from HDFS • Efficiency-effectiveness tradeoff • Cross-Lingual Pairwise Similarity • Efficient implementation using MapReduce • Efficiency-effectiveness tradeoff • ApproxPositional Indexes • Efficient and as effective as exact positions • Pseudo Test Collections • Possible! • Effective for training L2R models • MapReduce is not good for iterative algorithms http://ivory.cc

  46. Collaborators • Jimmy Lin • Don Metzler • Doug Oard • Ferhan Ture • Nima Asadi • Lidan Wang • EslamElnikety • Hany Ramadan

  47. Thank You! Questions?

More Related