1 / 16

Ivory : Pairwise Document Similarity in Large Collection with MapReduce

Ivory : Pairwise Document Similarity in Large Collection with MapReduce. Tamer Elsayed, Jimmy Lin, and Doug Oard Laboratory for Computational Linguistics and Information Processing (CLIP Lab) UM Institute for Advanced Computer Studies (UMIACS). ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~.

whiterobert
Download Presentation

Ivory : Pairwise Document Similarity in Large Collection with MapReduce

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Ivory: Pairwise Document Similarity in Large Collection with MapReduce Tamer Elsayed, Jimmy Lin, and Doug Oard Laboratory for Computational Linguistics and Information Processing (CLIP Lab) UM Institute for Advanced Computer Studies (UMIACS)

  2. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 0.20 0.30 0.54 0.21 0.00 0.34 0.34 0.13 0.74 0.20 0.30 0.54 0.21 0.00 0.34 0.34 0.13 0.74 0.20 0.30 0.54 0.21 0.00 0.34 0.34 0.13 0.74 0.20 0.30 0.54 0.21 0.00 0.34 0.34 0.13 0.74 0.20 0.30 0.54 0.21 0.00 0.34 0.34 0.13 0.74 Problem • Applications: • “more-like-that” queries • Clustering • e.g., co-reference resolution

  3. Solutions • Trivial • For each pair of vectors • Compute the inner product • Loads each vector O(N) times • Better • Each term contributes only if appears in

  4. Algorithm • Loads each posting once • Matrix must fit in memory • Works for small collections • Otherwise: disk access optimization

  5. Hadoopify : 2-Step Solution • Indexing • one MapRedcue step • term  posting file • Pairwise Similarity • another MapRedcue step • term contribution for all possible pairs • Generate ½ df*(df-1) intermediate contribution / term

  6. Indexing (A,[(d1,2), (d3,1)]) (A,[(d1,2), (d3,1)]) reduce d1 (A,(d1,2)) (B,(d1,1)) (C,(d1,1)) map A A B C (B,[(d1,1), (d2,1), (d3,2)]) (B,[(d1,1), (d2,1), (d3,2)]) reduce d2 (B,(d2,1)) (D,(d2,2)) shuffle map B D D reduce (C,[(d1,1)]) (C,[(d1,1)]) d3 reduce (A,(d3,1)) (B,(d3,2)) (E,(d3,1)) (D,[(d2,2)]) (D,[(d2,2)]) map A B B E reduce (E,[(d3,1)]) (E,[(d3,1)])

  7. Pairwise Similarity (A,[(d1,2), (d3,1)]) map reduce ((d1,d3),2) ((d1,d2),[1]) ((d1,d2),1) shuffle ((d1,d3),[2, 2]) reduce ((d1,d2),1) ((d1,d3),2) ((d2,d3),2) ((d1,d3),4) (B,[(d1,1), (d2,1), (d3,2)]) map reduce ((d2,d3),2) ((d2,d3),[2]) map (C,[(d1,1)]) map (D,[(d2,2)]) map (E,[(d3,1)])

  8. Implementation Issues • df-cut • Drop common terms • Intermediate tuples dominated by very high df terms • efficiency Vs. effectiveness • Space saving tricks • Common doc + stripes • Blocking • Compression

  9. Experimental Setup • Hadoop 0.16.0 • Cluster of 19 nodes (w/double processors) • Aquaint-2 collection • 906K documents • Okapi BM25 • Subsets of collection

  10. Efficiency (running time) 99% df-cut

  11. Efficiency (disk usage)

  12. Effectiveness (recent)

  13. Conclusion • Simple and efficient MapReduce solution • 2H (using 38 nodes, 99% df-cut) for ~million-doc collection • Play tricks for I/O bound jobs • Effective linear-time-scaling approximation • 99.9% df-cut achieves 98% relative accuracy • df-cut controls efficiency vs. effectiveness tradeoff

  14. Future work • Bigger collections! • More investigation of df-Cut and other techniques • Analytical model • Compression techniques (e.g., bitwise) • More effectiveness experiments • Joint resolution of personal names in email • Co-reference resolution of names and organization • MapReduce IR research platform • Batch query processing

  15. Thank You!

  16. MapReduce Framework input input input input map map map map Shuffling: group values by keys reduce reduce reduce output output output

More Related