pairwise document similarity in large collections with mapreduce n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Pairwise Document Similarity in Large Collections with MapReduce PowerPoint Presentation
Download Presentation
Pairwise Document Similarity in Large Collections with MapReduce

Loading in 2 Seconds...

play fullscreen
1 / 20

Pairwise Document Similarity in Large Collections with MapReduce - PowerPoint PPT Presentation


  • 319 Views
  • Uploaded on

Pairwise Document Similarity in Large Collections with MapReduce. Tamer Elsayed, Jimmy Lin, and Douglas W. Oard University of Maryland, College Park Human Language Technology Center of Excellence and UMIACS CLIP Lab. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Pairwise Document Similarity in Large Collections with MapReduce' - Lucy


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
pairwise document similarity in large collections with mapreduce

Pairwise Document Similarity in Large Collections with MapReduce

Tamer Elsayed, Jimmy Lin, and Douglas W. Oard

University of Maryland, College Park

Human Language Technology Center of Excellence

and

UMIACS CLIP Lab

ACL, June 2008

abstract problem

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

0.20

0.30

0.54

0.21

0.00

0.34

0.34

0.13

0.74

0.20

0.30

0.54

0.21

0.00

0.34

0.34

0.13

0.74

0.20

0.30

0.54

0.21

0.00

0.34

0.34

0.13

0.74

0.20

0.30

0.54

0.21

0.00

0.34

0.34

0.13

0.74

0.20

0.30

0.54

0.21

0.00

0.34

0.34

0.13

0.74

Abstract Problem
  • Applications:
    • Clustering
    • Coreference resolution
    • “more-like-that” queries

Pairwise Document Similarity in Large Collections with MapReduce

trivial solution
Trivial Solution
  • load each vector o(N) times
  • load each term o(dft2) times

Goal

scalable and efficient solutionfor large collections

Pairwise Document Similarity in Large Collections with MapReduce

better solution
Better Solution

Each term contributes only if appears in

  • Load weights for each term once
  • Each term contributes o(dft2) partial scores

Pairwise Document Similarity in Large Collections with MapReduce

mapreduce framework
MapReduce Framework

(a) Map

(b) Shuffle

(c) Reduce

(k1, v1)

[k2, v2]

Shuffling

group values by: [keys]

[(k3, v3)]

map

(k2, [v2])

input

reduce

output

map

input

reduce

output

map

input

reduce

output

map

input

handles low-level detailstransparently

Pairwise Document Similarity in Large Collections with MapReduce

decomposition
Decomposition

Each term contributes only if appears in

  • Load weights for each term once
  • Each term contributes o(dft2) partial scores

reduce

map

Pairwise Document Similarity in Large Collections with MapReduce

standard indexing
Standard Indexing

(a) Map

(b) Shuffle

(c) Reduce

Shuffling

group values by: terms

tokenize

doc

combine

posting list

tokenize

doc

combine

posting list

tokenize

doc

combine

posting list

tokenize

doc

Pairwise Document Similarity in Large Collections with MapReduce

indexing 3 doc toy collection
Indexing (3-doc toy collection)

Clinton

ObamaClinton

Clinton

Obama

Clinton

Clinton

1

2

Indexing

1

ClintonCheney

Cheney

Clinton

Cheney

1

Barack

1

Clinton

Barack

Obama

ClintonBarackObama

Obama

1

1

Pairwise Document Similarity in Large Collections with MapReduce

pairwise similarity

2

2

2

1

2

1

3

1

2

2

1

1

1

Pairwise Similarity

(a) Generate pairs

(b) Group pairs

(c) Sum pairs

Clinton

1

2

1

Cheney

1

Barack

1

Obama

1

1

Pairwise Document Similarity in Large Collections with MapReduce

pairwise similarity abstract
Pairwise Similarity (abstract)

(a) Generate pairs

(b) Group pairs

(c) Sum pairs

Shuffling

group values by: pairs

multiply

term postings

sum

similarity

multiply

term postings

sum

similarity

multiply

term postings

sum

similarity

multiply

term postings

Pairwise Document Similarity in Large Collections with MapReduce

experimental setup
Experimental Setup
  • 0.16.0
    • Open source MapReduce implementation
  • Cluster of 19 machines
    • Each w/ two processors (single core)
  • Aquaint-2 collection
    • 906K documents
  • Okapi BM25
  • Subsets of collection

Pairwise Document Similarity in Large Collections with MapReduce

efficiency disk space
Efficiency (disk space)

Aquaint-2 Collection, ~ 906k docs

8 trillion intermediate pairs

Hadoop, 19 PCs, each: 2 single-core processors, 4GB memory, 100GB disk

Pairwise Document Similarity in Large Collections with MapReduce

terms zipfian distribution
Terms: Zipfian Distribution

each term t contributes o(dft2) partial results

very few terms dominate the computations

most frequent term (“said”)  3%

most frequent 10 terms  15%

most frequent 100 terms  57%

most frequent 1000 terms  95%

doc freq (df)

~0.1% of total terms(99.9% df-cut)

term rank

Pairwise Document Similarity in Large Collections with MapReduce

efficiency disk space1
Efficiency (disk space)

Aquaint-2 Collection, ~ 906k doc

8 trillionintermediate pairs

0.5 trillion intermediate pairs

Hadoop, 19 PCs, each w/: 2 single-core processors, 4GB memory, 100GB disk

Pairwise Document Similarity in Large Collections with MapReduce

effectiveness recent work
Effectiveness (recent work)

Drop 0.1% of terms“Near-Linear” GrowthFit on diskCost 2% in Effectiveness

Hadoop, 19 PCs, each w/: 2 single-core processors, 4GB memory, 100GB disk

Pairwise Document Similarity in Large Collections with MapReduce

slide16

Ivory

  • Open source implementation
  • Java 1.5, 0.16.0
  • Available soon …

Pairwise Document Similarity in Large Collections with MapReduce

conclusion
Conclusion
  • Simple and efficient MapReduce solution
    • Many HLT problems can also be “hadoopified”
      • E.g., Statistical MT (see paper in StatMT workshop)
  • Shuffling is critical
    • df-cut controls efficiency vs. effectiveness tradeoff
    • 99.9% df-cut achieves 98% relative accuracy

Pairwise Document Similarity in Large Collections with MapReduce

future work
Future work
  • Apply to larger collections!
  • Develop analytical model
  • Measure effectiveness for different applications

Pairwise Document Similarity in Large Collections with MapReduce

thank you
Thank You!

Pairwise Document Similarity in Large Collections with MapReduce

algorithm
Algorithm
  • Matrix must fit in memory
    • Works for small collections
  • Otherwise: disk access optimization

Pairwise Document Similarity in Large Collections with MapReduce