map reduce for large scale similarity computation n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Map-Reduce for large scale similarity computation PowerPoint Presentation
Download Presentation
Map-Reduce for large scale similarity computation

Loading in 2 Seconds...

play fullscreen
1 / 21

Map-Reduce for large scale similarity computation - PowerPoint PPT Presentation


  • 123 Views
  • Uploaded on

Map-Reduce for large scale similarity computation. Lecture 2. …from last lecture. H ow to convert entities into high-dimensional numerical vectors H ow to compute similarity between two vectors. For example, is x and y are two vectors then . ..from last lecture. Example:

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Map-Reduce for large scale similarity computation' - thad


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
from last lecture
…from last lecture

How to convert entities into high-dimensional numerical vectors

How to compute similarity between two vectors.

For example, is x and y are two vectors then

from last lecture1
..from last lecture

Example:

X = (1,2,3) ; Y= (3,2,1)

||X|| = (1+4+9) = 140.5 = 3.74

||Y|| = ||X||

Sim(X,Y) = (1.3 + 2.2 + 3.1)/(3.742 ) = 10/14 = 5/7

We also learnt that for large data sets computing pair-wise similarity can be very time consuming.

map reduce
Map-Reduce

Map-Reduce has become a popular framework for speeding up computations like pair-wise similarity

Map-Reduce was popularized by Google and then Yahoo! (through the Hadoop open-source implementation)

Map-Reduce is a programming model built on top of “cluster computing”

cluster computing
Cluster Computing

Put simple (commodity) machines together, each with their own CPU, RAM and DISK, for parallel computing

Switch

rack

rack

map reduce1
Map-Reduce
  • Map-Reduce consists of two distinct entities
    • Distributed File System (DFS)
    • Library to implement Mapper and Reducer functions
  • A DFS seamlessly manages files on the “cluster computer.”
    • A file is broken into “chunks” and these chunks are replicated across the nodes of a cluster.
    • If a node which contains chunk A fails, the system will re-start the computation on a node which contains a copy of the chunk.
distributed file system
Distributed File System

A DFS will “chunk” files and replicated them across several nodes and then keep track of the chunks.

Only practical when data is mostly read only (e.g., historical data; not for live data –like airline reservation system).

File

Node 2,6,7

Node 3,2,18

node failure
Node failure
  • When several nodes are in play the chances that a singlenodegoes down at any time goes up significantly. ..
  • Suppose they are n nodes and let p be the probability that a single node will fail..
    • (1-p) that single node will not fail
    • (1-p)n that none of the nodes will fail
    • 1 – (1-p)n that at least one will fail.
node failure1
Node failure

The probability that at least one node failing is:

f= 1 – (1-p)n

When n =1; then f =p

Suppose p=0.0001 but n=10000, then:

f = 1 – (1 -0.0001)10000 = 0.63 [why/how ?]

This is one of the most important formulas to know (in general).

example hello world of mr
Example: “Hello World” of MR

Task: Produce an output which, for each word in the file, counts

the number of times it appears in the file.

Answer: (Java, 3); (Silent, 2), (mind,3)……

example
Example
  • For example
    • {doc1, doc2}  machine 1
    • {doc3,doc4}  machine 2
    • {doc5,doc6}  machine 3
  • Each chunk is also duplicated to other machines.
example1
Example
  • Now apply the MAP operation to each node and emit the pair (key, 1).
  • Thus doc1 emits:
    • (silent,1); (mind,1); (holy, 1); (mind,1)
  • Similarly doc6 emits:
    • (silent,1);(road,1); (to,1); (Cairns,1)
example2
Example

Note in the first chunk which contains (doc1, doc2)..each doc emits (key,value) pairs.

We can think that each computer node emits a list of (key, value) pairs.

Now this list is “grouped” so that the REDUCE function can be applied.

example3
Example
  • Note now that the (key,value) pairs have no connection with the docs…
    • (silent,1),(mind,1), (holy, 1), (mind,1), (road,1),(to,1),(Cairns,1); (Java,1),(programming,1),(is,1),(fun,1),…….
  • Now we have a hash function h:{a..z} {0,1}
    • Basically two REDUCE nodes
    • And (key,value) effectively become (key, list)
example4
Example
  • For example suppose the hash functions maps {to, Java, road} to one node. Then
    • (to,1) remains (to,1)
    • (Java,1);(Java,1);(Java,1)  (Java, [1,1,1])
    • (road,1);(road,1)(road,[1,1]);
  • Now REDUCE function converts
    • (Java,[1,1,1])  (Java,3) etc.
  • Remember this is a very simple example…the challenge is to take complex tasks and express them as Map and Reduce!
schema of map reduce tasks mmds
Schema of Map-Reduce Tasks [MMDS]

chunks

(key,value)

pairs

[k,(v,u,w,x,z)]

(k,v)

chunks

chunks

Output

Group By

Keys

Map Task

Reduce Task

the similarity join problem
The similarity join problem

Last time we discussed about computing the pair-wise similarity of all articles/documents in Wikipedia.

As we discussed it was time consuming problem because if N is the number of documents, and d is the length of each vector, then the running time proportional to O(N2d).

How can this problem be attacked using the Map Reduce framework.

similarity join
Similarity Join
  • Assume we are given two documents (vectors) d1 and d2. Then (ignoring the denominator)
  • Example:
    • d1 = {silent mind to holy mind}; d2 = {silent road to cairns}
    • sim(d1,d2) = 1silent,d11silent,d2 + 1to,d111to,d2 = 2
  • Exploit the fact that a term (word) only contributes if it belongs to at least two documents.
similarity example 2
Similarity Example [2]

Notice, it requires some ingenuity to come up with key-value pairs. This is

key to suing map-reduce effectively

amazon map reduce
Amazon Map Reduce
  • For this class we have received an educational grant from Amazon to run exercises on their Map Reduce servers.
  • Terminology
    • EC2 – is the name of Amazon’s cluster
    • S3 – is the name of their storage machines
    • Elastic Map Reduce – is the name Amazon’s Hadoop implementation of Map-Reduce
  • Lets watch this video.
references
References

Massive Mining of Data Sets (Rajaram, Leskovec, Ullman)

Computing Pairwise Similarity in Large Document Collection: A Map Reduce Perspective (El Sayed, Lin, Oard)