Map-Reduce for large scale similarity computation. Lecture 2. …from last lecture. H ow to convert entities into high-dimensional numerical vectors H ow to compute similarity between two vectors. For example, is x and y are two vectors then . ..from last lecture. Example:
How to convert entities into high-dimensional numerical vectors
How to compute similarity between two vectors.
For example, is x and y are two vectors then
X = (1,2,3) ; Y= (3,2,1)
||X|| = (1+4+9) = 140.5 = 3.74
||Y|| = ||X||
Sim(X,Y) = (1.3 + 2.2 + 3.1)/(3.742 ) = 10/14 = 5/7
We also learnt that for large data sets computing pair-wise similarity can be very time consuming.
Map-Reduce has become a popular framework for speeding up computations like pair-wise similarity
Map-Reduce was popularized by Google and then Yahoo! (through the Hadoop open-source implementation)
Map-Reduce is a programming model built on top of “cluster computing”
Put simple (commodity) machines together, each with their own CPU, RAM and DISK, for parallel computing
A DFS will “chunk” files and replicated them across several nodes and then keep track of the chunks.
Only practical when data is mostly read only (e.g., historical data; not for live data –like airline reservation system).
The probability that at least one node failing is:
f= 1 – (1-p)n
When n =1; then f =p
Suppose p=0.0001 but n=10000, then:
f = 1 – (1 -0.0001)10000 = 0.63 [why/how ?]
This is one of the most important formulas to know (in general).
Task: Produce an output which, for each word in the file, counts
the number of times it appears in the file.
Answer: (Java, 3); (Silent, 2), (mind,3)……
Note in the first chunk which contains (doc1, doc2)..each doc emits (key,value) pairs.
We can think that each computer node emits a list of (key, value) pairs.
Now this list is “grouped” so that the REDUCE function can be applied.
Last time we discussed about computing the pair-wise similarity of all articles/documents in Wikipedia.
As we discussed it was time consuming problem because if N is the number of documents, and d is the length of each vector, then the running time proportional to O(N2d).
How can this problem be attacked using the Map Reduce framework.
Notice, it requires some ingenuity to come up with key-value pairs. This is
key to suing map-reduce effectively
Massive Mining of Data Sets (Rajaram, Leskovec, Ullman)
Computing Pairwise Similarity in Large Document Collection: A Map Reduce Perspective (El Sayed, Lin, Oard)