Efficient Parallel kNN Joins for Large Data in MapReduce

Efficient Parallel kNN Joins for Large Data in MapReduce

Research problem • Research problem • Run query search based on graph • Return Top-K min-sub-graph which contain all keywords in query • Based on the partition duplication, try to generate a better partition, in that case, mapreduce can just run one parse to generate good enough results

Efficient Parallel kNN Joins for Large Data in MapReduce • Summary: • Baseline Method • Block nested loop kNN join with HadoopMapreduce • Partition R and S, each into n equal-sized blocks in the map phase, and put every |R|/n and |S|/n into one block. • Reduce phase run block nested loop kNN join between the local R and local S blocks in that bucket. • Second map-reduce is needed to calculate the global kNNs among its n local kNNs produced in the first phase, a total of nk candidates.

Z-Value Based Partition Join • Motivations: • Based method creates excessive communication(n^2 buckets) • New method tried to find alternatives with linear communication and computation costs • Use space-filling curves(z-value curve)

Z-order curve • In mathematical analysis, a space-filling curve is a curve whose range contains the entire 2-dimensional unit square (or more generally an n-dimensional hypercube from Wiki) • In mathematical analysis and computer science, Z-order, Morton order, or Morton code is a function which maps multidimensional data to one dimension while preserving locality of the data points.

zkNN Algorithm • zkNN algorithm runs on two datasets, we use R and S • Find a small integer α, run loop until αtimes • For each entry in R, try to use a vector array in order to find a candidate is S set which is used to find the k-nearest neighbors, • Final candidate should be the union of all candidate subsets

zkNN Algorithm based on Mapreduce • Partition • All the partition should based on z-value, two dataset R and S generate two linear lines with all entries from R and S • Each iterator, based on the same z-value function, two nodes with similar z-value are consider as near with each other • In case to find the corresponding nearest neighbors from dataset S, find the corresponding block from S • In order to make sure there are enough entries(at least k) from S, so duplicate is needed here

Partition Duplicate • In order to make sure all the possible nearest neighbors, we duplicate the nearest k points from the preceding block and succeeding block if necessary • First Challenge: balance partition • Partition the outer R to balance parts • Generate a sample of R with probability p=1/(ε2N) for any ε from 0 to 1, calculate the rank s(x) and then r(x)=s(x)/p • Calculate the variance of r(x) from sample of R with the real r(x) of R

Proof for partition R • Proof the standard deviation of rank of x with the real rank of x • See the Theorem 2 and Lemma 2

Partition Duplicate Continues • For the dataset S, original the partition information of S is just as same as partition of R, but as we discuss before, one block from dataset S, this block need to contains the nearest k points from the preceding block and also the succeeding block • Just as partition of R, generate a sample of S with probability p, the kp(upper bound) node from sample of S is considered as the kth node from the real S

Proof for partition S • Check the Theorem 3

Approximation quality • For each of the selected records, calculate its distance to the approximate kth-NN and its distance to the exact kth-N, the ratio between the two distances is one measurement of the approximation quality • Other measurements are recall and precision • Confidence interval for randomly selected records • One more information: all the experiments are conducted on a cluster with 16 slave nodes

Effect of ε and α • Run different ε and compare the running time and standard deviation, which can be found from paper figure 9 • Compare the Approximation ratio and Recall(Precision) with different α value, which can be found from paper figure 15

Thank you!

Efficient Parallel kNN Joins for Large Data in MapReduce

Efficient Parallel kNN Joins for Large Data in MapReduce

Presentation Transcript

Efficient Record Linkage in Large Data Sets

Efficient Parallel Set-Similarity Joins Using MapReduce Rares Vernica, Michael J. Carey, Chen Li

MapReduce: simplified data processing on large clusters

Efficient Parallel Software for Large-Scale Semidefinite Programs

Large-Scale Data Processing with MapReduce

Joins in mapreduce

MapReduce : Simplified Data Processing on Large Clusters

MapReduce : Simplified Data Processing on Large Clusters

MapReduce : Simplified Data Processing on Large Clusters

MapReduce : Simplified Data Processing on Large Clusters

MapReduce : Simplified Data Processing on Large Clusters

Parallel Visualization for Very Large Data Simulations

MapReduce: simplified data processing on large clusters

Handling Data Skew in Parallel Joins in Shared-Nothing Systems

Efficient Parallel Set-Similarity Joins Using Hadoop

Optical Technologies for Data Communication in Large Parallel Systems

MapReduce: Simplied Data Processing on Large Clusters

Efficient Clustering of Large EST Data Sets on Parallel Computers

Efficient Skyline Computation in MapReduce