1 / 15

Efficient Parallel kNN Joins for Large Data in MapReduce

Efficient Parallel kNN Joins for Large Data in MapReduce. Research problem. Research problem Run query search based on graph Return Top-K min-sub-graph which contain all keywords in query

chase
Download Presentation

Efficient Parallel kNN Joins for Large Data in MapReduce

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Efficient Parallel kNN Joins for Large Data in MapReduce

  2. Research problem • Research problem • Run query search based on graph • Return Top-K min-sub-graph which contain all keywords in query • Based on the partition duplication, try to generate a better partition, in that case, mapreduce can just run one parse to generate good enough results

  3. Efficient Parallel kNN Joins for Large Data in MapReduce • Summary: • Baseline Method • Block nested loop kNN join with HadoopMapreduce • Partition R and S, each into n equal-sized blocks in the map phase, and put every |R|/n and |S|/n into one block. • Reduce phase run block nested loop kNN join between the local R and local S blocks in that bucket. • Second map-reduce is needed to calculate the global kNNs among its n local kNNs produced in the first phase, a total of nk candidates.

  4. Z-Value Based Partition Join • Motivations: • Based method creates excessive communication(n^2 buckets) • New method tried to find alternatives with linear communication and computation costs • Use space-filling curves(z-value curve)

  5. Z-order curve • In mathematical analysis, a space-filling curve is a curve whose range contains the entire 2-dimensional unit square (or more generally an n-dimensional hypercube from Wiki) • In mathematical analysis and computer science, Z-order, Morton order, or Morton code is a function which maps multidimensional data to one dimension while preserving locality of the data points.

  6. zkNN Algorithm • zkNN algorithm runs on two datasets, we use R and S • Find a small integer α, run loop until αtimes • For each entry in R, try to use a vector array in order to find a candidate is S set which is used to find the k-nearest neighbors, • Final candidate should be the union of all candidate subsets

  7. zkNN Algorithm based on Mapreduce • Partition • All the partition should based on z-value, two dataset R and S generate two linear lines with all entries from R and S • Each iterator, based on the same z-value function, two nodes with similar z-value are consider as near with each other • In case to find the corresponding nearest neighbors from dataset S, find the corresponding block from S • In order to make sure there are enough entries(at least k) from S, so duplicate is needed here

  8. Partition Duplicate • In order to make sure all the possible nearest neighbors, we duplicate the nearest k points from the preceding block and succeeding block if necessary • First Challenge: balance partition • Partition the outer R to balance parts • Generate a sample of R with probability p=1/(ε2N) for any ε from 0 to 1, calculate the rank s(x) and then r(x)=s(x)/p • Calculate the variance of r(x) from sample of R with the real r(x) of R

  9. Proof for partition R • Proof the standard deviation of rank of x with the real rank of x • See the Theorem 2 and Lemma 2

  10. Partition Duplicate Continues • For the dataset S, original the partition information of S is just as same as partition of R, but as we discuss before, one block from dataset S, this block need to contains the nearest k points from the preceding block and also the succeeding block • Just as partition of R, generate a sample of S with probability p, the kp(upper bound) node from sample of S is considered as the kth node from the real S

  11. Proof for partition S • Check the Theorem 3

  12. Approximation quality • For each of the selected records, calculate its distance to the approximate kth-NN and its distance to the exact kth-N, the ratio between the two distances is one measurement of the approximation quality • Other measurements are recall and precision • Confidence interval for randomly selected records • One more information: all the experiments are conducted on a cluster with 16 slave nodes

  13. Effect of ε and α • Run different ε and compare the running time and standard deviation, which can be found from paper figure 9 • Compare the Approximation ratio and Recall(Precision) with different α value, which can be found from paper figure 15

  14. Thank you!

More Related