1 / 30

Fast Nearest-neighbor Search in Disk-resident Graphs

Fast Nearest-neighbor Search in Disk-resident Graphs. 报告人:鲁轶奇. Outline. Introduction Background & related works Proposed Work Experiments. Introduction-Motivation. Graph becoming enormous Streaming algorithm must take passes over the entire dataset

mikel
Download Presentation

Fast Nearest-neighbor Search in Disk-resident Graphs

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Fast Nearest-neighbor Search in Disk-resident Graphs 报告人:鲁轶奇

  2. Outline • Introduction • Background & related works • Proposed Work • Experiments

  3. Introduction-Motivation • Graph becoming enormous • Streaming algorithm must take passes over the entire dataset • Other perform clever preprocessing which use a specific similarity measure • This paper introduces analysis and algorithms which try to address the scalability problem in a generalizable way: not specific to one kind of graph partitioning nor one specific proximity measure.

  4. Introduction-Motivation(cont.) • Real world graphs contain high-degree nodes • Computing node value by combining that of its neighbors. • Whenever a high degree node is encountered, these algorithm have to examine a much large neighborhood leading to severely degraded performance.

  5. Introduction-Motivation(cont.) • Algorithms can no longer assume that entire graph can be stored in memory. • Compression techniques still have at least three setting where these might not work • social networks are far less compressible than Web graphs • decompression might lead to an unacceptable increase in query response time • even if a graph could be compressed down to a gigabyte, it might be undesirable to keep it in memory on a machine which is running other applications

  6. Contribution • a simple transform of the graph (turning high degree nodes into sinks) • a deterministic local algorithm guaranteed to return nearest neighbors in personalized pagerank from the disk-resident clustered graph. • we develop a fully external-memory clustering algorithm (RWDISK) that uses only sequential sweeps over data files.

  7. Background-Personalized Pagerank • A random walk starting at node a, at any step the walk can be reset to the start node with probability α • PPV(a, j) : PPV entry from a to j • Large value indicates high similarity

  8. Background-Clustering • Using random walk based approaches for computing good quality local graph partition near a given anchor node. • Main intuition: • A random walk started inside a low conductance cluster will mostly stay inside the cluster. • Conductance: • ФV(A) denote conductance and μ(A)=Σi∈Adegree(i)

  9. Proposed Work • First problem: most local algorithms for computing nearest neighbors suffer from the presence of high degree nodes. • Second issue: computing proximity measures on large disk-resident graphs. • Third issue: Finding a good clustering

  10. Effect of high degree nodes • High degree nodes are performance bottleneck • Effect on personalized pagerank • Main intuition: a very high degree node passes on a small fraction of its value to the out-neighbors, which might not be significant enough to invest our computing resources on. • Argue: stopping a random walk at a high degree node does not change the personalized pagerank value at other nodes which have relatively smaller degree.

  11. Effect of high degree nodes • error incurred in personalized pagerank is inversely proportional to the degree of the sink node.

  12. Effect of high degree nodes • faα(i, j) is simply the probability of hitting a node j for the first time from node i, in this α-discounted walk.

  13. Effect of high degree nodes

  14. Effect of high degree nodes • the error for introducing a set of sink nodes

  15. Nearest-neighbors on clustered graphs • how to use the clusters for deterministic computation of nodes "close" to an arbitrary query. • Use degree-normalized personalized pagerank • For a given node i, the PPV from j to it, i.e. PPV (j, i) can be written as

  16. assume that j and i are in the same cluster S. • Don’t have access to PPV-1(k), , replace it with upper and lower bound • lower bound: 0, we pretend that S is completely disconnected to the rest of the graph • Upper bound: A random walk from outside S has to cross the boundary of S to hit node i.

  17. S is small in size, the power method suffice • At each iteration, maintain the upper and lower bounds for nodes within S • To expand S: bring in the clusters for x of the external neighbors of • this global upper boundfalls below a pre-specified small threshold γ • In reality, using an additive slack ε, (ubk+1- ε)

  18. Ranking Step • return all nodes which have lower bound greater than the (k+1)th largest upper bound • Why: All nodes outside the cluster are guaranteed to have personalized pagerank smaller than the global upperbound, which is smaller than γ

  19. Clustered Representation on Disk • Intuition: use a set of anchor nodes and assign each remaining node to its “closest” anchor. • Using personalized page-rank as the measure of “closeness” • Algorithm: • Start with a random set of anchors • Iteratively add new anchors from the set of unreachable nodes, and the recompute the cluster assignments • Two properties: • new anchors are far away from the existing anchors • when the algorithm terminates, each node i is guaranteed to be assigned to its closest anchor.

  20. RWDISK • 4 kinds of files • Edge file: Each line represents an edge by a triplet {src,dst,p}, p = P(Xt = dst| Xt-1=src) • Last file: each line in Last is {src,anchor,value}, value= P(Xt-1=src| X0=anchor) • Newt file: Newt contains xt, each line is {src,anchor,value}, where value equals P(Xt=src|X0 =anchor) • Ans file: represents the values for vt. Thus each line in Ans is {src,anchor,value}, where value = • Algorithm to compute vt by power iterations

  21. RWDISK(cont.) • Newt is simply a matrix-vector product between the transition matrix stored in Edges and Last. • File are stored lexicographically, this can be obtained by a file-join like algorithm. • First step: simply joins the two files, and accumulates the probability values at a node from its in-neighbors. • Next step: the Newt file is sorted and compressed, in order to add up the values from different in-neighbors • multiply the probabilities by α(1-α)t-1 • Fix the number of iterations at maxiter.

  22. One major problem is that intermediate files can become much larger than the number of edges • in most real-world networks within 4-5 steps it is possible to reach a huge fraction of the whole graph • Intermediate file getting too large • Using rounding for reducing file sizes

  23. Experiments • Dataset

  24. Experiments(cont.) • System Detail • On a off-the-shelf PC • Least recently used replacement scheme • Page size 4KB

  25. Experiments(cont.)-Effect of high degree nodes • Three-fold advantages: • Speed up external memory clustering • Reduce number of page-faults in random-walk simulation • Effect on RWDISK

  26. Experiments(cont.)-Deterministic vs. Simulations • Computing top-10 neighborswith approximation slack 0.005 for 500 randomly picked nodes • Citeseer original graph • DBLP turned nodes with degree above 1000 into sinks • LiveJournal turn nodes with degree above 100 into sinks

  27. Experiments(cont.)-RWDISK vs. METIS • maxiter = 30, α = 0.1 and ε = 0.001 for PPV • METIS for baseline algorithm • break DBLP into 50000 parts, which used 20GB of RAM • Break LiveJournal into 75000 parts, which used 50GB of RAM • In comparison, RWDISK can be excuted on a 2-4 GB standard PC

  28. Experiments(cont.)-RWDISK vs. METIS • Measure of cluster quality • A good disk-based clustering must satisfy: • Low conductance • Fit in disk-sized pages

  29. Experiments(cont.)-RWDISK vs. METIS

  30. Experiments(cont.)-RWDISK vs. METIS

More Related