1 / 62

Thanks to Jimmy Lin slides

Graph Algorithms with MapReduce Chapter 5. Thanks to Jimmy Lin slides. Topics. Introduction to graph algorithms and graph representations Single Source Shortest Path (SSSP) problem Refresher: Dijkstra’s algorithm Breadth-First Search with MapReduce PageRank. What’s a graph?.

kurt
Download Presentation

Thanks to Jimmy Lin slides

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Graph Algorithms with MapReduce Chapter 5 Thanks to Jimmy Lin slides

  2. Topics • Introduction to graph algorithms and graph representations • Single Source Shortest Path (SSSP) problem • Refresher: Dijkstra’s algorithm • Breadth-First Search with MapReduce • PageRank

  3. What’s a graph? • G = (V,E), where • V represents the set of vertices (nodes) • E represents the set of edges (links) • Both vertices and edges may contain additional information • Different types of graphs: • Directed vs. undirected edges • Presence or absence of cycles • Graphs are everywhere: • Hyperlink structure of the Web • Physical structure of computers on the Internet • Interstate highway system • Social networks

  4. Some Graph Problems • Finding shortest paths • Routing Internet traffic and UPS trucks • Finding minimum spanning trees • Telco laying down fiber • Finding Max Flow • Airline scheduling • Identify “special” nodes and communities • Breaking up terrorist cells, spread of avian flu • Bipartite matching • Monster.com, Match.com • And of course... PageRank

  5. Graphs and MapReduce • Graph algorithms typically involve: • Performing computation at each node • Processing node-specific data, edge-specific data, and link structure • Traversing the graph in some manner • Key questions: • How do you represent graph data in MapReduce? • How do you traverse a graph in MapReduce?

  6. Representing Graphs • G = (V, E) • A poor representation for computational purposes • Two common representations • Adjacency matrix • Adjacency list

  7. Adjacency Matrices Represent a graph as an n x n square matrix M • n = |V| • Mij = 1 means a link from node i to j 2 1 3 4

  8. Adjacency Matrices: Critique • Advantages: • Naturally encapsulates iteration over nodes • Rows and columns correspond to inlinks and outlinks • Disadvantages: • Lots of zeros for sparse matrices • Lots of wasted space

  9. Adjacency Lists Take adjacency matrices… and throw away all the zeros 1: 2, 4 2: 1, 3, 4 3: 1 4: 1, 3

  10. Adjacency Lists: Critique • Advantages: • Much more compact representation • Easy to compute over outlinks • Graph structure can be broken up and distributed • Disadvantages: • Much more difficult to compute over inlinks

  11. Single Source Shortest Path • Problem: find shortest path from a source node to one or more target nodes • “Graph search algorithm that solves the single-source shortest path problem for a graph with nonnegative edge path costs, producing a shortest path tree” Wikipedia • First, a refresher: Dijkstra’s algorithm • Single machine

  12. Dijkstra’s Algorithm Example   1 10 0 9 2 3 4 6 5 7   2 Example from CLR

  13. Dijkstra’s Algorithm Example   n3 n1 1 10 0 n0 9 2 3 4 6 5 7   n2 n4 2 Example from CLR

  14. Dijkstra’s Algorithm Example 10  n3 n1 1 10 0 n0 9 2 3 4 6 5 7 5  n2 n4 2 Example from CLR

  15. Dijkstra’s Algorithm Example 8 14 n3 n1 1 10 0 n0 9 2 3 4 6 5 7 5 7 n2 n4 2 Example from CLR

  16. Dijkstra’s Algorithm Example 8 13 n3 n1 1 10 0 n0 9 2 3 4 6 5 7 5 7 n2 n4 2 Example from CLR

  17. Dijkstra’s Algorithm Example 8 9 n3 n1 1 10 0 n0 9 2 3 4 6 5 7 5 7 n2 n4 2 Example from CLR

  18. Dijkstra’s Algorithm Example 8 9 n3 n1 1 10 0 n0 9 2 3 4 6 5 7 5 7 n2 n4 2 Example from CLR

  19. Single Source Shortest Path • Problem: find shortest path from a source node to one or more target nodes • Single processor machine: Dijkstra’s Algorithm • MapReduce: parallel Breadth-First Search (BFS) • How to do it? First simplify the problem!!

  20. Finding the Shortest Path • First, consider equal edge weights • Solution to the problem can be defined inductively • Here’s the intuition: • DistanceTo(startNode) = 0 • For all nodes n directly reachable from startNode, DistanceTo(n) = 1 • For all nodes n reachable from some other set of nodes S, DistanceTo(n) = 1 + min(DistanceTo(m), m  S)

  21. Finding the Shortest Path • This strategy advances the “known frontier” by one hop • Subsequent iterations include more reachable nodes as frontier advances • Multiple iterations are needed to explore entire graph

  22. Visualizing Parallel BFS 3 1 2 2 2 3 3 3 4 4

  23. Termination • Does the algorithm ever terminate? • Eventually, all nodes will be discovered, all edges will be considered (in a connected graph) • When do we stop? • When distances at every node no longer change at next frontier

  24. Next Step to Solving • Next – • No longer assume distance to each node is 1

  25. Weighted Edges • Now add positive weights to the edges • Simple change: points-to list in map task includes a weight w for each pointed-to node • emit (p, D+wp) instead of (p, D+1) for each node p

  26. Dijkstra’s Algorithm Example   n3 n1 1 10 0 n0 9 2 3 4 6 5 7   n2 n4 2 Example from CLR

  27. Multiple Iterations Needed • This MapReduce task advances the “known frontier” by one hop • Subsequent iterations include more reachable nodes as frontier advances • Multiple iterations are needed to explore entire graph • Each iteration a MapReduce task • Final output is input to next iteration - MapReduce task • Feed output back into the same MapReduce task

  28. Assume d = 1

  29. From Intuition to Algorithm • What info does the map task require? • A map task receives (k,v) • Key: • node n • Value: • D (distance from start) • points-to (adjacency list of nodes reachable from n) • What does the map task do? • Computes distances • Emit (p, D+wp) p points-to: Makes sure current distance is carried into the reducer • Emits graph structure of node n (n, struct) which contains the current shortest distance to node n

  30. From Intuition to Algorithm • What info does the reduce task require? • The reduce task gathers possible distances to a given p • What does the reduce task do? • selects the minimum one

  31. Algorithm • Assume adjacency list has information about edges and distances!!

  32. class Mapper method MAP(nid n, node N) D ← N.Distance Emit(nid n, N) // Pass along graph structure for all nodeid m € N.AdjacencyList do Emit(nid m, d+w) // Emit distances to reachable nodes class Reducer method REDUCE (nid m, [d1, d2, ...]) dmin ← ∞ M ← Φ for all d € counts [d1, d2, ...] do if IsNode(d) then M ← d // Recover graph structure else if d < dmin then // Look for shorter distance dmin ← d if M.Distance > dmin // update shortest distance M.Distance ← dmin Increment counter for driver Emit(nid m, node M)

  33. Map Algorithm • Line 2. N is an adjacency list and current distance (shortest) • Line 4. Emits (k,v) in k which is current node info , but only one of these for a node because assume each node assigned to one mapper • Line 6. Emits different type of (k,v) which only has distance to neighbor not adjacency list • Shuffles (k,v) with same k to same reducers

  34. Reduce Algorithm • Line 2. Will have different types of (k,v) as input • Line 5. Determine what type of (k,v) if adjacency list • Line 6. If v is not adjacency list (Node structure) then it is a distance, find shortest • Only 1 IsNode as far as I can tell • Line 9. Determine if new shortest • Line 10. Update current shortest, increment a counter to determine if should stop

  35. Shortest path – one more thing • Only finds shortest distances, not the shortest path • Is this true? • Do we have to use backpointers to find shortest path to retrace • NO -- • Emit paths along with distances, each node has shortest path accessible at all times • Most paths relatively short, uses little space

  36. Weighted edgesFinds Minimum? • Discover node r • Discovered shortest D to p and shortest D to r goes through p • Maybe path through q to r that is shorter, but path lies outside current search frontier • Not true if D = 1 since shortest path cannot lie outside search frontier, since would be longer path • Have found shortest path within frontier • Will discover shortest path as frontier expands • With sufficient iterations, eventually discover shortest Distance

  37. Dijkstra’s Algorithm Example   n3 n1 1 10 0 n0 9 2 3 4 6 5 7   n2 n4 2 Example from CLR

  38. Termination • Does this ever terminate? • Yes! Eventually, no better distances will be found. When distance is the same, we stop • Checking of termination must occur outside of MapReduce • Driver program submits MR job to iterate algorithm, see if termination condition met • Hadoop provides Counters (drivers) outside MapReduce • Drivers determine after reducers if done • In shortest path reducers count each change to min distance, passes count to driver

  39. Iterations • How many iterations needed to compute shortest distance to all nodes? • Diameter of graph or greatest distance between any pair of nodes • Small for many real-world problems – 6 degrees of separation • For global social network – 6 MapReduce iterations

  40. Fig. 5.6 needs how many iterations for n1-n6? • Worst case? • need (#nodes – 1)

  41. Comparison to Dijkstra • Dijkstra’s algorithm is more efficient • At any step it only pursues edges from the minimum-cost path inside the frontier • MapReduce explores all paths in parallel • Brute force – wastes time • Divide and conquer • Except at search frontier, within frontier repeating same computations • Throw more hardware at the problem

  42. General Approach • MapReduce is adept at manipulating graphs • Store graphs as adjacency lists • Graph algorithms with MapReduce: • Each map task receives a node and its outlinks • Map task compute some function of the link structure, emits value with target as the key • Reduce task collects keys (target nodes) and aggregates • Iterate multiple MapReduce cycles until some termination condition • Remember to “pass” graph structure from one iteration to next

  43. Another example –Random Walks Over the Web • Model: • User starts at a random Web page • User randomly clicks on links, surfing from page to page (may also teleport to completely diff page • How frequently will a page be encountered during this surfing? • This is PageRank • Probability distribution over nodes in a graph representing likelihood random walk over a graph will arrive at a particular node

  44. PageRank: Defined Given page n with in-bound links L(n), where • C(m) is the out-degree ofm • P(m) is the page rank of m •  is probability of random jump • |G| is the total number of nodes in the graph m1 n mn … mn

  45. Computing PageRank • Properties of PageRank • Can be computed iteratively • Effects at each iteration is local • Sketch of algorithm: • Start with seed (Pi ) values • Each page distributes (Pi ) “credit” to all pages it links to • Each target page adds up “credit” from multiple in-bound links to compute (Pi+1) • Iterate until values converge

  46. Computing PageRank • What does map do? • What does reduce do?

  47. PageRank MapReduce • Fig. 5.7 • Begins with 5 nodes splitting 1.0 -> 0.2 each • Each node must split their 0.2 to outgoing nodes (map) • Then add up all incoming values (reduce) • Each iteration is one MapReduce job

  48. PageRank in MapReduce Map: distribute PageRank “credit” to link targets Reduce: gather up PageRank “credit” from multiple sources to compute new PageRank value Iterate until convergence ...

More Related