Cluster Computing and Datalog

Cluster Computing and Datalog Recursion Via Map-Reduce Seminaïve Evaluation Re-engineering Map-Reduce for Recursion

Acknowledgements • Joint work with Foto Afrati • Alkis Polyzotis and Vinayak Borkar contributed to the architecture discussions.

Implementing Datalog via Map-Reduce • Joins are straightforward to implement as a round of map-reduce. • Likewise, union/duplicate-elimination is a round of map-reduce. • But implementation of a recursion can thus take many rounds of map-reduce.

Seminaïve Evaluation • Specific combination of joins and unions. • Example: chain rule q(W,Z) :- r(W,X) & s(X,Y) & t(Y,Z) • Let r, s, t = “old” relations; r’, s’, t’ = incremental relations. • Simplification: assume |r’| = a|r|, etc.

A 3-Way Join Using Map-Reduce q(W,Z) :- r(W,X) & s(X,Y) & t(Y,Z) • Use k compute nodes. • Give X and Y shares to determine the reduce-task that gets each tuple. • Optimum strategy replicates r and t, not s, using communication |s| + 2k|r||t|.

Seminaïve Evaluation – (2) • Need to compute sum (union) of seven terms (joins): rst’+rs’t+r’st+rs’t’+r’st’+r’s’t+r’s’t’ • Obvious method for computing a round of seminaïve evaluation: • Replicate r and r’; replicate t and t’; do not replicate s or s’. • Communication = (1+a)(|s| + 2k|r||t|)

Seminaïve Evaluation – (3) • There are many other ways we might use k nodes to do the same task. • Example: one group of nodes does (r+r’)s’(t+t’); a second group does r’s(t+t’); the third group does rst’. • Theorem: no grouping does better than the obvious method for this example.

Networks of Processes for Recursions • Is it possible to do a recursion without multiple rounds of map-reduce and their associated communication cost? • Note: tasks do not have to be Map or Reduce tasks; they can have other behaviors.

Example: Very Simple Recursion p(X,Y) :- e(X,Z) & p(Z,Y); p(X,Y) :- p0(X,Y); • Use k compute nodes. • Hash Y-values to one of k buckets h(Y). • Each node gets a complete copy of e. • p0 is distributed among the k nodes, with p0(x,y) going to node h(y).

Example – Continued p(X,Y) :- e(X,Z) & p(Z,Y) • Each node applies the recursive rule and generates new tuples p(x,y). • Key point: since new tuples have a Y-value that hashes to the same node, no communication is necessary. • Duplicates are eliminated locally.

Harder Case of Recursion • Consider a recursive rule p(X,Y) :- p(X,Z) & p(Z,Y) • Responsibility divided among compute nodes by hashing Z-values. • Node n gets tuple p(a,b) if either h(a) = n or h(b) = n.

p(a,b) if h(a) = n or h(b) = n p(c,d) produced To nodes for h(c) and h(d) Search for matches Compute Node for h(Z) = n Node for h(Z) = n Remember all Received tuples (eliminate duplicates)

Comparison with Iteration • Advantage: Lets us avoid some communication of data that would be needed in iterated map-reduce rounds. • Disadvantage: Tasks run longer, more likely to fail.

Node Failures • To cope with failures, map-reduce implementations rely on each task getting its input at the beginning, and on output not being consumed elsewhere until the task completes. • But recursions can’t work that way. • What happens if a node fails after some of its output has been consumed?

Node Failures – (2) • Actually, there is no problem! • We restart the tasks of the failed node at another node. • The replacement task will send some data that the failed task also sent. • But each node remembers tuples to eliminate duplicates anyway.

Node Failures – (3) • But the “no problem” conclusion is highly dependent on the Datalog assumption that it is computing sets. • Argument would fail if we were computing bags or aggregations of the tuples produced. • Similar problems for other recursions, e.g., PDE’s.

Extension of Map-Reduce Architecture for Recursion • Necessarily, all tasks need to operate in rounds. • The master controller learns of all input files that are part of the round-i input to task T and records that T has received these files.

Extension – (2) • Suppose some task S fails, and it never supplies the round-(i +1) input to T. • A replacement S’ for S is restarted at some other node. • The master knows that T has received up to round i from S, so it ignores the first i output files from S’.

Extension – (3) • Master knows where all the inputs ever received by S are from, so it can provide those to S’.

Checkpointing and State • Another approach is to design tasks so that they can periodically write a state file, which is replicated elsewhere. • Tasks take input + state. • Initially, state is empty. • Master can restart a task from some state and feed it only inputs received after that state was written.

Example: Checkpointing p(X,Y) :- p(X,Z) & p(Z,Y) • Two groups of tasks: • Join tasks: hash on Z, using h(Z). • Like tasks from previous example. • Eliminate-duplicates tasks: hash on X and Y, using h’(X,Y). • Receives tuples from join tasks. • Distributes truly new tuples to join tasks.

to h(a) and h(b) if new p(a,b) p(a,b) to h’(a,b) Example – (2) . . . Dup-elim tasks. State has p(x,y) if h’(x,y) is right. Join tasks. State has p(x,y) if h(x) or h(y) is right.

Example – Details • Each task writes “buffer” files locally, one for each of the tasks in the other rank. • The two ranks of tasks are run on different racks of nodes, to minimize the probability that tasks in both ranks will fail at the same time.

Example – Details – (2) • Periodically, each task writes its state (tuples received so far) incrementally and lets the master controller replicate it. • Problem: the controller can’t be too eager to pass output files to their input, or files become tiny.

Future Research • There is work to be done on optimization, using map-reduce or similar facilities, for restricted SQL such as Datalog, Datalog–, Datalog + aggregation. • Check out Hive, PIG, as well as work on multiway join optimization.

Future Research – (2) • Almost everything is open about recursive Datalog implementation under map-reduce or similar systems. • Seminaïve evaluation in general case. • Architectures for managing failures. • Clustera and Hyrax are interesting examples of (nonrecursive) extension of map-reduce. • When can we avoid communication as with p(X,Y) :- e(X,Z) & p(Z,Y)?

Cluster Computing and Datalog

Cluster Computing and Datalog

Presentation Transcript

Cluster Computing

Datalog

Cluster Computing

Cluster Computing

Cluster Computing

Datalog

Datalog

Parallel and Cluster Computing

CLUSTER COMPUTING

Datalog

Cluster Computing

CLUSTER COMPUTING

Cluster Computing

Cluster Computing

Datalog

Cluster Computing