Cluster Computing, Recursion and Datalog

Cluster Computing, Recursion and Datalog Foto N. Afrati National Technical University of Athens, Greece

Map-Reduce Pattern “key”-value pairs Input from DFS Output to DFS Map tasks Reduce tasks

Tasks and Compute-Node Failures • Low-end hardware • Expect failures of the compute-nodes - disk crashes - not updated software - etc. We don’t want to restart the computation. In map-reduce this can be easily handled. Why?

Blocking property • Map-reduce deals with node failures by restricting theunits of computation in an important way. • Both Map tasksand Reduce tasks have the blocking property: A task does not deliver output to any other task untilit has completely finished its work.

Extensions: Dataflow systems • Uses function prototypes for each kind oftask the job needs, just as Map-Reduce or Hadoopuse prototypes for the Map and Reduce tasks. • Replicates the prototype into as many tasks as areneeded or requested by the user. • DryadLINQ (Microsoft),Clustera (U. Wisconsin), Hyracks (UC Irvine), Nephele/PACT (T. U. Berlin), BOOM (UC Berkeley), epiC (NUS). • Other extensions: High-level languages (PIG, Hive, SCOPE)

Simple extension: Several ranks of Map-Reduce computations Map1 Reduce1 Map2 Reduce2 Map3 Reduce3

A more advanced extension: an acyclic network Blocking property holds Not Map and Reduce tasks any more, could be anything.

We need Recursion • PageRank — the original problem for which Map-Reduce was developed • Studies of the structure of the web • Discovering communities • Centrality of persons • Need full transitive closure • Really operations on relational data

Outline • Data-volume cost model, in whichwe can evaluate different algorithms for executing queries ona cluster, recursive or not • Multiway join in map reduce • Algorithms for implementing recursive queries,starting with transitive closure • Generalizing to all Datalog queries • File transfer between compute nodes involves substantial overhead, hence a need to cope with many small files

Data-volume cost • Total running time of all the tasks thatcollaborate in the execution of the algorithm. • = Renting time on processors from a public cloud. • Surrogate: the sum of the sizesof the inputs to all the tasks that collaborate on a job. • Upper limit on the amount of data thatcan be input to any one task

Details of the cost model (other costs) • Execution time of a task could, in principle, be muchlarger than its input data volume • Many algorithms (e.g., SQL operations, hash join) perform operations in time proportional to the input sizeplus output size • The data-volume model counts only input size, not outputsize (output is input to another task and the final output succinct)

Map tasks send R(a,b) if h(b) = i All (a,b,c) such that h(b) = i, and (a,b) is in R, and (b,c) is in S. Map tasks send S(b,c) if h(b) = i Joining by Map-Reduce Reduce task i

A B 0 1 1 2 A C 1 3 2 1 B C 1 2 2 3 A B C 1 2 3 Natural Join of Three Relations The join:

Multiway join on Map-Reduce Answer(W,X,Y,Z) :- r(W,X) & s(X,Y) & t(Y,Z) • We use 100 Reduce tasks • Hash X five ways and Y 20 ways • h and g hash functions for X, Y • Reduce task ID = [h(X), g(Y)]. • Send r(a,b) to all tasks [h(b),v]. Similar for t(c,d): all [u,g(c)] • r facts are sent to 20 tasks • t facts are sent to 5 tasks • s facts are sent to only one task

Multiway join on Map-Reduce (minimizing the data volume cost) • Ifwe have k Reduce tasks, we can use two hash functions h(X) and g(Y), that mapX-values and Y-values, respectively, tok1 (k2respectively). • k1k2 =k • Then, a Reduce taskcorresponds to a pair of buckets, one for h and the other for g. • To minimize the number of tuples transmitted, pick: k1 = k|r|/|t|, k2 = k|t|/|r| Answer(W,X,Y,Z) :- r(W,X) & s(X,Y) & t(Y,Z) Data-volume cost= k1|t| + k2|r|=k|r||t|

Transitive closure (TC) Nonlinear p(x,y) <- e(x,y) p(x,y) <- p(x,z) &p(z,y) Right-linear p(x,y) <- e(x,y) p(x,y) <- e(x,z) &p(z,y)

Lower Bound on Query Execution Cost • Number of Derivations: the sum, over all the rules in the program, ofthe number of ways we can assign values to the variables inorder to make the entire body (right side of the rule) true. • Key point: An implementation of a Datalog program that executesthe rules as written must take time on a singleprocessor at least as great as the number of derivations. • Seminaive evaluation: time proportional to the number of derivations

Number of Derivations for Transitive closure • Nonlinear TC: the sum over all nodes c of the number of nodes that can reach c times the number of nodes reachable from c. • Left-linear TC: the sumover all nodes c of the in-degree of c times the number ofnodes reachable from c. Nonlinear TC more derivations than linear TC

NonlinearTC. Algorithm Dup-Elim • Join tasks, which perform the join of tuples as in earlier slide. • Dup-elim tasks, whose job is to catch duplicate p-tuples before they canpropagate.

NonlinearTC. Algorithm Dup-Elim

A join task • When p(a, b) is received by a task for the first time, the task: • If this task is h(b), it searches for previously received tuples p(b, x) for anyx. For each such tuple, it sends p(a, x) to two tasks: h(a) and h(x). • If this task is h(a), it searches for previously received tuples p(y, a) for anyy. For each such tuple, it sends p(y, b) to the tasks h(y) and h(b). • The tuple p(a, b) is stored locally as an already seen tuple.

A join task

NonlinearTC. Algorithm Dup-Elim • The method described above for combining join and duplicate-elimination tasks communicates a number of tuples that is at most the sum of: • The number of derivations plus • Twice the number of path facts plus • Three times the number of arcs.

NonlinearTC. Algorithm Smart(improvement on the number of derivations) • There is a path from node X to node Y if there are two paths: • One path from X to some node Z with length which is a power of 2 • Another path from Z to Y whose length is no less • Key point: Each shortest path between two points is discovered only once

Incomparability of TC Algorithms • Infinite variety of algorithms that discover shortest paths only once. • Like Smart or Linear algorithms. • None dominates the others on all graphs.

Implementing any Datalog program • For each rule we create a collection of tasks • Bucketsby vectors of values, and each component of the vector isobtained by hashing a certain variable • A task receives all factsP(a1, a2, . . . , ak) discovered by any task, provided that eachcomponent aimeets a constraint • Alternatively rewrite the rulesto have bodies of at most two subgoals

Implementations handling recursion • Haloop: iteration implemented as a series of Map-Reduce ranks (vldb2010) • Pregel: checkpointing for failure recovery mechanism (sigmod2010)

Haloop • Implements recursion as an iteration of map-reduce jobs • Trying hard to make sure that tasks for one round are located at the node where its input was created by the previous round. • They are not really recursive, and no problem of dealing with failure arises.

Pregel • Views all computation as a recursion on some graph. • Nodes send messages to one another bunched into supersteps • Checkpoints all tasks at intervals. If any task fails, all tasks are rolled back to the previous checkpoint. • Does not scale completely (the more compute nodes are involved, themore frequently we must checkpoint).

Is this the shortest path from M I know about? I found a path from node M to you of length L I found a path from node M to you of length L+5 I found a path from node M to you of length L+6 I found a path from node M to you of length L+3 Example: Shortest Paths Via Pregel Node N 5 6 3

The endgame(dealing with small files) • In later rounds of a recursion, it is possible that the number of new facts derivedat a round drops considerably. • Recall that there is significant overhead involvedin transmitting files. • It is desired that each of the recursive tasks have many tuples for each of the other tasks whenever we distribute data amongthe tasks. • An approach: small number of rounds.

The polynomial fringe property • It is highly unlikely that all Datalog programs can be implementedin parallel in polylog time • The division between those thatcan and those that (almost surely) cannot was addressed in the 80’s • And the polynomial-fringe property (hereafterPFP) was defined • Programs with the PFP can be implemented in parallel in polylog number of rounds • Reduces to TC

Reachability Query • R(X) <-- R(Y), e(X,Y) • R(X) <-- Q(X) • Can be computed in logarithmic number or rounds by computing the full transitive closure. • Theorem: Cannot be computed in logarithmic number of rounds unless the arity increases to 2.

Right linear chain programs P(X,Y) :- Blue(X,Y) P(X,Y) :- Blue(X,Z) & Q(Z,Y) Q(X,Y) :- Red(X,Z) & P(Z,Y) Theorem: can be executed in logarithmic number of rounds keeping arity 2.

Thank you

Cluster Computing, Recursion and Datalog

Cluster Computing, Recursion and Datalog

Presentation Transcript

Conjunctive Queries, Datalog, and Recursion

Cluster Computing

Cluster Computing and Datalog

Cluster Computing

Cluster Computing

Cluster Computing

Datalog

Datalog

Parallel and Cluster Computing

CLUSTER COMPUTING

Datalog

Cluster Computing

CLUSTER COMPUTING

Cluster Computing

Cluster Computing

Datalog

Cluster Computing