Cluster Computing and Datalog

1 / 26

# Cluster Computing and Datalog - PowerPoint PPT Presentation

Cluster Computing and Datalog. Recursion Via Map-Reduce Seminaïve Evaluation Re-engineering Map-Reduce for Recursion. Acknowledgements. Joint work with Foto Afrati Alkis Polyzotis and Vinayak Borkar contributed to the architecture discussions. Implementing Datalog via Map-Reduce.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about 'Cluster Computing and Datalog' - lyre

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

### Cluster Computing and Datalog

Recursion Via Map-Reduce

Seminaïve Evaluation

Re-engineering Map-Reduce for Recursion

Acknowledgements
• Joint work with Foto Afrati
• Alkis Polyzotis and Vinayak Borkar contributed to the architecture discussions.
Implementing Datalog via Map-Reduce
• Joins are straightforward to implement as a round of map-reduce.
• Likewise, union/duplicate-elimination is a round of map-reduce.
• But implementation of a recursion can thus take many rounds of map-reduce.
Seminaïve Evaluation
• Specific combination of joins and unions.
• Example: chain rule

q(W,Z) :- r(W,X) & s(X,Y) & t(Y,Z)

• Let r, s, t = “old” relations; r’, s’, t’ = incremental relations.
• Simplification: assume |r’| = a|r|, etc.

q(W,Z) :- r(W,X) & s(X,Y) & t(Y,Z)

• Use k compute nodes.
• Give X and Y shares to determine the reduce-task that gets each tuple.
• Optimum strategy replicates r and t, not s, using communication |s| + 2k|r||t|.
Seminaïve Evaluation – (2)
• Need to compute sum (union) of seven terms (joins): rst’+rs’t+r’st+rs’t’+r’st’+r’s’t+r’s’t’
• Obvious method for computing a round of seminaïve evaluation:
• Replicate r and r’; replicate t and t’; do not replicate s or s’.
• Communication = (1+a)(|s| + 2k|r||t|)
Seminaïve Evaluation – (3)
• There are many other ways we might use k nodes to do the same task.
• Example: one group of nodes does (r+r’)s’(t+t’); a second group does r’s(t+t’); the third group does rst’.
• Theorem: no grouping does better than the obvious method for this example.
Networks of Processes for Recursions
• Is it possible to do a recursion without multiple rounds of map-reduce and their associated communication cost?
• Note: tasks do not have to be Map or Reduce tasks; they can have other behaviors.
Example: Very Simple Recursion

p(X,Y) :- e(X,Z) & p(Z,Y);

p(X,Y) :- p0(X,Y);

• Use k compute nodes.
• Hash Y-values to one of k buckets h(Y).
• Each node gets a complete copy of e.
• p0 is distributed among the k nodes, with p0(x,y) going to node h(y).
Example – Continued

p(X,Y) :- e(X,Z) & p(Z,Y)

• Each node applies the recursive rule and generates new tuples p(x,y).
• Key point: since new tuples have a Y-value that hashes to the same node, no communication is necessary.
• Duplicates are eliminated locally.
Harder Case of Recursion
• Consider a recursive rule

p(X,Y) :- p(X,Z) & p(Z,Y)

• Responsibility divided among compute nodes by hashing Z-values.
• Node n gets tuple p(a,b) if either h(a) = n or h(b) = n.

p(a,b) if

h(a) = n

or h(b) = n

p(c,d)

produced

To nodes

for h(c)

and h(d)

Search for

matches

Compute Node for h(Z) = n

Node for

h(Z) = n

Remember all

(eliminate

duplicates)

Comparison with Iteration
• Advantage: Lets us avoid some communication of data that would be needed in iterated map-reduce rounds.
Node Failures
• To cope with failures, map-reduce implementations rely on each task getting its input at the beginning, and on output not being consumed elsewhere until the task completes.
• But recursions can’t work that way.
• What happens if a node fails after some of its output has been consumed?
Node Failures – (2)
• Actually, there is no problem!
• We restart the tasks of the failed node at another node.
• The replacement task will send some data that the failed task also sent.
• But each node remembers tuples to eliminate duplicates anyway.
Node Failures – (3)
• But the “no problem” conclusion is highly dependent on the Datalog assumption that it is computing sets.
• Argument would fail if we were computing bags or aggregations of the tuples produced.
• Similar problems for other recursions, e.g., PDE’s.
Extension of Map-Reduce Architecture for Recursion
• Necessarily, all tasks need to operate in rounds.
• The master controller learns of all input files that are part of the round-i input to task T and records that T has received these files.
Extension – (2)
• Suppose some task S fails, and it never supplies the round-(i +1) input to T.
• A replacement S’ for S is restarted at some other node.
• The master knows that T has received up to round i from S, so it ignores the first i output files from S’.
Extension – (3)
• Master knows where all the inputs ever received by S are from, so it can provide those to S’.
Checkpointing and State
• Another approach is to design tasks so that they can periodically write a state file, which is replicated elsewhere.
• Tasks take input + state.
• Initially, state is empty.
• Master can restart a task from some state and feed it only inputs received after that state was written.
Example: Checkpointing

p(X,Y) :- p(X,Z) & p(Z,Y)

• Join tasks: hash on Z, using h(Z).
• Like tasks from previous example.
• Eliminate-duplicates tasks: hash on X and Y, using h’(X,Y).
• Distributes truly new tuples to join tasks.

to h(a)

and h(b)

if new

p(a,b)

p(a,b)

to h’(a,b)

Example – (2)

.

.

.

State has p(x,y) if

h’(x,y) is right.

has p(x,y) if h(x)

or h(y) is right.

Example – Details
• Each task writes “buffer” files locally, one for each of the tasks in the other rank.
• The two ranks of tasks are run on different racks of nodes, to minimize the probability that tasks in both ranks will fail at the same time.
Example – Details – (2)
• Periodically, each task writes its state (tuples received so far) incrementally and lets the master controller replicate it.
• Problem: the controller can’t be too eager to pass output files to their input, or files become tiny.
Future Research
• There is work to be done on optimization, using map-reduce or similar facilities, for restricted SQL such as Datalog, Datalog–, Datalog + aggregation.
• Check out Hive, PIG, as well as work on multiway join optimization.
Future Research – (2)
• Almost everything is open about recursive Datalog implementation under map-reduce or similar systems.
• Seminaïve evaluation in general case.
• Architectures for managing failures.
• Clustera and Hyrax are interesting examples of (nonrecursive) extension of map-reduce.
• When can we avoid communication as with p(X,Y) :- e(X,Z) & p(Z,Y)?