cluster computing and datalog l.
Skip this Video
Download Presentation
Cluster Computing and Datalog

Loading in 2 Seconds...

play fullscreen
1 / 26

Cluster Computing and Datalog - PowerPoint PPT Presentation

  • Uploaded on

Cluster Computing and Datalog. Recursion Via Map-Reduce Seminaïve Evaluation Re-engineering Map-Reduce for Recursion. Acknowledgements. Joint work with Foto Afrati Alkis Polyzotis and Vinayak Borkar contributed to the architecture discussions. Implementing Datalog via Map-Reduce.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Cluster Computing and Datalog' - lyre

Download Now An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
cluster computing and datalog

Cluster Computing and Datalog

Recursion Via Map-Reduce

Seminaïve Evaluation

Re-engineering Map-Reduce for Recursion

  • Joint work with Foto Afrati
  • Alkis Polyzotis and Vinayak Borkar contributed to the architecture discussions.
implementing datalog via map reduce
Implementing Datalog via Map-Reduce
  • Joins are straightforward to implement as a round of map-reduce.
  • Likewise, union/duplicate-elimination is a round of map-reduce.
  • But implementation of a recursion can thus take many rounds of map-reduce.
semina ve evaluation
Seminaïve Evaluation
  • Specific combination of joins and unions.
  • Example: chain rule

q(W,Z) :- r(W,X) & s(X,Y) & t(Y,Z)

  • Let r, s, t = “old” relations; r’, s’, t’ = incremental relations.
  • Simplification: assume |r’| = a|r|, etc.
a 3 way join using map reduce
A 3-Way Join Using Map-Reduce

q(W,Z) :- r(W,X) & s(X,Y) & t(Y,Z)

  • Use k compute nodes.
  • Give X and Y shares to determine the reduce-task that gets each tuple.
  • Optimum strategy replicates r and t, not s, using communication |s| + 2k|r||t|.
semina ve evaluation 2
Seminaïve Evaluation – (2)
  • Need to compute sum (union) of seven terms (joins): rst’+rs’t+r’st+rs’t’+r’st’+r’s’t+r’s’t’
  • Obvious method for computing a round of seminaïve evaluation:
    • Replicate r and r’; replicate t and t’; do not replicate s or s’.
    • Communication = (1+a)(|s| + 2k|r||t|)
semina ve evaluation 3
Seminaïve Evaluation – (3)
  • There are many other ways we might use k nodes to do the same task.
  • Example: one group of nodes does (r+r’)s’(t+t’); a second group does r’s(t+t’); the third group does rst’.
  • Theorem: no grouping does better than the obvious method for this example.
networks of processes for recursions
Networks of Processes for Recursions
  • Is it possible to do a recursion without multiple rounds of map-reduce and their associated communication cost?
  • Note: tasks do not have to be Map or Reduce tasks; they can have other behaviors.
example very simple recursion
Example: Very Simple Recursion

p(X,Y) :- e(X,Z) & p(Z,Y);

p(X,Y) :- p0(X,Y);

  • Use k compute nodes.
  • Hash Y-values to one of k buckets h(Y).
  • Each node gets a complete copy of e.
  • p0 is distributed among the k nodes, with p0(x,y) going to node h(y).
example continued
Example – Continued

p(X,Y) :- e(X,Z) & p(Z,Y)

  • Each node applies the recursive rule and generates new tuples p(x,y).
  • Key point: since new tuples have a Y-value that hashes to the same node, no communication is necessary.
  • Duplicates are eliminated locally.
harder case of recursion
Harder Case of Recursion
  • Consider a recursive rule

p(X,Y) :- p(X,Z) & p(Z,Y)

  • Responsibility divided among compute nodes by hashing Z-values.
  • Node n gets tuple p(a,b) if either h(a) = n or h(b) = n.
compute node for h z n

p(a,b) if

h(a) = n

or h(b) = n



To nodes

for h(c)

and h(d)

Search for


Compute Node for h(Z) = n

Node for

h(Z) = n

Remember all

Received tuples



comparison with iteration
Comparison with Iteration
  • Advantage: Lets us avoid some communication of data that would be needed in iterated map-reduce rounds.
  • Disadvantage: Tasks run longer, more likely to fail.
node failures
Node Failures
  • To cope with failures, map-reduce implementations rely on each task getting its input at the beginning, and on output not being consumed elsewhere until the task completes.
  • But recursions can’t work that way.
  • What happens if a node fails after some of its output has been consumed?
node failures 2
Node Failures – (2)
  • Actually, there is no problem!
  • We restart the tasks of the failed node at another node.
  • The replacement task will send some data that the failed task also sent.
  • But each node remembers tuples to eliminate duplicates anyway.
node failures 3
Node Failures – (3)
  • But the “no problem” conclusion is highly dependent on the Datalog assumption that it is computing sets.
  • Argument would fail if we were computing bags or aggregations of the tuples produced.
  • Similar problems for other recursions, e.g., PDE’s.
extension of map reduce architecture for recursion
Extension of Map-Reduce Architecture for Recursion
  • Necessarily, all tasks need to operate in rounds.
  • The master controller learns of all input files that are part of the round-i input to task T and records that T has received these files.
extension 2
Extension – (2)
  • Suppose some task S fails, and it never supplies the round-(i +1) input to T.
  • A replacement S’ for S is restarted at some other node.
  • The master knows that T has received up to round i from S, so it ignores the first i output files from S’.
extension 3
Extension – (3)
  • Master knows where all the inputs ever received by S are from, so it can provide those to S’.
checkpointing and state
Checkpointing and State
  • Another approach is to design tasks so that they can periodically write a state file, which is replicated elsewhere.
  • Tasks take input + state.
    • Initially, state is empty.
  • Master can restart a task from some state and feed it only inputs received after that state was written.
example checkpointing
Example: Checkpointing

p(X,Y) :- p(X,Z) & p(Z,Y)

  • Two groups of tasks:
    • Join tasks: hash on Z, using h(Z).
      • Like tasks from previous example.
    • Eliminate-duplicates tasks: hash on X and Y, using h’(X,Y).
      • Receives tuples from join tasks.
      • Distributes truly new tuples to join tasks.
example 2

to h(a)

and h(b)

if new



to h’(a,b)

Example – (2)




Dup-elim tasks.

State has p(x,y) if

h’(x,y) is right.

Join tasks. State

has p(x,y) if h(x)

or h(y) is right.

example details
Example – Details
  • Each task writes “buffer” files locally, one for each of the tasks in the other rank.
  • The two ranks of tasks are run on different racks of nodes, to minimize the probability that tasks in both ranks will fail at the same time.
example details 2
Example – Details – (2)
  • Periodically, each task writes its state (tuples received so far) incrementally and lets the master controller replicate it.
  • Problem: the controller can’t be too eager to pass output files to their input, or files become tiny.
future research
Future Research
  • There is work to be done on optimization, using map-reduce or similar facilities, for restricted SQL such as Datalog, Datalog–, Datalog + aggregation.
    • Check out Hive, PIG, as well as work on multiway join optimization.
future research 2
Future Research – (2)
  • Almost everything is open about recursive Datalog implementation under map-reduce or similar systems.
    • Seminaïve evaluation in general case.
    • Architectures for managing failures.
      • Clustera and Hyrax are interesting examples of (nonrecursive) extension of map-reduce.
    • When can we avoid communication as with p(X,Y) :- e(X,Z) & p(Z,Y)?