Join algorithms using mapreduce
Download
1 / 22

Join algorithms using mapreduce - PowerPoint PPT Presentation


  • 231 Views
  • Updated On :

Join algorithms using mapreduce. Haiping Wang [email protected] Outline . MapReduce Framework MapReduce i mplementation on Hadoop Join algorithms using MapReduce. MapReduce: Simplified data processing on large clusters . In OSDI, 2004. MapReduce WordCount Diagram. file 1. file 2.

Related searches for Join algorithms using mapreduce

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Join algorithms using mapreduce' - alicia


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

Outline
Outline

  • MapReduce Framework

  • MapReduce implementation on Hadoop

  • Join algorithms using MapReduce



Mapreduce wordcount diagram
MapReduce WordCount Diagram

file1

file2

file3

file4

file5

file6

file7

ah ah er

ah

if or

or uh

or

ah if

map(String inputkey, String inputvalue):

ah:1

ah:1 ah:1 er:1

if:1 or:1

or:1 uh:1

or:1

ah:1 if:1

ah:1,1,1,1

er:1

if:1,1

or:1,1,1

uh:1

reduce(Stringoutputkey,

Iteratorintermediate_alues):

1

3

1

4

2

(ah)

(er)

(if)

(or)

(uh)


Mapreduce i mplementation on hadoop
MapReduce implementation on Hadoop

JobTracker

InputFormat

OutputFormat

Record Writer

Copy

RecordReader

Mapper

Partitioner

SorterReducer

TaskTracker


Mapreduce i mplementation on hadoop1
MapReduce implementation on Hadoop



Join algorithms using mapreduce1
Join algorithms using mapreduce

  • Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters sigmod07

  • Semi-join Computation on Distributed File Systems Using Map-Reduce-Merge Model Sac10

  • Optimizing joins in a map-reduce environment VLDB09,EDBT2010

  • A Comparison of Join Algorithms for Log Processing in MapReduce sigmod10



Map reduce merge implementations of relational join algorithms
Map-Reduce-Merge Implementations of Large Clusters Relational Join Algorithms


Example hash join
Example: Hash Join Large Clusters

  • Read from two sets of reducer outputs that share the same hashing buckets

  • One is used as a build set and the other probe

merger

merger

merger

  • Read from every mapper for one designated partition

reducer

reducer

reducer

reducer

reducer

reducer

mapper

mapper

mapper

mapper

mapper

mapper

  • Use a hash partitioner

split

split

split

split

split

split

split

split


Analysis and conclusion
analysis and conclusion Large Clusters

  • Connections

    • A(ma, ra ), B(mb , rb ), r mergers suppose ra=rb=r

    • Map->Reduce connections= ra*ma+rb*mb=r*(ma+mb)

    • Reduce->Merge in one-to-one case, connections=2r

    • matcher: compare tuples to see id they should be merged or not

  • Conclusion

    • Use multiple map-reduce job

    • Partitioner may cause data skew problem

    • The number of ma, ra, mb, rb, r ra=rb? –> connections


Semi join computation steps and workflow
Semi-join computation steps and workflow Large Clusters

Equal join reduce communication costs disk I/O costs

Insensitive to data skew ?


A comparison of join algorithms for log processing in mapreduce sigmod10
A Comparison of Join Algorithms for Log Processing in MapReduce sigmod10

  • Equi-join between a log table L and a reference table R on a single column.

  • L,R and the Join Result is stored in DFS.

  • Scans are used to access L and R.

  • Each map or reduce task can optionally implement two additional functions: init() and close() .

  • These functions can be called before or after each map or reduce task.

L ⊲⊳L.k=R.k R, with |L| ≫ |R|


Repartition join hive
repartition join( MapReduce Hive)

input

map

shuffle

reduce

output

Pairs: (key, targeted record)

Group by join key

1::1193::5::978300760

1::661::3::978302109

1::661::3::978301968

1::661::4::978300275

1 ::1193::5::97882429

Drawback: all records may have to be buffered

(2355, [R:2355::B’…])

(3408, [R:3408::Eri…])

661, R:661::James and the Gla…

914, R: 914::My Fair Lady..

1193, R: 1193::One Flew Over …

2355, R: 2355::Bug’s Life, A…

3408, R: 3408::Erin Brockovi…

(1,Ja..,3, …)

(1,Ja..,3, …)

(1,Ja..,4, …)

1193, L:1::1193::5::978300760

661, L :1::661::3::978302109

661, L :1::661::3::978301968

661, L :1::661::4::978300275

1193, L :1 ::1193::5 ::97882429

(661,

[L :1::661::3::97…],

[R:661::James…],

[L:1::661::3::978…],

[L:1::661::4::97…])

(661, …)

(661, …)

(661, …)

(661, …)

(2355, …)

(3048, …)

{(661::James…) }

X

(1::661::3::97…),

(1::661::3::97…),

(1::661::4::97…)

(1193, …)

(1193, …)

(914, …)

(1193, …)

L: Ratings.dat

Buffers records into two sets according to the table tag

+

Cross-product

661::James and the Glant…

914::My Fair Lady..

1193::One Flew Over the…

2355::Bug’s Life, A…

3408::Erin Brockovich…

R: movies.dat

Out of memory

  • The key cardinality is small

  • The data is highly skewed


The cost measure for mr algorithms
The Cost Measure for MR Algorithms MapReduce

  • The communication cost of a process is the size of the input to the process

    • This paper does not count the output size for a process

      • The output must be input to at least one other process

      • The final output is much smaller than its input

  • The total communication cost is the sum of the communication costs of all processes that constitute an algorithm

  • The elapsed communication cost is defined on the acyclic graph of processes

    • Consider a path through this graph, and sum the communication costs of the processes along that path

    • The maximum sum, over all paths is the elapsed communication cost


2 way join in mapreduce
2-Way Join in MapReduce MapReduce

Input

Reduce input

R

Final output

  • R(A,B) S(B,C)

Map

Reduce

S

b->(a, c)


Joining several relations at once
Joining Several Relations at Once MapReduce

  • R(A,B) S(B,C) T(C,D)

Input

Reduce input

R

Final output

S

Map

Reduce

T


Joining several relations at once1
Joining Several Relations at Once MapReduce

  • R(A,B) S(B,C) T(C,D)

  • Let h be a hash function with range 1, 2, …, m

    • S(b, c) -> (h(b), h(c))

    • R(a, b) -> (h(b), all)

    • T(c, d) -> (all, h(c))

  • Each Reduce process computes the join of the tuples it receives

h(S.b) = 2

h(S.c) = 1

h(T.c) = 1

h(c) = 0

1

3

2

h(b) = 0

1

2

3

h(R.b) = 2

Reduce processes

(# of Reduce processes: 42 = 16)

m=4, k=16


Problem solving
Problem Solving MapReduce

  • Problem solving using the method of Lagrange Multipliers

    • Take derivatives with respect to the three variables a, b, c

    • Multiply the three equations


Special cases
Special Cases MapReduce

  • Star Joins

  • Chain Joins

    • A chain join is a join of the form


Conclusion
Conclusion MapReduce

  • Just suitable for Equal join

  • Use one map-reduce

  • Does not consider the IO ( intermediate <K,V> pairs IO ) and CPU time

  • Main contribution: use “Lagrangean multipliers” method


ad