Joins in hadoop
1 / 33

Joins in Hadoop - PowerPoint PPT Presentation

  • Uploaded on

Joins in Hadoop. Gang and Ronnie. Agenda. Introduction of new types of joins Experiment results Join plan generator Summary and future work. Problem at hand. Map join (fragment-duplicate join). Fragment (large table). Map tasks:. Duplicate (small table).

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Joins in Hadoop' - deidra

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Joins in hadoop

Joins in Hadoop

Gang and Ronnie


  • Introduction of new types of joins

  • Experiment results

  • Join plan generator

  • Summary and future work

Problem at hand
Problem at hand

  • Map join (fragment-duplicate join)

Fragment (large table)

Map tasks:


(small table)

Slide taken from project proposal
Slide taken from project proposal

  • Too many copies of the small table are shuffled across the network

  • Partially Solved

    • Distributed Cache

  • Doesn’t work with too many nodes involved

§there are 64 nodes in our cluster, and distributed cache will copy the data no more than that amount of time

Slide taken from project proposal ii
Slide taken from project proposal II

  • Memory Limitation

    • Hash table is not memory-efficient.

    • The table size is usually larger than the heap memory assigned to a task

Out Of Memory Exception!

Solving not enough memory problem
Solving Not-Enough-Memory problem

New Map Joins:

  • Multi-phase map join (MMJ)

  • Reversed map join (RMJ)

  • JDBM-based map join (JMJ)

    small table as: duplicate

    large table as: fragment

Multi phase map join
Multi-phase map join

  • n-phase map join


Part 2


Part n


Part 1


Map tasks:


Problem? - Reading large table multiple times!

Reversed map join
Reversed map join

  • Default map join (in each Map task):1. read duplicate to memory, build hash table2. for each tuple in fragment, probe the hash table

  • Reversed map join (in each Map task): :1. read fragment to memory, build hash table2. for each tuple in duplicate , probe the hash table

Problem? – not really a Map job…

Jdbm based map join
JDBM-based map join

  • JDBM is a transactional persistence engine for Java.

  • Using JDBM, we can eliminate OutOfMemoryException. The size of the hash table is no longer bound by the heap size.

Problem? – Probing a hashtable on disk might take much time!

Advanced joins
Advanced Joins

  • Step 1:Semi join on join key only;

  • Step 2:Use the result to filter the table;

  • Step 3:Join new tables.

  • Can be applied to both map and reduce-side joins

Problem? – Step 1 and 2 have overhead!

The nine candidates
The Nine Candidates

  • AMJ/no dist advanced map join without DC

  • AMJ/dist advanced map join with DC

  • DMJ/no dist default map join without DC

  • DMJ/dist default map join with DC

  • MMJ multi-phase map join

  • RMJ/dist reversed map join with DC

  • JMJ/dist JDBM-based map join with DC

  • ARJ/dist advanced reduce join with DC

  • DRJ default reduce join

Experiment setup
Experiment Setup

  • TPC-DS benchmark

  • Evaluated query:JOIN customer, web_sales ON cid

  • Performed on different scales of generated data, e.g. 10GB, 170GB (not actual table size)

  • Each combination is performed five (5) times

  • Results are analyzed with error bars

Hadoop cluster
Hadoop Cluster

  • 128 — Hewlett Packard DL160 Compute Building Blocks

  • Each equipped with:

  • 2 quad-core CPUs

  • 16 GB RAM

  • 2 TB storage

  • High-speed network connection

  • Used in the experiment:

  • Hadoop Cluster (Altocumulus):64 nodes

Result analysis
Result analysis

Some results ignored

One small note
One small note

  • What does 50*200 mean?

  • TABLE customer: from 50GB version of TPC-DS - actual table size: about 100MBTABLE web_sales: 200GB version of TPC-DS - actual table size: about 30GB

Distributed cache ii
Distributed Cache II

  • Distributed cache introduces an overhead when converting the file in HDFS to local disks.

  • The following situations are in favor of Distributed cache (compared to non-DC):1. number of nodes is low2. number of map tasks is high

Advanced vs default iii
Advanced vs. Default III

  • The overhead of semi-join and filtering is heavy.

  • The following situations are in favor of advanced joins (compared to reduce joins):1. join selectivity gets lower2. network becomes slower (true!)3. we need to handle skewed data

Map join vs reduce join
Map Join vs Reduce Join

  • In most situations, Default Map Join performs better than Default Reduce Join

    • Eliminate the data transfer and sorting at shuffle stage

  • The gap is not significant due to the fast network

  • Potential problems of Map Joins

    • A job involving too many map tasks causes large amount of data transferred over network

    • Distributed cache may do harm to performance

Beyond default map join
Beyond Default Map Join

  • Multi-Phase Map Join

    • Succeed in all experiment groups.

    • Performance comparable with DMJ when only one phase is involved.

    • Performance degrades sharply when phase number are greater than 2, due to the much more tasks we launch.

    • Currently no support for distributed cache, not scalable

Beyond default map join1
Beyond Default Map Join

  • Reversed Map Join

    • Succeed in all experiment groups.

    • Not performs as good as DRJ due the overhead of distributed cache

    • Performs best when

Beyond default map join2
Beyond Default Map Join

  • JDBM Map Join

    • Fail for the last two experiment groups, mainly due to the improper configuration settings.

Join plan generator
Join Plan Generator

  • Cost-based + rule-based

  • Focus on three aspects

    • Whether or not to use distributed cache

    • Whether to use Default Map Join

    • Map joins or reduce side join

  • Parameters

Join plan generator1
Join Plan Generator

  • Whether to use distributed cache

    • Only works for map join approaches

    • Cost model

      • With distributed cache:

      • where is the average overhead to distribute one file

      • Without distributed cache:

Join plan generator2
Join Plan Generator

  • Whether to use Default Map Join

    • We give Default Map Join the highest priority since it usually works best

    • The choice on distributed cache can ensure Default Map Join works efficiently

    • Rule: if small table can fit into memory entirely, just do it.

Join plan generator3
Join Plan Generator

  • Map Joins or Default Reduce side Join

    • In those situations where DMJ fails, Reversed Map Join is most promising in terms of usability and scalability.

    • Cost model:

      • RMJ:

        • (without distributed cache)

        • (with distributed cache)

      • where is the average overhead to distribute one file

      • DRJ:

Join plan generator4
Join Plan Generator

Distributed cache?


Default Map Join?


Do it


Reversed Map Join /

Default Reduce side Join

Do it


  • Distributed cache is a double-edge sword

  • When using distributed cache properly, Default Map Join performs best

  • The three new map join approaches extend the usability of default map join

Future work
Future Work

  • SPJA workflow(selection, projection, join, aggregation)

  • Better optimizer

  • Multi-way join

  • Build to hybrid system

  • Need a dedicated (slower) cluster…