MapReduce vs Parallel DBMS: A Comparative Analysis for Data Processing Paradigms

MapReduce VS Parallel DBMSs Presenter: Ran Ding

Guideline • 1. Introduction • 2. Where the MR wins • 3. DBMS “sweet spot” tests • 4. Why the Parallel DBMS wins • 5. Conclusion

Introduction-----MR • The MapReduce(MR) paradigm has been hailed as a revolutionary new platform for large-scale, massively parallel data access. • Like Hadoop

Introduction----Parallel DBMS • Parallel DBMS appeared at mid-1980. the Teradata and Gamma projects pioneered a new architectural paradigm based on a cluster of commodity computers.

Introduction---Horizontal partitioning • Distributing the rows of a relational table across the nodes of the cluster so they can process in parallel.

Introduction---DBMS • One benefit is system automatically manages the various alternative partitioning strategies for the tables involved in the query. • Like hash, range, and round-robin…..

Introduction-- Mapping parallel DBMS onto MapReduce • It is not easy!!!!!! • UDF(user defined field) helps. • Like GROUP BY in SQL.

Where the MR wins • 1. ETL and “read once” data sets • 2. Complex analytics • 3. Semi-structured data • 4. Quick-and-dirty analyses • 5. Limited-budget operations

ETL and “read once” data sets • Extract-transform-load system • MR system can be considered a general-purpose parallel ETL system. • DBMSs may perform the ETL

Complex analytics • Cannot be structured as single SQL aggregate queries • MR is a good candidate

Semi-structured data • MR systems are good at processing the data is prepared for loading into a back-end system • DBMS requires wide tables with many attributes • Plus, MR-style systems are easily store and process

Quick-and-dirty analyses • DBMS need the programmer write the schema then load • MR just copy!

Limited-budget operations • MR is basically open sourcefor free • Parallel DBMS: huge cost

DBMS “Sweet Spot” Test

Why the Parallel DBMS wins • 1. Repetitive record parsing • 2. Compression • 3. Pipelining • 4. Scheduling • 5. Column-oriented storage

Repetitive record parsing • Parsing task requires each Map and Reduce task repeatedly parse and convert string fields into the appropriate type • Records are parsed by DBMSs when the data is initially loaded.

Compression • It is hard to say…….. • Commercial DBMSs may use carefully tuned compression algorithms

Pipelining • In parallel DBMS, data is streamed from producer to consumer • the intermediate data is never written to disk • In MR system, it writes the result to local data structure, and consumers read from it

Scheduling • In a parallel DBMS, every node knows what it should do • MR system is scheduled on processing nodes one storage block at a time.

Column-oriented storage • Vertica • Reads only the attributes necessary for solving the user query • DBMS-X and Hadoopare both row stores

What should MR learn from Parallel DBMS • MR advocates should learn from parallel DBMS the technologies and techniques for efficient query parallel execution.

Conclusion • MR systems are powerful tools for ETL-style applications and for complex analytics. If the application is query-intensive, whether semi structured or rigidly structured, then a DBMS is probably the better choice

Thank you~~Questions?

MapReduce vs Parallel DBMS: A Comparative Analysis for Data Processing Paradigms

MapReduce vs Parallel DBMS: A Comparative Analysis for Data Processing Paradigms

Presentation Transcript

Comparison of Parallel DB and MapReduce MapReduce: A Flexible Data Processing Tool

Data, Databases, and DBMSs

Parallel vs Sequential Algorithms

Series vs. Parallel Circuits

Distributed and Parallel Processing Technology Chapter2. MapReduce

Series VS parallel circuits

Object-Oriented DBMSs

L22: Parallel Programming Language Features (Chapel and MapReduce)

MapReduce As A Language for Parallel Computing

Parallel and Distributed Computing: MapReduce

Object-Relational DBMSs

Object-Relational DBMSs

MapReduce and Parallel DMBSs: Friends or Foes?

MapReduce

MapReduce

Indexing in DBMSs

Databases and DBMSs

Pipelining vs. Parallel processing

Parallel and Distributed Computing: MapReduce

Object-Relational DBMSs