Mapreduce vs parallel dbmss
This presentation is the property of its rightful owner.
Sponsored Links
1 / 23

MapReduce VS Parallel DBMSs PowerPoint PPT Presentation


  • 61 Views
  • Uploaded on
  • Presentation posted in: General

MapReduce VS Parallel DBMSs. Presenter: Ran Ding. G uideline. 1. Introduction 2. Where the MR wins 3. DBMS “sweet spot” tests 4. Why the Parallel DBMS wins 5. C onclusion. Introduction-----MR.

Download Presentation

MapReduce VS Parallel DBMSs

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Mapreduce vs parallel dbmss

MapReduce VS Parallel DBMSs

Presenter: Ran Ding


G uideline

Guideline

  • 1. Introduction

  • 2. Where the MR wins

  • 3. DBMS “sweet spot” tests

  • 4. Why the Parallel DBMS wins

  • 5. Conclusion


Introduction mr

Introduction-----MR

  • The MapReduce(MR) paradigm has been hailed as a revolutionary new platform for large-scale, massively parallel data access.

  • Like Hadoop


Introduction parallel dbms

Introduction----Parallel DBMS

  • Parallel DBMS appeared at mid-1980. the Teradata and Gamma projects pioneered a new architectural paradigm based on a cluster of commodity computers.


Introduction horizontal partitioning

Introduction---Horizontal partitioning

  • Distributing the rows of a relational table across the nodes of the cluster so they can process in parallel.


Introduction dbms

Introduction---DBMS

  • One benefit is system automatically manages the various alternative partitioning strategies for the tables involved in the query.

  • Like hash, range, and round-robin…..


Introduction mapping parallel dbms onto mapreduce

Introduction-- Mapping parallel DBMS onto MapReduce

  • It is not easy!!!!!!

  • UDF(user defined field) helps.

  • Like GROUP BY in SQL.


Where the mr wins

Where the MR wins

  • 1. ETL and “read once” data sets

  • 2. Complex analytics

  • 3. Semi-structured data

  • 4. Quick-and-dirty analyses

  • 5. Limited-budget operations


Etl and read once data sets

ETL and “read once” data sets

  • Extract-transform-load system

  • MR system can be considered a general-purpose parallel ETL system.

  • DBMSs may perform the ETL


Complex analytics

Complex analytics

  • Cannot be structured as single SQL aggregate queries

  • MR is a good candidate


Semi structured data

Semi-structured data

  • MR systems are good at processing the data is prepared for loading into a back-end system

  • DBMS requires wide tables with many attributes

  • Plus, MR-style systems are easily store and process


Quick and dirty analyses

Quick-and-dirty analyses

  • DBMS need the programmer write the schema then load

  • MR just copy!


Limited budget operations

Limited-budget operations

  • MR is basically open sourcefor free

  • Parallel DBMS: huge cost


Dbms sweet spot test

DBMS “Sweet Spot” Test


Why the parallel dbms wins

Why the Parallel DBMS wins

  • 1. Repetitive record parsing

  • 2. Compression

  • 3. Pipelining

  • 4. Scheduling

  • 5. Column-oriented storage


Repetitive record parsing

Repetitive record parsing

  • Parsing task requires each Map and Reduce task repeatedly parse and convert string fields into the appropriate type

  • Records are parsed by DBMSs when the data is initially loaded.


Compression

Compression

  • It is hard to say……..

  • Commercial DBMSs may use carefully tuned compression algorithms


Pipelining

Pipelining

  • In parallel DBMS, data is streamed from producer to consumer

  • the intermediate data is never written to disk

  • In MR system, it writes the result to local data structure, and consumers read from it


Scheduling

Scheduling

  • In a parallel DBMS, every node knows what it should do

  • MR system is scheduled on processing nodes one storage block at a time.


Column oriented storage

Column-oriented storage

  • Vertica

  • Reads only the attributes necessary for solving the user query

  • DBMS-X and Hadoopare both row stores


What should mr learn from parallel dbms

What should MR learn from Parallel DBMS

  • MR advocates should learn from parallel DBMS the technologies and techniques for efficient query parallel execution.


Conclusion

Conclusion

  • MR systems are powerful tools for ETL-style applications and for complex analytics. If the application is query-intensive, whether semi structured or rigidly structured, then a DBMS is probably the better choice


Thank you questions

Thank you~~Questions?


  • Login