Mapreduce vs parallel dbmss
Download
1 / 23

MapReduce VS Parallel DBMSs - PowerPoint PPT Presentation


  • 97 Views
  • Uploaded on

MapReduce VS Parallel DBMSs. Presenter: Ran Ding. G uideline. 1. Introduction 2. Where the MR wins 3. DBMS “sweet spot” tests 4. Why the Parallel DBMS wins 5. C onclusion. Introduction-----MR.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' MapReduce VS Parallel DBMSs' - petra


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Mapreduce vs parallel dbmss

MapReduce VS Parallel DBMSs

Presenter: Ran Ding


G uideline
Guideline

  • 1. Introduction

  • 2. Where the MR wins

  • 3. DBMS “sweet spot” tests

  • 4. Why the Parallel DBMS wins

  • 5. Conclusion


Introduction mr
Introduction-----MR

  • The MapReduce(MR) paradigm has been hailed as a revolutionary new platform for large-scale, massively parallel data access.

  • Like Hadoop


Introduction parallel dbms
Introduction----Parallel DBMS

  • Parallel DBMS appeared at mid-1980. the Teradata and Gamma projects pioneered a new architectural paradigm based on a cluster of commodity computers.


Introduction horizontal partitioning
Introduction---Horizontal partitioning

  • Distributing the rows of a relational table across the nodes of the cluster so they can process in parallel.


Introduction dbms
Introduction---DBMS

  • One benefit is system automatically manages the various alternative partitioning strategies for the tables involved in the query.

  • Like hash, range, and round-robin…..


Introduction mapping parallel dbms onto mapreduce
Introduction-- Mapping parallel DBMS onto MapReduce

  • It is not easy!!!!!!

  • UDF(user defined field) helps.

  • Like GROUP BY in SQL.


Where the mr wins
Where the MR wins

  • 1. ETL and “read once” data sets

  • 2. Complex analytics

  • 3. Semi-structured data

  • 4. Quick-and-dirty analyses

  • 5. Limited-budget operations


Etl and read once data sets
ETL and “read once” data sets

  • Extract-transform-load system

  • MR system can be considered a general-purpose parallel ETL system.

  • DBMSs may perform the ETL


Complex analytics
Complex analytics

  • Cannot be structured as single SQL aggregate queries

  • MR is a good candidate


Semi structured data
Semi-structured data

  • MR systems are good at processing the data is prepared for loading into a back-end system

  • DBMS requires wide tables with many attributes

  • Plus, MR-style systems are easily store and process


Quick and dirty analyses
Quick-and-dirty analyses

  • DBMS need the programmer write the schema then load

  • MR just copy!


Limited budget operations
Limited-budget operations

  • MR is basically open sourcefor free

  • Parallel DBMS: huge cost



Why the parallel dbms wins
Why the Parallel DBMS wins

  • 1. Repetitive record parsing

  • 2. Compression

  • 3. Pipelining

  • 4. Scheduling

  • 5. Column-oriented storage


Repetitive record parsing
Repetitive record parsing

  • Parsing task requires each Map and Reduce task repeatedly parse and convert string fields into the appropriate type

  • Records are parsed by DBMSs when the data is initially loaded.


Compression
Compression

  • It is hard to say……..

  • Commercial DBMSs may use carefully tuned compression algorithms


Pipelining
Pipelining

  • In parallel DBMS, data is streamed from producer to consumer

  • the intermediate data is never written to disk

  • In MR system, it writes the result to local data structure, and consumers read from it


Scheduling
Scheduling

  • In a parallel DBMS, every node knows what it should do

  • MR system is scheduled on processing nodes one storage block at a time.


Column oriented storage
Column-oriented storage

  • Vertica

  • Reads only the attributes necessary for solving the user query

  • DBMS-X and Hadoopare both row stores


What should mr learn from parallel dbms
What should MR learn from Parallel DBMS

  • MR advocates should learn from parallel DBMS the technologies and techniques for efficient query parallel execution.


Conclusion
Conclusion

  • MR systems are powerful tools for ETL-style applications and for complex analytics. If the application is query-intensive, whether semi structured or rigidly structured, then a DBMS is probably the better choice


Thank you questions
Thank you~~Questions?


ad