1 / 23

MapReduce VS Parallel DBMSs

MapReduce VS Parallel DBMSs. Presenter: Ran Ding. G uideline. 1. Introduction 2. Where the MR wins 3. DBMS “sweet spot” tests 4. Why the Parallel DBMS wins 5. C onclusion. Introduction-----MR.

petra
Download Presentation

MapReduce VS Parallel DBMSs

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. MapReduce VS Parallel DBMSs Presenter: Ran Ding

  2. Guideline • 1. Introduction • 2. Where the MR wins • 3. DBMS “sweet spot” tests • 4. Why the Parallel DBMS wins • 5. Conclusion

  3. Introduction-----MR • The MapReduce(MR) paradigm has been hailed as a revolutionary new platform for large-scale, massively parallel data access. • Like Hadoop

  4. Introduction----Parallel DBMS • Parallel DBMS appeared at mid-1980. the Teradata and Gamma projects pioneered a new architectural paradigm based on a cluster of commodity computers.

  5. Introduction---Horizontal partitioning • Distributing the rows of a relational table across the nodes of the cluster so they can process in parallel.

  6. Introduction---DBMS • One benefit is system automatically manages the various alternative partitioning strategies for the tables involved in the query. • Like hash, range, and round-robin…..

  7. Introduction-- Mapping parallel DBMS onto MapReduce • It is not easy!!!!!! • UDF(user defined field) helps. • Like GROUP BY in SQL.

  8. Where the MR wins • 1. ETL and “read once” data sets • 2. Complex analytics • 3. Semi-structured data • 4. Quick-and-dirty analyses • 5. Limited-budget operations

  9. ETL and “read once” data sets • Extract-transform-load system • MR system can be considered a general-purpose parallel ETL system. • DBMSs may perform the ETL

  10. Complex analytics • Cannot be structured as single SQL aggregate queries • MR is a good candidate

  11. Semi-structured data • MR systems are good at processing the data is prepared for loading into a back-end system • DBMS requires wide tables with many attributes • Plus, MR-style systems are easily store and process

  12. Quick-and-dirty analyses • DBMS need the programmer write the schema then load • MR just copy!

  13. Limited-budget operations • MR is basically open sourcefor free • Parallel DBMS: huge cost

  14. DBMS “Sweet Spot” Test

  15. Why the Parallel DBMS wins • 1. Repetitive record parsing • 2. Compression • 3. Pipelining • 4. Scheduling • 5. Column-oriented storage

  16. Repetitive record parsing • Parsing task requires each Map and Reduce task repeatedly parse and convert string fields into the appropriate type • Records are parsed by DBMSs when the data is initially loaded.

  17. Compression • It is hard to say…….. • Commercial DBMSs may use carefully tuned compression algorithms

  18. Pipelining • In parallel DBMS, data is streamed from producer to consumer • the intermediate data is never written to disk • In MR system, it writes the result to local data structure, and consumers read from it

  19. Scheduling • In a parallel DBMS, every node knows what it should do • MR system is scheduled on processing nodes one storage block at a time.

  20. Column-oriented storage • Vertica • Reads only the attributes necessary for solving the user query • DBMS-X and Hadoopare both row stores

  21. What should MR learn from Parallel DBMS • MR advocates should learn from parallel DBMS the technologies and techniques for efficient query parallel execution.

  22. Conclusion • MR systems are powerful tools for ETL-style applications and for complex analytics. If the application is query-intensive, whether semi structured or rigidly structured, then a DBMS is probably the better choice

  23. Thank you~~Questions?

More Related