1 / 8

### Comparing MapReduce and Parallel DBMS for Large-Scale Data Processing ###

This document explores the differences between MapReduce and Parallel Database Management Systems (DBMS) for handling large-scale data analytics. Highlighted are the fundamental principles of MapReduce, including the map and reduce functions and their application in counting word occurrences in large datasets. The paper references significant works by Jeffrey Dean and Sanjay Ghemawat, as well as comparisons detailed in ACM SIGMOD. Key design properties, performance considerations, and fault tolerance mechanisms of MapReduce are discussed, alongside examples of companies that successfully utilize Hadoop for data processing. ###

brad
Download Presentation

### Comparing MapReduce and Parallel DBMS for Large-Scale Data Processing ###

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. MapReduce vs. Parallel DBMS Hamid Safizadeh, Otelia Buffington CSci 5707, Fall 2013 University of Minnesota

  2. MapReduce Idea • Mapping map (k1, v1)  list (k2, v2) • Reducing reduce (k2, list(v2))  list (v2) Pseudo-code for counting the number of occurrences of each word in a large collection of documents Jeffrey Dean and Sanjay Ghemawat, MapReduce: Simplified Data Processing on Large Clustering, OSDI’08

  3. MapReduce Example Calculation of the number of occurrences of each word http://aimotion.blogspot.com/2010/08/mapreduce-with-mongodb-and-python.html

  4. MapReduce Architecture Execution overview Jeffrey Dean and Sanjay Ghemawat, MapReduce: Simplified Data Processing on Large Clustering, OSDI’08

  5. MapReduce or Parallel DBMS • Pavlo, A., Paulson, E., Rasin, A., Abadi, D.J., DeWitt, D.J., Madden, S., and Stonebraker, M., “A comparison of approaches to large-scale data analysis”, ACM SIGMOD International Conference, 2009 (http://database.cs.brown.edu/projects/mapreduce-vs-dbms) • Dean, J., and Ghemawat, S., “MapReduce: A flexible data processing tool”, Communications of the ACM, Vol. 53, 2010 (DOI: 10.1145/1629175.1629198)

  6. MapReduceDesign Properties • Heterogeneous Systems • Processing and combining data from a wide variety of storage systems(such as relational databases, file systems, etc.) • Fault Tolerance • Providing fine-grain fault tolerance for large jobs (Failure in middle of a multi-hour execution does not require restarting the job from scratch) • Complex Functions • Simple Map and Reduce functions with straightforward SQL equivalents • Offering a better framework for some complicated tasks 6 Jeffrey Dean and Sanjay Ghemawat, MapReduce: A Flexible Data Processing Tool, Communications of the ACM, Vol. 53, 2010

  7. MapReduceDesign Properties • Performance • Loading data: Startup overhead for MapReduce • Reading data: Full scan over large data files • Merging results: A MapReduce as the next consumer • Cost • Hardware: Network workstations • Software: Open source (Hodoop) • Communication: Network system 7 Jeffrey Dean and Sanjay Ghemawat, MapReduce: A Flexible Data Processing Tool, Communications of the ACM, Vol. 53, 2010

  8. Companies Using Hodoop • Facebook • Yahoo! • Google • Amazon • Twitter 8

More Related