CSci 5707, Fall 2013

MapReduce vs. Parallel DBMS Hamid Safizadeh, Otelia Buffington CSci 5707, Fall 2013 University of Minnesota

MapReduce Idea • Mapping map (k1, v1)  list (k2, v2) • Reducing reduce (k2, list(v2))  list (v2) Pseudo-code for counting the number of occurrences of each word in a large collection of documents Jeffrey Dean and Sanjay Ghemawat, MapReduce: Simplified Data Processing on Large Clustering, OSDI’08

MapReduce Example Calculation of the number of occurrences of each word http://aimotion.blogspot.com/2010/08/mapreduce-with-mongodb-and-python.html

MapReduce Architecture Execution overview Jeffrey Dean and Sanjay Ghemawat, MapReduce: Simplified Data Processing on Large Clustering, OSDI’08

MapReduce or Parallel DBMS • Pavlo, A., Paulson, E., Rasin, A., Abadi, D.J., DeWitt, D.J., Madden, S., and Stonebraker, M., “A comparison of approaches to large-scale data analysis”, ACM SIGMOD International Conference, 2009 (http://database.cs.brown.edu/projects/mapreduce-vs-dbms) • Dean, J., and Ghemawat, S., “MapReduce: A flexible data processing tool”, Communications of the ACM, Vol. 53, 2010 (DOI: 10.1145/1629175.1629198)

MapReduceDesign Properties • Heterogeneous Systems • Processing and combining data from a wide variety of storage systems(such as relational databases, file systems, etc.) • Fault Tolerance • Providing fine-grain fault tolerance for large jobs (Failure in middle of a multi-hour execution does not require restarting the job from scratch) • Complex Functions • Simple Map and Reduce functions with straightforward SQL equivalents • Offering a better framework for some complicated tasks 6 Jeffrey Dean and Sanjay Ghemawat, MapReduce: A Flexible Data Processing Tool, Communications of the ACM, Vol. 53, 2010

MapReduceDesign Properties • Performance • Loading data: Startup overhead for MapReduce • Reading data: Full scan over large data files • Merging results: A MapReduce as the next consumer • Cost • Hardware: Network workstations • Software: Open source (Hodoop) • Communication: Network system 7 Jeffrey Dean and Sanjay Ghemawat, MapReduce: A Flexible Data Processing Tool, Communications of the ACM, Vol. 53, 2010

Companies Using Hodoop • Facebook • Yahoo! • Google • Amazon • Twitter 8

CSci 5707, Fall 2013

CSci 5707, Fall 2013

Presentation Transcript

CSci 152: Programming II Fall 2004

CSci 152: Programming II Fall 2004

Fall 2013

Fall, 2013

Fall 2013

How to succeed in CSCI 5333 Fall 2013

Fall 2013

Fall 2013

CSCI 305 – Fall 2013

Fall 2013

Fall 2013

Fall 2013

CSCI 6174 Fall, 2013

Fall 2013

Fall 2013

Fall, 2013

Fall 2013

Fall 2013

CSCI 5707: Database Security

Fall 2013

CSci 8980: Data Mining (Fall 2002)

Fall 2013