1 / 17

Introduction to MapReduce and Hadoop

Introduction to MapReduce and Hadoop. IT 332 Distributed Systems. What is MapReduce ?. Data-parallel programming model for clusters of commodity machines Pioneered by Google Processes 20 PB of data per day Popularized by open-source Hadoop project Used by Yahoo!, Facebook, Amazon, ….

lucine
Download Presentation

Introduction to MapReduce and Hadoop

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Introduction to MapReduce and Hadoop IT 332 Distributed Systems

  2. What is MapReduce? • Data-parallel programming model for clusters of commodity machines • Pioneered by Google • Processes 20 PB of data per day • Popularized by open-source Hadoop project • Used by Yahoo!, Facebook, Amazon, …

  3. What is MapReduce used for? • At Google: • Index building for Google Search • Article clustering for Google News • Statistical machine translation • At Yahoo!: • Index building for Yahoo! Search • Spam detection for Yahoo! Mail • At Facebook: • Data mining • Ad optimization • Spam detection

  4. What is MapReduce used for? • In research: • Analyzing Wikipedia conflicts (PARC) • Natural language processing (CMU) • Bioinformatics (Maryland) • Astronomical image analysis (Washington) • Ocean climate simulation (Washington) • <Your application here>

  5. Outline • MapReduce architecture • Fault tolerance in MapReduce • Sample applications • Getting started with Hadoop • Higher-level languages on top of Hadoop: Pig and Hive

  6. MapReduce Design Goals • Scalability to large data volumes: • Scan 100 TB on 1 node @ 50 MB/s = 23 days • Scan on 1000-node cluster = 33 minutes • Cost-efficiency: • Commodity nodes (cheap, but unreliable) • Commodity network • Automatic fault-tolerance (fewer admins) • Easy to use (fewer programmers)

  7. Typical Hadoop Cluster Aggregation switch Rack switch • 40 nodes/rack, 1000-4000 nodes in cluster • 1 GBps bandwidth in rack, 8 GBps out of rack • Node specs (Yahoo terasort):8 x 2.0 GHz cores, 8 GB RAM, 4 disks (= 4 TB?)

  8. Typical Hadoop Cluster Image from http://wiki.apache.org/hadoop-data/attachments/HadoopPresentations/attachments/aw-apachecon-eu-2009.pdf

  9. Challenges • Cheap nodes fail, especially if you have many • Mean time between failures for 1 node = 3 years • MTBF for 1000 nodes = 1 day • Solution: Build fault-tolerance into system • Commodity network = low bandwidth • Solution: Push computation to the data • Programming distributed systems is hard • Solution: Data-parallel programming model: users write “map” and “reduce” functions, system handles work distribution and fault tolerance

  10. Hadoop Components • Distributed file system (HDFS) • Single namespace for entire cluster • Replicates data 3x for fault-tolerance • MapReduce implementation • Executes user jobs specified as “map” and “reduce” functions • Manages work distribution & fault-tolerance

  11. Hadoop Distributed File System • Files split into 128MB blocks • Blocks replicated across several datanodes (usually 3) • Single namenode stores metadata (file names, block locations, etc) • Optimized for large files, sequential reads • Files are append-only Namenode File1 1 2 3 4 1 2 1 3 2 1 4 2 4 3 3 4 Datanodes

  12. MapReduce Programming Model • Data type: key-value records • Map function: (Kin, Vin)  list(Kinter, Vinter) • Reduce function: (Kinter, list(Vinter))  list(Kout, Vout)

  13. Example: Word Count defmapper(line): foreach word in line.split(): output(word, 1) defreducer(key, values): output(key, sum(values))

  14. Word Count Execution Input Map Shuffle & Sort Reduce Output the, 1 brown, 1 fox, 1 the quick brown fox brown, 2 fox, 2 how, 1 now, 1 the, 3 Map Reduce the, 1 fox, 1 the, 1 the fox ate the mouse Map quick, 1 how, 1 now, 1 brown, 1 ate, 1 cow, 1 mouse, 1 quick, 1 ate, 1 mouse, 1 Reduce how now brown cow Map cow, 1

  15. MapReduce Execution Details • Single master controls job execution on multiple slaves as well as user scheduling • Mappers preferentially placed on same node or same rack as their input block • Push computation to data, minimize network use • Mappers save outputs to local disk rather than pushing directly to reducers • Allows having more reducers than nodes • Allows recovery if a reducer crashes

  16. An Optimization: The Combiner • A combiner is a local aggregation function for repeated keys produced by same map • For associative ops. like sum, count, max • Decreases size of intermediate data • Example: local counting for Word Count: defcombiner(key, values): output(key, sum(values))

  17. Word Count with Combiner Input Map & Combine Shuffle & Sort Reduce Output the, 1 brown, 1 fox, 1 the quick brown fox brown, 2 fox, 2 how, 1 now, 1 the, 3 Map Reduce the, 2 fox, 1 the fox ate the mouse Map quick, 1 how, 1 now, 1 brown, 1 ate, 1 cow, 1 mouse, 1 quick, 1 ate, 1 mouse, 1 Reduce how now brown cow Map cow, 1

More Related