mapreduce online l.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
MapReduce Online PowerPoint Presentation
Download Presentation
MapReduce Online

Loading in 2 Seconds...

play fullscreen
1 / 42

MapReduce Online - PowerPoint PPT Presentation


  • 822 Views
  • Uploaded on

MapReduce Online Tyson Condie and Neil Conway UC Berkeley Joint work with Peter Alvaro, Rusty Sears, Khaled Elmeleegy (Yahoo! Research), and Joe Hellerstein MapReduce Programming Model Programmers think in a data-centric fashion Apply transformations to data sets

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'MapReduce Online' - benjamin


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
mapreduce online

MapReduce Online

Tyson Condie and Neil Conway

UC Berkeley

Joint work with Peter Alvaro, Rusty Sears, Khaled Elmeleegy (Yahoo! Research), and Joe Hellerstein

mapreduce programming model
MapReduce Programming Model
  • Programmers think in a data-centric fashion
    • Apply transformations to data sets
  • The MR framework handles the Hard Stuff:
    • Fault tolerance
    • Distributed execution, scheduling, concurrency
    • Coordination
    • Network communication
mapreduce system model
MapReduce System Model
  • Designed for batch-oriented computations over large data sets
    • Each operator runs to completion before producing any output
    • Operator output is written to stable storage
      • Map output to local disk, reduce output to HDFS
  • Simple, elegant fault tolerance model: operator restart
    • Critical for large clusters
life beyond batch processing
Life Beyond Batch Processing
  • Can we apply the MR programming model outside batch processing?
  • Two domains of interest:
    • Interactive data analysis
      • Enabled by high-level MR query languages, e.g. Hive, Pig, Jaql
      • Batch processing is a poor fit
    • Continuous analysis of data streams
      • Batch processing adds massive latency
      • Requires saving and reloading analysis state
mapreduce online5
MapReduce Online
  • Pipeline data between operators as it is produced
    • Decouple computation schedule (logical) from data transfer schedule (physical)
  • Hadoop Online Prototype (HOP): Hadoop with pipelining support
    • Preserving the Hadoop interfaces and APIs
    • Challenge: retain elegant fault tolerance model
  • Enables approximate answers and stream processing
    • Can also reduce the response times of jobs
outline
Outline
  • Hadoop Background
  • HOP Architecture
  • Online Aggregation
  • Stream Processing with MapReduce
  • Future Work and Conclusion
hadoop architecture
Hadoop Architecture
  • HadoopMapReduce
    • Single master node, many worker nodes
    • Client submits a job to master node
    • Master splits each job into tasks (map/reduce), and assigns tasks to worker nodes
  • Hadoop Distributed File System (HDFS)
    • Single name node, many data nodes
    • Files stored as large, fixed-size (e.g. 64MB) blocks
    • HDFS typically holds map input and reduce output
job scheduling
Job Scheduling
  • One map task for each block of the input file
    • Applies user-defined map function to each record in the block
    • Record = <key, value>
  • User-defined number of reduce tasks
    • Each reduce task is assigned a set of record groups
      • Record group = all records with same key
    • For each group, apply user-defined reduce function to the record values in that group
  • Reduce tasks read from every map task
    • Each read returns the record groups for that reduce task
dataflow in hadoop
Dataflow in Hadoop
  • Map tasks write their output to local disk
    • Output available after map task has completed
  • Reduce tasks write their output to HDFS
    • Once job is finished, next job’s map tasks can be scheduled, and will read input from HDFS
  • Therefore, fault tolerance is simple: simply re-run tasks on failure
    • No consumers see partial operator output
dataflow in hadoop10
Dataflow in Hadoop

Submit job

map

reduce

schedule

map

reduce

dataflow in hadoop11
Dataflow in Hadoop

Read

Input File

map

reduce

Block 1

HDFS

Block 2

map

reduce

dataflow in hadoop12
Dataflow in Hadoop

Finished

Finished + Location

map

reduce

Local FS

map

reduce

Local FS

dataflow in hadoop13
Dataflow in Hadoop

map

reduce

Local FS

HTTP GET

map

reduce

Local FS

dataflow in hadoop14
Dataflow in Hadoop

Write Final Answer

reduce

HDFS

reduce

hadoop online prototype
Hadoop Online Prototype
  • HOP supports pipelining within and between MapReduce jobs: push rather than pull
    • Preserve simple fault tolerance scheme
    • Improved job completion time (better cluster utilization)
    • Improved detection and handling of stragglers
  • MapReduce programming model unchanged
    • Clients supply same job parameters
  • Hadoop client interface backward compatible
    • No changes required to existing clients
      • E.g., Pig, Hive, Sawzall, Jaql
    • Extended to take a series of job
pipelining batch size
Pipelining Batch Size
  • Initial design: pipeline eagerly (for each row)
    • Prevents use of combiner
    • Moves more sorting work to mapper
    • Map function can block on network I/O
  • Revised design: map writes into buffer
    • Spill thread: sort & combine buffer, spill to disk
    • Send thread: pipeline spill files => reducers
  • Simple adaptive algorithm
fault tolerance
Fault Tolerance
  • Fault tolerance in MR is simple and elegant
    • Simply recompute on failure, no state recovery
  • Initial design for pipelining FT:
    • Reduce treats in-progress map output as tentative
  • Revised design:
    • Pipelining maps periodically checkpoint output
    • Reducers can consume output <= checkpoint
    • Bonus: improved speculative execution
dataflow in hop
Dataflow in HOP

Schedule

Schedule + Location

map

reduce

Pipeline request

map

reduce

online aggregation
Online Aggregation
  • Traditional MR: poor UI for data analysis
  • Pipelining means that data is available at consumers “early”
    • Can be used to compute and refine an approximate answer
    • Often sufficient for interactive data analysis, developing new MapReduce jobs, ...
  • Within a single job: periodically invoke reduce function at each reduce task on available data
  • Between jobs: periodically send a “snapshot” to consumer jobs
intra job online aggregation
Intra-Job Online Aggregation
  • Approximate answers published to HDFS by each reduce task
  • Based on job progress: e.g. 10%, 20%, …
  • Challenge: providing statistically meaningful approximations
    • How close is an approximation to the final answer?
    • How do you avoid biased samples?
  • Challenge: reduce functions are opaque
    • Ideally, computing 20% approximation should reuse results of 10% approximation
    • Either use combiners, or HOP does redundant work
online aggregation in hop
Online Aggregation in HOP

Read

Input File

map

reduce

Block 1

HDFS

HDFS

Block 2

map

reduce

Write Snapshot

Answer

inter job online aggregation
Inter-Job Online Aggregation

Write Answer

reduce

map

HDFS

map

reduce

Job 2 Mappers

Job 1 Reducers

inter job online aggregation24
Inter-Job Online Aggregation
  • Like intra-job OA, but approximate answers are pipelined to map tasks of next job
    • Requires co-scheduling a sequence of jobs
  • Consumer job computes an approximation
    • Can be used to feed an arbitrary chain of consumer jobs with approximate answers
  • Challenge: how to avoid redundant work
    • Output of reduce for 10% progress vs. for 20%
example scenario
Example Scenario
  • Top K most-frequent-words in 5.5GB Wikipedia corpus (implemented as 2 MR jobs)
  • 60 node EC2 cluster
stream processing
Stream Processing
  • MapReduce is often applied to streams of data that arrive continuously
    • Click streams, network traffic, web crawl data, …
  • Traditional approach: buffer, batch process
    • Poor latency
    • Analysis state must be reloaded for each batch
  • Instead, run MR jobs continuously, and analyze data as it arrives
slide27
Why?
  • Why use MapReduce for stream processing?
    • Many existing MR use cases are a good fit
    • Ability to run user-defined code
      • Machine learning, graph analysis, unstructured data
    • Massive scale + low-latency analysis
    • Use existing MapReduce tools and libraries
stream processing with hop
Stream Processing with HOP
  • Map and reduce tasks run continuously
  • Reduce function divides stream into windows
    • “Every 30 seconds, compute the 1, 5, and 15 minute average network utilization; trigger an alert if …”
    • Window management done by user (reduce)
stream processing challenges
Stream Processing Challenges
  • How to store stream input?
    • HDFS is not ideal
  • Fault tolerance for long-running tasks
    • Operator restart increasingly expensive
  • Elastic scale-up / scale-down during MR job
1 storing stream input
#1: Storing Stream Input
  • Current approach: colocate map task and data producer
    • Apply map function, partition => reduce task
    • Fault tolerance: fate share
    • “Pushdown” predicates and scalar transforms
    • Total order = single reduce task
  • User-defined code at data producer = bad?
    • Fault-tolerant “buffer” (map task), coordination
2 fault tolerance for streams
#2: Fault Tolerance for Streams
  • Operator restart for long-running reduces: too expensive
  • Hence, window-oriented fault tolerance
    • Reducers label windows with IDs
    • Mappers use window IDs to garbage collect spills
  • Probably need fault-tolerant Job Tracker and HDFS Name Node
3 intra job elasticity
#3: Intra-Job Elasticity
  • Peak load != average load
    • Increasingly important as job duration grows
  • Solution: consistent hashing over reduce key space
    • Job Tracker manages reduce key => task mapping
  • Useful for regular Hadoop as well
other hop benefits
Other HOP Benefits
  • Shorter job completion time via improved cluster utilization: reduce work starts early
    • Important for high-priority jobs, interactive jobs
  • Adaptive load management
    • Better detection and handling of “straggler” tasks
    • Elastic scale-up/scale-down: better pre-emption
    • Decouple unit of data transfer from unit of scheduling
      • E.g. Yahoo! Petasort: 15GB/map task
sort performance blocking
Sort Performance: Blocking
  • 60 node EC2 cluster, 5.5GB input file
  • 40 map tasks, 59 reduce tasks
sort performance pipelining
Sort Performance: Pipelining
  • 927 seconds vs. 610 seconds
future work
Future Work
  • Basic pipelining
    • Performance analysis at scale (e.g. PetaSort)
    • Job scheduling is much harder
  • Online Aggregation
    • Statically-robust estimation
    • Better UI for approximate results
  • Stream Processing
    • Develop into full-fledged stream processing engine
    • Stream support for high-level query languages
    • Online machine learning
thanks
Thanks!

Questions?

Source code and technical report: http://code.google.com/p/hop/

Contact: nrc@cs.berkeley.edu

map task execution
Map Task Execution
  • Map phase
    • Read the assigned input split from HDFS
      • Split = file block by default
    • Parses input into records (key/value pairs)
    • Applies map function to each record
      • Returns zero or more new records
  • Commit phase
    • Registers the final output with the slave node
      • Stored in the local filesystem as a file
      • Sorted first by bucket number then by key
    • Informs master node of its completion
reduce task execution
Reduce Task Execution
  • Shuffle phase
    • Fetches input data from all map tasks
      • The portion corresponding to the reduce task’s bucket
  • Sort phase
    • Merge-sort *all* map outputs into a single run
  • Reduce phase
    • Applies user reduce function to the merged run
      • Arguments: key and corresponding list of values
    • Write output to a temp file in HDFS
      • Atomic rename when finished
design implications
Design Implications
  • Fault Tolerance
    • Tasks that fail are simply restarted
    • No further steps required since nothing left the task
  • “Straggler” handling
    • Job response time affected by slow task
    • Slow tasks get executed redundantly
      • Take result from the first to finish
      • Assumes slowdown is due to physical components (e.g., network, host machine)
  • Pipelining can support both!
fault tolerance in hop
Fault Tolerance in HOP
  • Traditional fault tolerance algorithms for pipelined dataflow systems are complex
  • HOP approach: write to disk and pipeline
    • Producers write data into in-memory buffer
    • In-memory buffer periodically spilled to disk
    • Spills sent to consumers
    • Consumers treat pipelined data as “tentative” until producer is known to complete
    • Fault tolerance via task restart, tentative output discarded
refinement checkpoints
Refinement: Checkpoints
  • Problem: Treating output as tentative inhibits parallelism
  • Solution: Producers periodically “checkpoint” with Hadoop master node
    • “Output split x corresponds to input offset y”
    • Pipelined data <= split x is now non-tentative
    • Also improves speculation for straggler tasks, reduces redundant work on task failure