LIBRA: Lightweight Data Skew Mitigation in MapReduce

LIBRA: Lightweight Data Skew Mitigation in MapReduce Qi Chen, Jinyu Yao, and Zhen Xiao Nov2014 To appear in IEEE Transactions on Parallel and Distributed Systems

2. Background 5. Evaluation 3. Previous work 6. Conclusion 4. System Design 1. Introduction Outlines 0

Introduction • The new era of Big Data is coming! • – 20 PB per day (2008) • – 30 TB per day (2009) • – 60 TB per day (2010) • –petabytes per day • What does big data mean? • Important user information • significant business value

MapReduce • What is MapReduce? • most popular parallel computing model proposed by Google Select, Join, Group Page rank, Inverted index, Log analysis Clustering, machine translation, Recommendation database operation Search engine Machine learning Applications … Scientific computation Cryptanalysis

Data skew in MapReduce • Mantri has witnessed the • Coefficients of variation in data • size across tasks are 0.34 and 3.1 • at the 50th and 90thpercentiles in the • Microsoft production cluster • The imbalance in the amount of data assigned to each task • Fundamental reason: • The datasets in the real world are often skewed • physical properties, hot spots • We do not know the data distribution beforehand • It cannot be solved by speculative execution

Architecture Intermediate data are divided according to some user defined partitioner Master Assign Assign Part 1 Map Part 2 Reduce Split 1 Part 1 Output1 Split 2 Map Part 2 … Output2 Split M … Reduce Output files Input files Part 1 Map Part 2 Map Stage Reduce Stage reduce sort combine copy map →

Challenges to solve data skew • Many real world applications exhibit data skew • Sort, Grep, Join, Group, Aggregation, Page Rank, Inverted Index, etc. • The data distribution cannot be determined ahead of time • The computing environment can be heterogeneous • Diversity of hardware • Resource competition in cloud environment

Previous work • Significant overhead • Applicable only to certain applications • In the parallel database area • limited on join, group, and aggregate operations • Pre-run sampling jobs • Adding two pre-run sampling and counting jobs for theta join (SIGMOD’11) • Operating pre-processing extracting and samplingprocedures for the spatial feature extraction (SOCC’11) • Collect data information during the job execution • Collecting key frequency in each node and aggregating them on the master after all maps done (Cloudcom’10) • Partitioning intermediate data into more partitions and using greedy bin-packing to pack them after all maps finish (CLOSER’11, ICDE’12) • Skewtune (SIGMOD’12) • Split skewed tasks when detected • Reconstruct the output by concatenating the results • Bring barrier between map and reduce phases • Bin-packing cannot support total order • Need more task slots • Cannot detect large keys • Cannot split in copy and sort phases

LIBRA – Solving data skew Normal Map Reduce Sample Map Normal Map Reduce HDFS HDFS Sample Map 4: Ask Workers to Partition Map Output 1: Issue Sample Tasks First Master 2:Sample Data 3: Calculate Partitions

Sampling and partitioning • Sampling strategy • Random, TopCluster (ICDE’12) • LIBRA – p largest keys and q random keys • Estimate Intermediate Data Distribution • Large keys -> represent only one large key • Random keys -> represent a small range keys • Partitioning strategy • Hash, bin packing, range • LIBRA - range

Heterogeneity Consideration Cnt=300 Intermediate data Cnt=150 Cnt=100 Cnt=50 Reducer2 Reducer1 Reducer3 Performance=0.5 Performance=1.5 Performance=1 Node3 Node1 Node2 Start Processing Finish

Problem Statement • The intermediate data can be represented as: • (K1, C1), (K2, C2), …, (Kn, Cn) Ki < Ki+1 • Ki  a distinct key Ci  number of (k,v) pairs of Ki • Range partition: 0 = < < … < = n • Reducer keys in the range of (, ] • Our goal: • Minimize •  computational complexity of processing Kj • sort:, self-join: •  performance factor of the worker node

(, ), …… (,) L L P1 keys, Q1 tuples K1 (), …… (), (), (), …… (), (), (), …… () (), …… (), (), (), …… (), (), (), …… () Distribution estimation P2 keys, Q2 tuples K2 P3 keys, Q3 tuples (, ) K3 (, ), …… (, ) … Ki-1 Pi (=1) keys, Qi tuples Ki (, ) Pi+1 keys, Qi+1 tuples Ki+1 (, ), …… (, ) … K|L| • Sum up samples (b) Pick up “marked keys” (c) Estimate distribution Minimize

Sparse Index to Speed Up Partitioning decrease the partition time by an order of magnitude Intermediate data Offset1 (Kb1, Vb1) Index chunk (Kb1+1, Vb1+1) L1 …… Sparse index Offset2 (Kb2, Vb2) (Kb1, Offset1, L1, Checksum1) (Kb2+1, Vb2+1) L2 (Kb2, Offset2, L2, Checksum2) …… …… (Kbn, Offsetn, Ln, Checksumn) Offsetn (Kbn, Vbn) (Kbn+1, Vbn+1) Ln ……

Large Cluster Splitting C, cnt = 10 A, cnt = 100 B, cnt = 10 • treat each intermediate (k,v) pair independently in reduce phase • e.g. sort, grep, join Cluster split is allow Cluster split is not allow A, cnt=100 B, cnt = 10 C, cnt = 10 A, cnt=60 A, cnt = 40 B, cnt = 10 C, cnt = 10 Reducer 1 Reducer 2 Reducer 1 Reducer 2 Data Skewed

Experiment Environment • Cluster: • 30 virtual machines on 15 physical machines • Each physical machine: • dual-Processors (2.4GHz Xeon E5620) • 24GB of RAM • two 150GB disks • connected by 1Gbps Ethernet • Each virtual machine: • 2 virtual core, 4GB RAM and 40GB of disk space • Benchmark: • Sort, Grep, Inverted Index, join

Evalution - Accuracy of the Sampling Method Zipf distribution (= 1.0) #keys = 65535 Sample 20% of splits and 1000 keys from each split

Evaluation – LIBRA Execution (sort) • 80% faster than Hadoop Hash • 167% faster than Hadoop Range

Evaluation – Degree of the skew (sort) The overhead of LIBRA is minimal

Evaluation – different applications • Grep application -- grep different words from the full English Wikipedia archive with total data size of 31GB

Evaluation – different applications • Inverted Index application • Dataset: full English Wikipedia archive

Evaluation – different applications • Join application

Evaluation – Heterogeneous Environments (sort) • 30% faster than without • heterogeneous consideration

Conclusion • We present LIBRA, a system that implements a set of innovative skew mitigation strategies in MapReduce: • A new sampling method for general user-defined programs • p largest keys and q random keys • An approach to balance the load among the reduce tasks • Large key split support • An innovative consideration of heterogeneous environment • Balance the processing time instead of just the amount of data • Performance evaluation demonstrates that: • the improvement is significant (up to 4x times faster) • the overhead is minimal and negligible even in the absence of skew

Thank You!

LIBRA: Lightweight Data Skew Mitigation in MapReduce

LIBRA: Lightweight Data Skew Mitigation in MapReduce

Presentation Transcript

Measuring a (MapReduce) Data Center

Clock Skew

MapReduce and Data Management

Data-Intensive Computing with MapReduce

Data-Intensive Computing with MapReduce

Data-Intensive Computing with MapReduce

Data-Intensive Computing with MapReduce

Libra ( constellation )

Data-Intensive Computing with MapReduce

Data-Intensive Computing with MapReduce

MapReduce , the Big Data Workhorse

Investigation of Data Locality in MapReduce

Data-Intensive Computing with MapReduce

Data-Intensive Computing with MapReduce

Data Processing with MapReduce

Handling Data Skew in Parallel Joins in Shared-Nothing Systems

Lightweight Overlap Mitigation for 802.11

SkewTune : Mitigating Skew in MapReduce Applications

MapReduce and Data Management

Simplifying MapReduce Data Processing

Data Engineering How MapReduce Works

MapReduce and Data Management