1 / 27

Improving MapReduce Performance through Data Placement in Heterogeneous Hadoop Clusters

Improving MapReduce Performance through Data Placement in Heterogeneous Hadoop Clusters. Jiong Xie Ph.D. Student April 2010. Presentation Outline. Background Motivation Related Work Design and Implementation Experimental Result Conclusion/Future Work. Background.

lpfeiffer
Download Presentation

Improving MapReduce Performance through Data Placement in Heterogeneous Hadoop Clusters

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Improving MapReduce Performance through Data Placement inHeterogeneous Hadoop Clusters Jiong Xie Ph.D. Student April 2010

  2. Presentation Outline • Background • Motivation • Related Work • Design and Implementation • Experimental Result • Conclusion/Future Work

  3. Background MapReduce programming model is growing in popularity Hadoop is used by Yahoo, Facebook, Amazon.

  4. Data Intensive Applications Bioinformatics Weather forecast Medicine science Astronautics

  5. Hadoop Overview (J. Dean and S. Ghemawat. Mapreduce: Simplified data processing on large clusters. OSDI ’04, pages 137–150, 2008)

  6. Hadoop Distributed File System (http://lucene.apache.org/hadoop)

  7. Motivational Example 1 task/min Node A (fast) 2x slower Node B (slow) Node C (slowest) 3x slower Time (min)

  8. The Native Strategy Node A 6 tasks Node B 3 tasks Node C 2 tasks Time (min) Loading Transferring Processing

  9. Our Solution--Reducing data transfer time Node A 6 tasks Node A’ 3 tasks Node B’ Node C’ 2 tasks Time (min) Loading Transferring Processing 9

  10. Preliminary Results Impact of data placement on performance of grep

  11. Challenges   Does computing ratio depend on the application? Initial data distribution Data skew problem New data arrival Data deletion   New joining node Data updating

  12. Measure Computing Ratios • Computing ratio • Fast machines process large data sets 1 task/min Node A 2x slower Node B Node C 3x slower Time

  13. Steps to MeasureComputing Ratios 1. Run the application on each node with the same size data, individually collect the response time 2. Set the ratio of the shortest response as 1, accordingly set the ratio of other nodes 3.Caculate the least common multiple of these ratios 4. Count the portion of each node

  14. Initial Data Distribution 1 4 2 5 3 a 7 b 8 9 Portion3:2:1 • Input files split into 64MB blocks • Round-robin data distribution algorithm Namenode File1 6 A B C c Datanodes

  15. Data Redistribution 1.Get network topology, the ratio and utilization 2.Build and sort two lists: under-utilized node list  L1 over-utilized node list  L2 3. Select the source and destination node from the lists. 4.Transfer data 5.Repeat step 3, 4 until the list is empty. 1 4 3 2 Namenode A C L1 L2 B Portion3:2:1 A B C 7 1 4 a 6 8 2 b 5 9 c 3

  16. Sharing Files among Multiple Applications The computing ratio depends on data-intensive applications. Redistribution Redundancy

  17. Experimental Environment Five nodes in a hadoop heterogeneous cluster

  18. Grep and WordCount Grep is a tool searching for a regular expression in a text file WordCount is a program used to count words in a text file

  19. Computing ratio for two applications Computing ratio of the five nodes with respective of Grep and Wordcount applications

  20. Response time of Grep andwordcount in each Node Application dependence Data size independence

  21. Six Data Placement Decisions

  22. Impact of data placement on performance of Grep

  23. Impact of data placement on performance of WordCount

  24. Conclusion • Identify the performance degradation caused by heterogeneity. • Designed and implemented a data placement mechanism in HDFS.

  25. Future Work • Data redundancy issue • Dynamic data distribution mechanism • Prefetching

  26. Thanks

  27. Question ?

More Related