1 / 24

Tarazu Optimizing MapReduce On Heterogeneous Clusters

Tarazu Optimizing MapReduce On Heterogeneous Clusters. 72130310 임규찬. 목차. Abstract of Paper Abstract of paper Reference of paper – LATE Introduction Issue with Heterogeneity Tarazu Experimental Result. Abstract of Paper. Heterogeneous Cluster 환경에서 MapReduce 기법의 최적화를 연구함 .

rigg
Download Presentation

Tarazu Optimizing MapReduce On Heterogeneous Clusters

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. TarazuOptimizing MapReduce On Heterogeneous Clusters 72130310 임규찬

  2. 목차 • Abstract of Paper • Abstract of paper • Reference of paper – LATE • Introduction • Issue with Heterogeneity • Tarazu • Experimental Result

  3. Abstract of Paper • Heterogeneous Cluster환경에서 MapReduce기법의 최적화를 연구함. • 데이터 센터 규모의 클러스터 환경에서 경제적 이유로Heterogeneous를 도입하고 있음. • MapReduce기법을 통한 BigData처리가 가능해짐. • 기존의 기법으로는 성능이 오히려 떨어졌음. • Straggler task Managing이용한 기존 연구는 효과 없음 • 그 예시로써 Improving MapReduce Performance in Heterogeneous Environments 논문을 비교함.

  4. Reference of PaperImproving MapReduce Performance in Heterogeneous Environments • Straggler Task제어를 통한 Heterogeneous 최적화 • Node is available but is performing poorly Condition • Can arise many reason, faulty hardware and misconfiguration • LATE Scheduler 제안 • Longest Approximate Time to End • Task별 Progress rate를 이용함 • P rogressScore/Amount of time the task • Unfortunately, LATE alone is not sufficient to address hardware heterogeneity.

  5. Introduction - Tarazu(तराजू) • ‘균형’을 뜻하는 힌디어 • MapReduce연산에 있어서 균형을 추구하도록 설계

  6. Introduction -MapReduce • 대용량 데이터를 분산 컴퓨팅 환경에서 병렬처리 하도록 만들어진 프레임워크 • Homogeneous cluster에 최적화.

  7. Introduction -Heterogeneous Computing • 서로 다른 코어로 이루어진 시스템을 이용한 Computing • CPU/GPU를 이용한 GPGPU • CPU/GPU 각각의 장점을 극대화하여 성능 향상을 꾀함. • OpenCL, CUDA, DirectCompute등 존재. • 본 논문에서 다루지 않음 • High/Row Node를 이용한 Clustering • 전력, 가격 등 금전적인 요소에서의 최적화 • 본 논문에서 10개의 Xeon Node, 80개의 Atom Node사용

  8. Issue with Heterogeneity-Background : MapReduce • Four phase Excution Model • Map computation • produces <Key, Value> tuple • Shuffle • all Map to all Reduce personalized Communication • Sorts • Grouping all the tuples for same Key • Reduce computation • Processes all the tuples for a key & produce final output

  9. Issue with Heterogeneity-Reasons for poor performance on heterogeneous clusters • Dynamic Load-balancing in MapReduce • Slower nodes fewer tasks/faster nodes more tasks • Heterogeneity is slow than Homogeneity • 20-75% slower for six out of eleven benchmarks. • Heterogeneity can be degrades performance • Poor performance is due to two Key factors • Non-intuitive • Other intuitive

  10. Issue with Heterogeneity-Reasons for poor performance on heterogeneous clusters • Factor 1 : Non-intuitive • Interaction between load balancing and network traffic • In Heterogeneous, cause remote task • Xeon is fast, Atom is slow. So Xeon stole Atom task • Remote task can Network Traffic • Network Traffic is exacerbated heavy Shuffle

  11. Issue with Heterogeneity-Reasons for poor perfermance on heterogeneous clusters • Factor 2 : Intuitive • Reduce phase imbalance amplified by heterogeneity • Reduce phase load imbalance • Different processing speeds cause long time

  12. Issue with Heterogeneity-A Simple(?) analytical model Map Finish Time (High/Low System중 Map 연산이 늦게 끝나는 시간값) Number of input data in bisection (Remote Task로 인한 데이터 + 셔플 데이터) Shuffle Finish Time (Remote task로 인한 시간 혹은 MFT) Reduce Finish Time (Remote task로 인한 시간 혹은 MFT)

  13. Tarazu • Two problems in MapReduce • Map-side built-in load balancing results in remote Map • Reduce-side load imbalance across the nodes • Tarazu consist of three components • Communication-Aware Load Balancing of Map computation • Communication-Aware Scheduling of Map computation • Predictive Load Balancing of Reduce computation

  14. Tarazu- Communication-Aware Load Balancing of Map computation • Based on key observation • Due to the overlap between Map computation and Shuffle • In Shuffle is critical, ‘no-steal mode’ • Pick up remote task when Shuffle end • There are no remote Map tasks to compete with Shuffle • Reduce the I/O Processing overhead • Slower nodes perform more work

  15. Tarazu- Communication-Aware Load Balancing of Map computation • In Map Computation is Critical, ‘task-steal mode’ • Concern of CAS. • CALB’s mode change using shuffleLag • Using MapReduce monitorfor fault tolerance • Diffence of number of • Map task that have completed their computation • Have completed their communication in all nodes • Deciding the Source of criticality once is enough without repeated, dynamic check.

  16. Tarazu- Communication-Aware Scheduling of Map computation • Determine how many remote tasks needed • Using in CALB ‘task-steal’ mode • Using to avoid increase SFT • To avoid traffic, CAS spreads out the remote task by interleaving them with local task

  17. Tarazu- Communication-Aware Scheduling of Map computation • CAS has other benefits • By interleaving remote tasks with local tasks, CAS achieves better overlap between remote task communication and local task computation on both sender and receiver sides • Remote tasks read input data faster by avoiding bursts

  18. Tarazu- Predictive Load Balancing of Reduce computation • Better load balance in the Reduce phase • Skewing the intermediate key distribution • Reduce max term RFT • Each Reduce task save number of fast/slow nodes.

  19. Experimental Methodology • Using Heterogeneous Cluster Environment • 10 Xeon-based/80 Atom-based server nodes • Using Hadoop 0.20.2 • Compare another solution, LATE

  20. Experimental Result-Performance • Heterogeneous 기법을 통한 시스템 장점 극대화 • Shuffle-Critical의 경우에는 Atom의 물량 반영 • Map-Critical의 경우에는 Xeon의 성능 반영

  21. Experimental Result-Effect of CALB, CAS and PLB

  22. Experimental Result-Sensitivity to extent of heterogeneity

  23. Experimental Result-Effect of skewed input data dist.

  24. Reference • Improving MapReduce Performance in Heterogeneous Environments –University of California, Berkeley • https://developers.google.com/appengine/docs/python/dataprocessing/ • http://www.cpubenchmark.net/

More Related