1 / 53

Starfish: A Self-tuning System for Big Data Analytics

Starfish: A Self-tuning System for Big Data Analytics. Presented by Carl Erhard & Zahid Mian Authors: Herodotos Herodotou, Harold Lim, Fei Dong, Shivnath Babu. Duke University. Analysis in the Big Data Era. Data Analysis. Massive Data. Insight.

joy
Download Presentation

Starfish: A Self-tuning System for Big Data Analytics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Starfish: A Self-tuning System for Big Data Analytics Presented by Carl Erhard & Zahid Mian Authors: Herodotos Herodotou, Harold Lim, Fei Dong, Shivnath Babu Duke University

  2. Analysis in the Big Data Era Data Analysis Massive Data Insight Key to Success = Timely and Cost-Effective Analysis Starfish

  3. We want a MAD System Magntetism “Attracts” or welcomes all sources of data, regardless of structure, values, etc. Agility Adaptive, remains in sync with rapid data evolution and modification Depth More than just your typical analytics, we need to support complex operations like statistical analysis and machine learning Starfish

  4. No wait…I mean MADDER Data-lifecycle Do more than just queries, Awareness optimize the movement, storage, and processing of big data Elasticity Dynamically adjust resource usage and operational costs based on workload and user requirements Robustness Provide storage and querying services even in the event of some failures Starfish

  5. Practitioners of Big Data Analytics • Who are the users? • Data analysts, statisticians, computational scientists… • Researchers, developers, testers… • Business Analysts… • You! • Who performs setup and tuning? • The users! • Usually lack expertise to tune the system Starfish

  6. Motivation Starfish

  7. Tuning Challenges • Heavy use of programming languages for MapReduce programs (e.g., Java/python) • Data loaded/accessed as opaque files • Large space of tuning choices (over 190 parameters!) • Elasticity is wonderful, but hard to achieve (Hadoop has many useful mechanisms, but policies are lacking) • Terabyte-scale data cycles Starfish

  8. Starfish: Self-tuning System Java / C++ / R / Python Elastic MapReduce Pig Jaql Oozie Hive Analytics System Starfish • Our goal: Provide good performance automatically Hadoop MapReduce Execution Engine HBase Distributed File System Starfish

  9. What are the Tuning Problems? Cluster sizing Job-level MapReduce configuration J1 J2 Data layout tuning J3 J4 Workflow optimization Workload management Starfish

  10. Starfish’s Core Approach to Tuning Optimizers Search through space of tuning choices Cluster Job Data layout Profiler What-if Engine Workflow Workload Collects concise summaries of execution Estimates impact of hypothetical changes on execution ifΔ(conf. parameters) then what …? ifΔ(data properties) then what …? ifΔ(cluster properties) then what …? Starfish

  11. Starfish Architecture Starfish

  12. Job Level Tuning • Just-in-Time Optimizer: Automatically selects efficient execution techniques for MapReduce jobs. • Profiler: A Starfish component which is able to collect detailed summaries of jobs on a task-by-task basis. • Sampler: Collects statistics about input, intermediate, and output data of a MapReduce job. Starfish

  13. MapReduce Job Execution job j = < program p, data d, resources r, configuration c > map map map map reduce reduce out 0 Out 1 split 0 split 2 split 1 split 3 Starfish

  14. What Controls MR Job Execution? • Space of configuration choices: • Number of map tasks • Number of reduce tasks • Partitioning of map outputs to reduce tasks • Memory allocation to task-level buffers • Whether output data from tasks should be compressed • Whether combine function should be used job j = < program p, data d, resources r, configuration c > Starfish

  15. Effect of Configuration Settings • Use defaults or set manually (rules-of-thumb) • Rules-of-thumb may not suffice Rules-of-thumb settings Two-dimensional projection of a multi-dimensional surface (Word Co-occurrence MapReduce Program) Starfish

  16. MapReduce Job Tuning in a Nutshell • Goal: • Challenges:p is an arbitrary MapReduce program; c is high-dimensional; … Runs p to collect a job profile (concise execution summary) of <p,d1,r1,c1> • Profiler • What-if Engine • Optimizer Given profile of <p,d1,r1,c1>, estimates virtual profile for <p,d2,r2,c2> Enumerates and searches through the optimization space S efficiently Starfish

  17. Job Profile • Concise representation of program execution as a job • Records information at the level of “task phases” • Generated by Profiler through measurement or by the What-if Engine through estimation Serialize, Partition map Memory Buffer Sort, [Combine], [Compress] split Merge DFS Read Map Collect Spill Merge Starfish

  18. Job Profile Fields Starfish

  19. Generating Profiles by Measurement • Goals • Have zero overhead when profiling is turned off • Require no modifications to Hadoop • Support unmodified MapReduce programs written in Java or Hadoop Streaming/Pipes (Python/Ruby/C++) • Approach: Dynamic (on-demand) instrumentation • Event-condition-action rules are specified (in Java) • Leads to run-time instrumentation of Hadoop internals • Monitors task phases of MapReduce job execution • We currently use Btrace (Hadoop internals are in Java) Starfish

  20. Generating Profiles by Measurement JVM JVM Enable Profiling map map reduce out 0 split 0 split 1 ECA rules raw data raw data JVM map profile reduce profile • Use of Sampling • Profile fewer tasks • Execute fewer tasks raw data job profile JVM = Java Virtual Machine, ECA = Event-Condition-Action Starfish

  21. Results of Job Profiling Starfish

  22. Results using Job Profiling Starfish

  23. Workflow-Aware Scheduling • Unbalanced Data Layout • Skewed Data • Data Layout Not Considered when SchedulingTasks • Addition/Dropping Partitions—No Rebalance • Can Lead to Failures Due to Space Issues • Locality-Aware Schedulers Can Make Problem Worse • Possible Solutions: • Change # of Replicas • Collocating Data (Block Placement Policy) Starfish

  24. Impact of Unbalanced Data Layout Starfish

  25. Impact of Unbalanced Data Layout Starfish

  26. Impact of Unbalanced Data Layout Starfish

  27. Workflow-Aware Scheduling • Makes Decisions by Considering Producer-Consumer Relationships Nodes Starfish

  28. Starfish’s Workflow-Aware Scheduler • Space of Choices: • Block Placement Policy: Round Robin (Local Write is default) • Replication Factor • Size of blocks: general large for large files • Compression: Impacts I/O; not always beneficial Starfish

  29. Starfish’s Workflow-Aware Scheduler • What-If Questions • A) Expected runtime of Job P if the RR block placement policy is used for P’s output files? • B) New Data layout in the cluster if the RR block placement policy is used for P’s output files? • C) Expected runtime of Job C1 (C2) if its input data layout is the one in the answer of Question (above)? • D) Expected runtimes of Jobs C1 and C2 if scheduled concurrently when Job P completes? • E) Given Local Write block policy and RF = 1 for Job P’s output, what is the expected increase in the runtime of Job C1 if one node in the cluster fails during C1’s execution? Starfish

  30. Estimates from the What-if Engine Hadoop cluster: 16 nodes, c1.medium MapReduce Program: Word Co-occurrence Data set: 10 GB Wikipedia True surface Estimated surface Starfish

  31. Workflow Scheduler Picks Layout Starfish

  32. Optimizations-Workload Optimizer Starfish

  33. Provisioning--Elastisizer • Motivation: Amazon Elastic MapReduce • Data in S3, processed in-cluster, stored to S3 • User Pays for Resources Used • Elastisizer Determines … • Best cluster • Hadoop configurations • … Based on user-specified goals (execution time and costs) Starfish

  34. Elastisizer Configuration Evaluation Starfish

  35. Elastisizer Configuration Evaluation Starfish

  36. Elastisizer- Cluster Configurations Starfish

  37. Multi-objective Cluster Provisioning Instance Type for Source Cluster: m1.large Starfish

  38. Critique of Paper • Good • Source Available for Implementation • Able to See the impact of various settings • Good Visualization Tools • Tutorials/Source available at duke.edu/starfish • Bad • The paper (and subsequent materials) talk a lot about what Starfish does, but not necessarily how it does it • There is no documentation on LastWord, and this seems important • Only works after a the job/workflow has been executed at least once Starfish

  39. Starfish’s Visualizer • Timeline Views • Shows progress of a job execution at the task level • See execution of same job with different settings • Data-flow Views • View of flow of data among nodes, along with MR jobs • “Video Mode” allows playback execution from past • Profile Views • Timings, data-flow, resource-level Starfish

  40. Timeline Views Starfish

  41. Timeline View Starfish

  42. Data Skew View Starfish

  43. Data Skew View Starfish

  44. Data Skew View Starfish

  45. Data-flow Views Starfish

  46. References • Herodotou, Herodotos, et al. "Starfish: A self-tuning system for big data analytics." Proc. of the Fifth CIDR Conf. 2011. • Dong, Fei. Extending Starfish to Support the Growing Hadoop Ecosystem. Diss. Duke University, 2012. • Herodotou, Herodotos, Fei Dong, and ShivnathBabu. "MapReduce programming and cost-based optimization? Crossing this chasm with Starfish." Proceedings of the VLDB Endowment 4.12 (2011). • http://www.cs.duke.edu/starfish/ • http://www.youtube.com/watch?v=Upxe2dzE1uk Starfish

  47. Backup Starfish

  48. Hadoop MapReduce Ecosystem • Popular solution to Big Data Analytics Java / C++ / R / Python Elastic MapReduce Pig Jaql Oozie Hive Hadoop MapReduce Execution Engine HBase Distributed File System Starfish

  49. Workflow-level Tuning • Starfish has a Workflow-aware Scheduler which addresses several concerns: • How do we equally distribute computation across nodes? • How do we adapt to imbalance of a load or energy cost? • The Workflow-aware Scheduler works with the What-if Engine and the Data Manager to answer these questions Starfish

  50. Workload-level Tuning • Starfish’s Workload Optimizer is aware of each workflow that will be executed. It reorders the workflows in order to make them more efficient. • This includes reusing data for different workflows that use the same MapReduce jobs. Starfish

More Related