1 / 25

Towards Pre-Deployment Detection of Performance Failures in Cloud Systems

Towards Pre-Deployment Detection of Performance Failures in Cloud Systems. Riza Suminto , Agung Laksono * , Anang Satria * , Thanh Do † , Haryadi Gunawi. *. †. Cloud Systems. Demands. U sers demand high dependability, reliability, and performance stability

rich
Download Presentation

Towards Pre-Deployment Detection of Performance Failures in Cloud Systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Towards Pre-Deployment DetectionofPerformance Failuresin Cloud Systems Riza Suminto, AgungLaksono*, AnangSatria*,Thanh Do†, HaryadiGunawi * †

  2. SPV @ HotCloud ’15 Cloud Systems

  3. SPV @ HotCloud ’15 Demands • Users demand high dependability, reliability, and performance stability • Amazon found that every 100ms of latency cost them 1% in sales • Google found an extra 0.5second in search page generation time dropped traffic by 20% Speed Matters!

  4. SPV @ HotCloud ’15 Performance failures happen What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems, SOCC’14 22%

  5. SPV @ HotCloud ’15 Outline PerformanceBug System PerformanceVerifier

  6. SPV @ HotCloud ’15 Performance Bug • Jobs take multiple times than usual to finish • Improper speculative execution JCH1& TPL1 & FPL2 & FTY1 • Unnecessary repeated recovery TPL1& TPL4 & FTY4 & TOP1

  7. SPV @ HotCloud ’15 UntriggeredSpecExec Map read locally Mappers and reducersin different nodes All-to-All Fault at map node Slow NIC Mappers Reducers DLCA TPLA M1 JCHA M2 slow! FPLA M3 FTYA All reducers slow! No straggler = No SpecExec DLCA& TPLA & JCHA& FPLA& FTYA

  8. SPV @ HotCloud ’15 UntriggeredSpecExec, cont • DLCA & TPLA & JCHA & FPLA & FTYA • DLCA & TPLA & JCHA & FPLA & FTYA Mappers Reducers DLCB = read remote M1 DN Mappers M1 M2 M2 Straggler! M3 M3

  9. SPV @ HotCloud ’15 UntriggeredSpecExec, cont • DLCA & TPLA & JCHA & FPLA & FTYA • DLCA & TPLA & JCHA & FPLA & FTYA Mappers Reducers slow reducer = FPLB M1 Mappers Reducers M2 Straggler! M1 M2 M3 M3

  10. SPV @ HotCloud ’15 O(n) Recovery Mappers and Reducersin different nodes Mappers and Reducersin different racks Large number of nodes per rack Slow inter-rack switch TPLA M M TPLB R M M slow! TOPA M FTYB Rack 1 Rack 2 TPLA& TPLB& TOPA& FTYB

  11. SPV @ HotCloud ’15 Conditions lead to performance bug • Untriggered Speculative Execution • MR-70001 = JCH1& TPL1& FPL2& FTY1 • MR-70002 = DSR1& DLC1& FPL1& FTY1 • MR-5533 = FTY2 & FPL3 & TPL3 • … • O(n) Recovery • MR-5251 = FTY3 & FPL3 & FTM1 • MR-5060 = TPL1 & TPL3 & FTY1 & FPL2 • MR-1800 = TPL1 & TPL4 & FTY4 & TOP1 • … • Long lock contention • MR-9191 = FTY3 & FPL3 & FTM1 • MR-9292 = TPL1 & TPL3 & FTY1 & FPL2 • MR-9393 = TPL1 & TPL4 & FTY4 & TOP1 • …

  12. SPV @ HotCloud ’15 Outline PerformanceBug System PerformanceVerifier

  13. SPV @ HotCloud ’15 Current Approach • Benchmarking • Hundreds benchmark for every scenario • Injecting slowdowns and failures • Take days to weeks!!

  14. SPV @ HotCloud ’15 What we want… • Four goals in performance verification • Fast • Covers many deployment scenario • Runs in pre-deployment • Directly checks implementation code • Formal modeling tools!

  15. SPV @ HotCloud ’15 System Performance Verifier (SPV) • Hand model • 20X larger than • hand model @Data publicclassJobInProgress { JobIDjobId; TaskInProgressmaps[]; ... } @IO publicHeartbeatResponse heartbeat (HeartbeatDatahd){ ... } • SPV Compiler • Target system • (e.g., Hadoop code) • Auto-generated model(in Colored Petri Net) • PerformanceVerification

  16. SPV @ HotCloud ’15 Colored Petri Nets (CPN) Tasks (“T1”,map) task @+10 A @0 (A,“T1”,map) @10 Node Task to Run Schedule Task assignment node input(node,task);output(assignment); action let val (id,type) = task in (node,id,type) end;

  17. SPV @ HotCloud ’15 Challenges : Two Different World CPN Java

  18. SPV @ HotCloud ’15 Our Approach • Java SysJava • Data flattening • Code modularization • Annotation tagging • SysJava Model compiler

  19. SPV @ HotCloud ’15 Data Flattening • Java system states = ArrayList, Map, Tree,… • CPN states = multisets [(1)] List<JobInProgress> runningJobs; publicclassJobInProgress { JobIDjobId; TaskInProgressmaps[]; ... } classTaskInProgress{ TaskIDid; doubleprogress; ... } Job In Progress [(1,a),(1,b)] Job Task Mapping [(a,10%),(b,15%)] Task In Progress

  20. SPV @ HotCloud ’15 Code Modularization Modular function @ProcessState privatevoid initCheck() { synchronized (taskTrackers) { ... } } privatebooleanprocessHeartbeat( TaskTrackerStatustrackerStats) { synchronized (taskTrackers) { ... } for (TaskStatusts: trackerStats) { tasks.get(ts.id).updateStatus(ts); } ... } Control Flow logic @ForEach privatevoidupdateStatuses( TaskTrackerStatustrackerStats) { for (TaskStatusts: trackerStats) { ... } } CRUD Logic @GetState privateTaskInProgressgetTask(TaskID id) { tasks.get(ts.id); } @UpdateState privatevoidtipUpdate(TaskInProgresstip, TaskStatusts) { tip.updateStatus(ts); }

  21. SPV @ HotCloud ’15 Annotation Tagging • Assist compiler • Annotation Category: • Data Structure • I/O • CRUD & Process • Miscellaneous @Data publicclassJobInProgress { JobIDjobId; TaskInProgressmaps[]; ... } @IO publicHeartbeatResponse heartbeat (HeartbeatDatahd) { ... }

  22. SPV @ HotCloud ’15 Model Checking • SPV Compiler Executable XML • Define configurations, assertions, and specifications • Explore every non-deterministic choices • Task to node mapping (“T1”,map) (“T1”,map) Tasks Tasks A A B B Schedule Task Schedule Task Node Node (A,“T1”,map) (B,“T1”,map) Task to Run Task to Run T1 on A T1 on B

  23. SPV @ HotCloud ’15 Preliminary Result • 5305lines of code on top of WALA & Access/CPN • HadoopMapReduce 1.2.1, with 1067lines code change • 20xlarger than hand-made model • 34scenario, 30assertion violation, 4 performance bug • 1.5hour model checking

  24. SPV @ HotCloud ’15 Thank you!Questions? http://ucare.cs.uchicago.edu

  25. SPV @ HotCloud ’15 Discussion • Is it time for pre-deployment detection of performance bugs? • Bridging system code and formal methods • Future of data-centric languages • Beyond Hadoop • Root cause anatomy of performance bugs • Beyond performance bugs

More Related