1 / 73

Modeling and Optimization for Customized Computing: Performance, Energy and Cost Perspective

Modeling and Optimization for Customized Computing: Performance, Energy and Cost Perspective. Peipei Zhou Ph.D. Final Defense June 10 th , 2019. Committee: Jason Cong (Chair) Glenn Reinman Jae Hoon Sul Tony Nowatzki. Publication. Conference Publication

tori
Download Presentation

Modeling and Optimization for Customized Computing: Performance, Energy and Cost Perspective

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Modeling and Optimization for Customized Computing:Performance, Energy and Cost Perspective Peipei Zhou Ph.D. Final Defense June 10th, 2019 Committee: Jason Cong (Chair) Glenn Reinman Jae Hoon Sul Tony Nowatzki

  2. Publication Conference Publication [ICCAD ‘18] Y. Chi, J. Cong, P. Wei, and P. Zhou, “SODA: Stencil with Optimized Dataflow Architecture”, IEEE/ACM International Conference on Computer-Aided Design (ICCAD), 2018. Best Paper Candidate [FCCM ‘18] J. Cong, P. Wei, C.H. Yu, and P. Zhou, “Latte: Locality Aware Transformation for High-Level Synthesis”, IEEE International Symposium on Field-Programmable Custom Computing Machines (FCCM), 2018. [FCCM ‘18] Z. Ruan, T. He, B. Li, P. Zhou, and J. Cong. “ST-Accel: A High-Level Programming Platform for Streaming Applications on FPGA”, IEEE International Symposium on Field-Programmable Custom Computing Machines (FCCM), 2018. [ISPASS ‘18] P. Zhou, Z. Ruan, Z. Fang, M. Shand, D. Roazen, and J. Cong, “Doppio: I/O-Aware Performance Analysis, Modeling and Optimization for In-Memory Computing Framework”, IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), 2018. Best Paper Candidate (4 out of 21 accepted)

  3. Publication Conference Publication [DAC ‘17] J. Cong, P. Wei, C.H. Yu, and P. Zhou, “Bandwidth optimization through on-chip memory restructuring for HLS”, Design Automation Conference (DAC), 2017 [ICCAD ‘16] C. Zhang, Z. Fang, P. Zhou, P. Pan, J. Cong, “Caffeine: Towards Uniformed Representation and Acceleration for Deep Convolutional Neural Networks”, IEEE/ACM International Conference on Computer-Aided Design (ICCAD) ,2016 [FCCM ‘16] P. Zhou, H. Park, Z. Fang, J. Cong, A. DeHon, “Energy Efficiency of Full Pipelining: A Case Study for Matrix Multiplication”, IEEE International Symposium on Field-Programmable Custom Computing Machines (FCCM), 2016 [FCCM ‘14]J. Cong, H. Huang, C. Ma, B. Xiao, P. Zhou, “A Fully Pipelined and Dynamically Composable Architecture of CGRA”, IEEE International Symposium on Field-Programmable Custom Computing Machines (FCCM), 2014

  4. Publication Journal and Poster Publication [TCAD ‘18] C. Zhang, G. Sun, Z. Fang, P. Zhou, P. Pan, J. Cong, “Caffeine: Towards Uniformed Representation and Acceleration for Deep Convolutional Neural Networks”, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (TCAD),2018. TCAD Donald O. Pederson Best Paper Award 2019. [DAC ‘18] Y. Chi, P. Zhou, J. Cong, “An Optimal Microarchitecture for Stencil Computation with Data Reuse and Fine-Grained Parallelism (Work-in-Progress Accept)”, Design Automation Conference (DAC), 2018 [FPGA ‘18] Y. Chi, P. Zhou, J. Cong, “An Optimal Microarchitecture for Stencil Computation with Data Reuse and Fine-Grained Parallelism (Abstract Only)”, International Symposium on Field-Programmable Gate Arrays (FPGA), 2018 [FPGA ‘16] Y. Chen, J. Cong, Z. Fang, P. Zhou, “ARAPrototyper: Enabling Rapid Prototyping and Evaluation for Accelerator-Rich Architecture (Abstract Only)”, ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA), 2016 [HITSEQ ‘15] Y. Chen, J. Cong, J. Lei, S. Li, M. Peto, P. Spellman, P. Wei, P. Zhou, “CS-BWAMEM: A fast and scalable read aligner at the cloud scale for whole genome sequencing”, High Throughput Sequencing, Algorithms and Applications (HiTSeq) (Poster), 2015. Best Poster Award

  5. Accelerators Parallelization Customization • CPU core scaling coming to an end • CPUs are power hungry • Accelerators: FPGA (Field Programmable Gate Arrays) • Energy efficiency, 10-100X than CPUs

  6. Accelerator at Different Levels Battery Life & Performance Matters! • Single Chip Level • System-on-Chip (SoC) FPGAs • IoT devices, self-driving car

  7. Accelerator at Different Levels Performance & Dollars Matter! PCIe • Node Level • CPU <-> FPGA accelerators through PCIE • Host offload kernels onto FPGAs

  8. Accelerator at Different Levels Performance & Dollars Matter! 2014: Bing web search & Image search engine • Cluster Level • FPGA accelerators in various cloud services • IBM Cloud, Microsoft, FPGA and GPU in 2014 • Amazon AWS, Baidu, Huawei, Alibaba starting 2016 • Till 2019, Nimbix, Tencent

  9. Accelerator at Different Levels Performance & Dollars Matter! • Cluster Level • Fully pipelining CPUs and FPGAs gives more cost benefit

  10. Outline • Introduction • Chip Level Performance and Energy Modeling • Energy Efficiency of Full Pipelining (Chapter 2) • Latte: Frequency and performance optimization (Chapter 3) • Node Level Performance and Cost Modeling: • Computation Resources (Chapter 4) • Doppio, storage resources (Chapter 5) • Cluster Level Performance and Cost Modeling • Public Cloud Cost Optimization: • Composable Compute Instance: Mocha(Chapter 6) • Compute + Storage Co-optimization (Chapter 7) • Private Cloud Throughput Optimization (Chapter 8) • Conclusion

  11. Applications: GATK Best Practice by Broad Institute • Genome Pipeline Stages: • Alignment: BWA-MEM (Burrows-Wheeler Aligner) • Base-Quality Score Recalibration: GATK (Genome Analysis Toolkit) • Calling Variants: GATK, HaplotypeCaller (HTC) to find germline variants, Mutect2 for tumor sequences • Applies to single-sample whole genome sequence (WGS) or whole exome sequence (WES) analysis

  12. Applications: Genome Pipeline • Genome Pipeline Stages: • Alignment: BWA-MEM, Smith-Waterman (FPGA accelerator) • Base-Quality Score Recalibration: GATK • Calling Variants: HaplotypeCaller+Mutect2, PairHMM (FPGA accelerator)

  13. Outline • Introduction • Chip Level Performance and Energy Modeling • Energy Efficiency of Full Pipelining (Chapter 2) • Latte: Frequency and performance optimization (Chapter 3) • Node Level Performance and Cost Modeling: • Computation Resources (Chapter 4) • Doppio, storage resources (Chapter 5) • Cluster Level Performance and Cost Modeling • Public Cloud Cost Optimization: • Composable Compute Instance: Mocha(Chapter 6) • Compute + Storage Co-optimization (Chapter 7) • Private Cloud Throughput Optimization (Chapter 8) • Conclusion

  14. Motivation:HowDoes Using FPGA Accelerators Impact anApplication’s Out-of-PocketCost in Public Cloud Services? • FromCPUsolution->CPU+FPGAforHTConAWS Cloud • Latencyimproves1.60x • Cost2.58x

  15. Motivation:HowDoes Using FPGA Accelerators Impact anApplication’s Out-of-PocketCost in Public Cloud Services? • FromCPUsolution->CPU+FPGAforMutect2onHuawei • Latencyimproves1.49x • Cost1.16x

  16. Why? • FPGAisnotcheapinpubliccloud • AWS,1FPGA=25vCPUinprice • Huawei,1FPGA=23vCPUinprice • Amdahl’slaw • the theoretical speedup is always limited by the part of the task that cannot benefit from the improvement. • ForHTC,PairHMMkerneltakes39% • Applicationspeedupisboundedby1/(1-0.39)=1.63

  17. Challenges BWA, Samtools, GATK3.6 are all single node, multi-thread program BWA, Samtools in C, GATK3.6 in Java • Applications are different

  18. Performance Model on CPU-FPGA Platform “Well Matched” FPGA vs CPUs e.g. r = 0.5, S = 8,P=8

  19. Performance Model on CPU-FPGA Platform Fast FPGA e.g. r = 0.39, 1-r/r = 1.56, S = 40,P’=64P=8, Leave 1-8/64 = 87% FPGA resource idling, Waste!

  20. Performance Model on CPU-FPGA Platform Slow FPGA e.g. r = 0.83, 1-r/r = 0.2, S = 40,P’=8P=32, Leave 32-8 = 24 CPU cores idling, Waste!

  21. Proposed Optimization for Fast FPGA: Sharing FPGA among Instances Network network network Matching throughput of CPUs and FPGAs

  22. Slow FPGA: FPGA is Bottleneck FPGA offload CPU operation P, executor-core P = 1 1. P increases, runtime decreases 2. FPGA becomes bottleneck EffectiveCPUcoresremainssame(8) Runtime remain same for P or larger(P>8)

  23. Proposed Optimization for Slow FPGA: Co-executionofFPGA+CPU M1 tasks: Offload acceleratable kernels to FPGAs M2 tasks: Instead of offload, schedule to CPUs Partial Offload: Co-execution of FPGA + CPU

  24. Outline • Introduction • Chip Level Performance and Energy Modeling • Energy Efficiency of Full Pipelining (Chapter 2) • Latte: Frequency and performance optimization (Chapter 3) • Node Level Performance and Cost Modeling: • Computation Resources (Chapter 4) • Doppio, storage resources (Chapter 5) • Cluster Level Performance and Cost Modeling • Public Cloud Cost Optimization: • Composable Compute Instance: Mocha(Chapter 6) • Compute + Storage Co-optimization (Chapter 7) • Private Cloud Throughput Optimization (Chapter 8) • Conclusion

  25. Mocha:Multinode Optimization of Cost in Heterogeneous Cloud with Accelerators 1: Profiling r + Speedup S -> MatchingCoreNumber,P’ 3. Mocha RuntimeCPU clients <-> NAM <-> Accelerator 2. Settingupcluster HTConAWS,P’=64 Instances:f1.2x:8,m4.10x:40,m4.4x:16

  26. OptimizationResults • HTConAWS EC2 • Latencyimproves1.60x-> 1.58x, almost same performance • Cost2.58x-> 0.93x, cost efficiency improvement: 2.77x f1.16x (64 CPU+8FPGA) ->f1.2x + m4.16x (64 CPU +1 FPGA) Make more use of 1 FPGAs

  27. OptimizationResults • Mutect2onHuawei Cloud • Latencyimproves1.49x-> 2.27x, performance improvement: 1.5x • Cost1.16x-> 0.76x, cost efficiency improvement: 1.5x Effective CPUs: 8 -> 32, Make more use of CPUs

  28. Outline • Introduction • Chip Level Performance and Energy Modeling • Energy Efficiency of Full Pipelining (Chapter 2) • Latte: Frequency and performance optimization (Chapter 3) • Node Level Performance and Cost Modeling: • Computation Resources (Chapter 4) • Doppio, storage resources (Chapter 5) • Cluster Level Performance and Cost Modeling • Public Cloud Cost Optimization: • Composable Compute Instance: Mocha(Chapter 6) • Compute + Storage Co-optimization (Chapter 7) • Private Cloud Throughput Optimization (Chapter 8) • Conclusion

  29. Problem Formulation: Public Cloud, Cost Optimization • Samples:One genome sequence? Batch of genome sequences • Deadline: 6 hours ? 12 hours? 24 hours? No DDL constraints? • Objective: Minimum cost, in public cloud like Amazon EC2 1 Read $? $billions savings! UCLA CDSC

  30. Prior Work • ILP formulations are common • Cost of running scientific workloads (astronomy application) on public clouds, different number of scheduled processors [DSL08] • Cost-optimal scheduling in public cloud, formulates the problem as a linear programming, different memory, CPU, and data communication overhead, no storage considered [VVB10] • Limitations • Storage not considered • Composable instances not considered • [DSL08]EwaDeelman, Gurmeet Singh, MironLivny, Bruce Berriman, and John Good. The cost of doing science on the cloud: the montage example." In SC’08: Proceedings of the 2008 CM/IEEE conference on Supercomputing, • [VVB10] R. Van den Bossche, K. Vanmechelen, and J. Broeckhove. \Cost-Optimal Scheduling in Hybrid IaaS Clouds for Deadline Constrained Workloads.“ In 2010 IEEE 3rd International Conference on Cloud Computing UCLA CDSC

  31. Public Cloud: Storage + Compute

  32. Multiple Computation Instances • Different instances have different CPU types • In each CPU type series, different instances have a different number of CPU cores, memory sizes and prices UCLA CDSC

  33. Multiple Storage Choices -> I/O-Aware Modeling • Storage Choices: • SSD io1 price: $0.125 per GB month, 0.125*250/24/30 = $0.0434/hr • SSD gp2 price: $0.1 per GB month, 0.1*250/24/30 = $0.0347/hr • Doppio [ISPASS’18] • I/O-Aware Analytic Model • Quantify the I/O impact • Customized storage for optimal cost UCLA CDSC

  34. Multiple Storage Choices, Size or IOPs driven? • For genome pipeline, size matters • vcf: 20G • fastq: 3G • Input fastq: 104GB • Output Bam: 110G • Local folder: 250GB https://aws.amazon.com/ebs/pricing/ • Choose SSD gp2 price, Update the table price • SSD io1 price: $0.125 per GB month, 0.125*250/24/30 = $0.0434/hr • SSD gp2 price: $0.1 per GB month, 0.1*250/24/30 = $0.0347/hr UCLA CDSC

  35. Different Runtime of Stages in Instances • Amazon EC2 instances and runtime (seconds) for different stages. UCLA CDSC

  36. Scheduling by Solving MILP (Mixed Integer Linear Programming) Problem • Inputs: • Objective: minimize total cost • Constraints (details later) • Output: Allocation variable indicating genome s, stage t is allocated to type i, m-th instance UCLA CDSC

  37. MILP Constraints • Scheduling Constraints: • Hardware resource constraints:if two tasks are scheduled onto the same instance, then the two tasks do not overlap • Deadline constraints: finish time < DDL • Task dependency constraints: BWA -> BQSR -> HTC • Solver: IBM CPLEX Studio

  38. Scheduling Results for One Read • Cost under different time constraints • When deadline is as small as 19800 seconds (5.5hrs) $70 to finish a WGS • When deadline is as large as 87800 seconds (24.3hrs) • $10 to finish a WGS UCLA CDSC

  39. Scheduling Results When Mocha is Considered • Composable Instances • “New” instance type, HTC: f1.2x + m4.16x • cost to meet a deadline as small as 19800 seconds is reduced from $70 to $56 UCLA CDSC

  40. Scheduling Results for Batches of Reads • The cost savings from 1 -> 2 comes from reduced overhead for node setup, and the benefit is small • numSeq = 1 • $10.7, • ddl> 87800 secs • numSeq = 2 • $10.6, • ddl> 13900 secs S0 bwa S1 bwa S0 bqsr S1 bqsr S0 htc S1 htc UCLA CDSC

  41. Batches of Genome Sequences (Reads) • Number of Genome Sequences = 2, 3, 4, 5 • numSeq = 1, minCost = 10.734 when ddl > 87800 • numSeq = 2, minCost = 10.616 when ddl > 139000, - 1.096% • numSeq = 3, minCost = 10.557 when ddl > 187331, - 0.554% • numSeq = 4, minCost = 10.528 when ddl > 237748, - 0.279% • numSeq = 5, minCost = 10.513 when ddl > 288165, - 0.140% UCLA CDSC

  42. Insight • Duplication of configuration of numSeq = 2 is good enough for any given number of sequences S0 bwa S0 bwa S0 bwa S1 bwa S1 bwa S1 bwa S0 bqsr S0 bqsr S0 bqsr S1 bqsr S1 bqsr S1 bqsr S0 htc S0 htc S0 htc S1 htc S1 htc S1 htc • …

  43. Outline • Introduction • Chip Level Performance and Energy Modeling • Energy Efficiency of Full Pipelining (Chapter 2) • Latte: Frequency and performance optimization (Chapter 3) • Node Level Performance and Cost Modeling: • Computation Resources (Chapter 4) • Doppio, storage resources (Chapter 5) • Cluster Level Performance and Cost Modeling • Public Cloud Cost Optimization: • Composable Compute Instance: Mocha(Chapter 6) • Compute + Storage Co-optimization (Chapter 7) • Private Cloud Throughput Optimization (Chapter 8) • Conclusion

  44. Problem Formulation: Private Cloud, Throughput Optimization • Given hardware: 56 core CPU, 2TB disk • Given number of sequences (reads) • Objective: Minimum runtime UCLA CDSC

  45. Private Cloud, Runtime Matters • Objective isto minimize runtime • Assumption: • CPU cores on a server is limited, 56 cores • constant c0, parallel part t0; c0+ t0/P • Machine storage is limited -> parallel genomes are limited, for example, on 2TB server, at most 2000GB/250GB = 8 whole genomes can run in parallel UCLA CDSC

  46. Scheduling by Solving MILP Problem • Inputs: • Objective: minimize total runtime • Constraints (details later) • Output: pi,jAllocated number of cores for genome i UCLA CDSC

  47. MILP Constraints • Scheduling Constraints: • Hardware resource constraints:if two tasks are scheduled onto the same CPU cores / storage space, then the two tasks do not overlap • Dependency constraints: if two tasks have dependency, then the start time of the latter task >= end time of the previous one Solver: IBM CPLEX Studio

  48. Scheduling Results • Solver: IBM CPLEX • Cores: 56, Storage Space: 8, number of Genomes: 1, 2, 3, ,… • First 4 can give results <5 mins, 4+ goes more than 24 hrs, UCLA CDSC When #genome = 5, > 27 hrs, not finished yet

  49. Configuration of Optimal Results • #genome = 1, both stages use 56 cores • #genome = 2, both stages in two genomes use 28 cores • #genome = 4, both stages in four genomes use 14 cores • #genome = 3, as even as possible for each stage,split 56 cores to 19,19,18 for stage, 18,18,20 for stage 2 (explained later) Evenly Splitting 19: 35607 18: 24662 18: 24662 19: 35607 20: 22596 18: 37304

  50. ILP • For 4<#genome<=8, it takes extremely long time even with multithreading mode for CPLEX. • We put forward 3 heuristics, check their error rates against optimal results on 3 problem sizes, and pick the best heuristic to obtain a good runtime when 4<#genome<=8. • 3 Problem sizes case (first two are manually constructed): • Case A: #cores is 14, #storage space is 2 • Case B: #cores is 28, #storage space is 4 • Case C: #cores is 56, #storage space is 8 UCLA CDSC

More Related