1 / 40

S ystem-on- C hip (SoC) Proliferation

A Polyhedral-based SystemC Modeling and Generation Framework for Effective Low-power Design Space Exploration. Wei Zuo 1 , Warren Kemmerer 1 , Jong Bin Lim 1 , Louis-Noel Pochet 2 , Andrey Ayupov 3 , Taemin Kim 3 , Kyuntae Han 3 , Deming Chen 1

waldmanv
Download Presentation

S ystem-on- C hip (SoC) Proliferation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Polyhedral-based SystemC Modeling and Generation Framework for Effective Low-power Design Space Exploration Wei Zuo1, Warren Kemmerer1, Jong Bin Lim1, Louis-Noel Pochet2, Andrey Ayupov3, Taemin Kim3, Kyuntae Han3, Deming Chen1 1 University of Illinois at Urbana-Champaign 2Ohio State University 3Strategic CAD Labs of Intel Corporation

  2. System-on-Chip(SoC) Proliferation • SoChasagreatimpactondesignmetrics • Deal with the fast growing design complexity • Virtual platform based hardware/software co-design • SystemC for high-level modeling • Accuracyofcomponents modeling is the precondition ITRS 2007 SoC Consumer Portable Design Complexity Trends

  3. Accelerator Design in SoC Accelerator design is critical in SoC design Out-of-Core Accelerators Source: Shao and Brook @ Havard [http://vlsiarch.eecs.harvard.edu/accelerators/die-photo-analysis]

  4. Challenges of the Accelerator Design/Modeling in SoC • How to accurately model performance and power early on? • Essential to enable rapid prototyping of SoC devices • Difficult with no physical information available  • How to dramatically improve design productivity • Traditional design flow is too slow: mainly rely on manual process  • How to explore the large design space? • Micro-architecture choices for performance and cost trade-off • Identify the optimality is crucial but difficult  • System-level automation is the key

  5. Our Approaches and Contribution

  6. Overall Framework (1) • Automated C-to-SystemC transformation engine • Tile based characterization flow for latency and power estimation • Two versions of SystemC code generation • For system-level modeling and high-level synthesis

  7. SystemC Generation Framework

  8. SystemC Generation Framework • Tile the loops using polyhedral transformations • What is polyhedral transformation and why is this critical?

  9. Polyhedral Transformation • Fine-granularity optimization for affine programs for HLS • Expose parallelismand localityfor parallelization and data reuse • Atomic tiles and the data reuse buffers and data transfers operations • Regularity is key for accurate power and latency characterization for (i=0; i<N; i++) for(j=0; j<N; j++) s1: A[i, j] += u1[i] *v1[j]+u2[i]*v2[j]; for(k=0; k<N; k++) for(l=0; l<N; l++) s2: x[k] += A[l, k]*y[l]; (c) Dependence for edge s1s2 (b) Domain of s1 (a) Original Code for s1 for s2 for (c1=0; i<N; i++) for(c2=0; j<N; j++) { A[c2, c1] += u1[c2] *v1[c1]+u2[c2]*v2[c1]; x[c1] += A[c2, c1]*y[c2]; } (d) Scheduling functions for s1 and s2 (e) Transformed code based on scheduling function in (d)

  10. SystemC Generation Framework • Dilemma: • Latency and power accuracy relies on physical information • GeneratetheRTLfortheentiredesignof different instances isnotscalable • Tilebasedcharacterization: • Extract tiles and separate them into components: • Computation blocks, communication channels and memory blocks • Separately characterize the power and latency of these parts • Information extracted from gate-level simulation • Build power model for each part considering different input switching activities

  11. Characterization Flow • General architecture of generated accelerator Main Mem Hardware local mem SA localmem SA … acc_tile1 acc_tile2 acc_tileN SA SA SA … • Computation modules (acc_tile) • Read data from memories and compute • Local memories (local mem) • Storing data for the acclerator • Switching activity calculation function • Compute the input switching activity to guide the selection of power and latency

  12. Characterization Flow • General architecture of generated accelerator SystemC (generated from stage 3) Main Mem Hardware Memory Library High Level Synthesis local mem SA localmem SA … RTL Code acc_tile1 acc_tile2 acc_tileN SA SA SA … Logic Synthesis Netlist Testbench Gate Level Power Analysis Simulator • Computation modules (acc_tile) • Read data from memories and compute • Local memories (local mem) • Storing data for the acclerator • Switching activity calculation function • Compute the input switching activity to guide the selection of power and latency Switching Activities (SW) Latency Power

  13. Characterization Flow • General architecture of generated accelerator Main Mem Hardware Memory Library RTL memory wrapper local mem SA localmem SA … acc_tile1 acc_tile2 acc_tileN SA SA SA … Logic Synthesis Netlist Testbench Gate Level Power Analysis Simulator • Computation modules (acc_tile) • Read data from memories and compute • Local memories (local mem) • Storing data for the acclerator • Switching activity calculation function • Compute the input switching activity to guide the selection of power and latency Switching Activities (SW) Latency Power

  14. The Look-up Table for Power Modeling • The look-up table indexed by input-switching activity • The power consumption is NOT directly proportional to the input switching activity • The irregularity of switching activity propagation within the accelerator is captured by the characterization data

  15. SystemC Generation Framework • SystemC generation for modeling • Use polyhedral analysis to generate a SystemC model for the tiled kernels • Back-annotate the power and latency information to the SystemC model • SystemC simulation to compute the values for the entire design • SystemC generation for HLS • Cycle-accurate interface • Insert “wait()” statements for scheduling

  16. Code Generation: An Example Unroll at this level: assume N/T = 4 //Top level module class FM_module: public sc_module{ void FM(){ while(1){ update_power(PW_MODE_ON, PW_PHASE_IDLE); for(it=0; it < N/T; it++){ … //start the tile threads scgen_tile_start[it] = true;} wait(); update_power(PW_MODE_ON, PW_PHASE_COMPUTE); //wait until all threads are finished while(!(scgen__tile_done[0] && … !scgen_tile_done[N/T-1])) wait(); sc_stop(); }}}; //One tile class FM_module_tile0: public sc_module{ for(jt=0; jt<N/T; jt++){ //communication blocks /*read “size” elements from mem1 with start address “sa”, to local_mem, with read delay “delay1”*/ copy_to_local(int sa, int size, int *local_A, sc_time &delay1; … /*computation kernel of accelerator, with power and latency counter*/ acc(); /*write“size” elements to x from local_X, with write delay2 “delay”*/ copy_to_mem(int sa, int size, int *local_X, sc_time &delay2); };

  17. Overall Framework (2) • Automated C-to-SystemC transformation engine • Tile based characterization flow for latency and power estimation • Two versions of SystemC code generation • For system-level modeling and high-level synthesis • Analytical power and latency models • Design space is big • impossible to traverse the entire space even with SystemC simulation • Use hyper-surface based sampling method • Evaluate the power and latency of all the points in the design space

  18. Analytical Modeling for Power and Latency Loop structures of application latency constraints Complete design space

  19. Analytical Modeling for Power and Latency Loop structures of application latency constraints Sampling on the design space Sampling on tile size and unrolling factors

  20. Analytical Modeling for Power and Latency Loop structures of application latency constraints * Generating SystemC for sampled points * Run SystemC simulation Sampling on tile size and unrolling factors SystemC model generation for sampled points SystemC simulation

  21. Analytical Modeling for Power and Latency Loop structures of application latency constraints Surface fitting Sampling on tile size and unrolling factors SystemC model generation for sampled points SystemC simulation Surface fitting

  22. Analytical Modeling for Power and Latency Loop structures of application latency constraints Generating modeled design space Sampling on tile size and unrolling factors SystemC model generation for sampled points SystemC simulation Surface fitting Modeled design space

  23. Overall Framework (3) • Automated C-to-SystemC transformation engine • Tile based characterization flow for latency and power estimation • Two versions of SystemC code generation • For system-level modeling and high-level synthesis • Analytical power and latency models • Use hyper-surface based sampling method • Evaluate the power and latency of all the points in the design space • Design space is big • impossible to traverse the entire space even with SystemC simulation • Fast design space exploration • Design space pruning • Generate power and latency Pareto curve

  24. Design Space Exploration User-defined power & latency constraints Generating modeled design space Power / Latency models

  25. Design Space Exploration User-defined power & latency constraints Generating modeled design space Power / Latency models Design space pruning Pareto-optimal candidates

  26. Design Space Exploration User-defined power & latency constraints C/C++ Generating modeled design space Power / Latency models Power & Latency annotated SystemC Design space pruning SystemC simulation Pareto-optimal candidates Pareto-optimal candidates

  27. Design Space Exploration User-defined power & latency constraints C/C++ Generating modeled design space Power / Latency models Power & Latency annotated SystemC Design space pruning SystemC simulation Pareto-optimal candidates Pareto-optimal candidates Power and Latency info Pareto-optimal points SystemC model

  28. An Example • Blue dots form the design space • Red dots are the frontiers Error Rate: Power: 4.1% Latency: 3.28%

  29. Experiment (1): Accuracy of the SystemC Model against Gate-level Simulation • Setup: • 8-benchmarks • 45-nm standard cell library for computation blocks • 45-nm memory compiler for the memory blocks • All experiments target a frequency of 1GHz • Golden model: • The design of the accelerator generated by HLS • Experiment • Verify results in two settings

  30. Accuracy of the Model for Different Switching Activities • Generate twenty input vector sets • Different switching activities from 0.1 to 0.95 • Each set includes 10000 vectors • As input to the SystemC as well as the golden model • Compare with the golden model

  31. Experiment (2): Design Space Exploration • Generate Pareto point set

  32. Experiment (2): Design Space Exploration • Generate Pareto point set • Evaluate accuracy • Analytical model vs. SystemC simulation

  33. Experiment (2): Design Space Exploration • Generate Pareto point set • Simulation speed-up • Number of points in design space / Simulation Points

  34. Experiment (2): Design Space Exploration • Generate Pareto point set • Benefit of the thickness addition • Number of points would be pruned away w/o thickness (true Pareto points)

  35. Analysis of the DSE Results • Trade-off between power and latency • Different graph shapes • Communication-bound applications (Gemver) • Computation-bound applications (Correlation) Gemver Correlation

  36. Analysis of the DSE Results • Communication dominated design (Gemver) • Increase parallel computations cause trival latency decrease • Optimization opportunity: • P1 vs. P2: 1.7 x less power, 4% longer latency • Communication dominated design (Correlation) • Effective power-latency trade-off Gemver Correlation

  37. Conclusions

  38. Thanks!

  39. Back-up

  40. Accuracy of the Model for Different Switching Activities • Generate twenty input vector sets • Different switching activities from 0.1 to 0.95 • Each set includes 10000 vectors • As input to the SystemC as well as the golden model • Compare with the golden model

More Related