1 / 26

Layout Driven Data Communication Optimization for High Level Synthesis

Layout Driven Data Communication Optimization for High Level Synthesis. Adam Kaplan, Philip Brisk and Majid Sarrafzadeh Computer Science Department University of California, Los Angeles. Ryan Kastner, Wenrui Gong, Xin Hao, Forrest Brewer Dept. of Electrical and Computer Engineering

donagh
Download Presentation

Layout Driven Data Communication Optimization for High Level Synthesis

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Layout Driven Data Communication Optimization for High Level Synthesis Adam Kaplan, Philip Brisk and Majid Sarrafzadeh Computer Science Department University of California, Los Angeles Ryan Kastner, Wenrui Gong, Xin Hao, Forrest Brewer Dept. of Electrical and Computer Engineering University of California, Santa Barbara

  2. SSA CDFG for (y_pos=ygrid_start-y_fmid-1,res_pos=0; y_pos<0; y_pos+=ygrid_step) { for (x_pos=xgrid_start-x_fmid-1; x_pos<0; x_pos+=xgrid_step,res_pos++) { (*reflect)(filt,x_fdim,y_fdim,x_pos, y_pos,temp,FILTER); sum=0.0; for (y_filt_lin=x_fdim,x_filt=y_im_lin=0; y_filt_lin<=filt_size; y_im_lin+=x_dim,y_filt_lin+=x_fdim) for (im_pos=y_im_lin; x_filt<y_filt_lin; x_filt++,im_pos++) sum+=image[im_pos]*temp[x_filt]; result[res_pos] = sum; } first_col = x_pos+1; (*reflect)(filt,x_fdim,y_fdim,0,y_pos,temp,FILTER); Output: “Hardware” (RTL Specification) High Level Synthesis Input: Application description written in *C (C, SystemC, HandelC, SpecC) Internal filter of an image convolver Maximize “performance” (area, latency, power, …) subject to input constraints

  3. Target Architectures • “Spatial” architectures • Local control between data path, global data flow between control nodes • Lots of distributed computational units, memory • Coarse/fine grained reconfigurable architectures • Techniques could be used for other architectures • May not make sense • Our design flow has little resource sharing Coarse grain programmable platform Fine grain configurable platform

  4. Basic Block Entity Entity 1 SUIF: Syntactic & Semantic Analysis Machine SUIF: Compiler Backend 1. Create interface Application Specification AST CFG Entity SSA CDFG 2. Transform instruction list to dataflow graph 5. Create CFG interface Entity 2 Entity 3 6. Determine structural control and data communication between basic block entities * + + + * 3. Transform dataflow graph to behavioral HDL code Entity 4 4. Synthesize behavioral HDL code to RTL code entity basic_block is … architecture behavioral of basic_block … 7. Generate synthesizable RTL code Behavioral Synthesis entity basic_block is entity cfg is … architecture behavioral of cfg … 8. Synthesize RTL code Logical & Physical Synthesis Obligatory Design Flow Slide

  5. Design Example int FAST(real *b, int n) { real fn; int i, in, nn, n2pow, n4pow, nthpo; n2pow = fastlog2(n); if(n2pow <= 0) return 0; nthpo = n; fn = nthpo; n4pow = n2pow / 2; /* radix 2 iteration required; do it now */ if(n2pow % 2) { nn = 2; in = n / nn; FR2TR(in, b, b + in); } else nn = 1; Node 1 • “FAST” function from MediaBench • Some nodes missing - simple computation, merged into others • Lines below show data communication Node 2 Node 3 Node 4 Node 5 /* perform radix 4 iterations */ for(i = 1; i <= n4pow; i++) { nn *= 4; in = n / nn; FR4TR(in, nn, b, b + in, b + 2 * in, b + 3 * in); } /* perform inplace reordering */ FORD1(n2pow, b); FORD2(n2pow, b); /* take conjugates */ for(i = 3; i < n; i += 2) b[i] = -b[i]; return 1;} Node 6 Node 7 Node 8 Node 9 Node 10

  6. Characterizing Data Communication • Examples of data communication schemes Memory (Register Bank, RAM) Bus Control Node 2 Control Node 3 Control Node 2 Control Node 3 Control Node 4 Control Node 4 Distributed Centralized Data communication = wire Data communication = memory access

  7. Global Data Communication = 5 variables Identifying Data Communication • Determine relationship between place(s) where data is defined and where data is used a  … b  … • Naïve method: all use-points of a variable depend on all definitions of that variable • Not all use points “use” a variable a  … b  … a  … c  …  b  c  a Need analysis to minimize the amount of data communication

  8. a1 … a  … b1 … b  … a2 … a  … b2 … b  … a3 … a  … c1 … c  …  b  b1  c1  c a4 (a2,a3)  a4  a Use of SSA in Compilation • Must determine relationship between where data is generated and where data is used • Problem formulations • [DAC03]: Minimize the total number of bits communicated between all pairs of control nodes • Today: Minimize overall wirelength • SSA (Static Single Assignment) • Changes each variable to have a unique definition point • Must add -nodes to merge definitions

  9. Semi-Pruned Minimal Pruned a1 … a1 … a1 … b1 … b1 … b1 … a2 … a2 … a2 … b2 … b2 … b2 … a3 … a3 … a3 … c1 … c1 … c1 …  b1  b1  b1  c1  c1  c1 a4 (a2,a3) a4 (a2,a3) a4 (a2,a3) b3 (b1,b2) b3 (b1,b2) c2 (c1)  a4  a4  a4 SSA Fundamentals • SSA algorithms • Find location of -nodes • Rename variables • Three main SSA algorithms • Minimal, Pruned – Cytron et al. • Semi-pruned – Briggs et al. • Differ in number and location of -nodes • Minimal – insert -nodes at iterated dominance frontier (IDF) • Semi-pruned – insert -node at IDF if variable live outside some basic block • Pruned – insert -node at IDF if variable live at that time

  10. Results: SSA for Data Comm. Minimization • Edge Weight w(i,j)– number of bits communicated from node i to j • Total Edge Weight (TEW) - corresponds to amount of data communication “MediaBench”marks

  11. a1 … a1 … b1 … b1 … a2 … a2 … b2 … b2 … a3 … a3 … c1 … c1 …  b1  b1  c1  c1 a4 (a2,a3) a4 (a2,a3) TEW = 4  a4  a4 Further Minimizing Data Communication • Current SSA algorithms place -nodes temporally • In software compilation, live ranges should be short • Appropriate in hardware? Spatial -node distribution Temporal -node distribution a1 … b1 … a2 … b2 … a3 … c1 …  b1  c1 TEW = 3 a4 (a2,a3)  a4

  12. 1. Given a CDFG G(N , E ) cfg cfg 2. perform_SSA( G) 3. calculate_def_use _chains( G ) 4. remove_back_edges( G ) 5. topological_sort( G ) Î 6. for each node n N cfg F F Î 7. for each - node n ¬ F 8. s | .sources | ¬ F 9. d |def_use_chain( .dest) | × 10. if s d < s + d F 11. move_to_spatial_locations( ) 12. restore _back_edges( G ) Spatial -nodes Distribution Algorithm • d –number of uses of -node destination • s – number of -node source values • Number of temporal links • Number of spatial links s = 3 a3(a0,a1,a2)  a3  a3 d = 2 Optimal assuming “ideal” n-dimensional floorplan

  13. Let’s Get Physical! Floor-planner Physically Aware Compiler Transforms • Consider layout information during compilation • Modify transforms to consider physical info • Ideal: full physical synthesis – extremely accurate, but way too time consuming • Approximate using floorplanning • Much faster • Gives “good enough” high level physical picture application Hardware Compilation • Our previous data comm. work • No physical information • Can lead to negative results Physical Synthesis

  14. FindPlacementOptions Algorithm • Given a set of CFG Nodes R • -options  • insert(R) into-options • foreach instruction i  R • if( i is a destination of -function f ) • return -options • temp_-options  • foreach non-dominated child c of R • temp_-options  crossProductJoin(temp__options, findPlacementOptions(c)) • return-options  temp_-options Physically Aware Data Communication • Modify placement of Φ-functions to consider wirelength -Placement Algorithm • Given a CFG Gcfg(Vcfg, Ecfg) • perform_ssa(Gcfg) • calculate_def_use_chains(Gcfg) • remove_back_edges(Gcfg) • topological_sort(Gcfg) • foreach vertex v  Vcfg • foreach -node  v • s  .sources • d |def_use_chain(.dest)| • IDF  iterated_dominance_fronter(s) • PossiblePlacements  findPlacementOptions(IDF) • place()  selectBest(PossiblePlacements) • distribute/duplicate  to place()

  15. Algorithm in Action a1 … b1 … • Evaluate all options for -nodes • Replicate  when necessary • Limit amount of replication - most often leads to more wirelength • Can play tricks to limit redundant placements a2 … b2 … a3 … c1 …  b1  c1 Traditional (temporal) Traditional (temporal) a4 (a2,a3) Any of these options could yield the best wirelength Highly dependent on the floorplan a4 (a2,a3) a4 (a2,a3) a4 (a2,a3) a4 (a2,a3) a4 (a2,a3) Spatial [DAC03] Spatial [DAC03]  a4

  16. N3 F T T nn_4, i_2 nn_5, i_3 F N9 Algorithm in Action • FAST function from MediaBench testsuite

  17. N3 N3 F T F T T F nn_4, i_2 nn_5, i_3 T nn_4, i_2 nn_5, i_3 F N9 N9 Algorithm in Action

  18. Hardware Compilation Physical Synthesis Full Floor-planner Full Floorplanning Results Spectacularly negative results • Simple iterative approach • Initial optimization minimizes data communication • Full SA based floorplanning • Reoptimization based to minimize floorplanning • Full SA based floorplanning

  19. 1 2 3 4 floorplan 6 modules (e.g. due to -function movement) floorplan 6 1 2 1 3 6 4 Incremental Floorplanning • Incremental Placement [Coudert et al]: • Given an optimized placement and a set of changes to the netlist (e.g., due to technology remapping) modify the placement to improve it. • Equally applicable to floorplanning Initial Floorplan Modified Floorplan Perturbations 6

  20. 1 2 3 4 6 1 2 3 1 2 6 4 3 6 4 6 Our Incremental Floorplanner Initial Floorplan Modified Floorplan Perturbations 6 Incremental Floorplan | 32/36 - Incremental Floorplanner - 27/30.4 - 1 5/5.6 - 4 16/18 - - 11/12.4 - 2 3 2/2.3 - 9/10.1 -

  21. 1 2 3 4 1 2 3 6 4 6 Our Incremental Floorplanner • Calculate area & room of each node: bottom up slicing tree traversal • Area redistribution • Top down traversal • Increase area if necessary • Not enough space at root • Aspect ratios become too distorted Simple, yet effective Other more complicated algorithms might work better Modified Floorplan Incremental Floorplan | 32/36 - - 27/30.4 - 1 5/5.6 - 4 16/18 - - 11/12.4 - 2 3 2/2.3 - 9/10.1 -

  22. MediaBench Functions

  23. Incremental Floorplanning Results “Optimal” Approach: 12% Overall Wirelength Reduction 25% Phi-node Wirelength Reduction Normalized Wirelength Our Approach: 6% Overall Wirelength Reduction 8% Phi-node Wirelength Reduction avg Benchmarks

  24. Related Work • Hardware compilation projects using SSA • PDG+SSA form [UCSB] • CASH [CMU] • SA-C [UCR] • Sea Cucumber [BYU] • Physically aware behavioral synthesis techniques • SA for scheduling, binding and floorplanning [Prabhakaran97] • SA for binding and floorplanning [Yung-Ming94] • Scheduling, allocation and binding [Dougherty00] • Fasolt: bus topology [Knapp92] • High level synthesis [Tarafdar00] • Incremental CAD • Problem overview/challenges [Coudert00] • Floorplanning [Crenshaw99]

  25. Conclusions • It’s been a long strange trip… • SSA a nice IR for hardware compilation • Explicitly shows data flow • Useful for exploiting parallelism • Compiler techniques applied to hardware design can reduce wirelength • They must be aware of physical information • They must use an incremental floorplanning

  26. Questions? (and cue for applause)

More Related