Layout Driven Data Communication Optimization for High Level Synthesis

Layout Driven Data Communication Optimization for High Level Synthesis Adam Kaplan, Philip Brisk and Majid Sarrafzadeh Computer Science Department University of California, Los Angeles Ryan Kastner, Wenrui Gong, Xin Hao, Forrest Brewer Dept. of Electrical and Computer Engineering University of California, Santa Barbara

SSA CDFG for (y_pos=ygrid_start-y_fmid-1,res_pos=0; y_pos<0; y_pos+=ygrid_step) { for (x_pos=xgrid_start-x_fmid-1; x_pos<0; x_pos+=xgrid_step,res_pos++) { (*reflect)(filt,x_fdim,y_fdim,x_pos, y_pos,temp,FILTER); sum=0.0; for (y_filt_lin=x_fdim,x_filt=y_im_lin=0; y_filt_lin<=filt_size; y_im_lin+=x_dim,y_filt_lin+=x_fdim) for (im_pos=y_im_lin; x_filt<y_filt_lin; x_filt++,im_pos++) sum+=image[im_pos]*temp[x_filt]; result[res_pos] = sum; } first_col = x_pos+1; (*reflect)(filt,x_fdim,y_fdim,0,y_pos,temp,FILTER); Output: “Hardware” (RTL Specification) High Level Synthesis Input: Application description written in *C (C, SystemC, HandelC, SpecC) Internal filter of an image convolver Maximize “performance” (area, latency, power, …) subject to input constraints

Target Architectures • “Spatial” architectures • Local control between data path, global data flow between control nodes • Lots of distributed computational units, memory • Coarse/fine grained reconfigurable architectures • Techniques could be used for other architectures • May not make sense • Our design flow has little resource sharing Coarse grain programmable platform Fine grain configurable platform

Basic Block Entity Entity 1 SUIF: Syntactic & Semantic Analysis Machine SUIF: Compiler Backend 1. Create interface Application Specification AST CFG Entity SSA CDFG 2. Transform instruction list to dataflow graph 5. Create CFG interface Entity 2 Entity 3 6. Determine structural control and data communication between basic block entities * + + + * 3. Transform dataflow graph to behavioral HDL code Entity 4 4. Synthesize behavioral HDL code to RTL code entity basic_block is … architecture behavioral of basic_block … 7. Generate synthesizable RTL code Behavioral Synthesis entity basic_block is entity cfg is … architecture behavioral of cfg … 8. Synthesize RTL code Logical & Physical Synthesis Obligatory Design Flow Slide

Design Example int FAST(real *b, int n) { real fn; int i, in, nn, n2pow, n4pow, nthpo; n2pow = fastlog2(n); if(n2pow <= 0) return 0; nthpo = n; fn = nthpo; n4pow = n2pow / 2; /* radix 2 iteration required; do it now */ if(n2pow % 2) { nn = 2; in = n / nn; FR2TR(in, b, b + in); } else nn = 1; Node 1 • “FAST” function from MediaBench • Some nodes missing - simple computation, merged into others • Lines below show data communication Node 2 Node 3 Node 4 Node 5 /* perform radix 4 iterations */ for(i = 1; i <= n4pow; i++) { nn *= 4; in = n / nn; FR4TR(in, nn, b, b + in, b + 2 * in, b + 3 * in); } /* perform inplace reordering */ FORD1(n2pow, b); FORD2(n2pow, b); /* take conjugates */ for(i = 3; i < n; i += 2) b[i] = -b[i]; return 1;} Node 6 Node 7 Node 8 Node 9 Node 10

Characterizing Data Communication • Examples of data communication schemes Memory (Register Bank, RAM) Bus Control Node 2 Control Node 3 Control Node 2 Control Node 3 Control Node 4 Control Node 4 Distributed Centralized Data communication = wire Data communication = memory access

Global Data Communication = 5 variables Identifying Data Communication • Determine relationship between place(s) where data is defined and where data is used a  … b  … • Naïve method: all use-points of a variable depend on all definitions of that variable • Not all use points “use” a variable a  … b  … a  … c  …  b  c  a Need analysis to minimize the amount of data communication

a1 … a  … b1 … b  … a2 … a  … b2 … b  … a3 … a  … c1 … c  …  b  b1  c1  c a4 (a2,a3)  a4  a Use of SSA in Compilation • Must determine relationship between where data is generated and where data is used • Problem formulations • [DAC03]: Minimize the total number of bits communicated between all pairs of control nodes • Today: Minimize overall wirelength • SSA (Static Single Assignment) • Changes each variable to have a unique definition point • Must add -nodes to merge definitions

Semi-Pruned Minimal Pruned a1 … a1 … a1 … b1 … b1 … b1 … a2 … a2 … a2 … b2 … b2 … b2 … a3 … a3 … a3 … c1 … c1 … c1 …  b1  b1  b1  c1  c1  c1 a4 (a2,a3) a4 (a2,a3) a4 (a2,a3) b3 (b1,b2) b3 (b1,b2) c2 (c1)  a4  a4  a4 SSA Fundamentals • SSA algorithms • Find location of -nodes • Rename variables • Three main SSA algorithms • Minimal, Pruned – Cytron et al. • Semi-pruned – Briggs et al. • Differ in number and location of -nodes • Minimal – insert -nodes at iterated dominance frontier (IDF) • Semi-pruned – insert -node at IDF if variable live outside some basic block • Pruned – insert -node at IDF if variable live at that time

Results: SSA for Data Comm. Minimization • Edge Weight w(i,j)– number of bits communicated from node i to j • Total Edge Weight (TEW) - corresponds to amount of data communication “MediaBench”marks

a1 … a1 … b1 … b1 … a2 … a2 … b2 … b2 … a3 … a3 … c1 … c1 …  b1  b1  c1  c1 a4 (a2,a3) a4 (a2,a3) TEW = 4  a4  a4 Further Minimizing Data Communication • Current SSA algorithms place -nodes temporally • In software compilation, live ranges should be short • Appropriate in hardware? Spatial -node distribution Temporal -node distribution a1 … b1 … a2 … b2 … a3 … c1 …  b1  c1 TEW = 3 a4 (a2,a3)  a4

1. Given a CDFG G(N , E ) cfg cfg 2. perform_SSA( G) 3. calculate_def_use _chains( G ) 4. remove_back_edges( G ) 5. topological_sort( G ) Î 6. for each node n N cfg F F Î 7. for each - node n ¬ F 8. s | .sources | ¬ F 9. d |def_use_chain( .dest) | × 10. if s d < s + d F 11. move_to_spatial_locations( ) 12. restore _back_edges( G ) Spatial -nodes Distribution Algorithm • d –number of uses of -node destination • s – number of -node source values • Number of temporal links • Number of spatial links s = 3 a3(a0,a1,a2)  a3  a3 d = 2 Optimal assuming “ideal” n-dimensional floorplan

Let’s Get Physical! Floor-planner Physically Aware Compiler Transforms • Consider layout information during compilation • Modify transforms to consider physical info • Ideal: full physical synthesis – extremely accurate, but way too time consuming • Approximate using floorplanning • Much faster • Gives “good enough” high level physical picture application Hardware Compilation • Our previous data comm. work • No physical information • Can lead to negative results Physical Synthesis

FindPlacementOptions Algorithm • Given a set of CFG Nodes R • -options  • insert(R) into-options • foreach instruction i  R • if( i is a destination of -function f ) • return -options • temp_-options  • foreach non-dominated child c of R • temp_-options  crossProductJoin(temp__options, findPlacementOptions(c)) • return-options  temp_-options Physically Aware Data Communication • Modify placement of Φ-functions to consider wirelength -Placement Algorithm • Given a CFG Gcfg(Vcfg, Ecfg) • perform_ssa(Gcfg) • calculate_def_use_chains(Gcfg) • remove_back_edges(Gcfg) • topological_sort(Gcfg) • foreach vertex v  Vcfg • foreach -node  v • s  .sources • d |def_use_chain(.dest)| • IDF  iterated_dominance_fronter(s) • PossiblePlacements  findPlacementOptions(IDF) • place()  selectBest(PossiblePlacements) • distribute/duplicate  to place()

Algorithm in Action a1 … b1 … • Evaluate all options for -nodes • Replicate  when necessary • Limit amount of replication - most often leads to more wirelength • Can play tricks to limit redundant placements a2 … b2 … a3 … c1 …  b1  c1 Traditional (temporal) Traditional (temporal) a4 (a2,a3) Any of these options could yield the best wirelength Highly dependent on the floorplan a4 (a2,a3) a4 (a2,a3) a4 (a2,a3) a4 (a2,a3) a4 (a2,a3) Spatial [DAC03] Spatial [DAC03]  a4

N3 F T T nn_4, i_2 nn_5, i_3 F N9 Algorithm in Action • FAST function from MediaBench testsuite

N3 N3 F T F T T F nn_4, i_2 nn_5, i_3 T nn_4, i_2 nn_5, i_3 F N9 N9 Algorithm in Action

Hardware Compilation Physical Synthesis Full Floor-planner Full Floorplanning Results Spectacularly negative results • Simple iterative approach • Initial optimization minimizes data communication • Full SA based floorplanning • Reoptimization based to minimize floorplanning • Full SA based floorplanning

1 2 3 4 floorplan 6 modules (e.g. due to -function movement) floorplan 6 1 2 1 3 6 4 Incremental Floorplanning • Incremental Placement [Coudert et al]: • Given an optimized placement and a set of changes to the netlist (e.g., due to technology remapping) modify the placement to improve it. • Equally applicable to floorplanning Initial Floorplan Modified Floorplan Perturbations 6

1 2 3 4 6 1 2 3 1 2 6 4 3 6 4 6 Our Incremental Floorplanner Initial Floorplan Modified Floorplan Perturbations 6 Incremental Floorplan | 32/36 - Incremental Floorplanner - 27/30.4 - 1 5/5.6 - 4 16/18 - - 11/12.4 - 2 3 2/2.3 - 9/10.1 -

1 2 3 4 1 2 3 6 4 6 Our Incremental Floorplanner • Calculate area & room of each node: bottom up slicing tree traversal • Area redistribution • Top down traversal • Increase area if necessary • Not enough space at root • Aspect ratios become too distorted Simple, yet effective Other more complicated algorithms might work better Modified Floorplan Incremental Floorplan | 32/36 - - 27/30.4 - 1 5/5.6 - 4 16/18 - - 11/12.4 - 2 3 2/2.3 - 9/10.1 -

MediaBench Functions

Incremental Floorplanning Results “Optimal” Approach: 12% Overall Wirelength Reduction 25% Phi-node Wirelength Reduction Normalized Wirelength Our Approach: 6% Overall Wirelength Reduction 8% Phi-node Wirelength Reduction avg Benchmarks

Related Work • Hardware compilation projects using SSA • PDG+SSA form [UCSB] • CASH [CMU] • SA-C [UCR] • Sea Cucumber [BYU] • Physically aware behavioral synthesis techniques • SA for scheduling, binding and floorplanning [Prabhakaran97] • SA for binding and floorplanning [Yung-Ming94] • Scheduling, allocation and binding [Dougherty00] • Fasolt: bus topology [Knapp92] • High level synthesis [Tarafdar00] • Incremental CAD • Problem overview/challenges [Coudert00] • Floorplanning [Crenshaw99]

Conclusions • It’s been a long strange trip… • SSA a nice IR for hardware compilation • Explicitly shows data flow • Useful for exploiting parallelism • Compiler techniques applied to hardware design can reduce wirelength • They must be aware of physical information • They must use an incremental floorplanning

Questions? (and cue for applause)

Layout Driven Data Communication Optimization for High Level Synthesis

Layout Driven Data Communication Optimization for High Level Synthesis

Presentation Transcript

High-Level Synthesis an introduction

Model-Based Optimization of High Level Synthesis Directives

ECE 565 High-Level Synthesis--Introduction

High Level Synthesis

Compiler-Driven Data Layout Transformation for Heterogeneous Platforms

IL2200 - High Level Synthesis

High-Level Synthesis

High-Level Carrier Requirements for Cross Layer Optimization

High-level Synthesis Scheduling, Allocation, Assignment,

High-Level Synthesis: Creating Custom Circuits from High-Level Code

ENGG3190 Logic Synthesis High Level Synthesis

Validating High-Level Synthesis

Combinatorial Optimization for Text Layout

Lower Power High Level Synthesis

High-Level Synthesis-II

High-Level Synthesis for Reconfigurable Systems

High-Level Synthesis Algorithms

High-level synthesis

High-Level Synthesis

High-level Synthesis Transformations