Séminaire COSI ’01

Séminaire COSI ’01 Power Driven Processor Array Partitionning for FPGA SoC S.Derrien, S. Rajopadhye Séminaire COSI-Roscoff’01

Content • Context and motivations • Silicon compilation tools • Target architectures • Power consumption • Related work • Partitioning • Modeling Power • Experimental results • Conclusion Séminaire COSI-Roscoff’01

Silicon compilation tools • Parallel processor array architectures • Regular and scalable (well suited to FPGAs) • Specialized high-performance data-path • Restricted class of loops • SUREs (uniform dependencies) • Static polyhedral loop domain • Compute intensive nested loops • Image processing (motion estimation, stereo vision) • Signal processing (QR factorization, DLMS) Séminaire COSI-Roscoff’01

Power consumption • General model and motivations • P=Pstat+Vdd.Cd.Df (gate level model) • Estimate at RTL level (entropy based models) • Mainly dictated by : • On chip area cost and activity • Off-chip I/O volume • System level power model ? • Estimate from specs and target arch. Séminaire COSI-Roscoff’01

System Memory CPU FPGA Ext world Target architecture • Embedded CPU • Power PC • NIOS • Soc bus • Amba, Coreconnect • Plug ’n play IP cores • Shared Memory • Low latency • High bandwidth Séminaire COSI-Roscoff’01

Related Work • Compiler transformations to reduce mem accesses [Kandemir] • Loop fusion • Loop tiling • Loop reordering • Design space exploration for custom memory systems [Imec] • Systematic exploration • Multi-level memory hierachy • The approach is brute force Séminaire COSI-Roscoff’01

Content • Context and motivations • Target architectures • Partitioning • Clustering (LSGP) • Tiling (LPGS) • Co-partitionning • modeling Power • Experimental results • Conclusion Séminaire COSI-Roscoff’01

Tiling (LPGS) • Partition PE array into Tiles • Tiles are executed sequentially • Intermediate results stored in off-chip memory • requires unidirectionnal communications : • Tile shape is rectangular • Bound // to PE space base vectors • Perfect « Tiling » of processor space Séminaire COSI-Roscoff’01

Tiling (LPGS) w1=2 w2=3 domain height • Matrix W • diagonal • det|W|=Npe Séminaire COSI-Roscoff’01

Clustering (LSGP) • Regroups PEs into Clusters • operations executed sequentially • I/O accesses reduced • Cluster shape is rectangular • Bound // to PE space basis vectors • Perfect « Tiling » of processor space • Scheduling is axes-major • Several possible schedulings • Seq. of clustering along each axis • Simplifies control logic Séminaire COSI-Roscoff’01

Original space-time mapping PE index vector Iteration index vector Clustering (LSGP) • Matrix G • diagonal • det|G|=Npe • size syx…xsx sy=2 sy=3 Séminaire COSI-Roscoff’01

Clustering (LSGP) 3 6 1 1 1 2 1 1 1 PE original sx=2 sx=2, sy=3 Resource usage estimate : Séminaire COSI-Roscoff’01

Hybrid-partitioning • Step1 : array is Tiled • Tune the I/O volume • Step2 : Tile is clusteredArray • Tune the resource usage • Trade-Off • Off-chip I/O Volume • Local memory sizes Séminaire COSI-Roscoff’01

Content • Context and motivations • Target architectures • Partitioning • modeling Power • IO power model • Core power model • Putting it all together • Experimental results • Conclusion Séminaire COSI-Roscoff’01

Dynamic IO Energy model • IO Energy depends on • IO volume (Ram clock speed) • Operation (Rd,Wr) • Port Toggle rate Eio=Krd.Vrd+ Kwr.Vwr • Determine IO volume • For all loop variables • Given tiling parameters Technological constant Number write I/O operations Séminaire COSI-Roscoff’01

IO Volume estimate (1/2) • Tile IO volume is called « foot print » • Estimate for this foot print [Arg95] • Spread vectorof dependencies : substituting ith row with spread vector Séminaire COSI-Roscoff’01

IO Volume estimate (1/2) • Total Tile IO volume: • Example : dA=[1 0 0] aA=[1 0 0] lA=2 VA= 2.H.w1 dB=[0 1 0] aB=[1 0 0] lB=2 VB= 2.H.w2 dC=[0 0 1] aC=[1 0 0] lC=4 VC= 4w1 w2 Number of variables kth variable byte width Spread vector Tile size parameter Séminaire COSI-Roscoff’01

Core power model (1/4) • FPGA power dissipation model Pcore=Pstat+Kc.Dlc.nlc.f • Not suited to our target FPGA architecture. • Distinction between LCs (mem and logic) Pcore=Pstat+Kc.Dlc.nlc.f+ Km.Dm.nm.f Design operating freq. Technology constant Nbs of logic cells Average toggle rate Séminaire COSI-Roscoff’01

Core power model (2/4) • Control logic is not modeled • too complex to estimate • no significant contribution to power • Core power depends on • Number of PEs : depends on G andW • Area usage for each PE : depends on W • Average toggle rate for PE datapath and local memory (application constant) Séminaire COSI-Roscoff’01

Core power model (3/4) • Memory ressource usage • LCs used as distributed memory (16x1bits) • Datapath is design constant (library based) • Area cost for a PE array Datapath functional cost Number of PEs Register width along processor space k Clustering parameter along processor space j Séminaire COSI-Roscoff’01

Core power model (4/4) • Energy cost for the whole loop nest • we have Ec=Pc.ncycle.Tcycle • we will considerncycle=Vcalc/np • Total core energy cost Average toggle rate Total loop computation volume Energy is not dependant on np !! Séminaire COSI-Roscoff’01

Content • Context and motivations • Target architectures • Partitioning • Modeling Power • Experimental results • Model validation • Extrapolations • Conclusion Séminaire COSI-Roscoff’01

IO power model results Séminaire COSI-Roscoff’01

Core power model results Séminaire COSI-Roscoff’01

System power model Séminaire COSI-Roscoff’01

Content • Context and motivations • Target architectures • Partitioning • modeling Power • Experimental results • Conclusion • Solving the optimisation problem (Lagrange Multipliers) • Custom cache for embedded CPUs • Extension to SAREs (affine dependances) Séminaire COSI-Roscoff’01

Conclusion • Models matches experiments • Cheap measurement setup • Many components contribute to current dissipation (LEDs, PCI, etc…) • Observations • Trade-off evolves with technology • More sensitive for Asics ? Séminaire COSI-Roscoff’01

Future Work(1/2) • Formulation of the optimization pb • Minimize Energy/iteration • Contraints on Performance and Area • Analitycal solution ? • Lagrange multipliers • No closed form for n>3 • BUT fast numerical methods Séminaire COSI-Roscoff’01

Future Work(2/2) • Model for embedded CPUs • Trade-off cache-size and memory acceses. • Determine optimal cache size and associated tiling parameters. • Extension to SARE ? • Affine dependencies. • More general loops. Séminaire COSI-Roscoff’01

Séminaire COSI ’01

Séminaire COSI ’01

Presentation Transcript

John Lennon

Week-4 Flip chart

Characters in Cosi

Cosi Characters