1 / 27

10th Reconfigurable Architectures Workshop (RAW 2003), Nice, France, April 22, 2003

Loop Dissevering: A Technique for Temporally Partitioning Loops in Dynamically Reconfigurable Computing Platforms. João M. P. Cardoso University of Algarve, Faro, INESC-ID, Lisboa Portugal. 10th Reconfigurable Architectures Workshop (RAW 2003), Nice, France, April 22, 2003

Download Presentation

10th Reconfigurable Architectures Workshop (RAW 2003), Nice, France, April 22, 2003

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Loop Dissevering: A Technique for Temporally Partitioning Loops in Dynamically Reconfigurable Computing Platforms João M. P. Cardoso University of Algarve, Faro, INESC-ID, Lisboa Portugal 10th Reconfigurable Architectures Workshop (RAW 2003), Nice, France, April 22, 2003 17th Annual Int’l Parallel & Distributed Processing Symposium (IPDPS 2003)

  2. Motivation • How to map sets of computational structures requiring more resources than available? for(int i=0; i<8;i++) for(int j=0;j<8;j++) CosTrans[j+8*i] = CosBlock[i+8*j]; for(int i=0; i<8;i++) for(int j=0;j<8;j++) { TempBlock[i+j*8] = 0; for(int k=0;k<8;k++) TempBlock[i+j*8] += InIm[i+k*8] * CosTrans[k+j*8]; }

  3. Motivation • How to map sets of computational structures requiring more resources than available? • Temporal Partitioning for(int i=0; i<8;i++) for(int j=0;j<8;j++) CosTrans[j+8*i] = CosBlock[i+8*j]; for(int i=0; i<8;i++) for(int j=0;j<8;j++) { TempBlock[i+j*8] = 0; for(int k=0;k<8;k++) TempBlock[i+j*8] += InIm[i+k*8] * CosTrans[k+j*8]; }

  4. Motivation • How to map sets of computational structures requiring more resources than available? • Temporal Partitioning • Other motivations for Partitioning Computations in Time • each design is simpler • may lead to better performance! • amortize some configuration time • by overlapping execution stages • use of smaller reconfigurable arrays to implement complex applications For more info: see Cardoso and Weinhardt, DATE 2003

  5. Motivation • How to map sets of computational structures requiring more resources than available? • Temporal Partitioning • Computational structures for each loop or set of nested loops implemented in a single partition • But, what to do with a Loop requiring more resources than available?

  6. Outline • Motivation • Configure-Execute Paradigm (execution stages) • Target Architecture • PACT XPP Architecture • XPP Configuration Flow • XPP-VC Compilation Flow • Temporal Partitioning of Loops • Experimental Results • Conclusions & Future Work

  7. Configure-Execute Paradigm (Execution Stages) • the program in a single configuration • two configurations • without on-chip context planes and without partial reconfiguration • with partial reconfiguration • with on-chip context planes Fetch (f) Configure (c) Compute (comp) f1 c1 comp1 comp2 f2 c2 comp2 f1 c1 comp1 f2 c2_1 c2_2 f1 c1 comp1 comp2 f2 c2 time

  8. PE PE PE PE M M PACT XPP Architecture (briefly) • X × Y Coarse-grained array: • Processing elements (PEs): compute typical ALU operations • Two columns of SRAMs (Ms) • I/O ports for data streaming

  9. PE PE PE PE M M PACT XPP Architecture (briefly) • Ready/ack. protocol for each programmable interconnection • Flow of data (pre-foundry parameterized bit-widths) • Flow of events (1-bit lines)

  10. PE PE PE PE M M PACT XPP Architecture (briefly) • Dynamically reconfigurable: • On-chip configuration cache and configuration manager • Partial reconfiguration (only those used resources are configured) configure Configuration Cache (CC) fetch Configuration Manager (CM) CMPort0 CMPort1

  11. Uses 3 stages to execute each configuration: Array may request the next configuration Configuration manager accepts requests and proceeds without intervention from external host Fetch (f) Configure (c) Compute (comp) fetch configure c0 c0 Configuration Cache (CC) c1 c2 <N Configuration Manager (CM) c0; If(CMPort0) then c1; If(CMPort1) then c2; CMport0 CMport1 XPP Configuration Flow

  12. Preprocessing + Dependence Analysis TempPart Temporal Partitioning MODGen “Module Generation” (with pipelining) C program XPP Binary Code NML Control Code Generation (Reconfiguration) NML file xmap XPP-VC Compilation Flow • TempPart: partitions and generates reconfiguration statements which are executed by Configuration Manager • MODGen: maps C subset to NML (PACT proprietary structural language with reconfiguration primitives) For more info: see Cardoso and Weinhardt, FPL 2002

  13. start Loop 1 x Loop 2 coef tmp Loop 3 y Loop 4 end Temporal Partitioning • One partition for each node in the Hierarchical Task Graph (HTG) TOP level • Merge adjacent nodes if combination of both can be mapped to XPP device and if merge does not degrade overall performance • If HTG node too large, create separate partition for each node of the inner-HTG and call algorithm recursively

  14. Temporal Partitioning of Loops • What to do when loops in the program cannot be mapped due to the lack of enough resources? • Software/reconfigware approach • control of the loop in software, • migrates to reconfigware inner-code sections, each one mapped to a single configuration • Loop Distribution • transforms a loop into two or more loops • each loop with the same iteration-space traversal of the original loop • inner statements of the original loop are split among the loops • Loop Dissevering • transforms a loop in a set of configurations • cyclic behavior implemented by the configuration flow

  15. Temporal Partitioning of Loops … for(nx=0;nx<X_DIM_BLK; nx++) for(ny=0;ny<Y_DIM_BLK; ny++) { for(i=0;i<N;i++) for(j=0;j<N;j++) { tmp = 0; Inner Loop 1 for(k=0;k<N;k++) tmp += X[i+ny*N][k+nx*N]* CosBlock[j][k]; TempBlock[i][j] = tmp; } // to be partitioned here for(i=0;i<N;i++) for(j=0;j<N;j++) { tmp = 0;Inner Loop 2 for(k=0;k<N;k++) tmp += TempBlock[k][j]* CosBlock[i][k]; Y[i+ny*N][j+nx*N] = tmp; } } … • Loop Distribution • Loop Dissevering

  16. Loop Distribution … for(nx=0;nx<X_DIM_BLK; nx++) for(ny=0;ny<Y_DIM_BLK; ny++) for(i=0;i<N;i++) for(j=0;j<N;j++) { Inner Loop 1 TempBlock[i+ny*N][j+nx*N] = tmp; } for(nx=0;nx<X_DIM_BLK; nx++) for(ny=0;ny<Y_DIM_BLK; ny++) for(i=0;i<N;i++) for(j=0;j<N;j++) { tmp = 0; for(k=0;k<N;k++) tmp += TempBlock[k+ny*N][j+nx*N]* CosBlock[i][k]; Y[i+ny*N][j+nx*N] = tmp; } … Conf. 1 begin Conf. 1 Conf. 2 Conf. 2 end tmp += TempBlock[k][j]* CosBlock[i][k];

  17. Loop Distribution • Cannot be applied to all loops • no break of cycles in the dependence graph of the original loop • Use of auxiliary array variables • for each loop-independent flow dependence of a scalar variable (known as scalar expansion) and • for each control dependence in the place where we want to partition the loop • Expansion of some arrays • But, it preserves the software pipelining potential, and • may improve parallelization, cache hit/miss ratio, etc.

  18. Loop Dissevering Conf. 1 Conf. 2 begin Conf. 1 Conf. 3 Conf. 2 end Conf. 3 Conf. 4 Conf. 4 Conf. 5 Conf. 5

  19. Loop Dissevering • Applicable to every loop • Only relies on a configuration manager to execute complex loops • May relieve the host microprocessor to execute other tasks • No array or scalar expansion (only scalar communication) • But, • Besides its usage to furnish feasible mappings, is it worth to be applied? Does it lead to efficient solutions (in terms of performance)? • What are the improvements if the architecture can switch between configurations in few clock cycles?

  20. Experimental Results • Compared Architectures • Both with runtime support to partial reconfiguration • ARCH-A • word-grained partial reconfiguration • ARCH-B • context-planes • with switching between contexts in few clock cycles comp2 f1 c1 comp1 f2 c2_1 c2_2 f1 c1 comp1 comp2 f2 c2

  21. Experimental Results • Benchmarks

  22. Experimental Results (resource savings) • Using loop dissevering • When compared to implementations without loop dissevering only 44% (DCT), 66% (BPIC), and 85% (Life) of resources are used

  23. Experimental Results (speedups) • Architecture A (ARCH-A) • Word-grained partial reconfiguration • Architecture B (ARCH-B) • Context-planes • DCT

  24. Experimental Results (speedups) • Life • Applying Loop Dissevering • Benefits of ARCH-B are neglected when partitions “in the loop” compute for long times

  25. Conclusions • Temporal Partitioning + Loop Dissevering • guarantees the mapping of theoretically unlimited computational structures • Loop Dissevering and Loop Distribution • may lead to performance enhancements • saving of resources • Loop Dissevering applicable to every loop • performance efficient implementations may require fast reconfiguration • the resultant performance may decrease • when innermost loops are partitioned (no more potential for loop pipelining) • when each active partition computes for short times (does not amortize the reconfiguration time)

  26. Future Work • More study on the impact of Loop Dissevering and Loop Distribution • To understand the impact of the number of context-planes, configuration cache size, etc. • To evaluate loop partitioning when mapping to FPGAs • Automatic implementation of Loop Distribution • Methods to decide between Loop Dissevering and Loop Distribution

  27. Acknowledgments (in the paper) • Part of this work has been done when the author was with PACT XPP Technologies, Inc, Munich, Germany. • We gratefully acknowledge the support of all the members of PACT XPP Technologies, Inc., especially the help of Daniel Bretz, Armin Strobl, and Frank May, regarding the XDS tools. A special thanks to Markus Weinhardt regarding the fruitful discussions about loop dissevering and the XPP-VC compiler.

More Related