Enhancing Performance Through Data-Driven Loop Pipelining

A Data-DrivenApproachfor Pipelining Sequences of Data-Dependent LOOPs João M. P. Cardoso Portugal ITIV, University of Karlsruhe, July 2, 2007

Motivation • Many applications have sequences tasks • E.g., in image and video processing algorithms • Contemporary FPGAs • Plenty of room to accommodate highly specialized complex architectures • Time to creatively “use available resources” than to simply “save resources”

Motivation • Computing Stages • Sequentially Task A Task B Task C TIME

Motivation • Computing Stages • Concurrently Task A Task B Task C TIME

Outline • Objective • Loop Pipelining • Producer/Consumer Computing Stages • Pipelining Sequences of Loops • Inter-Stage Communication • Experimental Setup and Results • Related Work • Conclusions • Future Work

Objectives • To speed-up applications with multiple and data-dependent stages • each stage seen as a set of nested loops • How? • Pipelining those sequences of data-dependent stages using fine-grain synchronization schemes • Taking advantage of field-custom computing structures (FPGAs)

Attempt to overlap loop iterations Significant speedups are achieved But how to pipeline sequences of loops? Loop Pipelining I3 I4 ... I1 I2 I1 I2 I3 I4 ... time

Computing Stages • Sequentially Producer: ...A[2]A[1]A[0] Consumer: A[0]A[1]A[2]...

A[0] A[1] A[2] A[3] ... Computing Stages • Concurrently • Ordered producer/consumer pairs • Send/receive Producer: ...A[2]A[1]A[0] Consumer: A[0]A[1]A[2]... FIFO with N stages

Computing Stages • Concurrently • Unordered producer/consumer pairs • Empty/Full table Empty/full data Producer: ...A[3]A[5]A[1] Consumer: A[3]A[1]A[5]...

Main Idea • FDCT Intermediate data Data output Data Input 0 1 2 3 4 5 6 7 8 Loop 1 Loop 2 16 Loop 3 24 32 40 48 Global FSM 56 Execution of Loops 1, 2 Execution of Loop 3 Intermediate data array time

Main Idea • FDCT • Out-of-order producer/consumer pairs • How to overlap computing stages? 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 8 8 16 16 24 24 32 32 40 40 48 48 56 56

Main Idea • Pipelined FDCT Data input Intermediate data ( dual-port RAM ) Data output Loop 1 Loop 2 Loop 3 0 1 2 3 4 5 6 7 8 FSM 2 16 24 FSM 1 Dual-port 1-bit table ( empty/full ) 32 40 48 Execution of Loops 1, 2 56 Intermediate data array Execution of Loop 3 time

Main Idea Memory Memory Memory Task A Task B

Possible Scenarios • Single write, single read • Accepted without code changes • Single write, multiple reads • Accepted without code changes (by using an N-bit table) • Multiple writes, single read • Need code transformations • Multiple writes, multiple reads • Need code transformations

Inter-Stage Communication • Responsible to: • Communicate data between pipelined stages • Flagdata availability • Solutions • Perfect associative memory • Cost too high • Memory for data plus 1-bit table (each cell represents full/empty information) • Size of the data set to communicate • Decrease size using hash-based solution Empty/full data

Inter-Stage Communication • Memory plus 1-bit table … boolean tab[SIZE]={0, 0,…, 0}; … for(i=0; i<num_fdcts; i++){ //Loop 1 for(j=0; j<N; j++){ //Loop 2 // loads // computations // stores tmp[48+i_1] = F6 >> 13; tab[48+i_1] = true; tmp[56+i_1] = F7 >> 13; tab[56+i_1] = true; i_1++; } i_1 += 56; } i_1 = 0; for (i=0; i<N*num_fdcts; i++){ //Loop 3 L1: f0 = tmp[i_1]; if(!tab[i_1]) goto L1; L2: f1 = tmp[1+i_1]; if(!tab[1+i_1]) goto L2; // remaining loads // computations … // stores i_1 += 8; }

Inter-Stage Communication • Hash-based solution: … boolean tab[SIZE]={0, 0,…, 0}; … for(i=0; i<num_fdcts; i++){ //Loop 1 for(j=0; j<N; j++){ //Loop 2 // loads // computations // stores tmp[H(48+i_1)] = F6 >> 13; tab[H(48+i_1)] = true; tmp[H(56+i_1)] = F7 >> 13; tab[H(56+i_1)] = true; i_1++; } i_1 += 56; } i_1 = 0; for (i=0; i<N*num_fdcts; i++){ //Loop 3 L1: f0 = tmp[H(i_1)]; if(!tab[H(i_1)]) goto L1; L2: f1 = tmp[H(1+i_1)]; if(!tab[H(1+i_1)]) goto L2; // remaining loads // computations … // stores i_1 += 8; }

0 0 1 1 A[1] A[1] 0 0 0 0 0 0 1 1 A[5] A[5] 0 0 0 0 Inter-Stage Communication • Hash-based solution • We did not want to include additional delays in the load/store operations • Use H(k) = k MOD m • When m is a multiple of 2*N, • H(k) can be implemented by just using the least log2(m) significant bits of K to address the cache (translates to simple interconnections) H H

Inter-Stage Communication • Hash-based solution: H(k) = k MOD m • Single read (L=1) • R = 1 •  = 0 • a) write • b) read • c) empty/full update

Inter-Stage Communication • Hash-based solution: H(k) = k MOD m • Multiple reads (L>1) • R = 11...1 (L) •  >>= R • a) write • b) read • c) empty/full update

Buffer size calculation • By monitoring behavior • of communication component • For each read and write • determine the size of the buffer needed to avoid collisions • Done during RTL simulation

Experimental Setup • Compilation flow • Uses our previous work on compiling algorithms in a Java subset to FPGAs

Experimental Setup • Simulation back-end Library of Operators (JAVA) datapath.xml fsm.xml rtg.xml fsm.xml datapath.xml XSLTs to dotty to hds to vhdl to java to dotty to vhdl to java datapath.hds fsm.java rtg.java HADES fsm.class rtg.class ANT build file I/O data ( RAMs and Stimulus )

Experimental Results • Benchmarks

Experimental Results • FDCT (speed-up achieved by Pipelining Sequences of Loop)

Experimental Results

Experimental Results • What does happen with buffer sizes?

Experimental Results • Adjust latency of tasks in order to balance pipeline stages: • Slowdown tasks with higher latency • Optimization of slower tasks in order to reduce their latency • Slowdown of producer tasks usually reduces the size of the inter-stage buffers

Experimental Results • Buffer sizes +2 cycles per iteration of the producer +1 cycle per iteration of the producer original +Optimizations in the consumer Optimizations in the producer original

Experimental Results • Buffer sizes

Experimental Results • Resources and Frequency (Spartan-3 400)

Producer: Consumer: A[0] A[1] A[2] A[3] ... A[0] A[1] A[2] A[3] ... ... A[1] A[0] A[1] A[0] A[3] A[2] ... A[0] A[1] A[2] A[3] ... A[0] A[3] A[1] A[4] A[2] A[5] ... A[0] A[1] A[2] A[3] A[4] A[5] ... ... A[2] A[3] A[0] A[1] ... A[5] A[6] A[7] A[8] A[9] A[0] A[1] A[2] A[3] A[4] Related Work • Previous approach (Ziegler et al.) • Coarse-grained communication and synchronization scheme • FIFOs are used to communicate data between pipelining stages • Width of FIFO stages dependent on producer/consumer ordering • Less applicable time

Conclusions • We presented a scheme to accelerate applications, pipelining sequences of loops • I.e., Before the end of a stage (set of nested loops) a subsequent stage (set of nested loops) can start executing based on data already produced • Data-driven scheme is used based on empty/full tables • A scheme to reduce the size of the memory buffers for inter-stage pipelining (using a simple hash function) • Depending on the consumer/producer ordering, speedups close to theoretical ones are achieved • as if stages are concurrently and independently executed

address_in data_in (a) (a) (a) L N R H M (b) (b) H T (c) hit/miss data_out address_out Future Work • Research other hash functions • Study slowdown effects • Apply the technique in the context of Multi-Core Systems Processor Core A Processor Core B Memory Memory

Acknowledgments • Work partially funded by • CHIADO - Compilation of High-Level Computationally Intensive Algorithms to Dynamically Reconfigurable COmputing Systems • Portuguese Foundation for Science and Technology (FCT), POSI and FEDER, POSI/CHS/48018/2002 • Based on the work done by Rui Rodrigues • In collaboration with Pedro C. Diniz

A Data-Driven Approach for Pipelining Sequences of Data-Dependent Loops technologyfrom seed

Buffer Monitor

Enhancing Performance Through Data-Driven Loop Pipelining

Enhancing Performance Through Data-Driven Loop Pipelining

Presentation Transcript

Lula da Silva Fernando Henrique Cardoso and Brazil

mathematical modelling of the morphodynamic aspects of the 1996 flood in the Ha! Ha! river

Cross-Language Evaluation Forum CLEF Workshop 2005

RUOLO DELLA TERAPIA ANTIANGIOGENICA NEL CARCINOMA MAMMARIO Rilevanza delle Evidenze Scientifiche

The Team

mathematical modelling of the morphodynamic aspects of the 1996 flood in the Ha! Ha! river