1 / 43

João M. P. Cardoso

A Data-Driven Approach for Pipelining Sequences of Data-Dependent LOOPs. João M. P. Cardoso. Portugal. ITIV, University of Karlsruhe, July 2, 2007. Motivation. Many applications have sequences tasks E.g., in image and video processing algorithms Contemporary FPGAs

vivien-kidd
Download Presentation

João M. P. Cardoso

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Data-DrivenApproachfor Pipelining Sequences of Data-Dependent LOOPs João M. P. Cardoso Portugal ITIV, University of Karlsruhe, July 2, 2007

  2. Motivation • Many applications have sequences tasks • E.g., in image and video processing algorithms • Contemporary FPGAs • Plenty of room to accommodate highly specialized complex architectures • Time to creatively “use available resources” than to simply “save resources”

  3. Motivation • Computing Stages • Sequentially Task A Task B Task C TIME

  4. Motivation • Computing Stages • Concurrently Task A Task B Task C TIME

  5. Outline • Objective • Loop Pipelining • Producer/Consumer Computing Stages • Pipelining Sequences of Loops • Inter-Stage Communication • Experimental Setup and Results • Related Work • Conclusions • Future Work

  6. Objectives • To speed-up applications with multiple and data-dependent stages • each stage seen as a set of nested loops • How? • Pipelining those sequences of data-dependent stages using fine-grain synchronization schemes • Taking advantage of field-custom computing structures (FPGAs)

  7. Attempt to overlap loop iterations Significant speedups are achieved But how to pipeline sequences of loops? Loop Pipelining I3 I4 ... I1 I2 I1 I2 I3 I4 ... time

  8. Computing Stages • Sequentially Producer: ...A[2]A[1]A[0] Consumer: A[0]A[1]A[2]...

  9. A[0] A[1] A[2] A[3] ... Computing Stages • Concurrently • Ordered producer/consumer pairs • Send/receive Producer: ...A[2]A[1]A[0] Consumer: A[0]A[1]A[2]... FIFO with N stages

  10. Computing Stages • Concurrently • Unordered producer/consumer pairs • Empty/Full table Empty/full data Producer: ...A[3]A[5]A[1] Consumer: A[3]A[1]A[5]...

  11. Main Idea • FDCT Intermediate data Data output Data Input 0 1 2 3 4 5 6 7 8 Loop 1 Loop 2 16 Loop 3 24 32 40 48 Global FSM 56 Execution of Loops 1, 2 Execution of Loop 3 Intermediate data array time

  12. Main Idea • FDCT • Out-of-order producer/consumer pairs • How to overlap computing stages? 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 8 8 16 16 24 24 32 32 40 40 48 48 56 56

  13. Main Idea • Pipelined FDCT Data input Intermediate data ( dual-port RAM ) Data output Loop 1 Loop 2 Loop 3 0 1 2 3 4 5 6 7 8 FSM 2 16 24 FSM 1 Dual-port 1-bit table ( empty/full ) 32 40 48 Execution of Loops 1, 2 56 Intermediate data array Execution of Loop 3 time

  14. Main Idea Memory Memory Memory Task A Task B

  15. Possible Scenarios • Single write, single read • Accepted without code changes • Single write, multiple reads • Accepted without code changes (by using an N-bit table) • Multiple writes, single read • Need code transformations • Multiple writes, multiple reads • Need code transformations

  16. Inter-Stage Communication • Responsible to: • Communicate data between pipelined stages • Flagdata availability • Solutions • Perfect associative memory • Cost too high • Memory for data plus 1-bit table (each cell represents full/empty information) • Size of the data set to communicate • Decrease size using hash-based solution Empty/full data

  17. Inter-Stage Communication • Memory plus 1-bit table … boolean tab[SIZE]={0, 0,…, 0}; … for(i=0; i<num_fdcts; i++){ //Loop 1 for(j=0; j<N; j++){ //Loop 2 // loads // computations // stores tmp[48+i_1] = F6 >> 13; tab[48+i_1] = true; tmp[56+i_1] = F7 >> 13; tab[56+i_1] = true; i_1++; } i_1 += 56; } i_1 = 0; for (i=0; i<N*num_fdcts; i++){ //Loop 3 L1: f0 = tmp[i_1]; if(!tab[i_1]) goto L1; L2: f1 = tmp[1+i_1]; if(!tab[1+i_1]) goto L2; // remaining loads // computations … // stores i_1 += 8; }

  18. Inter-Stage Communication • Hash-based solution: … boolean tab[SIZE]={0, 0,…, 0}; … for(i=0; i<num_fdcts; i++){ //Loop 1 for(j=0; j<N; j++){ //Loop 2 // loads // computations // stores tmp[H(48+i_1)] = F6 >> 13; tab[H(48+i_1)] = true; tmp[H(56+i_1)] = F7 >> 13; tab[H(56+i_1)] = true; i_1++; } i_1 += 56; } i_1 = 0; for (i=0; i<N*num_fdcts; i++){ //Loop 3 L1: f0 = tmp[H(i_1)]; if(!tab[H(i_1)]) goto L1; L2: f1 = tmp[H(1+i_1)]; if(!tab[H(1+i_1)]) goto L2; // remaining loads // computations … // stores i_1 += 8; }

  19. 0 0 1 1 A[1] A[1] 0 0 0 0 0 0 1 1 A[5] A[5] 0 0 0 0 Inter-Stage Communication • Hash-based solution • We did not want to include additional delays in the load/store operations • Use H(k) = k MOD m • When m is a multiple of 2*N, • H(k) can be implemented by just using the least log2(m) significant bits of K to address the cache (translates to simple interconnections) H H

  20. Inter-Stage Communication • Hash-based solution: H(k) = k MOD m • Single read (L=1) • R = 1 •  = 0 • a) write • b) read • c) empty/full update

  21. Inter-Stage Communication • Hash-based solution: H(k) = k MOD m • Multiple reads (L>1) • R = 11...1 (L) •  >>= R • a) write • b) read • c) empty/full update

  22. Buffer size calculation • By monitoring behavior • of communication component • For each read and write • determine the size of the buffer needed to avoid collisions • Done during RTL simulation

  23. Experimental Setup • Compilation flow • Uses our previous work on compiling algorithms in a Java subset to FPGAs

  24. Experimental Setup • Simulation back-end Library of Operators (JAVA) datapath.xml fsm.xml rtg.xml fsm.xml datapath.xml XSLTs to dotty to hds to vhdl to java to dotty to vhdl to java datapath.hds fsm.java rtg.java HADES fsm.class rtg.class ANT build file I/O data ( RAMs and Stimulus )

  25. Experimental Results • Benchmarks

  26. Experimental Results • FDCT (speed-up achieved by Pipelining Sequences of Loop)

  27. Experimental Results

  28. Experimental Results • What does happen with buffer sizes?

  29. Experimental Results • Adjust latency of tasks in order to balance pipeline stages: • Slowdown tasks with higher latency • Optimization of slower tasks in order to reduce their latency • Slowdown of producer tasks usually reduces the size of the inter-stage buffers

  30. Experimental Results • Buffer sizes +2 cycles per iteration of the producer +1 cycle per iteration of the producer original +Optimizations in the consumer Optimizations in the producer original

  31. Experimental Results • Buffer sizes

  32. Experimental Results • Resources and Frequency (Spartan-3 400)

  33. Producer: Consumer: A[0] A[1] A[2] A[3] ... A[0] A[1] A[2] A[3] ... ... A[1] A[0] A[1] A[0] A[3] A[2] ... A[0] A[1] A[2] A[3] ... A[0] A[3] A[1] A[4] A[2] A[5] ... A[0] A[1] A[2] A[3] A[4] A[5] ... ... A[2] A[3] A[0] A[1] ... A[5] A[6] A[7] A[8] A[9] A[0] A[1] A[2] A[3] A[4] Related Work • Previous approach (Ziegler et al.) • Coarse-grained communication and synchronization scheme • FIFOs are used to communicate data between pipelining stages • Width of FIFO stages dependent on producer/consumer ordering • Less applicable time

  34. Conclusions • We presented a scheme to accelerate applications, pipelining sequences of loops • I.e., Before the end of a stage (set of nested loops) a subsequent stage (set of nested loops) can start executing based on data already produced • Data-driven scheme is used based on empty/full tables • A scheme to reduce the size of the memory buffers for inter-stage pipelining (using a simple hash function) • Depending on the consumer/producer ordering, speedups close to theoretical ones are achieved • as if stages are concurrently and independently executed

  35. address_in data_in (a) (a) (a) L N R H M (b) (b) H T (c) hit/miss data_out address_out Future Work • Research other hash functions • Study slowdown effects • Apply the technique in the context of Multi-Core Systems Processor Core A Processor Core B Memory Memory

  36. Acknowledgments • Work partially funded by • CHIADO - Compilation of High-Level Computationally Intensive Algorithms to Dynamically Reconfigurable COmputing Systems • Portuguese Foundation for Science and Technology (FCT), POSI and FEDER, POSI/CHS/48018/2002 • Based on the work done by Rui Rodrigues • In collaboration with Pedro C. Diniz

  37. A Data-Driven Approach for Pipelining Sequences of Data-Dependent Loops technologyfrom seed

  38. Buffer Monitor

  39. Buffer Monitor

  40. Buffer Monitor

  41. Buffer Monitor

  42. Buffer Monitor

  43. Buffer Monitor

More Related