1 / 21

A Design Space Exploration of Grid Processor Architectures

This research explores the design space of grid processor architectures, focusing on scalability, fast clock rate, and high instruction-level parallelism (ILP). The proposed architecture incorporates ALU chaining, associative issue windows, global bypass, and partitioned I-cache and register file banks. The evaluation demonstrates improved performance on various benchmarks.

garygilbert
Download Presentation

A Design Space Exploration of Grid Processor Architectures

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Design Space Exploration of Grid Processor Architectures Karu Sankaralingam Ramadass Nagarajan, Doug Burger, and Stephen W. Keckler Computer Architecture and Technology Laboratory Department of Computer Sciences The University of Texas at Austin

  2. Technology and Architecture Trends • Good news • Lots of transistors, faster transistors • Bad news • Pipeline depth near optimal • Pipelining limits will slow clock rate improvements by half • Performance must come from more ILP • IPC has only doubled in one decade, despite considerable effort • Global wire delays are growing • At 35nm, less than 1% of die is 1-cycle-reachable • Goals for future architectures • Scalability with process technology improvements • Fast clock and high ILP

  3. A “New” Approach • ALU chaining • Execution model eliminates • Majority of register reads • Associative issue windows • Rename tables • Global bypass • Partitions I-Cache and register file banks around ALUs • Statically map and dynamically issue

  4. Outline • Grid Processor Architecture (GPA) • Block Compilation • Program Execution • Evaluation • Conclusions and Future Work

  5. Banked register file 0 1 2 3 Moves Bank M Instruction caches Data caches Bank 0 Load store queues Bank 0 Bank 1 Bank 1 Bank 2 Bank 2 Bank 3 Bank 3 Block termination Logic Grid Processor Inst OP1 OP2 ALU Router

  6. Block Compilation (1 of 3) Intermediate Code Data flow graph I1) add r1, r2, r3 I2) sub r7, r2, r1 I3) ld r4, (r1) I4) add r5, r4, r4 I5) beqz r5, 0xdeac I1 I2 I3 move r2, I1,I2 move r3, I1 I4 I5 Inputs r7 Temporaries Outputs

  7. Block Compilation (2 of 3) Data flow graph Mapping (1,1) move r2, (1,3), (2,2) move r3, (1,3) I1 I1 move r2, I1,I2 move r3, I1 Scheduler I2 I3 I2 I3 I4 I4 I5 I5 r7

  8. Mapping (1,1) move r2, (1,3), (2,2) move r3, (1,3) I1 I2 I3 I4 I5 Block Compilation (3 of 3) GPA code I1) : (1,3) add (1,-1), (1,0) Code generation Targets Instruction location Opcode

  9. Block Atomic Execution Model • A block of instructions is an atomic unit of fetch/schedule/execute/commit • Blocks expose critical path • Operand chains hidden from large structures • Instructions specify consumers as explicit targets • Blocks allow simple internal control flow • Single point of entry • If-conversion using predication • Predicated hyperblocks

  10. Bank0 Bank0 Bank1 Bank1 Bank2 Bank2 Bank3 Bank3 Icache moves Icache moves ICache bank 0 ICache bank 0 Load store queues Load store queues DCache bank 0 add DCache bank 1 DCache bank 1 ICache bank 1 ICache bank 1 load ICache bank 2 ICache bank 2 DCache bank 2 add DCache bank 3 ICache bank 3 ICache bank 3 beqz Block termination Logic Block Execution DCache bank 0 sub DCache bank 2 DCache bank 3 Block termination Logic

  11. Bank0 Bank0 Bank1 Bank1 Bank2 Bank2 Bank3 Bank3 Icache moves Icache moves ICache bank 0 ICache bank 0 Load store queues Load store queues DCache bank 0 add DCache bank 1 DCache bank 1 ICache bank 1 ICache bank 1 load ICache bank 2 ICache bank 2 DCache bank 2 add DCache bank 3 ICache bank 3 ICache bank 3 beqz Block termination Logic Block Execution DCache bank 0 sub DCache bank 2 DCache bank 3 Block termination Logic

  12. Use frames! • Virtualize grid • 4 slots == 4 frames • Logical partitioning of node storage space • Local OOO issue Inst OP1 OP2 Control Inst OP1 OP2 ALU Router Instruction Buffers - Frames • What if? • Blocks exceed grid size • Overlap fetch and map

  13. Execution Opportunities • Serialized block fetch/map and execute • Overlapped instruction distribution and execution • Overlapped fetch/map • Next-block predictor • Block level squash on mis-prediction • Overlapped execution of blocks • Next-block predictor • Block level squash on mis-prediction • Block stitching using input/output register masks

  14. Evaluation • 3 SPECInt2000, 3SPECFP2000, 3 Mediabench benchmarks • adpcm, dct, mpeg2encode • gcc, mcf, parser • ammp, art equake • Compiled using the Trimaran toolset • Hyperblocks parsed and scheduled using custom tools • Event driven configurable timing simulator used for performance estimates

  15. 8x8 grid ¼ cycle router + ¼ cycle wire delay 32 slots at every node 5 stage pipeline, 8-wide 0 cycle router and wire delay! 512 entry instruction window GPA Evaluation Parameters GPA Superscalar • Alpha 21264 functional unit latencies • L1: 3 cycles, L2: 13 cycles, Main memory: 62 cycles

  16. GPA Performance Comparison Mean SPECFP Mediabench SPECINT

  17. Sensitivity to Communication Delay

  18. Conclusions • Technology trends • Enforce partitioning • Wire delays become first order constraint • GPA • Distributed execution engine with few central structures • Technology scalable, fast clock rate and high ILP • Challenges • Block control mechanisms • Distributed memory interface design • Optimizing predication mechanisms

  19. Future Work • Alternate execution models • SMT support • Use frames to run different threads • Stream based execution • Loop re-use and data partitioning in caches • Scientific vector-based execution • Use rows as vector execution units • Vector loads read from caches • Hardware prototype

  20. Related Work • Dataflow • Static dataflow architecture – Dennis and Misunas [1975] • Tagged-Token Dataflow – Arvind [1990] • Hybrid dataflow execution – Culler et. al [1991] • RAW architecture – Waingold et. al [1997] • Multiscalar Processors – Sohi et. al [1995] • Trace Processors – Vajapeyam [1997] • Clustered Speculative Multithreaded Processors – Marcuello and González [1999] • Levo – Uht et. al [2001]

  21. Questions

More Related