1 / 37

Mapping Dataflow Blocks to Distributed Hardware

Mapping Dataflow Blocks to Distributed Hardware. Behnam Robatmili, Katherine E. Coons, Doug Burger, Kathryn S. McKinley October 22, 2008. Motivation. Improve single-threaded code Current designs do not exhaust ILP Single-thread performance matters Efficiency is critical moving forward

chloe
Download Presentation

Mapping Dataflow Blocks to Distributed Hardware

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Mapping Dataflow Blocks to Distributed Hardware Behnam Robatmili, Katherine E. Coons, Doug Burger, Kathryn S. McKinley October 22, 2008

  2. Motivation • Improve single-threaded code • Current designs do not exhaust ILP • Single-thread performance matters • Efficiency is critical moving forward • Most energy in high-ILP processors is not consumed by the ALUs • EDGE architectures reduce energy overheads, but not communication

  3. RISC EDGE Atomic unit L1: L2: add mul ld sub tlt jlt L2 fadd fmul ld sd jump L1 L1: L2: add mul ld sub tlt jlt L2 fadd fmul ld sd jump L1 R1 R0 ld Register File ld muli Atomic unit muli add br sd add R0 EDGE Architectures • Block atomic execution (Melvin & Patt 1988) • Instruction groups fetch, execute, and commit atomically • Direct instruction communication (Dataflow) • Explicitly encode dataflow graph by specifying targets

  4. Outline • Motivation • Background • Block mapping strategies • Core Selection • Results • Conclusions and future work

  5. P P P P L2 L2 L2 L2 P P L2 L2 L2 L2 L2 L2 L2 L2 P P P P L2 L2 L2 L2 C C C C C C C C P P L2 L2 L2 L2 L2 L2 L2 L2 P P P P L2 L2 L2 L2 C C C C C C C C P P L2 L2 L2 L2 L2 L2 L2 L2 P P P P L2 L2 L2 L2 C C C C C C C C L2 L2 L2 L2 L2 L2 L2 L2 P P P P L2 L2 L2 L2 C C C C C C C C P L2 L2 L2 L2 L2 L2 L2 L2 P P P P L2 L2 L2 L2 C C C C C C C C L2 L2 L2 L2 L2 L2 L2 L2 P P P P L2 L2 L2 L2 C C C C C C C C P L2 L2 L2 L2 L2 L2 L2 L2 P P P P L2 L2 L2 L2 C C C C C C C C L2 L2 L2 L2 L2 L2 L2 L2 C C C C C C C C TFlex Processors Composable, Lightweight Processor (CLP) P Operating system assigns resources to threads Core Fusion = x86 compatible approach w/ similar goals

  6. P P L2 L2 L2 L2 C C P P L2 L2 L2 L2 Inst queue Reg bank L2 L2 L2 L2 C C C C C C L2 L2 L2 L2 C C C C L1 cache P L2 L2 L2 L2 L2 L2 L2 L2 1 cycle latency L2 L2 L2 L2 L2 L2 L2 L2 TFlex Cores P

  7. System Components Operating System Allocate cores Mapping hints Cores available Application Atomic blocks Compiler Block Mapper Hardware Mapping decisions Compile Time Run Time

  8. Outline • Motivation • Background • Block mapping strategies • Core selection • Results • Conclusions and future work

  9. Hardware Block Mapper • Map blocks to cores at runtime • Fixed strategies • Adaptive strategy • Map instructions to cores • Compiler-generated IDs encode criticality/locality • Preserve locality information • Balance concurrency and communication

  10. Deep Mapping Flat Mapping Block Mapper

  11. Available Concurrency Differs Critical path length: 7 cycles Total instructions: 65 instructions Max possible IPC: 9.3 insts/cycle

  12. Critical path length: 54 cycles Total instructions: 104 instructions Max possible IPC: 1.9 insts/cycle

  13. Block Mapper Compiler: 1.3 IPC Hardware: 2 cores Deep Mapping Adaptive Mapping Flat Mapping Compiler: 1 IPC Hardware: 1 core Compiler: 1 IPC Hardware: 1 core Compiler: 2 IPC Hardware: 2 cores

  14. Adaptive Block Mapping Evaluate block concurrency at compile time Calculate number of cores at runtime Select C available cores

  15. Outline • Motivation • Background • Block mapping strategies • Core selection • Results • Conclusions and future work

  16. ? ? ? Block Mapper Improvements • Instruction mapping • Flat, and adaptive strategies • To which cores will instructions be mapped • Core selection • Deep, and adaptive strategies • To which cores will blocks be mapped

  17. Blocks may use subset of cores Block mapper must select among available cores for deep, adaptive strategies Minimize inter-block communication Register locality Memory locality Block Mapper: Core Selection ? ? ?

  18. Core Selection Algorithms • Round-robin (RR) • Inside-out (IO): Blocks in center have higher priority • Preferred Location (PL): Compiler-generated list of preferred cores Legend: High priority Low priority Inside-out (IO) Preferred Location (PL) Round-robin (RR)

  19. Hardware Complexity • Flat and adaptive • Additional distributed protocols • Deep or adaptive with Core Selection • Priority encoder • Storage for core allocation status • Adaptive instruction mapping • Per-block reconfigurable mapping

  20. Outline • Motivation • Background • Block mapping strategies • Instruction mapping and core selection • Results • Conclusions and future work

  21. Methodology • TFlex simulator • Added support for different block mapping strategies • TFlex instruction scheduler • Added concurrency hints to block headers • Evaluated benchmarks • EEMBC, SPEC2000

  22. Speedup over one dual-issue core SPEC floating point benchmarks SPEC integer benchmarks DI/Adaptive SI/Adaptive DI/Deep DI/Flat SI/Deep SI/Flat Results

  23. Communication Overhead Hop counts on 16 single (SI) and dual-issue (DI) cores % of total hops using flat PL Deep PL Adaptive Flat Deep Adaptive

  24. Future Work • New concurrency metrics • Vary the optimization strategy • Group instructions differently • Other granularities of parallelism

  25. Conclusions • Less communication than flat • More complexity than deep

  26. Questions?

  27. Backup Slides • Adaptive Block Mapping • Cross-core Communication Effect • Instruction Mapping • Communication Overhead • Encoding Locality • Preserving Locality • Reducing Inter-block Communication • Dual-issue Results • Single-issue Results

  28. Adaptive Block Mapping Balance concurrency/communication Exploit concurrency when available Limit communication costs Combines hardware and software approaches Software statically summarizes code Hardware uses static information to map graphs efficiently

  29. Cross-core Communication Effect SPEC benchmarks on 16 single dual-issue cores 1.6 1.5 1.4 1.3 1.2 1.1 1.0 0.9 Geomean of speedup over flat Perfect Reg Perfect Mem Perfect Operand Perfect All Baseline

  30. Blocks may use subset of cores Block mapper must select among available cores for deep, adaptive strategies Minimize inter-block communication Register locality Memory locality Block Mapper: Core Selection ? ? ?

  31. Instruction Mapping Problem: Compiler determines placement, but number of cores is unknown at compile time Solution: Hardware/software contract to preserve locality information across configurations

  32. a b e Hardware interprets bits based on number of cores c f g d h Locality Locality Criticality Criticality Compiler encodes instruction IDs CRC FFFF 011 0000 CR FFFFF 01 10000 d: d: Inst ID Which core Reg file Reg file Reg file Reg file Reg file Reg file a 0000000 Which core Slot in issue queue Row 0 Row 0 Slot in issue queue L1 cache L1 cache L1 cache L1 cache L1 cache L1 cache b 0010000 d c 0100000 Reg file Reg file Reg file Reg file Reg file Reg file Row 1 Row 1 d e 1010000 L1 cache L1 cache L1 cache L1 cache L1 cache L1 cache f 1000000 Col 0 Col 0 Col 1 Col 2 Col 3 Col 1 g 1110000 h 1100000 Encoding Locality d 0110000

  33. CRC FFFF a b e c f g e b f a d h c h d g Inst ID CR FFFFF a 0000000 C FFFFFF b 0010000 f a c 0100000 b e d 0110000 a b c d f e h g e 1010000 c h f 1000000 d g g 1110000 h 1100000 Preserving Locality

  34. Reducing Inter-block Communication SPEC benchmarks on 16 dual-issue cores

  35. Dual-Issue Results # of cores: 1 2 4 larger Speedup for Individual Benchmarks Concurrency Distribution 100 90 80 70 60 50 40 30 20 10 0

  36. Single-Issue Results # of cores: 1 2 4 larger Speedup for Individual Benchmarks Concurrency Distribution 100 90 80 70 60 50 40 30 20 10 0

  37. C C C C C C C C C C C C C C C C Motivation • How should blocks be mapped? • Limit communication among instructions • Exploit concurrency • Allocate resources to extract ILP

More Related