1 / 27

Register Bank Assignment For Spatially Partitioned Processors

Register Bank Assignment For Spatially Partitioned Processors. Behnam Robatmili, Katherine E. Coons, Kathryn S. McKinley, and Doug Burger. Motivation. Spatially partitioned processors Technology scalable substrate Challenging compilation target Partitioned register files Spill code

Jims
Download Presentation

Register Bank Assignment For Spatially Partitioned Processors

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Register Bank Assignment For Spatially Partitioned Processors Behnam Robatmili, Katherine E. Coons, Kathryn S. McKinley, and Doug Burger

  2. Motivation • Spatially partitioned processors • Technology scalable substrate • Challenging compilation target • Partitioned register files • Spill code • Operand routing latency • Bank and network link contention • Conflicting goals • Reduce communication distances • Avoid contention • Avoid spills Traditionally, spill costs take priority Now, spatial locality and contention are important

  3. v2 v3 v0 v1 v0 v1 v2 v3 3 1 1 2 i1 i0 i0 i1 Variables Register banks Network links Flow of data Instructions Execution tiles Bank Allocation Example B0 B1 B2 3 2 E0 E1 E2

  4. Outline • Motivation • Background • TRIPS • Compiling for TRIPS • Baseline Register Allocator • Bank Allocation Algorithm • Customizing for TRIPS • Results • Conclusions

  5. Register Allocation for EDGE ISAs Block atomic execution Instruction groups fetch, execute, and commit atomically Direct instruction communication Explicitly encode dataflow graph by specifying targets RISC EDGE B0 B1 B2 Centralized Register File B0 B1 B2

  6. TRIPS Microarchitecture • TRIPS ISA • Up to 128 instructions/block • Instructions can be placed anywhere • TRIPS microarchitecture • Up to 8 blocks in flight • 1 cycle latency per hop • TRIPS blocks constraints • Max 128 instructions • 32 load and store instructions • 32 register reads or writes • 8 register reads/writes per bank Register File G R0 R1 R2 R3 D0 E0 E1 E2 E3 D1 E4 E5 E6 E7 Data Cache D2 E8 E9 E10 E11 D3 E12 E13 E14 E15 Single cycle communication latency

  7. R1 R2 add mul mul add add Compiling for TRIPS Dataflow Graph Execution Substrate Control Flow Graph B1 read R2 Source Code mul B2 B3 add add read R1 mul add B4 Static instruction placement write R1

  8. TRIPS Compiler Back End If-conversion Loop peeling While loop unrolling Instruction merging Predicate optimizations TRIPS block Formation Constraints 128 instructions 32 load/store IDs 32 reg. read/writes (4 banks, 8 per bank) Register allocation Reverse if-conversion & split Load/store ID assignment SSA for constant outputs Resource Allocation Fanout insertion Instruction placement Target form generation Trips Assembly Language Scheduling

  9. Baseline Register Allocator • Linear scan register allocator • Traverse variables using standard priority function (Chow & Hennessy ‘90): • For each variable, find all available architectural registers • For each candidate architectural register • Check for live range conflicts • Check max reads/writes per block constraint • Spill variable if no candidate meets criteria • If spill code invalidates blocks, split invalidated blocks and re-allocate

  10. Outline • Motivation • Background • TRIPS • Compiling for TRIPS • Baseline Register Allocator • Bank Allocation Algorithm • Customizing for TRIPS • Results • Conclusions

  11. Register Dependence Graph • First introduced by Hiser et al. (HCSB ‘00) • Nodes represent variables • Edge weights indicate affinity between variables • Use RDG to optimize the critical path • Use ideal schedule to estimate execution time • Estimate arrival time of instruction inputs • Set edge weights based on differences between arrival times to instructions in critical path

  12. vr1 vr0 vr0 vr1 vr2 1 1 1 * * ~ 0 vr0 vr1 t0 t0 t2 4 2 2 2 ~ ~ + vr2 t1 t1 t3 5 - - t4 t4 Register Dependence Graph Dataflow Dependence Graph Register Dependence Graph Intermediate Representation mul t0,vr0,vr1 not t1,t0 not t2,vr2 add t3,vr1,t2 sub t4,t1,t3 1 3 Ideal Schedule 6

  13. Bank Assignment Algorithm • Traverse variables in priority order: • For every variable • Find cost for placing it in each bank • Choose bank with minimum cost • Allocate variable to a register in that bank • Bank cost • Number of variables already allocated to that bank • Weights of edges in the RDG

  14. Bank Score Evaluation • Evaluation function • Bank utilization • Dependencies among variables • CalculateBankCost (vr, bank) • Return CalculateDependenceCost(vr, bank) + bank.numAssignedVR • CalculateDependenceCost (vr, bank) • cost = 0 • for each nvr RDG neighbor of vr assigned to NeighborBankSet(bank) • cost = cost + RDG Weight(vr, nvr) • return cost

  15. Outline • Motivation • Background • TRIPS • Compiling for TRIPS • Baseline Register Allocator • Bank Allocation Algorithm • Customizing for TRIPS • Results • Conclusions

  16. Customizing for TRIPS Fewer register/data cache banks than execution tiles Heavy traffic between registers and execution tiles Heavy traffic between data cache and execution tiles Cost function should separate data cache traffic Register File • TieBreaker (vr, bank1, bank2) • if (vr.affectedCriticalLoads + • vr.affectedCriticalStores > 0) • return min(bank1, bank2) • else • return max(bank1, bank2) B0 B1 B2 B3 Data Cache

  17. Outline • Motivation • Background • TRIPS • Compiling for TRIPS • Baseline Register Allocator • Bank Allocation Algorithm • Customizing for TRIPS • Results • Conclusions

  18. Bank Oblivious Always assign the next available register Fills each bank before switching to the next bank Round Robin Selects banks in a round robin fashion HCSB Places dependent variables close together No ideal schedule Spatial Uses ideal schedule to reason about critical path Customized bank assignment algorithm for TRIPS Implemented Allocator

  19. Remaining benchmarks never spill TRIPS has 128 registers Register communication converted to intra-block temporaries Spill Code Size

  20. EEMBC Results 1.33,1.39 Average 5% improvement

  21. EEMBC Results 1.33,1.39 Average 5% improvement

  22. EEMBC Results 1.33,1.39 Average 5% improvement

  23. v0 v1 v2 v1 v0 v2 v1 v0 v2 st + st st Sample Spatial Allocations fbital Spatial HCSB + + Separate memory traffic

  24. SPEC Results 1.22,1.22,1.23 Average 5% improvement

  25. SPEC Results 1.22,1.22,1.23 Average 5% improvement

  26. Conclusions • Spatial locality among registers matters • Register dependence graph can help • Avoids spilling critical registers • Flexible tool to incorporate locality information • Modeling the topology is important • Non-uniform distribution of registers/L1 cache banks • Separate different types of traffic • EDGE ISA eases burden on register allocator • Spills are rare • Spatial locality and contention become first-order constraints

  27. Questions?

More Related