270 likes | 405 Views
Register Bank Assignment For Spatially Partitioned Processors. Behnam Robatmili, Katherine E. Coons, Kathryn S. McKinley, and Doug Burger. Motivation. Spatially partitioned processors Technology scalable substrate Challenging compilation target Partitioned register files Spill code
E N D
Register Bank Assignment For Spatially Partitioned Processors Behnam Robatmili, Katherine E. Coons, Kathryn S. McKinley, and Doug Burger
Motivation • Spatially partitioned processors • Technology scalable substrate • Challenging compilation target • Partitioned register files • Spill code • Operand routing latency • Bank and network link contention • Conflicting goals • Reduce communication distances • Avoid contention • Avoid spills Traditionally, spill costs take priority Now, spatial locality and contention are important
v2 v3 v0 v1 v0 v1 v2 v3 3 1 1 2 i1 i0 i0 i1 Variables Register banks Network links Flow of data Instructions Execution tiles Bank Allocation Example B0 B1 B2 3 2 E0 E1 E2
Outline • Motivation • Background • TRIPS • Compiling for TRIPS • Baseline Register Allocator • Bank Allocation Algorithm • Customizing for TRIPS • Results • Conclusions
Register Allocation for EDGE ISAs Block atomic execution Instruction groups fetch, execute, and commit atomically Direct instruction communication Explicitly encode dataflow graph by specifying targets RISC EDGE B0 B1 B2 Centralized Register File B0 B1 B2
TRIPS Microarchitecture • TRIPS ISA • Up to 128 instructions/block • Instructions can be placed anywhere • TRIPS microarchitecture • Up to 8 blocks in flight • 1 cycle latency per hop • TRIPS blocks constraints • Max 128 instructions • 32 load and store instructions • 32 register reads or writes • 8 register reads/writes per bank Register File G R0 R1 R2 R3 D0 E0 E1 E2 E3 D1 E4 E5 E6 E7 Data Cache D2 E8 E9 E10 E11 D3 E12 E13 E14 E15 Single cycle communication latency
R1 R2 add mul mul add add Compiling for TRIPS Dataflow Graph Execution Substrate Control Flow Graph B1 read R2 Source Code mul B2 B3 add add read R1 mul add B4 Static instruction placement write R1
TRIPS Compiler Back End If-conversion Loop peeling While loop unrolling Instruction merging Predicate optimizations TRIPS block Formation Constraints 128 instructions 32 load/store IDs 32 reg. read/writes (4 banks, 8 per bank) Register allocation Reverse if-conversion & split Load/store ID assignment SSA for constant outputs Resource Allocation Fanout insertion Instruction placement Target form generation Trips Assembly Language Scheduling
Baseline Register Allocator • Linear scan register allocator • Traverse variables using standard priority function (Chow & Hennessy ‘90): • For each variable, find all available architectural registers • For each candidate architectural register • Check for live range conflicts • Check max reads/writes per block constraint • Spill variable if no candidate meets criteria • If spill code invalidates blocks, split invalidated blocks and re-allocate
Outline • Motivation • Background • TRIPS • Compiling for TRIPS • Baseline Register Allocator • Bank Allocation Algorithm • Customizing for TRIPS • Results • Conclusions
Register Dependence Graph • First introduced by Hiser et al. (HCSB ‘00) • Nodes represent variables • Edge weights indicate affinity between variables • Use RDG to optimize the critical path • Use ideal schedule to estimate execution time • Estimate arrival time of instruction inputs • Set edge weights based on differences between arrival times to instructions in critical path
vr1 vr0 vr0 vr1 vr2 1 1 1 * * ~ 0 vr0 vr1 t0 t0 t2 4 2 2 2 ~ ~ + vr2 t1 t1 t3 5 - - t4 t4 Register Dependence Graph Dataflow Dependence Graph Register Dependence Graph Intermediate Representation mul t0,vr0,vr1 not t1,t0 not t2,vr2 add t3,vr1,t2 sub t4,t1,t3 1 3 Ideal Schedule 6
Bank Assignment Algorithm • Traverse variables in priority order: • For every variable • Find cost for placing it in each bank • Choose bank with minimum cost • Allocate variable to a register in that bank • Bank cost • Number of variables already allocated to that bank • Weights of edges in the RDG
Bank Score Evaluation • Evaluation function • Bank utilization • Dependencies among variables • CalculateBankCost (vr, bank) • Return CalculateDependenceCost(vr, bank) + bank.numAssignedVR • CalculateDependenceCost (vr, bank) • cost = 0 • for each nvr RDG neighbor of vr assigned to NeighborBankSet(bank) • cost = cost + RDG Weight(vr, nvr) • return cost
Outline • Motivation • Background • TRIPS • Compiling for TRIPS • Baseline Register Allocator • Bank Allocation Algorithm • Customizing for TRIPS • Results • Conclusions
Customizing for TRIPS Fewer register/data cache banks than execution tiles Heavy traffic between registers and execution tiles Heavy traffic between data cache and execution tiles Cost function should separate data cache traffic Register File • TieBreaker (vr, bank1, bank2) • if (vr.affectedCriticalLoads + • vr.affectedCriticalStores > 0) • return min(bank1, bank2) • else • return max(bank1, bank2) B0 B1 B2 B3 Data Cache
Outline • Motivation • Background • TRIPS • Compiling for TRIPS • Baseline Register Allocator • Bank Allocation Algorithm • Customizing for TRIPS • Results • Conclusions
Bank Oblivious Always assign the next available register Fills each bank before switching to the next bank Round Robin Selects banks in a round robin fashion HCSB Places dependent variables close together No ideal schedule Spatial Uses ideal schedule to reason about critical path Customized bank assignment algorithm for TRIPS Implemented Allocator
Remaining benchmarks never spill TRIPS has 128 registers Register communication converted to intra-block temporaries Spill Code Size
EEMBC Results 1.33,1.39 Average 5% improvement
EEMBC Results 1.33,1.39 Average 5% improvement
EEMBC Results 1.33,1.39 Average 5% improvement
v0 v1 v2 v1 v0 v2 v1 v0 v2 st + st st Sample Spatial Allocations fbital Spatial HCSB + + Separate memory traffic
SPEC Results 1.22,1.22,1.23 Average 5% improvement
SPEC Results 1.22,1.22,1.23 Average 5% improvement
Conclusions • Spatial locality among registers matters • Register dependence graph can help • Avoids spilling critical registers • Flexible tool to incorporate locality information • Modeling the topology is important • Non-uniform distribution of registers/L1 cache banks • Separate different types of traffic • EDGE ISA eases burden on register allocator • Spills are rare • Spatial locality and contention become first-order constraints