Register Bank Assignment For Spatially Partitioned Processors

Register Bank Assignment For Spatially Partitioned Processors Behnam Robatmili, Katherine E. Coons, Kathryn S. McKinley, and Doug Burger

Motivation • Spatially partitioned processors • Technology scalable substrate • Challenging compilation target • Partitioned register files • Spill code • Operand routing latency • Bank and network link contention • Conflicting goals • Reduce communication distances • Avoid contention • Avoid spills Traditionally, spill costs take priority Now, spatial locality and contention are important

v2 v3 v0 v1 v0 v1 v2 v3 3 1 1 2 i1 i0 i0 i1 Variables Register banks Network links Flow of data Instructions Execution tiles Bank Allocation Example B0 B1 B2 3 2 E0 E1 E2

Outline • Motivation • Background • TRIPS • Compiling for TRIPS • Baseline Register Allocator • Bank Allocation Algorithm • Customizing for TRIPS • Results • Conclusions

Register Allocation for EDGE ISAs Block atomic execution Instruction groups fetch, execute, and commit atomically Direct instruction communication Explicitly encode dataflow graph by specifying targets RISC EDGE B0 B1 B2 Centralized Register File B0 B1 B2

TRIPS Microarchitecture • TRIPS ISA • Up to 128 instructions/block • Instructions can be placed anywhere • TRIPS microarchitecture • Up to 8 blocks in flight • 1 cycle latency per hop • TRIPS blocks constraints • Max 128 instructions • 32 load and store instructions • 32 register reads or writes • 8 register reads/writes per bank Register File G R0 R1 R2 R3 D0 E0 E1 E2 E3 D1 E4 E5 E6 E7 Data Cache D2 E8 E9 E10 E11 D3 E12 E13 E14 E15 Single cycle communication latency

R1 R2 add mul mul add add Compiling for TRIPS Dataflow Graph Execution Substrate Control Flow Graph B1 read R2 Source Code mul B2 B3 add add read R1 mul add B4 Static instruction placement write R1

TRIPS Compiler Back End If-conversion Loop peeling While loop unrolling Instruction merging Predicate optimizations TRIPS block Formation Constraints 128 instructions 32 load/store IDs 32 reg. read/writes (4 banks, 8 per bank) Register allocation Reverse if-conversion & split Load/store ID assignment SSA for constant outputs Resource Allocation Fanout insertion Instruction placement Target form generation Trips Assembly Language Scheduling

Baseline Register Allocator • Linear scan register allocator • Traverse variables using standard priority function (Chow & Hennessy ‘90): • For each variable, find all available architectural registers • For each candidate architectural register • Check for live range conﬂicts • Check max reads/writes per block constraint • Spill variable if no candidate meets criteria • If spill code invalidates blocks, split invalidated blocks and re-allocate

Register Dependence Graph • First introduced by Hiser et al. (HCSB ‘00) • Nodes represent variables • Edge weights indicate aﬃnity between variables • Use RDG to optimize the critical path • Use ideal schedule to estimate execution time • Estimate arrival time of instruction inputs • Set edge weights based on differences between arrival times to instructions in critical path

vr1 vr0 vr0 vr1 vr2 1 1 1 * * ~ 0 vr0 vr1 t0 t0 t2 4 2 2 2 ~ ~ + vr2 t1 t1 t3 5 - - t4 t4 Register Dependence Graph Dataflow Dependence Graph Register Dependence Graph Intermediate Representation mul t0,vr0,vr1 not t1,t0 not t2,vr2 add t3,vr1,t2 sub t4,t1,t3 1 3 Ideal Schedule 6

Bank Assignment Algorithm • Traverse variables in priority order: • For every variable • Find cost for placing it in each bank • Choose bank with minimum cost • Allocate variable to a register in that bank • Bank cost • Number of variables already allocated to that bank • Weights of edges in the RDG

Bank Score Evaluation • Evaluation function • Bank utilization • Dependencies among variables • CalculateBankCost (vr, bank) • Return CalculateDependenceCost(vr, bank) + bank.numAssignedVR • CalculateDependenceCost (vr, bank) • cost = 0 • for each nvr RDG neighbor of vr assigned to NeighborBankSet(bank) • cost = cost + RDG Weight(vr, nvr) • return cost

Customizing for TRIPS Fewer register/data cache banks than execution tiles Heavy traffic between registers and execution tiles Heavy traffic between data cache and execution tiles Cost function should separate data cache traffic Register File • TieBreaker (vr, bank1, bank2) • if (vr.affectedCriticalLoads + • vr.affectedCriticalStores > 0) • return min(bank1, bank2) • else • return max(bank1, bank2) B0 B1 B2 B3 Data Cache

Bank Oblivious Always assign the next available register Fills each bank before switching to the next bank Round Robin Selects banks in a round robin fashion HCSB Places dependent variables close together No ideal schedule Spatial Uses ideal schedule to reason about critical path Customized bank assignment algorithm for TRIPS Implemented Allocator

Remaining benchmarks never spill TRIPS has 128 registers Register communication converted to intra-block temporaries Spill Code Size

EEMBC Results 1.33,1.39 Average 5% improvement

v0 v1 v2 v1 v0 v2 v1 v0 v2 st + st st Sample Spatial Allocations fbital Spatial HCSB + + Separate memory traffic

SPEC Results 1.22,1.22,1.23 Average 5% improvement

Conclusions • Spatial locality among registers matters • Register dependence graph can help • Avoids spilling critical registers • Flexible tool to incorporate locality information • Modeling the topology is important • Non-uniform distribution of registers/L1 cache banks • Separate different types of traffic • EDGE ISA eases burden on register allocator • Spills are rare • Spatial locality and contention become first-order constraints

Questions?

Register Bank Assignment For Spatially Partitioned Processors

Register Bank Assignment For Spatially Partitioned Processors

Presentation Transcript

Processors for Educators

Spatially Positioning Data

Register For MEDPDB

Calvin : Fast Distributed Transactions for Partitioned Database

Thinking Spatially

9. SPATIALLY DISTRIBUTED

Partitioned Tables

Processors

An Optimistic and Conservative Register Assignment Heuristic for Chordal Graphs

Spatially Resolved Spectroscopy

Systematic Register Bypass Customization for Application-Specific Processors

Compiler-Based Register Name Adjustment for Low-Power Embedded Processors

A Shifting Strategy for Dynamic Channel Assignment under Spatially Varying Demand

Processors

Spatially Separated Markets

Mechanisms for a Spatially Distributed Market

PROCESSORS

Bank Exam Coaching for Immediate Results – Register now for Free Demo Class

Processors for Embedded Systems

Register For MEDPDB

Trusted Hardware for Partitioned Multicore

Processors for Embedded Systems