400 likes | 534 Views
Designing Memory Systems for Tiled Architectures. Anshuman Gupta September 18, 2009. Multi-core Processors are abundant. Multi-cores increase the compute resources on the chip without increasing hardware complexity. Keeps power consumption within the budgets. Sun Niagara 2 (8-core).
E N D
Designing Memory Systems for Tiled Architectures Anshuman Gupta September 18, 2009
Multi-core Processors are abundant Multi-cores increase the compute resources on the chip without increasing hardware complexity Keeps power consumption within the budgets. Sun Niagara 2 (8-core) Intel Polaris (80-core) AMD Phenom (4-core) Tile64 (64-core)
Multi-Core Processors are underutilized … b = a + 4 … (0) c = b * 8 … (1) d = c – 2 … (2) e = b * b … (3) f = e * 3 … (4) g = f + d … (5) … 0 0 1 1 11 3 3 1 1 12 2 2 3 4 4 2 2 13 4 13 3 5 5 5 14 6 Single –thread code Parallel Execution Serial Execution Software gets the responsibility of utilizing the cores with parallel instruction streams Hard to parallelize applications.
Tiled Architectures increase Utilization by enabling Parallelization • Tiled architectures are of class of multi-core architectures • Provide mechanisms to facilitate automatic parallelization of single-threaded programs • Fast On Chip Networks (OCNs) to connect cores The OCN communication latencies are of the order of 2+(distance between tiles) cycles* *Latency for RAW inter-ALU OCN
Automatic Parallelization on Tiled Architectures … b = a + 4 … (0) c = b * 8 … (1) d = c – 2 … (2) e = b * b … (3) f = e * 3 … (4) g = f + d … (5) … 0 0 0 1 1 11 1 2 3 3 3 1 1 1 12 2 3 2 3 2 4 4 4 2 2 2 13 4 4 13 4 3 5 3 5 5 5 14 6 5 Single –thread code Multi-cores Tiled Architecture In tiled architectures, dependent instructions can be placed on multiple cores with low penalty in tiled architectures due to cheap inter-ALU communication.
Why aren’t tiled architectures used everywhere? What if we add some memory instructions? … (*b) = a + 4 … (0) c = (*b) * 8 … (1) (*d) = c – 2 … (2) e = (*h) * 4 … (3) f = e * 3 … (4) g = f + (*i) … (5) … 0 0 0 1 1 11 1 11 3 3 3 1 1 1 12 2 12 3 2 2 4 4 4 2 2 2 13 4 13 13 13 3 5 3 5 5 5 14 6 14 Single –thread code Multi-cores Tiled Architecture Automatic parallelization is still very difficult due to slow resolution of remote memory dependencies Tiled Architecture Memory systems have a special requirement – Fast Memory Dependence Resolution
Outline • Motivation • Preserving Memory Ordering • Memory Ordering in Existing Work • Analysis of Existing Work • Future Work and Conclusion
Memory Dependence foo (int * a, int * b) { *a = … … = *b } *a = … … = *b *a = … … = *b *a = … … = *b
Memory Coherence • Coherent space provides an abstraction of a single data buffer with a single read write port • Hierarchical implementation of shared memory • Require coherence protocols to provide the same abstraction Core 0 Core 1 A = 1 Core 0 Write A = 1 Core 1 Read A Dependence Signal Write A = 1 Read A Shared Memory Cache Cache Shared Buffer Shared Memory A = 1 A = 0
Improving Memory Dependence Resolution • Memory Dependence Resolution Performance depends on – • True Dependence Performance • False Dependence Performance • Coherence System Performance
True Dependence Resolution Source Destination Delay • Delay 1 – Determined by Signaling Stage • Earlier is better • Delay 2 – Determined by signaling delay inside the ordering mechanism • Faster is better • Delay 3 – Determined by Stalling Stage • Later is better • Delays 1 and 3 are determined by the resolution model Stall Stage 1 Signal Stage Signal 2 3
False Dependence Resolution • False Dependencies occur when • Static analysis cannot disambiguate • Memory Dependence encoding is not partial • For false dependencies, dependent instruction should ideally not wait for any signal • Runtime Disambiguation • The address comparison done in hardware to declare the dependent instruction as free • Speculation • Dependent instruction is issued speculatively assuming the dependence is false
Fast Data Access • Local L1 caches can help decrease average latencies • No network delays • Cache Coherence (CC) • Dynamic access – data location not known statically • Expensive dynamic access in the absence of CC
Outline • Motivation • Preserving Memory Ordering • Memory Ordering in Existing Work • RAW • WaveScalar • EDGE • Analysis of Existing Work • Future Work and Conclusion
RAWA highly static tiled architecture • Array of simple in-order MIPS cores • Scalar Operand Network (SON) for fast inter-ALU communication • Shared address space, local caches and shared DRAMs • No cache coherence mechanism • Software cache management through flush and invalidation *Taylor et al, IEEE Micro 2002
Artifacts of Software Cache Management • Difficult to keep track of the most up-to-date version of a memory address • All memory accesses can be categorized as - • Static Access • The location of the cache line is known statically • Dynamic Access • A runtime lookup is required for determining the location of the cache line • These are really expensive (36 vs7)
Static-Dynamic Access Ordering • Two static accesses • Synchronization over SON • Dependence between a static and a dynamic access • Synchronizing over SON between • Static access • Static requestor or receiver for dynamic access • Execute side resolution • No speculative runahead • False dependencies are as expensive as true dependence
Dynamic Access Ordering • Execute side resolution very expensive • Resolution done late in the memory system • Static ordering point • Turnstile tile • One per equivalence class • Equivalence class - set of all memory operations that can access the same memory address • Requests sent on static SON to turnstile • Receives in memory order • In-order dynamic network channels
Outline • Motivation • Preserving Memory Ordering • Memory Ordering in Existing Work • RAW • WaveScalar • EDGE • Analysis of Existing Work • Future Work and Conclusion
WaveScalarA fully dynamic Tiled Architecture with Memory Ordering • Clusters arranged in 2D array connected by mesh dynamic network • Each tile has a store buffer and banked data cache • Secondary memory system made up of L2 caches around the tiles • Cache coherence *Swanson et al, Micro 2003
Memory Ordering • WaveScalar preserves memory ordering by using a sequence number for each memory operation in a wave • Unique • Indicates age • Each memory operation also stores its predecessor’s and successor’s sequence number • Use “?” if not known at compile time • There cannot be a memory operation whose possible predecessor has it’s successor marked as “?” and vice-versa • MEM-NOPs • A request is allowed to go ahead if it’s predecessor has issued • In hardware this ordering is managed in the store buffers • A single store buffer is responsible to handle all memory requests for a dynamic wave Load A <.,0,?> Load A <0> Load A Store B <0,1,3> Store B <0,1,2> Store B <1> Store B Nop<0,2,3> Load C <?,2,.> Load C <?,3,.> Load C <2> Load C Load C <?,3,.> Load C <1,3,.> Store B <0,1,3> Load A <.,0,?> Load A <.,0,1> Store Buffer
Removing False Load Dependencies • Sequence number based ordering is highly restrictive • Loads are stalled on previous loads • Each memory operation has ripple number as last store’s sequence number • Memory operation can issue if op with ripple number has issued • Loads can issue OoO • Stores still have total ordering
Outline • Motivation • Preserving Memory Ordering • Memory Ordering in Existing Work • RAW • WaveScalar • EDGE • Analysis of Existing Work • Future Work and Conclusion
EDGEA partially dynamic Tiled Architecture with block execution • Array of tiles connected over fast OCNs • Primarymemorysystem is distributed over tiles • Each such tile has address interleaved • Data cache • Load Store Queue • Distributed SecondaryMemory System • Cache Coherence *S. Sethumadhavan et al, ICCD ‘06
Memory Ordering Control Tile Ld A <0> Ld B <1> St C <2> Ld C <3> • Unique 5 bit tag called LSID • Completion of block execution • Ordering of memory operations • DTs get a list of all LSIDs in a block during fetch stage • Memory operations reach a DT • LSID sent to all the DTs • Request issued if all requests with earlier LSIDs completed • memory side dependence resolution • When all memory operations have completed, block is committed Execution Tiles Ld A <0> Ld B <1> St C <2> Ld C <3> Interleaved Data Tiles <0,1,2,3>,0 <0,1,2,3> <0,1,2,3> <0,1,2,3>,0 <0,1,2,3>,0 <0,1,2,3>,0 <0,1,2,3>, 1 <0,1,2,3> <0,1,2,3>, 1 <0,1,2,3>, 1 <0,1,2,3>, 1 <0,1,2,3>, 1 <0,1,2,3> <0,1,2,3>, 3 <0,1,2,3> <0,1,2,3>, 3,2 <0,1,2,3>, 3,2 <0,1,2,3>, 3,2 <0,1,2,3> <0,1,2,3> <0,1,2,3> <0,1,2,3> <0,1,2,3> <0,1,2,3>
Dependence Speculation • EDGE memory ordering is very restrictive • Total memory order • Loads execute speculatively • Earlier store to the same address causes squash • Predictor used to reduce squashes
Outline • Motivation • Preserving Memory Ordering • Memory Ordering in Existing Work • Analysis of Existing Work • Future Work and Conclusion
Memory Side Resolution allows more Overlap RAWsd RAWdd EDGE/WaveScalar Requestor A Requestor B Requestor A Requestor B Requestor A Requestor B Turnstile Home Node Home Node Home Node Tag Buffer RAWsd Request A Request B RAWdd Response A Response B Coherence delay A Coherence delay B E/WS *The length of the bars do not indicate delays
Network Stalls should be avoided • Execute Side Resolution - e • RAWsd • Memory Side Resolution - m • Edge, WaveScalar • RAW dynamic ordering - mt • Network delay to memory system is overlapped F NaEN$TpNm Ts M NcNr W e e m mt m m,mt e E,W,N$,Nr m Tp,Nm mt
False Dependence Optimization Partial Ordering reduces false deps Disambiguation should be done early Speculation on false deps reduces stalls
Outline • Motivation • Preserving Memory Ordering • Memory Ordering in Existing Work • Analysis of Existing Work • Future Work and Conclusion
What’s a Good Tiled Architecture Memory System? • Local caches for fast L1 hit • Cache Coherence support for ease in programmability and no dynamic access delays • Fast True Dependence Resolution • Performance comparable to same core placement of operations • Late stalls • Early signaling • Reduction of false dependencies through partial memory operation ordering • Fast False Dependence resolution • Performance comparable to same core placement of operations • Early runtime memory disambiguation • Speculative memory requests
Conclusion • Auto-parallelization on tiled architecture can benefit from fast Memory Dependence resolution • Multi-core memory system were not designed with this goal • Performance of both true and false dependence resolution should be comparable dependent memory instructions placed on the same core • ISA should support partial memory operation ordering to avoid artificial false dependencies • Memory system should have local caches and cache coherence for performance and programmability Thank You! Questions?
Dynamic Accesses are expensive • X looks up a global address list and sends a dynamic request to owner Y • Y is interrupted, data is fetched and dynamic request sent to Z • Z is interrupted, data is stored in local cache • One table lookup, two interrupt handlers and two dynamic requests make dynamic loads expensive Lifted portions represent processor occupancy, while unlifted portions portion represents network latency