AMRM: Project Technical Approach

AMRM:Project Technical Approach A Technology and Architectural View of Adaptation Rajesh Gupta Project Kickoff Meeting November 5, 1998 Washington DC

Outline • Technology trends driving this project • Changing ground-rules in high-performance system design • Rethinking circuits and microelectronic system design • Rethinking architectures • The opportunity of application-adaptive architectures • Adaptation Challenges • Adaptation for memory hierarchy • Why memory hierarchy? • Adaptation space and possible gains • Summary

Wire delay, ns/cm. Evolutionary growth but its effects are subtle and powerful! Technology Evolution Industry continues to outpace NTRS projections on technology scaling and IC density.

3000 u Static interconnect 3000 Dynamic interconnect 500nm Length (um) CROSS-OVER REGION 2000 Avg. Interconnect Length (scales with pitch) 1000 100 u Critical Length I II III 1000 100 Feature Size (nm) 100nm Consider Interconnect • Average interconnect delay is greater than the gate delays! • Reduced marginal cost of logic and signal regeneration needs make it possible to include logic in inter-block interconnect.

Rethinking Circuits When Interconnect Dominates • DEVICE: Choose better interconnect • Copper, low temperature interconnect • CAD: Choose better interconnect topology, sizes • Minimize path from driver gate to each receiver gate • e.g., A-tree algorithm yields about 12% reduction in delay • Select wire sizes to minimize net delays • e.g., upto 35% reduction in delay by optimal sizing algorithms • CKT: Use more signal repeaters in block-level designs • longest interconnect=2000 mu for 350nm process • u-ARCH: A storage element no longer defines a clock boundary • Multiple storage elements in a single clock • Multiple state transitions in a clock period • Storage-controlled routing • Reduced marginal cost of logic

Implications: Circuit Blocks • Frequent use of signal repeaters in block-level designs • longest interconnect=2000 mu for 0.3 mu process • A storage element no longer (always) defines a clock boundary • storage delay (=1.5x switching delay) • multiple storage elements in a single clock • multiple state transitions in a clock period • storage-controlled routing • Circuit block designs that work independently of data latencies • asynchronous blocks • Heterogenous clocking interfaces • pausible clocking [Yun, ICCD96] • mixed synchronous, asynchronous circuit blocks.

Implications: Architectures • Architectures to exploit interconnect delays • pipeline interconnect delays [recall Cray-2] • cycle time = max delay - min delay • use interconnect delay as the minimum delay • need P&R estimates early in the design • Algorithms that use interconnect latencies • interconnect as functional units • functional unit schedules are based on a measure of spatial distances • Increase local decision making • multiple state transitions in a clock period • storage-controlled routing • re-programmable blocks in “custom layouts”

Opportunity: Application-Adaptive Architectures • Exploit architectural “low-hanging” fruits • performance variation across applications (10-100X) • performance variation across data-sets (10X) • Use interconnect and data-path reconfiguration to • increase performance • combat performance fragility and • improve fault tolerance • Configurable hardware is used to improve utilization of performance critical resources • instead of using configurable hardware to build additional resources • design goal is to achieve peak performance across applications • configurable hardware leveraged in efficient utilization of performance critical resources

Architectural Adaptation • Each of the following elements can benefit from increased adaptability (above and beyond CPU programming) • CPU • Memory hierarchy : eliminate false sharing • Memory system : virtual memory layout based on cache miss data • IO : disk layout based on access pattern • Network interface : scheduling to reduced end-to-end latency • Adaptability used to build • programmable engines in IO, memory controllers, cache controllers, network devices • configurable data-paths and logic in any part of the system • configurable queueing in scheduling for interconnect, devices, memory • smart interfaces for information flow from applications to hardware • performance monitoring and coordinated resource management... Intelligent interfaces, information formats, mechanisms and policies.

Adaptation Challenges • Is application-driven adaptation viable from technology and cost point of view? • How to structure adaptability • to maximize the performance benefits • provide protection, multitasking and a reasonable programming environment • enable easy exploitation of adaptability through automatic or semi-automatic means. • We focus on memory hierarchy as the first candidate to explore the extent and utility of adaptation.

Why Cache Memory?

4-year technological scaling • CPU performance increases by 47% per year • DRAM performance increases by 7%per year • Assume the Alpha is scaled using this scaling and • Organization remains 8KB/96KB/4MB/mem • Benchmarks requirements are same • Expect something similar if both L2/L3 cache size and benchmark size will increase

Impact of Memory Stalls • A statically scheduled processor with a blocking cache stalls, on average, for • 15% of the time in integer benchmarks • 43% of the time in f.p. benchmarks • 70% of the time in transaction benchmark • Possible performance improvements due to improved memory hierarchy without technology scaling: • 1.17x, • 1.89x, and • 3.33x • Possible improvements with technology scaling • 2.4x, 7.5x, and 20x

Opportunities for Adaptivity in Caches • Cache organization • Cache performance “assist” mechanisms • Hierarchy organization • Memory organization (DRAM, etc) • Data layout and address mapping • Virtual Memory • Compiler assist

Opportunities - Cont’d • Cache organization: adapt what? • Size: NO • Associativity: NO • Line size: MAYBE, • Write policy: YES (fetch,allocate,w-back/thru) • Mapping function: MAYBE • Organization, clock rate optimized together

Opportunities - Cont’d • Cache “Assist”: prefetch, write buffer, victim cache, etc. between different levels • due to delay/size constraint, all of the above cannot be implemented together • improvement as f(size) may not be at max_size • Adapt what? • which mechanism(s) to use, algorithms • mechanism “parameters”: size, lookahead, etc

Opportunities - Cont’d • Hierarchy Organization: • Where are cache assist mechanisms applied? • Between L1 and L2 • Between L1 and Memory • Between L2 and Memory... • What are the datapaths like? • Is prefetch, victim cache, write buffer data written into a next-level cache? • How much parallelism is possible in the hiearchy?

Opportunities - Cont’d • Memory Organization • Cached DRAM? • yes, but very limited configurations • Interleave change? • Hard to accomplish dynamically • Tagged memory • Keep state for adaptivity

Opportunities - Cont’d • Data layout and address mapping • In theory, something can be done but • would require time-consuming data re-arrangement • MP case is even worse • Adaptive address mapping or hashing • based on what?

Opportunities - Cont’d • Compiler assist can • Select initial hardware configuration • Pass hints on to hardware • Generate code to collect run-time info and adapt during execution • Adapt configuration after being “called” at certain intervals during execution • Re-optimize code at run-time

Opportunities - Cont’d • Virtual Memory can adapt • Page size? • Mapping? • Page prefetching/read ahead • Write buffer (file cache) • The above under multiprogramming?

Applying Adaptivity • What Drives Adaptivity? • Performance impact, overall and/or relative • “Effectiveness”, e.g. miss rate • Processor stall introduced • Program characteristics • When to perform adaptive action? • Run time: use feedback from hardware • Compile time: insert code, set up hardware

Where to Implement Adaptivity? • In Software: compiler and/or OS • (Static) Knowledge of program behavior • Factored into optimization and scheduling • Extra code, overhead • Lack of dynamic run-time information • Rate of adaptivity • Requires recompilation, OS changes

Where to Implement?- Cont’d • Hardware • dynamic information available • fast decision mechanism possible • transparent to software (thus safe) • delay, clock rate limit algorithm complexity • difficult to maintain long-term trends • little knowledge of program behavior

Where to Implement - Cont’d • Hardware/software • Software can set coarse hardware parameters • Hardware can supply software dynamic info • Perhaps more complex algorithms can be used • Software modification required • Communication mechanism required

Current Investigation • L1 cache assist • See wide variability in assist mechanisms’ effectiveness between • Individual Programs • Within a program as a function of time • Propose a hardware mechanism to select between assist types and allocate buffer space • Give compiler an opportunity to set parameters

Mechanisms Used (L1 to L2) • Prefetching • Stream Buffers • Stride-directed, based on address alone • Miss Stride: prefetch the same addr using the number of intervening misses as lookahead • Pointer Stride • Victim Cache • Write Buffer

Mechanisms Used - Cont’d • A mechanism can be used by itself • Which is most effective? • All can be used at once • Buffer space size and organization fixed • No adaptivity involved in current results • Observe time-domain behavior

Configurations • 32KB L1 data cache, 32B lines, direct-map • 0.5MB L2 cache, 64B line, direct-map • 8-line write buffer • Latencies: • 1-cycle L1, 8-cycle L2, 60-cycle memory • 1-cycle prefetch, Write Buffer, Victim Cache • All 3 mechanisms at once

Observed Behavior • Programs exhibit different effect from each mechanism • none is a consistent winner • Within a program, the same holds in the time domain between mechanisms. • Both of the above facts indicate a likely improvement from adaptivity • Select a better one among mechanisms • Even more can be expected from adaptively re-allocating from the combined buffer pool • To reduce stall time • To reduce the number of misses

Possible Adaptive Mechanisms • Hardware: • a common pool of (small) n-word buffers • a set of possible policies, a subset of: • Stride-directed prefetch • PC-based prefetch • History-based prefetch • Victim cache • Write buffer

Adaptive Hardware - Cont’d • Performance monitors for each type/buffer • misses, stall time on hit, thresholds • Dynamic buffer allocator among mechanisms • Allocation and monitoring policy: • Predict future behavior from observed past • Observe in time interval T, set for next T • Save perform. trends in next-level tags (<8bits)

Adaptive Hardware - Cont’d • Adapt the following • Number of buffers per mechanism • May also include control, e.g. prediction tables • Prefetch lookahead (buffer depth) • Increase when buffers fill up and are still stalling • Adaptivity interval • Increase when every

Adaptivity via compiler • Give software control over configuration setting • Provide feedback via same parameters as used by hardware: stall time, miss rate, etc • Have the compiler • select program points to change configuration • set parametrs based on hardware feedback • use compile-time knowledge as well

Further opportunities to adapt • L2 cache organization • variable-size line • L2 non-sequential prefetch • L3 organization and use (for deep sub-) • In-memory adaptivity assist (DRAM tags) • Multiple processor scenarios • Even longer latency • Coherence, hardware or software • Synchronization • Prefetch under and beyond the above • Avoid coherence if possible • Prefetch past synchronization • Assist Adaptive Scheduling

The AMRM Project= Compiler, Architecture and VLSI Research for AA Architectures Compiler control Machine Definition Application Analysis Identification of AA Mechanisms Semantic Retention StrategiesCompiler Instrumentation for Runtime Memory hierarchy analysis ref. structure identification Fault Detection and Containment Interface to mapping and synthesis hardware Continuous Validation Strategies Protection tests Partitioning, Synthesis, Mapping Algorithms for efficient runtime adaptation Efficient reprogrammable circuit structures for rapid reconfiguration Prototype hardware platform

Summary • Semiconductor advances are bringing powerful changes to how systems are architected and built: • challenges underlying assumptions on synchronous digital hardware designs • interconnect (local and global) dominates architectural choices, local decision making is free; • in particular, it can be made adaptable using CAD tools. • The AMRM Project: • achieve peak performance by adapting machine capabilities to application and data characteristics. • Initial focus on memory hierarchy promises to yield high performance gains due to worsening effects of memory (vs. cpu speeds) and increased data sets.

Appendix: Assists Being Explored

Victim Caching addr data Small (1-5 lines) fully-associative cache configured as victim/stream cache or stream buffer. victim line tags • VC useful in case of conflict misses, long sequential reference streams. Prevent sharp fall off in performance when WSS is slightly larger than L1. • Estimate WSS from the structure of the RM such as the size of the strongly connected components (SCCs) • MORPH data-path structure supports addition of a parameterized victim/stream cache. The control logic is synthesized using CAD tools. • Victim caches provide 50X the marginal improvement in hit rate over the primary cache. mux data FA addr tag v one line Direct mapped L1, L2 +1 MRU new line tags to stream buffer

CPU =? =? Tag L1 VC data WB Lower memory hierarchy Victim Cache • Mainly used to eliminate conflict miss • Prediction: the memory address of a cache line that is replaced is likely to be accessed again in near future • Scenario for prediction to be effective: false sharing, ugly address mapping • Architecture implementation: use a on-chip buffer to store the contents of recently replaced cache line • Drawbacks • Ugly mapping can be rectified by cache aware compiler • Small size of victim cache, probability of memory address reuse within short period is very low. • Experiment shows victim cache is not effective across the board for DI apps.

=? =? CPU L1 Cache Data Address Data In Data Out Tag Stream Buffer Write Buffer Lower memory hierarchy Stream Buffer • Mainly used to eliminate compulsory/capacity misses • Prediction: if a memory address is missed, the consecutive address is likely to be missed in near future • Scenario for prediction to be useful: stream access • Architecture implementation: when an address miss, prefetch consecutive address into on-chip buffer. When there is a hit in stream buffer, prefetch the consecutive address of the hit address.

Tag =? =? =? CPU L1 Cache Data Address Data In Data Out Stream Cache Stream Buffer Write Buffer L2 Cache Stream Cache • Modification of stream buffer • Use a separate cache to store stream data to prevent cache pollution • When there is a hit in stream buffer, the hit address is sent to stream cache instead of L1 cache

Stride Prefetch • Mainly used to eliminate compulsory/capacity miss • Prediction: if a memory address is missed, an address that is offset by a distance from the missed address is likely to be missed in near future • Scenario for prediction to be useful: stride access • Architecture implementation: when an address miss, prefetch address that is offset by a distance from the missed address. When there is a hit in buffer, also prefetch the address that is offset by a distance from the hit address.

Miss Stride Buffer • Mainly used to eliminate conflict miss • Prediction: if a memory address miss again after N other misses, the memory address is likely to miss again after N other misses • Scenario for the prediction to be useful • multiple loop nests • some variables or array elements are reused across iterations

AMRM: Project Technical Approach