Adaptive Memory Reconfiguration Management: The AMRM Project

Adaptive Memory Reconfiguration Management:The AMRM Project Rajesh Gupta, Alex Nicolau University of California, Irvine Andrew Chien University of California, San Diego DARPA DIS PI Meeting, Santa Fe, October 1998

Outline • Project Drivers • application needs for diverse (cache) memory configurations • technology trends favoring reconfigurablity in high-performance designs • Project Goals and Deliverables • Project Implementation Plan • Project Team • Summary of New Ideas Proposed by AMRM

CPU L1 TLB 3 cycles 2 GB/s L2 33 cycles Memory 57-72 ~ MB/s 106 cycles from disk Introduction • Many defense applications are data-starved • large data-sets, irregular locality characteristics • FMM Radar Cross-section Modeling, OODB, CG • Memory access times falling behind CPU speeds • increased memory penalty and data starvation. • No single architecture works well: • Data-intensive applications need a variety of strategies to deliver high-performance according to application memory reference needs: • multilevel caches/policies • intelligent prefetching schemes • dynamic “cache-like” structures: prediction tables, stream caches, victim caches • even simple optimizations like block size selection improve performance significantly.

Wire delay, ns/cm. Evolutionary growth but its effects are subtle and powerful! Technology Evolution Industry continues to outpace NTRS projections on technology scaling and IC density.

Static interconnect CROSS-OVER REGION Dynamic interconnect Avg. Interconnect Length Critical Length I II III Consider Interconnect 3000 Length (um) 2000 1000 1000 100 Feature Size (nm) • Average interconnect delay is greater than the gate delays! • Reduced marginal cost of logic coupled with signal regeneration makes it possible to include logic in inter-block interconnect.

The Opportunity of Application-Adaptive Architectures • Use interconnect and data-path reconfiguration to • adapt architectures for increased performance, combat performance fragility and improve fault tolerance • AMRM technological basis is in re-configurable hw: • configurable hardware is used to improve utilization of performance critical resources (instead of using configurable hardware to build additional resources) • design goal is to achieve peak performance across applications • configurable hardware leveraged in efficient utilization of performance critical resources First quantitative answers to utility of architectural adaptation provided by the MORPH Point Design Study (PDS)

MORPH Point Design Study:Custom Mechanisms Explored • Combat latency deterioration • optimal prefetching: • “memory side pointer chasing” • blocking mechanisms • fast barrier, broadcast support • synchronization support • Bandwidth management • memory (re)organization to suit application characteristics • translate and gather hardware • “prefetching with compaction” • Memory controller design

CPU/L1 data L2 Cache Adaptation for Latency Tolerance • Operation 1. Application sets prefetch parameters (compiler controlled) • set lower/upper bounds on memory regions (for memory protection etc.) • download pointer extraction function • element size 2. Prefetching event generation (runtime controlled) • when a new cache block is filled virtual addr./data physical addr. Prefetcher additional addr. if(start<=vAddr<=end) { if(pAddr & 0x20) addr = pAddr - 0x20 else addr = pAddr + 0x20 <initiate fetch of cache line at addr to L1> }

Program View Physical Layout Processor Access translate Return val row col rowPtr colPtr val col val col ... Addr. Translation L1 Cache val1 val2 val3 val row col rowPtr colPtr cache Gather Logic + 64 synthesize val1, RowPtr1, ColPtr1 • No Change in Program Logical Data Structures • Partition Cache • Translate Data • Synthesize Pointer val2, RowPtr2, ColPtr2 val3, RowPtr3, ColPtr3 memory Adaptation for Bandwidth Reduction • Prefetching Entire Row/Column • Pack Cache with Used Data Only

100x reduction in BW. 10x reduction in miss rate. Adaptation Results

Going Beyond PDS • Memory hierarchy utilization • estimate working set size • memory grain size • miss types: conflict, capacity, coherence, cold-start • memory access patterns: sequential, stride prediction • assess marginal miss rates and “what-if” scenarios • Dynamic cache structures • victim caches, stream caches, stride prediction, buffers. • Memory bank conflicts • detect array references that cause bank conflicts • PE load profiling • Continuous validation hardware

Challenges in Building AA Architectures • Without automatic application analysis application adaptation is still pretty much subject to hand-crafting • Compiler support for identification and use of appropriate architectural assists is crucial • Significant semantic loss occurs when going from application to compiler-level optimizations. • The runtime system must actively support architectural customization safely.

Project Goals • Design an Adaptive Memory Reconfiguration Management (AMRM) system that provides • 100X improvement in hierarchical memory system performance over conventional static memory hierarchy in terms of latency and available bandwidth. • Develop compiler algorithms that staticallyselect adaptation of memory hierarchy on a per application basis • Develop operating system and architecture features which ensure process isolation, error detection and containment for a robust multi-process computing environment.

Project Deliverables • An architecture for adaptive memory hierarchy • Architectural mechanisms and policies for efficient memory system adaptation • Compiler support (identification and selection) of the machine adaptation • OS and HW architecture features which enable process isolation, error detection, and containment in dynamic adaptive systems.

Impact • Optimized data placement and movement through the memory hierarchy • per application sustained performance close to peak machine performance • particularly for applications with non-contiguous large data-sets such as • sparse-matrix and conjugate gradient computations, circuit simulation • data-base (relational and object-oriented) systems • imaging data • security-sensitive applications

Impact (continued) • Integration with core system mechanisms enables multi-process, robust and safe computing • enables basic software modularity through processes on adaptive hardware • ensures static and dynamic adaptation will not compromise system robustness -- errors generally confined to a single process • provides mechanisms for online validation of dynamic adaptation (catch compiler and hardware synthesis errors) enabling fallback to earlier versions for correctness • High system performance using standard CPU components • adaptive cache management achieved using reconfigurable logic, compiler and OS smarts • 15-20X improvement in sparse matrix/conjugate gradient computations • 20X improvement in radar cross section modeling code • high system performance without changing computation resources preserves the DOD investment into existing software

CPU Base m/c L1 TLB adapt Adaptive Machine Definition Adaptive Cache Structures L2 logic Memory Operating System Strategies Continuous Validation Fault Detection and Containment The AMRM Project:Enabling Co-ordinated Adaptation 2. Compiler Control of Cache Adaptation Compilation for Adaptive Memory Application Analysis Application Instrumentation for runtime adaptation Vicitim cache Stride predictor Synthesis & Mapping Software Prefetcher Stream cache Miss stride buffer Stream buffer Write buffer 3. Safe and Protected Execution 1. Flexible Memory System Architecture

Project Organization • Three coordinated thrusts T1 design of a flexible memory system architecture T2 compiler control of the adaptation process T3 safe and protected execution environment • System architecture enables machine adaptation • by implementing architectural assists, mechanisms and policies • Compiler enables application-specific machine adaptation • by providing powerful memory behavior analysis techniques • Protection and validation enables a robust multi-process software environment • by ensuring process isolation and online validation

Project Personnel • Project Co-PIs • Professor Rajesh Gupta, UC Irvine • Professor Alex Nicolau, UC Irvine • Professor Andrew Chien, UC San Diego • Collaborators • Dr. Phil Kuekes, HP Laboratories, Palo Alto • Research Specialist • Dr. Alexander Veidenbaum, UC Irvine • Graduate Research Assistants • Contract Technical Monitor • Dr. Larry Carter, AIC , Fort Huachuca, AZ

Summary of New Ideas in AMRM 1. Application-adaptive architectural mechanisms and policies for memory latency and bandwidth management: • combat latency deterioration using hardware-assisted blocking, prefetching • manage bandwidth through adaptive translation, movement and placement of application-data for the most efficient access • cache organization, coherence, dynamic cache structures are modified as needed by an application 2. Cache memory adaptation is driven by compiler techniques • semantic retention applied at language and architectural levels • control memory adaptation and maintain machine usability through application software 3. OS and Architecture features enable process isolation and online validation of adaptations • OS and architecture features enable error detection, isolation and containment; online validation extends to dynamic adaptations • modular, robust static and dynamic reconfiguration with precise characterization of isolation properties

Adaptive Memory Reconfiguration Management: The AMRM Project