HiDISC: A Decoupled Architecture for Applications in Data Intensive Computing

USC UNIVERSITY OF SOUTHERN CALIFORNIA HiDISC: A Decoupled Architecture for Applications in Data Intensive Computing PIs: Alvin M. Despain and Jean-Luc Gaudiot University of Southern California http://www-pdpc.usc.edu May 2001

Outline • HiDISC Project Description • Experiments and Accomplishments • Work in Progress • Summary

Sensor Inputs Application (FLIR SAR VIDEO ATR /SLD Scientific ) Decoupling Compiler Processor Processor Processor Dynamic Database Registers Cache HiDISC Processor Memory Situational Awareness HiDISC: Hierarchical Decoupled Instruction Set Computer Technological Trend: Memory latency is getting longer relative to microprocessor speed (40% per year) Problem: Some SPEC benchmarks spend more than half of their time stalling Domain: benchmarks with large data sets: symbolic, signal processing and scientific programs Present Solutions: Multithreading (Homogenous), Larger Caches, Prefetching, Software Multithreading

Present Solutions • Solution • Larger Caches • Hardware Prefetching • Software Prefetching • Multithreading Limitations • Slow • Works well only if working set fits cache and there is temporal locality. • Cannot be tailored for each application • Behavior based on past and present execution-time behavior • Ensure overheads of prefetching do not outweigh the benefits > conservative prefetching • Adaptive software prefetching is required to change prefetch distance during run-time • Hard to insert prefetches for irregular access patterns • Solves the throughput problem, not the memory latency problem

The HiDISC Approach • Observation: • Software prefetching impacts compute performance • PIMs and RAMBUS offer a high-bandwidth memory system - useful for speculative prefetching • Approach: • Add a processor to manage prefetching -> hide overhead • Compiler explicitly manages the memory hierarchy • Prefetch distance adapts to the program runtime behavior

Cache 2nd-Level Cache and Main Memory What is HiDISC? • A dedicated processor for each level ofthe memory hierarchy • Explicitly manage each level of the memory hierarchy using instructions generated by the compiler • Hide memory latency by converting data access predictability to data access locality (Just in Time Fetch) • Exploit instruction-level parallelism without extensive scheduling hardware • Zero overhead prefetches for maximal computation throughput Computation Instructions Computation Processor (CP) Registers Access Instructions Program Compiler Access Processor (AP) Cache Mgmt. Instructions Cache Mgmt. Processor (CMP)

Cache Cache Cache 2nd-Level Cache and Main Memory 2nd-Level Cache and Main Memory 2nd-Level Cache and Main Memory 2nd-Level Cache and Main Memory Decoupled Architectures 8-issue 3-issue 5-issue 2-issue Computation Processor (CP) Computation Processor (CP) Computation Processor (CP) Computation Processor (CP) Registers Registers Registers Registers SAQ SDQ LQ Access Processor (AP) - (5-issue) Access Processor (AP) - (3-issue) SCQ Cache 3-issue 3-issue Cache Mgmt. Processor (CMP) Cache Mgmt. Processor (CMP) 2nd-Level Cache and Main Memory MIPS CAPP HiDISC DEAP (Conventional) (Decoupled) (New Decoupled) DEAP: [Kurian, Hulina, & Caraor ‘94] PIPE: [Goodman ‘85] Other Decoupled Processors: ACRI, ZS-1, WM

Slip Control Queue • The Slip Control Queue (SCQ) adapts dynamically • Late prefetches = prefetched data arrived after load had been issued • Useful prefetches = prefetched data arrived before load had been issued if (prefetch_buffer_full ()) Don’t change size of SCQ; else if ((2*late_prefetches) > useful_prefetches) Increase size of SCQ; else Decrease size of SCQ;

Decoupling Programs for HiDISC(Discrete Convolution - Inner Loop) while (not EOD) y = y + (x * h); send y to SDQ Computation Processor Code for (j = 0; j < i; ++j) { load (x[j]); load (h[i-j-1]); GET_SCQ; } send (EOD token) send address of y[i] to SAQ for (j = 0; j < i; ++j) y[i]=y[i]+(x[j]*h[i-j-1]); Inner Loop Convolution Access Processor Code SAQ: Store Address Queue SDQ: Store Data Queue SCQ: Slip Control Queue EOD: End of Data for (j = 0; j < i; ++j) { prefetch (x[j]); prefetch (h[i-j-1]; PUT_SCQ; } Cache Management Code

Where We Were • HiDISC compiler • Frontend selection (Gcc) • Single thread running without conditionals • Hand compiling of benchmarks • Livermore loops, Tomcatv, MXM, Cholsky, Vpenta and Qsort

Benchmarks Benchmarks Source of Lines of Description Data Benchmark Source Set Code Size LLL1 Livermore 20 1024 - element 24 KB Loops [45] arrays, 100 iterations LLL2 Livermore 24 1024 - element 16 KB Loops arrays, 100 iterations LLL3 Livermore 18 1024 - element 16 KB Loops a rrays, 100 iterations LLL4 Livermore 25 1024 - element 16 KB Loops arrays, 100 iterations LLL5 Livermore 17 1024 - element 24 KB Loops arrays, 100 iterations Tomcatv SPECfp95 [68] 190 33x33 - element <64 KB matrices, 5 iterations MXM NAS kernels [5] 11 3 Unrolled matrix 448 KB multiply, 2 iterations Cholsky matrix CHOLSKY NAS kernels 156 724 KB decomposition VPENTA NAS kernels 199 Invert three 128 KB pentadiagonals simultaneously Qsort Quicksort 58 Quicksort 128 KB sorting algorithm [14]

Simulation Parameters

LLL3 Tomcatv 5 3 MIPS MIPS DEAP DEAP 2.5 4 CAPP CAPP HiDISC HiDISC 2 3 1.5 2 1 1 0.5 0 0 200 0 40 80 120 160 200 0 40 80 120 160 Main Memory Latency Main Memory Latency Vpenta Cholsky 12 MIPS MIPS 16 DEAP 10 14 DEAP CAPP 12 HiDISC 8 CAPP 10 HiDISC 6 8 4 6 4 2 2 0 0 0 40 80 120 160 200 0 40 80 120 160 200 Main Memory Latency Main Memory Latency Simulation Results

Accomplishments • 2x speedup for scientific benchmarks with large data sets over an in-order superscalar processor • 7.4x speedup for matrix multiply (MXM) over an in-order issue superscalar processor - (similar operations are used in ATR/SLD) • 2.6x speedup for matrix decomposition/substitution (Cholsky) over anin-order issue superscalar processor • Reduced memory latency for systems that have high memory bandwidths (e.g. PIMs, RAMBUS) • Allows the compiler to solve indexing functions for irregular applications • Reduced system cost for high-throughput scientific codes

Work in Progress • Silicon Space for VLSI Layout • Compiler Progress • Simulator Integration • Hand-compiling of Benchmarks • Architectural Enhancement for Data Intensive Applications

VLSI Layout Overhead • Goal: Evaluate layout effectiveness of HiDISC architecture • Cache has become a major portion of the chip area • Methodology: Extrapolate HiDISC VLSI Layout based on MIPS10000 processor (0.35 μm, 1996) • The space overhead is 11.3% over a comparable MIPS processor

VLSI Layout Overhead

Compiler Progress • Preprocessor- Support for library calls • Gnu pre-processor - Support for special compiler directives (#include, #define) • Nested Loops • Nested loops without data dependencies (for and while) • Support for conditional statements where the index variable of an inner loop is not a function of some outer loop computation • Conditional statements • CP to perform all computations • Need to move the condition to AP

for (i=0;i<10;++i) for(k=0;k<10;++k) ++j; Nested Loops (Assembly) # 6 for(i=0;i<10;++i) sw $0, 12($sp) $32: # 7 for(k=0;k<10;++k) sw $0, 4($sp) $33: # 8 ++j; lw $15, 8($sp) addu $24, $15, 1 sw $24, 8($sp) lw $25, 4($sp) addu $8, $25, 1 sw $8, 4($sp) blt $8, 10, $33 lw $9, 12($sp) addu $10, $9, 1 sw $10, 12($sp) blt $10, 10, $32 High Level Code - C Assembly Code

Nested Loops # 6 for(i=0;i<10;++i) $32: b_eod loop_i # 7 for(k=0;k<10;++k) $33: b_eod loop_k # 8 ++j; addu $15, LQ, 1 sw $15, SDQ b $33 Loop_k : b $32 loop_i : # 6 for(i=0;i<10;++i) sw $0, 12($sp) $32: # 7 for(k=0;k<10;++k) sw $0, 4($sp) $33: # 8 ++j; lw LQ, 8($sp) get SCQ sw 8($sp), SAQ lw $25, 4($sp) addu $8, $25, 1 sw $8, 4($sp) blt $8, 10, $33 s_eod lw $9, 12($sp) addu $10, $9, 1 sw $10, 12($sp) blt $10, 10, $32 s_eod # 6 for(i=0;i<10;++i) sw $0, 12($sp) $32: # 7 for(k=0;k<10;++k) sw $0, 4($sp) $33: # 8 ++j; pref 8($sp) put SCQ lw $25, 4($sp) addu $8, $25, 1 sw $8, 4($sp) blt $8, 10, $33 lw $9, 12($sp) addu $10, $9, 1 sw $10, 12($sp) blt $10, 10, $32 CP Stream AP Stream CMP Stream

HiDISC Compiler • Gcc Backend • Use input from the parsing phase for performing loop optimizations • Extend the compiler to provide MIPS4 • Handle loops with dependencies • Nested loops where the computation depends on the indices of more than 1 loop e.g. X(i,j) = i*Y(j,i) where i and j are index variables and j is a function of i

HiDISC Stream Separator Sequential Source Program Flow Graph Classify Address Registers Allocate Instruction to streams Previous Work Current Work Computation Stream Access Stream Fix Conditional Statements Move Queue Access into Instructions Move Loop Invariants out of the loop Add Slip Control Queue Instructions Substitute Prefetches for Loads, Remove global Stores, and Reverse SCQ Direction Add global data Communication and Synchronization Produce Assembly code Cache Management Assembly Code Computation Assembly Code Access Assembly Code

Simulator Integration • Based on MIPS RISC pipeline Architecture (dsim) • Supports MIPS1 and MIPS2 ISA • Supports dynamic linking for shared library • Loading shared library in the simulator • Hand compiling: • Using SGI-cc or gcc as front-end • Making 3 streams of codes • Using SGI-cc as compiler back-end .c cc –mips2 -s • Modification on three .s codes to HiDISC assembly • Convert .hs to .s for three codes (hs2s) .s .cp.s .ap.s .cmp.s cc –mips2 -o sharedlibrary .cp .ap .cmp dsim

DIS Benchmarks • Atlantic Aerospace DIS benchmark suite: • Application oriented benchmarks • Many defense applications employ large data sets – non contiguous memory access and no temporal locality • Too large for hand-compiling, wait until the compiler ready • Requires linker that can handle multiple object files • Atlantic Aerospace Stressmarks suite: • Smaller and more specific procedures • Seven individual data intensive benchmarks • Directly illustrate particular elements of the DIS problem

Stressmark Suite * DIS Stressmark Suite Version 1.0, Atlantic Aerospace Division

Example of Stressmarks • Pointer Stressmark • Basic idea: repeatedly follow pointers to randomized locations in memory • Memory access pattern is unpredictable • Randomized memory access pattern: • Not sufficient temporal and spatial locality for conventional cache architectures • HiDISC architecture provides lower memory access latency

Decoupling of Pointer Stressmarks while (not EOD)if (field > partition) balance++; if (balance+high == w/2) break;else if (balance+high > w/2) { min = partition;}else { max = partition; high++;} for (i=j+1;i<w;i++) { if (field[index+i] > partition) balance++;} if (balance+high == w/2) break;else if (balance+high > w/2) { min = partition;}else { max = partition; high++;} Computation Processor Code for (i=j+1; i<w; i++) { load (field[index+i]); GET_SCQ; } send (EOD token) Access Processor Code for (i=j+1; i<w; i++) { prefetch (field[index+i]); PUT_SCQ; } Inner loop for the next indexing Cache Management Code

Stressmarks • Hand-compile the 7 individual benchmarks • Use gcc as front-end • Manually partition each of the three instruction streams and insert synchronizing instructions • Evaluate architectural trade-offs • Updated simulator characteristics such as out-of-order issue • Large L2 cache and enhanced main memory system such as Rambus and DDR

Architectural Enhancement for Data Intensive Applications • Enhanced memory system such as RAMBUS DRAM and DDR DRAM • Provide high memory bandwidth • Latency does not improve significantly • Decoupled access processor can fully utilize the enhanced memory bandwidth • More requests caused by access processor • Pre-fetching mechanism hide memory access latency

Flexi-DISC • Fundamental characteristics: • inherently highly dynamic at execution time. • Dynamic reconfigurable central computational kernel (CK) • Multiple levels of caching and processing around CK • adjustable prefetching • Multiple processors on a chip which will provide for a flexible adaptation from multiple to single processors and horizontal sharing of the existing resources.

Flexi-DISC • Partitioning of Computation Kernel • It can be allocated to the different portions of the application or different applications • CK requires separation of the next ring to feed it with data • The variety of target applications makes the memory accesses unpredictable • Identical processing units for outer rings • Highly efficient dynamic partitioning of the resources and their run-time allocation can be achieved

Summary • Gcc Backend • Use the parsing tree to extract the load instructions • Handle loops with dependencies where the index variable of an inner loop is not a function of some outer loop computation • Robust compiler is being designed to experiment and analyze with additional benchmarks • Eventually extend it to experiment DIS benchmarks • Additional architectural enhancements have been introduced to make HiDISC amenable to DIS benchmarks

Sensor Inputs Application (FLIR SAR VIDEO ATR /SLD Scientific ) Decoupling Compiler Processor Processor Processor Dynamic Database Registers Cache HiDISC Processor Memory Situational Awareness • New Ideas • A dedicated processor for each level of the memory hierarchy • Explicitly manage each level of the memory hierarchy using instructions generated by the compiler • Hide memory latency by converting data access predictability to data access locality • Exploit instruction-level parallelism without extensive scheduling hardware • Zero overhead prefetches for maximal computation throughput HiDISC: Hierarchical Decoupled Instruction Set Computer • Impact • 2x speedup for scientific benchmarks with large data sets over an in-order superscalar processor • 7.4x speedup for matrix multiply over an in-order issue superscalar processor • 2.6x speedup for matrix decomposition/substitution over an in-order issue superscalar processor • Reduced memory latency for systems that have high memory bandwidths (e.g. PIMs, RAMBUS) • Allows the compiler to solve indexing functions for irregular applications • Reduced system cost for high-throughput scientific codes Schedule • Defined benchmarks • Completed simulator • Performed instruction-level simulations on hand-compiled • benchmarks • Continue simulations of more benchmarks (SAR) • Define HiDISC architecture • Benchmark result • Update Simulator • Develop and test a full decoupling compiler decoupling compiler • Generate performance statistics and evaluate design Start April 2001

HiDISC: A Decoupled Architecture for Applications in Data Intensive Computing