1 / 31

HiDISC: A Decoupled Architecture for Applications in Data Intensive Computing

HiDISC: A Decoupled Architecture for Applications in Data Intensive Computing. PIs: Alvin M. Despain and Jean-Luc Gaudiot DARPA DIS PI Meeting Santa Fe, NM. March 26-27, 2002. Sensor Inputs. Application (FLIR SAR VIDEO ATR /SLD Scientific ). Decoupling Compiler. Processor. Processor.

yahto
Download Presentation

HiDISC: A Decoupled Architecture for Applications in Data Intensive Computing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. HiDISC: A Decoupled Architecture for Applications in Data Intensive Computing PIs: Alvin M. Despain and Jean-Luc Gaudiot DARPA DIS PI Meeting Santa Fe, NM March 26-27, 2002

  2. Sensor Inputs Application (FLIR SAR VIDEO ATR /SLD Scientific ) Decoupling Compiler Processor Processor Processor Dynamic Database Registers Cache HiDISC Processor Memory Situational Awareness HiDISC: Hierarchical Decoupled Instruction Set Computer • New Ideas • A dedicated processor for each level of the memory    hierarchy • Explicit management of each level of the memory hierarchy using   instructions generated by the compiler • Hide memory latency by converting data access    predictability to data access locality • Exploit instruction-level parallelism without extensive   scheduling hardware • Zero overhead prefetches for maximal computation   throughput • Impact • 2x speedup for scientific benchmarks with large data sets   over an in-order superscalar processor • 7.4x speedup for matrix multiply over an in-order issue   superscalar processor • 2.6x speedup for matrix decomposition/substitution over an   in-order issue superscalar processor • Reduced memory latency for systems that have high   memory bandwidths (e.g. PIMs, RAMBUS) • Allows the compiler to solve indexing functions for irregular   applications • Reduced system cost for high-throughput scientific codes Schedule • Defined benchmarks • Completed simulator • Performed instruction-level simulations on hand-compiled • benchmarks • Continue simulations of more benchmarks (SAR) • Define HiDISC architecture • Benchmark result • Update Simulator • Developed and tested a complete decoupling compiler • Generated performance statistics and evaluated design April 2001 March 2002 Start

  3. Accomplishments • Design of the HiDISC model • Compiler development (operational) • Simulator design (operational) • Performance Evaluation • Three DIS benchmarks (Multidimensional Fourier Transform, Method of Moments, Data Management) • Five stressmarks (Pointer, Transitive Closure, Neighborhood, Matrix, Field) • Results • Mostly higher performance (some lower) • Range of applications • HiDISC of the future

  4. Outline • Original Technical Objective • HiDISC Architecture Review • Benchmark Results • Conclusion and Future Works 

  5. Sensor Inputs Application (FLIR SAR VIDEO ATR /SLD Scientific ) Decoupling Compiler Processor Processor Processor Dynamic Database Registers Cache HiDISC Processor Memory Situational Awareness HiDISC: Hierarchical Decoupled Instruction Set Computer • Technological Trend: Memory latency is getting longer relative to microprocessor speed (40% per year) • Problem: Some SPEC benchmarks spend more than half of their time stalling [Lebeck and Wood 1994] • Domain: benchmarks with large data sets: symbolic, signal processing and scientific programs • Present Solutions: Larger Caches, Prefetching (software & hardware), Simultaneous Multithreading

  6. Present Solutions • Solution • Larger Caches • Hardware Prefetching • Software Prefetching • Multithreading Limitations • Slow • Works well only if working set fits cache and there is temporal locality. • Cannot be tailored for each application • Behavior based on past and present execution-time behavior • Ensure overheads of prefetching do not outweigh the benefits > conservative prefetching • Adaptive software prefetching is required to change prefetch distance during run-time • Hard to insert prefetches for irregular access patterns • Solves the throughput problem, not the memory latency problem

  7. The HiDISC Approach • Observation: • Software prefetching impacts compute performance • PIMs and RAMBUS offer a high-bandwidth memory system - useful for speculative prefetching • Approach: • Add a processor to manage prefetching  hide overhead • Compiler explicitly manages the memory hierarchy • Prefetch distance adapts to the program runtime behavior

  8. L2 Cache and Higher Level What is HiDISC? • A dedicated processor for each level ofthe memory hierarchy • Explicitly manage each level of the memory hierarchy using instructions generated by the compiler • Hide memory latency by converting data access predictability to data access locality (Just in Time Fetch) • Exploit instruction-level parallelism without extensive scheduling hardware • Zero overhead prefetches for maximal computation throughput Computation Processor (CP) 2-issue Registers Store Address Queue Load Data Queue Access Processor (AP) Store Data Queue Slip Control Queue 3-issue L1 Cache Cache Mgmt. Processor (CMP) 3-issue HiDISC

  9. Slip Control Queue • The Slip Control Queue (SCQ) adapts dynamically • Late prefetches = prefetched data arrived after load had been issued • Useful prefetches = prefetched data arrived before load had been issued if (prefetch_buffer_full ()) Don’t change size of SCQ; else if ((2*late_prefetches) > useful_prefetches) Increase size of SCQ; else Decrease size of SCQ;

  10. Decoupling Programs for HiDISC(Discrete Convolution - Inner Loop) while (not EOD) y = y + (x * h); send y to SDQ Computation Processor Code for (j = 0; j < i; ++j) { load (x[j]); load (h[i-j-1]); GET_SCQ; } send (EOD token) send address of y[i] to SAQ for (j = 0; j < i; ++j) y[i]=y[i]+(x[j]*h[i-j-1]); Inner Loop Convolution Access Processor Code SAQ: Store Address Queue SDQ: Store Data Queue SCQ: Slip Control Queue EOD: End of Data for (j = 0; j < i; ++j) { prefetch (x[j]); prefetch (h[i-j-1]; PUT_SCQ; } Cache Management Code

  11. General View of the Compiler Source Program Gcc Binary Code Disassembler Stream Separator Computation Code Cache Mgmt. Code Access Code HiDISC Compilation Overview

  12. Sequential Source Data Dependency Graph Defining Load/Store Instructions Instruction Chasing for Backward Slice Access Stream Computation Stream Insert prefetchInstructions Insert Communication Instructions Access Code Computation Code Cache Management Code HiDISC Stream Separator

  13. Stream Separation: Backward Load/Store Chasing • Other remaining instructions are sent to the Computation stream • Load/store instructions and “backward slice” are included as Access stream LLL1 • Communications between the two streams are also defined at this point

  14. Stream Separation: Creating an Access Stream • Communication needs to take place via the various hardware queues. • Insert Communication instructions • CMP instructions are copied from access stream instructions, except that load instructions are replaced by prefetch instructions

  15. DIS Benchmarks • Application oriented benchmarks • Main characteristic: Large data sets – non contiguous memory access and no temporal locality

  16. DIS Benchmarks Description

  17. Atlantic Aerospace Stressmark Suite • Smaller and more specific procedures • Seven individual data intensive benchmarks • Directly illustrate particular elements of the DIS problem • Fewer computation operations than DIS benchmarks • Focusing on irregular memory accesses

  18. Stressmark Suite * DIS Stressmark Suite Version 1.0, Atlantic Aerospace Division

  19. Pointer Stressmarks • Basic idea: repeatedly follow pointers to randomized locations in memory • Memory access pattern is unpredictable (input data dependent) • Randomized memory access pattern: • Not sufficient temporal and spatial locality for conventional cache architectures • HiDISC architecture should provide lower average memory access latency • The pointer chasing in AP can run ahead without blocking by the CP

  20. Simulation Parameters (simplescalar 3.0, 2000)

  21. Superscalar and HiDISC • Ability to vary the memory latency • Superscalar supports OOO issue with 16 RUU (Register Update Unit) and 8 LSQ (Load Store Queue) • Each of AP and CP issues instructions in order

  22. DIS Benchmark/Stressmark Results ↑: better with HiDISC ↓: better with Superscalar

  23. DIS Benchmark Results

  24. DIS Benchmark Performance • DIS benchmarks perform extremely well in general with our decoupled processing for the following reasons: • Many long latency floating-point operations • Robust with longer memory latency (eg, Method of Moment is a more stream-like process) • In FFT, the HiDISC architecture also suffers from longer memory latencies (Due to the data dependencies between two streams) • DM is not affected by the longer memory latency in either case.

  25. Stressmark Results

  26. Stressmark Results – Good Performance • Pointer chasing can be executed far ahead by using decoupled access stream • It does not requires the computation results from the CP • Transitive closure also produces good results • Not much in the AP depend on the results of CP • AP can run ahead

  27. Some Weak Performance

  28. Performance Bottleneck • Too much synchronization causes loss of decoupling • Unbalanced code size between two streams • Stressmark suite is highly “access-heavy” • Application domain for decoupled architectures: • Balanced ratio between computation and memory access

  29. Synthesis • Synchronization degrades performance • When the dependency of AP on CP increases, the slip distance of AP cannot be increased • More stream like applications will benefit from using HiDISC • Multithreading support is needed • Applications should contain enough computation (1 to 1 ratio) to hide the memory access latency • CMP should be simple • It executes redundant operations if the data is already in cache • Cache pollution can occur

  30. Flexi-DISC • Fundamental characteristics: • inherently highly dynamic at execution time. • Dynamic reconfigurable central computational kernel (CK) • Multiple levels of caching and processing around CK • adjustable prefetching • Multiple processors on a chip which will provide for a flexible adaptation from multiple to single processors and horizontal sharing of the existing resources.

  31. Flexi-DISC • Partitioning of Computation Kernel • It can be allocated to the different portions of the application or different applications • CK requires separation of the next ring to feed it with data • The variety of target applications makes the memory accesses unpredictable • Identical processing units for outer rings • Highly efficient dynamic partitioning of the resources and their run-time allocation can be achieved

More Related