Abbas Rahimi, Andrea Marongiu , Rajesh K. Gupta, Luca Benini

A Variability-Aware OpenMP Environment for Efficient Execution of Accuracy-Configurable Computation on Shared-FPU Processor Clusters Abbas Rahimi, Andrea Marongiu,Rajesh K. Gupta, Luca Benini UC San Diego, and University of Bologna Micrel.deis.unibo.it /MultiTherman variability.org

Outline • Introduction and motivation • Contribution • Architecture • OpenMPextensions • Programming interface • Runtime environment • Profiling-basedapproximation control • ExperimentalResults

Introduction and Motivation VCC Droop Process Temperature guardband actual circuit delay Clock Aging • Variability in transistor characteristics is a major challenge in nanoscale CMOS: • Static variation (Process); Dynamic variations (Temperature fluctuations, supply Voltage droops, and device Aging) • To handle variations • Designers use conservative guardbands loss of operational efficiency  • Resilient designs impose costly error recovery 

Introduction and Motivation Error Detection Sequential (EDS) Multiple-Issue Instruction Replay [1]K.A. Bowman, et al., “A 45 nm ResilientMicroprocessor Core for DynamicVariationTolerance,” IEEE Journal of Solid-State Circuits, 46(1): 194-208, Jan. 2011. • Resilient designs impose costly error recovery 

Introduction and Motivation • Resilient designs impose costly error recovery  • This is especially true for floating-point (FP) pipelined architectures • High latency (up to 32 cycles) • Deep pipelines also induce higher cost of recovery (REPLAY) • Even more troublesome for SHARED FPUs among multi-cores

Contribution ACCURATE APPROXIMATE Our goal is to reduce the cost of a resilient FP environment which is dominated by the error correction • An integrated approach to vertically expose FPU vulnerability at the programming model level based on • EDS sensing • Runtime components to schedule less vulnerable FPUs first • By leveraging the inherent tolerance of certain applications to approximation • Programming model extensions to specify approximate blocks • Reconfigurable EDS in resilient FPUs • Profiling-based technique to achieve controlled approximation

Architecture Tightly-coupledsharedmemory multi-core cluster Multi-core architecture • 16x 32-bit RISC cores • L1 SW-managedTightlyCoupled Data Memory (TCDM) • Multi-banked/multi-ported • Fast concurrentreadaccess • Fast logarithmicinterconnect • SharedFPU • 32-bit single precision • IEEE 754 compliant SLAVEPORT CORE 0 FPU EDS ECU I$ I$ MASTER PORT SLAVE PORT EDS ECU FPU LOW-LATENCY LOGARITHMIC INTERCONNECT SLAVE PORT SLAVE PORT SLAVE PORT SLAVE PORT L2/L3 BRIDGE SHARED L1 TCDM test-and-setsemaphores BANK 0 BANK 1 BANK N

Architecture Every pipeline block has two dynamically reconfigurable operating modes:(i) accurate, and (ii) approximate. Accurate mode: every pipeline uses • EDS circuit sensors to detect any timing errors [1] • ECU to correct errors using multiple-issue operation replay mechanism (without changing frequency) [2] SLAVEPORT EDS ECU FPU [1] K.A. Bowman, et al., “Energy-Efficient and Metastability-Immune ResilientCircuits for DynamicVariationTolerance,” IEEE Journal of Solid-State Circuits, 44(1): 49-63, 2009. [2]K.A. Bowman, et al., “A 45 nm ResilientMicroprocessor Core for DynamicVariationTolerance,” IEEE Journal of Solid-State Circuits, 46(1): 194-208, Jan. 2011.

Controlled Approximation • Approximate computation leverages the inherent tolerance of some (type of) applications within certain error bounds that are acceptable to the end application • To ensure that it is safe not to correct a timing error when approximating the associated computation: • The error significance is controllable ≤ given threshold; • The error rate is controllable ≤ given error rate threshold; • There is a region of the program that can produce an acceptable fidelity metric by tolerating the uncorrected, thus propagated, errors with the above-mentioned properties.

Accuracy-Configurable Architecture In the approximate mode Pipeline disables the EDS sensors on the less significant N bits of the fraction where N is reprogrammable through a memory-mapped register. The sign and the exponent bits are always protected by EDS. Thus pipeline ignores any timing error below the less significant N bits of the fraction and save on the recovery cost. Switching between modes disables/enables the error detection circuits partially on N bits of the fraction  FP pipeline can efficiently execute subsequent interleaved accurate or approximate software blocks.

Floating-point Pipeline Vulnerability The FPV metadata is defined as the percentage of cycles in which a timing error occurs on the pipeline reported by the EDS sensors. The ECU dynamically characterizes this per-pipeline metric over a programmable sampling period. The characterized FPV of each pipeline is visible to the software through memory-mapped registers. Enables runtime scheduler to perform on-line selection of best FP pipeline candidates.

OpenMP Compiler Extension error_significance_threshold (<value N>) #pragma omp parallel { #pragma omp accurate #pragmaomp for for (i=K/2; i <(IMG_M-K/2); ++i) { // iterate over image for (j=K/2; j <(IMG_N-K/2); ++j) { float sum = 0; int ii, jj; for (ii =-K/2; ii<=K/2; ++ii) { // iterate over kernel for (jj = -K/2; jj <= K/2; ++jj) { float data = in[i+ii][j+jj]; float coef = coeffs[ii+K/2][jj+K/2]; float result; #pragmaomp approximate error_significance_threshold(20) { result = data * coef; sum += result; } } } out[i][j]=sum/scale; } } } Code snippet for Gaussian filter utilizing OpenMP variability-aware directives int ID = GOMP_resolve_FP (GOMP_APPROX, GOMP_MUL, 20); GOMP_FP (ID, data, coeff, &result); int ID = GOMP_resolve_FP (GOMP_APPROX, GOMP_ADD, 20); GOMP_FP (ID, sum, result, &sum); Invokes the runtime FPU scheduler programs the FPU #pragmaompaccurate structured-block #pragmaompapproximate[clause] structured-block

Runtime Support and FPV Utilization The variation-aware scheduler reduces • Number of recovery cycles for accurate blocks • by favoring utilization of FPUs with a lower FPV  lower error rate and recovery • Cost of error correction • by deliberately propagating the error toward application  excluding the recovery (correction) cost

Runtime Support and FPV Utilization Appr. Appr. Appr. Appr. Acc. Acc. Acc. Acc. Allocate PR1 Allocate PR2 Allocate PRK Allocate PRN Configure opmode Configure opmode Configure opmode Configure opmode No No No No No No No No Start point Busy (PR1)? Busy (PR2)? Busy (PRK)? Busy (PRN)? Accurate Yes Yes Yes Yes Yes … … Yes Yes Yes End point Approximate For every operation type of P, sorted list of P: FLV (PR1) ≤ … ≤ FLV (PRK) ≤ … ≤ FLV (PRN) FLV (PRK) < error rate threshold for approximate computation Scheduler ranks all the individual pipelines based on their FPV. The sorted list is maintained in the shared TCDM

Profiling-based controlled approximation [3] M. A. Breueret al., “Intelligible Test Techniques to SupportErrorTolerance,” Proc, Asian Test Symp, 2004 We analyze the manifestation of a range of error significance and error rate on the PSNR of two image processing kernels (gauss and sobel) In a series of profiling runs we monotonically increase the error significance by injecting timing errors as random multiple-bit toggling up to a certain bit position. We also vary the error rate {25%, 50%, 100%} For our experiments we consider as a fidelity metric PSNR ≥ 30dB [3]

Error rate = 100%

Error-tolerant Applications • For error rates of {100%, 50%, 25%} if the error lies within the bit position of 0 to {20, 21, 22} of the fraction part, these two applications can tolerate error by delivering a PSNR ≥ 30dB. We set • the error rate threshold to 100% • the error significance threshold to 20 Profiling with annotated approximate region

Experimental Setup • OpenMP-enabled SystemC-based virtual platform • Shared-FPUs are generated and optimized by FloPoCo • TSMC 45nm ASIC flow (SS/0.81V/125°C) • Synopsys Design Compiler (front-end) • Synopsys IC Compiler (back-end) • Synopsys PrimeTime VX (static and dynamic variations) • Variation-induced delays are back-annotated to the SystemC models

Error-tolerant Applications • Energy and execution time of RANK scheduling (normalized to round-robin) for accurateGaussian and Sobel filters: • up to 12% lower energy • the maximum timing penalty is less than 1% Execution without approximation directives

Error-tolerant applications 25% 23% By ignoring the errors within the bit position of 0 to 20 of the fraction • The shared-FPUs consume 4.6μJ for the accurate Sobel program (60x60), while execution of the approximate version of the program reduces the energy to 3.5μJ, achieving 25% energy saving. Execution with approximation directives

Error-intolerant Applications Compared to the worst-case design, on average 22% (and up to 28%) energy saving is achieved at temperature of 125°C, thanks to allocating the FP operations to the appropriate pipelines. This saving is consistent (20%-22% on average) across a wide temperature range (∆T=125°C), thanks to the online FPV metadata characterization which reflects the latest variations.

Conclusion A vertically integrated approach to reducing the cost of a resilient FP environment which is dominated by the error correction This is achieved by: An integrated approach to vertically expose FPU vulnerability at the programming model level based on • EDS sensing • Runtime components to schedule less vulnerable FPUs first By leveraging the inherent tolerance of certain applications to approximation • Programming model extensions to specify approximate blocks • Reconfigurable EDS in resilient FPUs • Profiling-based technique to achieve controlled approximation Experimental results show that our approach achieves significant energy reduction for both accurate and approximate programs, with negligible performance impact

Comparison with Truffle on average, 20% more energy saving by reducing the conservative voltage for the accurate parts 36% more energy saving, as Truffle faces with the overhead of switching between modes which is imposed by interference of the accurate and approximate operations from the concurrent execution Iso-area comparison with Truffle  dual-voltage FPUs and changes the voltage depending on the instruction being executed.

Abbas Rahimi, Andrea Marongiu , Rajesh K. Gupta, Luca Benini