Variable Latency Speculative Addition: A New Paradigm for Arithmetic Circuit Design

csda csda Variable Latency Speculative Addition: A New Paradigm for Arithmetic Circuit Design Ajay K. Verma, Philip Brisk and Paolo Ienne Processor Architecture Laboratory (LAP) & Centre for Advanced Digital Systems (CSDA) Ecole Polytechnique Fédérale de Lausanne (EPFL)

Do We Always Need 100% Accuracy √ √ Ariane 5 explosion, 96 Patriot missile failure, 91 X Cryptography attacks

Ciphertext-Only Attacks (1 of 2) Ciphertext Guess a key Decryption Frequency analysis No Yes

Ciphertext-Only Attacks (2 of 2) • Speeding up decryption process will allow • Large amount of ciphertext to decipher • More key guesses • Error in the decryption of a few blocks will NOT • Affect the frequencies of characters significantly • Reduce the efficacy of attack Use of extremely fast, almost correct arithmetic components is desirable

Our Contribution • Almost Correct Adder (ACA) • Exponentially faster compared to fastest reliable adder • Produces correct result in 99.99% cases • Trade-off between delay and error-precision • Variable Latency Speculative Adder (VLSA) • For a processor which allows variable latency instructions • Uses ACA as a component • Always produces correct result • Extremely fast in more than 99.99% cases

Outline • Related work • Main Idea • Limited carry propagation occurs in most cases • Design of the ACA • Delay optimal design with minimal area • Design of the VLSA • Error detection and recovery of ACA • Results • Extension to other arithmetic components • Parallel counters, multipliers etc. • Conclusions

Related Work • Design of optimal adders with respect to different metrics • Delay and area: Ripple carry adder, Carry lookahead adder, Prefix adder etc. • Maximum fanout, wiretrack: Kogge-Stone adder, Brent-Kung adder, Knowles adders • Generation of all Pareto-optimal prefix adders [Liu07] • Probabilistic arithmetic component • Probabilistic arithmetic component to save energy [George06] • Razor: circuit level correction for low power operations [Ernst05] • Error detection and correction due to reduction in power supply voltage [Hegde01] • Asynchronous speculative adder [Nowick96, Nowick97]

Recurrence for A Typical Adder a15 a14 a13 a12 a1 a0 b15 b14 b13 b12 b1 b0 s15 s14 s13 s12 s1 s0 ci-1 ci-1 X 0 if ki = 1 gi = ai bi gen kill 1 if gi = 1 ci = prop pi = ai  bi ci-1 if pi = 1 ki = ai + bi si = ai bi  ci-1 ci ci

Main Idea: Limited Carry Propagation X X X gen gen prop prop prop kill

Longest Sequence of Propagates • Longest sequence of propagates • Longest run of 1’s in the XOR of input integers (A  B) • Longest run of heads in tossing a coin n times Tk = Tk-1 + average number of steps to advance from k-1 to k 1 + (1 + Tk) Tk = 2k+1 - 2  Tk = Tk-1 + 2

Probabilistic Bounds on The Longest Sequence of Propagates An (x) = number of instances in n-bit addition, where longest sequence of propagates is bounded by x 22n if n ≤ x An (x) = 2n (An-1 (x) + An-2 (x) + … + An-x-1 (x)) otherwise

A Primitive Design of ACA (1 of 2)

A Primitive Design of ACA (2 of 2) S [0] A [5, 0] ADD B [5, 0] S [5] A [6, 1] ADD B [6, 1] S [6] Large area overhead due to the multitude of small adders A [7, 2] ADD B [7, 2] S [7] A [19, 14] ADD B [19, 14] S [19]

Area Overhead in ACA (1 of 2) a15 a14 a13 a12 a1 a0 b15 b14 b13 b12 b1 b0 p, g (15, 0) p, g (14, 0) bit position

Area Overhead in ACA (2 of 2) • Step 1: compute the (p, g) for any group of two consecutive bit positions • Step 2: compute the (p, g) for any group of four consecutive bit positions • Final step: combine the computed (p, g)’s to compute the (p, g) for any group of k consecutive bi positions A slightly more complicated design can be used to further reduce the hardware area

Outline • Related work • Main Idea • Limited carry propagation occurs in most cases • Design of the ACA • Delay optimal design with minimal area • Design of the VLSA • Error detection and recovery of ACA • Results • Extension to other arithmetic components • Parallel counters, multipliers etc. • Conclusions

Error Detection • Error occurs if there is a long chain of propagates ER = ∑ pi pi+1 … pi+k • Delay of error detection • Higher than the delay of an ACA • Smaller than the delay of a traditional adder • Experimentally 2/3 of the delay of a traditional adder

Error Recovery Significant amount of ACA computation can be used for the computation of correct addition in error recovery

Variable Latency Speculative Adder

Example of VLSA Computation

Experimental Setup Traditional fast adder (Prefix adder) Almost correct adder (ACA) Input N (bitwidth) Logic synthesis Error detection ACA + error recovery (VLSA) Synopsis Design Compiler - compile_ultra - minimize delay Artisan Standard Cells UMC (0.18µm)

Results Average delay of VLSA = 0.70 x delay of traditional adder Delay of ACA = 0.52 x delay of traditional adder

Conclusions • We have presented an exponentially fast adder that works correctly in more than 99.99% cases • We have also presented the reliable version of above adder that works correctly in all case, and • Is extremely fast in more than 99.99% cases • Has almost the same delay as traditional adder in other cases • An extension for the similar approach for other arithmetic components is desirable

Future Work: Can We Have A Fast Almost Correct (Counter/Multiplier) 0001 1001 1101 1101 1101 0101 1001 1001 0 1 1 1 1 0 1 1 0 1 1 0 0 0 1 Output = path number = 1001 1001000 Ex [path number] = sum of bits Var [path number] = high Since each output bit depends on each input bit equally, one cannot discard some input bits in the computation of an output bit

Future Work: Few Most Significant Bits in Multiplier 1001 0110 x 1101 1001 0111 1111 0010 0110 1001 x 1101 0111 0101 Even if we ignore the lower half bits of two inputs, most significant (log n) bits of output will remain same with high probability

Variable Latency Speculative Addition: A New Paradigm for Arithmetic Circuit Design