A Novel Memory Architecture for Elliptic Curve Cryptography with Parallel Modular Multipliers

A Novel Memory Architecture for Elliptic Curve Cryptography with Parallel Modular Multipliers Ralf Laue, Sorin A. Huss Integrated Circuits and Systems Lab, Computer Science Dept. Technische Universität Darmstadt, Germany {laue|huss}@iss.tu-darmstadt.de December 14th, 2006 FPT 2006, Bangkok

Introduction • Speed-up of todays hardware stems increasingly from parallelization. • Cryptographical implementations should take ad-vantage of this by using parallel algorithm versions. • We begin with an survey about parallelization on dif-ferent abstraction levels of public key cryptography. • Then, we present a novel parallel memory architecture for elliptic curve cryptography in GF(P). • Allows the execution time to scale with the number of parallel modular multipliers. • Direct memory connection leads to low resource usage.

Overview • Parallelization on Different Abstraction Levels • Novel Memory Architecture • Design Considerations • Proposed Memory Architecture • Experimental Results • Number of Parallel Multipliers • Prototype Implementation • Application to Another EC Arithmetic Algorithm

Parallelization on Different Abstraction Levels RSA • In general, parallelization yields greater benefit on lower levels (as less control logic needs to be duplicated) • Parallelization on higher levels allows further speed-up and offers advantages not available on lower levels. • Parallelization methods on different levels do not exclude each other. ECC/HECC System Cryptographic Scheme Discrete Logarithm/ Integer Factorization Point Multiplication/Exponentiation Elliptic Curve Group Point Addition and Doubling Finite Field Modular Arithmetic

Parallelization on Finite Field Level RSA • Modular multi-word multiplication is the most critical operation. Thus, paralleliza-tion on this level is a popular strategy. • The approaches on this level do not exclude each other. ECC/HECC System Cryptographic Scheme Discrete Logarithm/ Integer Factorization Point Multiplication/Exponentiation Elliptic Curve Group Point Addition and Doubling Data-paths of full bit-width: • Allow for linear time complexity at costof proportional increase of resources (e.g. systolic array). • Usual bit-widths: ECC: >100 bit, RSA: >1000 bit • Problem: Design for maximum bit-width. For smaller word counts resources stay unused, higher may be infeasible. Finite Field Modular Arithmetic

Parallelization on Finite Field Level (cont.) • Pipelining • Allows for linear time complexity, too. • More flexible as buses of full bit-width, because number of pipeline stages may be chosen freely. • Problem: calculated bit-width always corresponds to a multiple of the number of stages in words. • Resources may still stay unused. • ECC/RSA-combination allows only for pipeline lengths designed for ECC, as those designed for RSA would waste resources and execution time, if used with ECC.

Parallelization on Finite Field Level (cont.) • Karatsuba multiplication: • Multiplying two numbers with two words each can be done with three word multiplications. • Recursion leads to approx. O(n1,585). • As recursion is difficult in hardware, this is usually used for multiplications in full bit-width (requires less resources). • Residue Number Systems: • Long numbers are represented relative to a base consisting of multiple smaller moduli, relatively prime to each other. The Chinese Remainder Theorem ensures a unique mapping. • Multiplication, addition and subtraction may be executed in parallel. • Can be interpreted as special case of buses of full bit-width.

Parallelization on Elliptic Group Level RSA • EC doubling and addition may be sped upby using multiple modular units in paral-lel. • Literature suggests a maximum of two orthree modular multipliers (data depen-dencies limit further improvements). • One instance of the remaining modulararithmetic is sufficient, because it is veryfast in comparison. • This abstraction level is well-suited for parallelization in SIMD implementations. • Note that this level does not exist for RSA. ECC/HECC System Cryptographic Scheme Discrete Logarithm/ Integer Factorization Point Multiplication/Exponentiation Elliptic Curve Group Point Addition and Doubling Finite Field Modular Arithmetic

Parallelization on Discrete Logarithm/ Integer Factorization Level RSA • Both point multiplication and expo-nentiation allows parallel use of twoinstances of group operations. • E.g. with Montgomery Ladder (paral-lel point doubling/addition for ECC;parallel square/multiply for RSA). • Parallelization on this abstractionlevel is (in addition to further speed-ups) often used as countermeassure against side channel attacks. ECC/HECC System Cryptographic Scheme Discrete Logarithm/ Integer Factorization Point Multiplication/Exponentiation Elliptic Curve Group Point Addition and Doubling Finite Field Modular Arithmetic

Parallelization on Cryptographic Primitive/ System Level RSA • Cryptographic Schermes usually only useone point multiplication/exponentiation. • We know of no proposal for parallelizationon this level. • Possible scenario: Flexible coprocessor for RSA/ECC • Parallelization on lower abstraction levelsis only possible to a certain degree, as long as unused resources should be avoided. • Further parallelization may be done on the level of the cryptographic primitive to increase throughput. ECC/HECC System Cryptographic Scheme Discrete Logarithm/ Integer Factorization Point Multiplication/Exponentiation Elliptic Curve Group Point Addition and Doubling Finite Field Modular Arithmetic

Design Goals • ECC implementation for GF(P) on FPGAs. • Ability to support different key lengths. • Resource requirements should be relatively low, thus allowing integration of further functions on the FPGA. • E.g. other cryptographic modules, something unrelated to cryptography. • Thus, minimum execution time was less important than a high utilization of the allocated resources.

Design Decisions • No parallelization on finite field level • Would lead to unused resources, at least for some key lengths. • Instead, parallelization on elliptic group level • Depends on data dependencies, independent from key length. • Modular multiplication is more complex and time consuming than remaining modular operations. • Chosen architecture consists of multiple modular multipliers parallel to each other and the module for the remaining modular arithmetic parallel to the multipliers.

Conventional Memory Architecure • Memory architecture must allow all operations to be continuously supplied with data. • Conventional memory architecure consists of one memory and modules with input and output registers. • Registers take up FPGA resources, but contain only redundant data copied from memory. RAM Mult 1 Mult n ALU Square ... ...

Novel Memory Architecture • Each modular multiplier is assigned its own memory block via a direct connection. • Supports continuous data supply. • Low general resource usage, slightly increased memory usage. • Remaining modular arithmetic may access memory blocks via the second port. • Execution time scales with the number of modular multpliers. • Modular arithmetic copies data between local memory blocks, as multipliers only can access “their“ memory block. • Does not hinder scalability, as remaining modular arithmetic can access all memory blocks simultaneously in parallel.

Novel Memory Architecture (cont.) • Usual memory blocks lack third port. • Cryptographic primitive and modular arithemtic share second memory port. • Access from cryptographic primitive only while no computation is executed. • Else: access from the modular arithmetic. • Elliptic curve arithmetic does not directly access the data, but only indirectly via the modular arithmetic. Cryptographic Primitive commands Elliptic Curve Arithmetic data commands status busy Modular Arithmetic commands MUX data BRAM BRAM BRAM ... ModMult ModMult ModMult

Number of Parallel Multipliers • Determine number of multipliers to be used (IEEE 1363): • ECDbl can utilize only two parallel modular multipliers because of data dependecies. • Utilization of modular multipliers for ECAdd (16 multiplications). • Table highlights scalability. • (#multipliers * #consecutive multiplications) is smallest multiple of the number of multipliers larger or equal than overall number of multiplications.

Data Flow Graph ECAdd, IEEE • Consecutive multiplications are always executed on same multiplier. • No copying between memory blocks. • Dark and light grey multiplications are executed on different modular multipliers. • Longest path contains 5 modular multiplications. • No speed-up by using more than 4 multipliers possible.

Schedule ECAdd, IEEE • Schedule for two modular multipliers. • Mapping to multipliers as shown in data flow graph on last slide. Sub3 Sub4 Sub5 Sub6 Sub7 Sub1 Mult8_Add ModArith Div1 Mult7_Add Sub2 Mult13_Add Quad1 Mult1 Mult2 Mult3 Quad3 Mult12 Mult11 Mult15 ModMultB Quad2 Mult5 Mult4 Mult6 Mult9 Quad4 Mult10 Mul14 ModMultA

Prototype Implementation - Results • Taking its smaller resource usage into account, the execution time of our solution is comparable to previous work. • However, because of the high resource usage, none of the previous designs fulfills the given requirements. • Reference [5] uses GF(2m) as finite field, thus execution time is not comparable. But its memory architecture is similar, but not easily applicable to GF(P) and it does not scale as well.

Application to Alternative EC Arithmetic • Application of our memory architecture to an algorithm for atomic point doubling and addition. • Algorithms consists of more modular multiplications, thus, allowing the better utilization for more modular multipliers. • Our architecture allows the parallel execution of modular additions. • With three multipliers atomic algorithm is faster as IEEE point addition with only two parallel multipliers.

Schedule for Atomic ECAdd&Dbl • Schedule for three modular multipliers. Add19 Add17 Sub4 Sub24 Add12 Add16 Add33 ModArith Add26 Add11 Add3 Add25 Add15 Sub18 Sub32 ModMultC Mult6 Mult7 Mult8 Mult13 Mult10 ModMultB Mult21 Mult9 Mult2 Mult23 Mult5 Mult14 Mult31 ModMultA Mult1 Mult22 Mult27 Mult29 Mult20 Mult28 Mult30

Conclusions • Novel memory architecture for ECC implementations over GF(P) on FPGAs features the following advantages: • Low register usage, because of direct memory access. • Execution time scales with the number of modular multipliers, as long as data dependencies allow this. • Remaining modular arithmetic is executed in parallel to all the modular multiplications.

Thank you for the attention. • Any questions?

References [5] N. A. Saqib, F. Rodríguez-Henríquez, A. Díaz-Pérez, „A Parallel Architecture for Computing Scalar Multiplication on Hessian Elliptic Curves.“ in ITCC, vol. 2, 2004, pp.493-497. [16] A. B. Örs, L. Batina, B. Preneel, J. Vandewalle, „Hardware Implementation of an Elliptic Curve Processor over GF(p).“ in ASAP. IEEE Computer Society, 2003, pp. 433-443. [21] W. Fischer, C. Giraud, E. W. Knudsen, „Parallel scalar multiplication on general elliptic curves over Fp hedged against Non-Differential Side-Channel Attacks.“, Jan 2002. [30] G. Orlando, C. Paar, „A Scalable GF(p) Ellitpic Curve Processor Architecture for Programmable Hardware.“ in CHES, ser. LNCS, vol 2162, 2001, pp. 348-363.

A Novel Memory Architecture for Elliptic Curve Cryptography with Parallel Modular Multipliers