Efficient software implementation of aes on 32 bit platforms
Download
1 / 25

Efficient Software Implementation of AES on 32-bit Platforms - PowerPoint PPT Presentation


  • 341 Views
  • Uploaded on

Efficient Software Implementation of AES on 32-bit Platforms. Guido Bertoni, Luca Breveglieri Politecnico di Milano, Milano - Italy Pasqualina “Lilli” Fragneto AST-LAB of ST Microelectronics, Agrate B. - Italy Marco Macchetti, Stefano Marchesin

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Efficient Software Implementation of AES on 32-bit Platforms' - jaden


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Efficient software implementation of aes on 32 bit platforms l.jpg

Efficient Software Implementation of AES on 32-bit Platforms

Guido Bertoni, Luca Breveglieri

Politecnico di Milano, Milano - Italy

Pasqualina “Lilli” Fragneto

AST-LAB of ST Microelectronics, Agrate B. - Italy

Marco Macchetti, Stefano Marchesin

ALARI - Università della Svizzera Italiana, Lugano - Switzerland


Table of contents l.jpg
Table of Contents

  • Introduction

  • Short description of AES

  • Optimisation of the algorithm

  • Simulation results

  • Conclusions

CHES 2002 Workshop – Redwood City (SF Bay), CA, USA


Introduction l.jpg
Introduction

  • A work for the efficient software implementation of AES.

  • Optimised software implementation (in C) oriented to 32-bit platforms with low memory  (e.g. embedded systems).

  • Evaluation of the time performances on various platforms: ARM, ST and Pentium.

  • Comparison with the time performances of Gladman’s C code.

 The usage of look-up tables is limited: only the S-BOX and the inverse S-BOX transformations are tabularised (2  256 bytes).

CHES 2002 Workshop – Redwood City (SF Bay), CA, USA


Algorithm description general l.jpg
Algorithm Description - General

  • Rijndael is the selected (NIST competition) algorithm for AES (Advanced Encryption Standard).

  • It is a block cipher algorithm, operating on blocks of data.

  • It needs a secret key, which is another block of data.

  • Performs encryption and the inverse operation, decryption (using the same secret key).

  • It reads an entire block of data, processes it in rounds and then outputs the encrypted (or decrypted) data.

  • Each round is a sequence of four inner transformations.

  • The AES standard specifies 128-bit data blocks and 128-bit, 192-bit or 256-bit secret keys.

CHES 2002 Workshop – Redwood City (SF Bay), CA, USA


Algorithm description encrypt l.jpg
Algorithm Description – Encrypt.

encryptionalgorithm

structure of ageneric round

PLAINTEXT

SECRET KEY

INPUT DATA

ROUND KEY 0

ROUND 0

SUBBYTES

ROUND KEY 1

ROUND 1

SHIFTROWS

KEY SCHEDULE

MIXCOLUMNS

ROUND KEY 9

ROUND 9

ROUND KEY

ADDROUNDKEY

ROUND KEY 10

ROUND 10

OUTPUT DATA

ENCRYPTED DATA

CHES 2002 Workshop – Redwood City (SF Bay), CA, USA


Algorithm description encrypt6 l.jpg

s0

s4

s8

s12

s0

s4

s8

s12

1 byte

s1

s5

s9

s13

s5

s9

s13

s1

2 bytes

s2

s6

s10

s14

s10

s14

s2

s6

3 bytes

s3

s7

s11

s15

s15

s3

s7

s11

Algorithm Description – Encrypt.

SubBytes

S-BOX

s'0

s'4

s'8

s'12

s0

s4

s8

s12

s'1

s'5

s'9

s'13

s1

s5

s9

s13

s'5

s5

s'2

s'6

s'10

s'14

s2

s6

s10

s14

one byte

s'3

s'7

s'11

s'15

s3

s7

s11

s15

state array

ShiftRows

rotation of

state array

CHES 2002 Workshop – Redwood City (SF Bay), CA, USA


Algorithm description encrypt7 l.jpg
Algorithm Description – Encrypt.

MixColumns

coeff.s matrix

state array

s0

s4

s8

s12

s'0

s'4

s'8

s'12

s1

s5

s9

s13

s'1

s'5

s'9

s'13

=

s2

s6

s10

s14

s'2

s'6

s'10

s'14

s3

s7

s11

s15

s'3

s'7

s'11

s'15

polynomial multiplications

field GF(28)

bit-wise XOR

AddRoundKey

state array

round key

k0

k4

k8

k12

s'0

s'4

s'8

s'12

s0

s4

s8

s12

k1

k5

k9

k13

s'1

s'5

s'9

s'13

s1

s5

s9

s13

=

s'2

s'6

s'10

s'14

s2

s6

s10

s14

k2

k6

k10

k14

s'3

s'7

s'11

s'15

s3

s7

s11

s15

k3

k7

k11

k15

CHES 2002 Workshop – Redwood City (SF Bay), CA, USA


Optimisation the idea l.jpg
Optimisation – The Idea

  • To improve the time performances of AES, a transposed state array has been used.

state

transposed state

s0

s4

s8

s12

s0

s1

s2

s3

s1

s5

s9

s13

s4

s5

s6

s7

s2

s6

s10

s14

s8

s9

s10

s11

s3

s7

s11

s15

s12

s13

s14

s15

  • Very simple idea, but yields interesting consequences!

CHES 2002 Workshop – Redwood City (SF Bay), CA, USA


Optimisation consequences l.jpg
Optimisation - Consequences

  • The following round transformations are essentially invariant with respect to transposition (and their speed is unchanged):

    • SubBytes

    • ShiftRows

    • AddRoundKey (but the round keys must be transposed)

  • Instead, the MixColumns transformation must be completely restructured.

  • The new MixColumns is considerablysped-up by the transposition of the state.

CHES 2002 Workshop – Redwood City (SF Bay), CA, USA


Old mixcolumns l.jpg
Old MixColumns

  • It is a matricial product (in GF(28)):

Mix Column number c(0  c  3)

  • In C language a macro is used:

    fwd_mcol(x)

    (f2 = FFmulX(x), f2^upr(x^f2,3)^upr(x,2)^upr(x,1))

state column

CHES 2002 Workshop – Redwood City (SF Bay), CA, USA


Old mixcolumns cost l.jpg
Old MixColumns - Cost

  • The cost per column is: a single “doubling”, 4 additions (XOR) and 3 rotations (all operations work on 32 bits).

  • For a complete MixColumns transformation 4 “doublings”, 16 additions (XOR) and 12 rotationsare required.

  • “doubling” means 4 multiplications in GF(28) of each byte of the 32-bit word.

CHES 2002 Workshop – Redwood City (SF Bay), CA, USA


New mixcolumns l.jpg
New MixColumns

s'0

s'4

s'8

s'12

s0

s4

s8

s12

s1

s5

s9

s13

s'1

s'5

s'9

s'13

=

s2

s6

s10

s14

s'2

s'6

s'10

s'14

s3

s7

s11

s15

s'3

s'7

s'11

s'15

s0

s4

s8

s12

s1

s5

s9

s13

=

s'0

s'4

s'8

s'12

s2

s6

s10

s14

s3

s7

s11

s15

state row

Transposition is equivalent to processing the state arrayby rows, instead of processing it by columns!

CHES 2002 Workshop – Redwood City (SF Bay), CA, USA


New mixcolumns13 l.jpg
New MixColumns

  • The New MixColumns transformation is:

y0 = ({02} • x0) + ({03} • x1) + x2 + x3

y1 = x0 + ({02} • x1) + ({03} • x2) + x3

y2 = x0 + x1 + ({02} • x2) + ({03} • x3)

y3 = ({03} • x0) + x1 + x2 + ({02} • x3)

  • The symbols xi and yi (0  i  3) indicate the 32-bit rows of the state array before and after New MixColumns, respectively.

  • The 32-bit word xi accommodates 4 bytes coming from 4 different columns (and similarly for yi).

  • The operation {02} • xi or “doublings”consists of 4 multiplications in GF(28) of each byte of the 32bits word.

CHES 2002 Workshop – Redwood City (SF Bay), CA, USA


New mixcolumns14 l.jpg
New MixColumns

  • The transformation can be executed in three steps.

y0 = x1 + x2 + x3

y1 = x0 + x2 + x3

y2 = x0 + x1 + x3

y3 = x0 + x1 + x2

  • It can be conceived as a sort of “double and add” algorithm.

x0 = {02} • x0

x1 = {02} • x1

x2 = {02} • x2

x3 = {02} • x3

y0 += x0 + x1

y1 += x1 + x2

y2 += x2 + x3

y3 += x3 +x0

Remainder:

CHES 2002 Workshop – Redwood City (SF Bay), CA, USA


Mixcolumns cost comparison l.jpg
MixColumns – Cost Comparison

  • The standard implementation of MixColumns requires:

    • 4 “doublings”,

    • 16 XOR’s and 12 rotations,

    • and one intermediate variable

  • The “transposed” version of MixColumns requires:

    • 4 “doublings”,

    • 16 XOR’s and NO rotation,

    • and NO intermediate variable.

  • Software time performances should improve!

CHES 2002 Workshop – Redwood City (SF Bay), CA, USA


Decryption l.jpg
Decryption

  • Decryption uses the InvMixColumns transformation – inverse of MixColumns.

  • Also InvMixColumns can be sped-up by the transposition of the state array.

  • Transposition yields a higher speed-up for InvMixColumns than for MixColumns.

  • This is due to the complex structure of the coefficient matrix of InvMixColumns.

  • Mixcolumns’ coeff.s: 01, 02 and 03 (hex).

  • InvMixColumns’ coeff.s: 09, 0b, 0d and 0e (hex).

CHES 2002 Workshop – Redwood City (SF Bay), CA, USA


Old invmixcolumns l.jpg
Old InvMixColumns

  • The entries of the coefficient matrix of InvMixColumns contain a larger number of 1’s than those of MixColumns.

  • Transposition exposes more parallelism and hence yields a significant speed-up.

CHES 2002 Workshop – Redwood City (SF Bay), CA, USA


New invmixcolumns l.jpg
New InvMixColumns

s0

s4

s8

s12

=

s1

s5

s9

s13

s'0

s'4

s'8

s'12

s2

s6

s10

s14

Reminder:

0ehex = 1 1 1 0 b

0bhex = 1 0 1 1 b

0dhex = 1 1 0 1 b

09hex = 1 0 0 1 b

s3

s7

s11

s15

y0 = x1 + x2 + x3

x0 = {02} • (x0 + x2)

x1 = {02} • (x1 + x3)

x0 = {02} • x0

x1 = {02} • x1

x2 = {02} • x2

x3 = {02} • x3

y0 += x0

x0 = {02} • (x0 + x1)

y0 += x0 + x1

y0 += x0

CHES 2002 Workshop – Redwood City (SF Bay), CA, USA


Invmixcolumns cost comparison l.jpg
InvMixColumns – Cost Comparison

  • The standard algorithm requires:

    • 12 “doublings”,

    • 32 XOR’s and 12 rotations,

    • and 4 intermediate variables.

  • The “transposed” algorithm requires only:

    • 7 “doublings”,

    • 27 XOR’s and NO rotation,

    • and NO intermediate variable.

  • Software time performances should improve!

  • But time performances should improve in hardware as well!

CHES 2002 Workshop – Redwood City (SF Bay), CA, USA


Time performances l.jpg
Time Performances

  • The time performances of the proposed algorithm have been tested on some 32-bit CPU’s:

    • ARM 7 TDMI and ARM 9 TDMI, typical microcontrollers

    • ST 22, a CPU designed for smart card (by STM)

    • and PENTIUM III, a general purpose CPU

  • The time performances are computed in CPU cycles, and are compared with those of Gladman’s C code.

  • Where Gladman is better, it is due to the time overhead required to transpose input and output data, to remain compliant with the standard.

CHES 2002 Workshop – Redwood City (SF Bay), CA, USA


Results arm l.jpg
Results (ARM)

Simulations have been executed by means of the ARM Development Suite ADS 1.1.

CHES 2002 Workshop – Redwood City (SF Bay), CA, USA


Results st 22 and p iii l.jpg
Results (ST 22 and P III)

ST 22 figures are normalized with respect to Gladman decryption.

CHES 2002 Workshop – Redwood City (SF Bay), CA, USA


Comparisons with gladman l.jpg
Comparisons with Gladman

The comparison is performed setting to 100 % the time performancesof Gladman’s implementation for the corresponding function.

In red the cases where the transposed version has higher performances.

CHES 2002 Workshop – Redwood City (SF Bay), CA, USA


Conclusions and further developments l.jpg
ConclusionsandFurtherDevelopments

Conclusions:

  • Study and optimization of AES.

  • Some interesting time performance improvements in software.

  • Part of this work is under patenting process.

    Further Developments:

  • Hardware implementations.

CHES 2002 Workshop – Redwood City (SF Bay), CA, USA



ad