FPGA Design Challenges for Realtime A5/1 Attack

Application of FPGA Design: Design Challenges for Implementing Realtime A5/1 Attack with Precomputation Tables Martin Novotný, Andy Rupp Ruhr University Bochum

Outline • A5/1 cipher • Time-Memory Trade-off Tables • Original Hellman Approach • Distinguished points • TMTO with multiple data • Rainbow tables • Thin-rainbow tables • Architecture of the A5/1 TMTO engine – table generation • Implementation results

A5/1 Cipher • Encrypts GSM communication • GSM communication organized in frames • 1 frame = 114 bits in each direction • Stream cipher • produces the keystream KS being xored with the plaintext P to form ciphertext C C = PKS A5/1 010010011101010010101 P 111101101001000000000 C A5/1 101111110100010010101 KS

Architecture of A5/1 Cipher • 3 linear feedback shift registers (LFSRs) • LFSRs irregularly clocked • the register is clocked iff its clocking bit (yellow) is equal to the majority of all 3 clocking bits  at least 2 registers are clocked in each cycle

Algorithm of A5/1 • Reset all 3 registers • (Initialization) Load 64 bits of key K + 22 bits of frame number FN into 3 registers • K and FN xored bit-by-bit to the least significant bits • registers clocked regularly • (Warm-up) Clock for 100 cycles and discard the output • registers clocked irregularly • (Execution) Clock for 228 cycles, generate 114+114 bits (for each direction) • registers clocked irregularly • Repeat for the next frame

Cryptanalysis of stream ciphers with known plaintext • From the ciphertext C and known plaintext P compute keystream KS: • KS = PC • Keystream KS is a function of: • key K: KS = f(K) • internal state L: KS = g(L) • (internal state = content of all registers) • Reset all 3 registers • (Initialization) Load 64 bits of key K + 22 bits of frame number FN into 3 registers • (Warm-up) Clock for 100 cycles and discard the output • (Execution) Clock for 228 cycles, generate 114+114 bits (for each direction) We can skip Initialization and Warm-up!!!

L Cryptanalysis of A5/1 • For (known) keystream KS find the internal state L • When L found, track the A5/1 cipher back through Warm-up phase and Initialization to get the key K. • Reset all 3 registers • (Initialization) Load 64 bits of key K + 22 bits of frame number FN into 3 registers • (Warm-up) Clock for 100 cycles and discard the output • (Execution) Clock for 228 cycles, generate 114+114 bits (for each direction)

L0 L1 L2 L3 KS0 KS1 KS2 KS3 Cryptanalysis of A5/1 • Internal state L has 64 bits  we need (at least) 64 bits of keystream KS One A5/1 frame has 114 bits  we can make samples KSi 0100111101101010110100101010010100010010100010011110110001 It is sufficient to find any Li

Brute force attack Check all combinations of a key K online timeT = N = 2k memory M = 1 Table lookup (For a given plaintext P) All pairs key-ciphertext {Ki, Ci} precomputed and stored (sorted by C) Online phase: Look-up C in the table (and find K) time T = 1 memoryM = N = 2k Two extreme approaches

Time-Memory Trade-Off (Hellman, 1981) • Compromises the above two extreme approaches • Precomputation phase: For a given plaintext P: • precompute (ideally all) pairs key-ciphertext {Ki, Ci}; • store only some of them in the table. • Online phase: • Perform some computations; • lookup the table and find the key K. • time T = N2/3 • memory M = N2/3

Idea: Encryption function E is a pseudo-random function C = EK(P) Pairs {Ki, Ci} organized in chains Ci is used to create a key Ki+1 for the next step E is pseudo-random  we perform a pseudo-random walk in the keyspace R – reduction function (DES: C has 64 bits, K has 56 bits) f – step function f(x) = R(Ex(P)) P E K C plaintext P is the same End Point Start Point P P P f f f K2 C2 C1 K3 Kt EP Ct K1 SP = E E E R R R 1234 7A3D 28DF B05B 8EC0 Precomputation (offline) phase

1234 SP1 = k10 fk11 fk12 f … fk1t-1 fk1t = EP18EC0 1235 SP2 = k20 fk21fk22f … fk2t-1fk2t = EP22A1B 1236 SP3 = k30fk31fk32f … fk3t-1 fk3t = EP34D3C … … … 9999 SPm = km0fkm1fkm2f … fkmt-1 fkmt = EPm02E3 m chains with a fixed length tgenerated Only pairs {SPi, EPi} stored (sorted by EP)  reducing memory requirements P P P f f SPj = kj0 kjt = EPj kj1 kj2 kjt-1 E E R R f E R Precomputation (offline) phase

y1 C R f Lookup: = EPi? K E f f f Online phase • Given C. (and P) • … try to find K, such that C = EK(P)  SPi

y1 C R Lookup: = EPi? Online phase • Given C. (and P) • … try to find K, such that C = EK(P) 

y1 C R y2 f Online phase • Given C. (and P) • … try to find K, such that C = EK(P)

y1 C R Lookup: y2 = EPi? f Online phase • Given C. (and P) • … try to find K, such that C = EK(P) 

y1 C R y2 y3 f f Online phase • Given C. (and P) • … try to find K, such that C = EK(P)

y1 C R Lookup: y2 y3 = EPi? f f Online phase • Given C. (and P) • … try to find K, such that C = EK(P) 

y1 C R y2 y3 y4 f f f Online phase • Given C. (and P) • … try to find K, such that C = EK(P)

y1 C R Lookup: y2 y3 y4 = EPi? f f f Online phase • Given C. (and P) • … try to find K, such that C = EK(P) 

y1 C R f Lookup: y2 y3 y4 = EPi? K E f f f f Online phase • Given C. (and P) • … try to find K, such that C = EK(P)  SPi

Birthday paradox problem • m chains of fixed length t generated • R is not bijective ⇒ some kijcollide. Collisions yield in chain merges or in cycles in chains • Matrix stopping rule: Hellman proved that it is not worth to increase • number of chains m or • length of chain t • beyond the point at which • m ×t2 = N • (the coverage of keyspace does not increase too much then)

SP1 … … … … … EP1 SP2 … … … … … EP2 SP3 … … … … … EP3 … … … SP200 … … … ... EP200 … … … SP1 … … … … … EP1 SP2 … … … … … EP2 SP3 … … … … … EP3 … … … SP200 … … … ... EP200 … … … SP1 … … … … … EP1 SP2 … … … … … EP2 SP3 … … … … … EP3 … … … SP200 … … … ... EP200 … … … SP1 … … … … … EP1 SP2 … … … … … EP2 SP3 … … … … … EP3 … … … SP200 … … … ... EP200 … … … Birthday paradox problem • Matrix stopping rule: • m ×t2 = N • Recommendation: To use r tables, each with different reduction (re-randomization) function R • Since also N = mtr, then r = t • Hellman recommends m = t = r = N1/3

Hellman TMTO – Complexity • Precomputation phase • Precomputation time PT = mtr = N(e.g. 260) • Memory M = mr = N2/3(e.g. 240 ) • Online phase • Memory M = N2/3 • Online time T = t r = t2 = N2/3(e.g. 240 ) • Table accesses TA = T = N2/3(e.g. 240 )

Hellman TMTO – Complexity • Precomputation phase • Precomputation time PT = mtr = N(e.g. 260) • Memory M = mr = N2/3(e.g. 240 ) • Online phase • Memory M = N2/3 • Online time T = t r = t2 = N2/3(e.g. 240 ) • Table accesses TA = T = N2/3 (e.g. 240 ) 34 years (1 disk access ~ 1 ms)

Distinguished points (DP)(Rivest, ????) • Slight modification of original Hellman method • Goal: To reduce the number of table accesses TA (in Hellman TA = N2/3) • Distinguished point is a point of a certain property (e.g. 20 most significant bits are equal to 0). • 000000000000000000000010101001101100101010010111110010110101

Distinguished Points (DP)Precomputation phase • Chains are generated until the distinguished point (DP) is reached • if the chain exceeds maximum length tmax, then it is discarded and the next chain is generated • the chain is also discarded if the DP has been reached, but the chain is too short tmin (to have better coverage) • Triples {SPj, EPj, lj} stored, sorted by EP (ljis a length of the chain) • 1234 SP1 = k10 fk11 f … … … … … … … fk1u = EP1 0EC0 • 1235 SP2 = k20 fk21f … … fk2v = EP2 0A1B • 1236 SP3 = k30fk31f … … … … … fk3w = EP3 043C • … … … • 9999 SPm = km0fkm1f … … … fkmz = EPm 02E3 chains have different lengths End Points are DP

y1 C R f Lookup: y3 y4 y2 = EPi? K E f f f f 043C Distinguished Points (DP)Online phase • There is 1 distinguished point per chain – the End Point • End Point Distinguished Point • Algorithm: • computeyi+1 = f(yi) iteratively until the DP is reached (or the maximum length tmax is exceeded) • then lookup (just once per table) (if tmax is exceeded, do not lookup at all) • Advantages • Table accesses TA = r = N1/3 (c.f. TA = t r = N2/3 in original Hellman) • Chain loops are not possible  SPi

L0 L1 L2 L3 KS0 KS1 KS2 KS3 TMTO with multiple data(Biryukov & Shamir, 2000) • Important for stream ciphers: To reveal an internal state Li having k bits we need only k bits of a keystream KSi • 0100111101101010110100101010010100010010100010011110110001 • Having D data samples of the ciphertext C (or the keystreamKS) we have D times more chances to find the key K (or the internal stateL) •  We calculate r/D tables only  we reduce the precomputation time PT and the memory M× online time T and #table access TA remain unchanged

L0 L1 L2 L3 KS0 KS1 KS2 KS3 TMTO with multiple dataA5/1 • 1 frame: 114 bits • Internal state: 64 bits •  114 – 64 +1 = 51 data samples from 1 frame (each sample has 64 bits) • D = 51 • We calculate D times less tables ( save memory, save time) 0100111101101010110100101010010100010010100010011110110001

Idea: to use different reduction/re-randomization function Ri in each step of chain generation, hence the step functions are: f1f2f3 … ft-1ft Online phase: Compute y1 = Rt(C), compare with EPs, if no match, then Compute y2 = ft(Rt-1(C)), compare with EPs, if no match, then Compute y3 = ft(ft-1(Rt-2(C))), compare with EPs, if no match, then … P P P f1 f2 SPj = kj0 kjt = EPj kj1 kj2 xjt-1 E E R1 R2 ft E Rt Rainbow tables(Oechslin, 2003)

Rainbow tables • Just one table (or only several tables) generated, • m = N2/3 (t reduction functions used ⇒ the table can be t times longer), • t = N1/3 • Advantages • chain loops impossible • point collisions lead to chain merges only if the equal points appear in the same position of the chain • online time T about ½ of the online time of original Hellman (for single data) • number of table accesses the same like for the Hellman+DP method (for single data) • Disadvantages • Inferior to the Hellman+DP method in the case of multiple data (D > 1)(online time T and the number of table accesses TA are D-times greater)

1st 2nd kth Thin-rainbow tables • The way to cope with the rainbow tables when having multiple data • The sequence of S different reduction functions fi is applied k-times periodically in order to create a chain: • f1f2f3 … fS-1fSf1f2f3 … fS-1fS … … … f1f2f3 … fS-1fS • Chain length • t = S×k

1st 2nd kth Thin-rainbow tables + DP (to reduce # table accesses TA) • DP criterion is checked after each fS • f1f2f3 … fS-1fSf1f2f3 … fS-1fS … … … f1f2f3 … fS-1fS • We store only chains for which kmin < k < kmax DP ? DP ? DP ? DP ?

Hellman + DP DP-criterion checked after each step-function f Thin-rainbow + DP DP-criterion checked after fSonly  simpler HW, better time/area product Candidates for implementation(in case of multiple data, D>1) Both have the same precomputation complexity Both have comparable online time T and # table accesses TA

1st 2nd kth Thin-rainbow tables + DP (to reduce # table accesses TA) • DP criterion is checked after each fS • f1f2f3 … fS-1fSf1f2f3 … fS-1fS … … … f1f2f3 … fS-1fS • We store only chains for which kmin < k < kmax DP ? DP ? DP ? DP ?

Pipeline? Array of small computing elements? Implementation choices

Slice – FPGAs Basic Building Block Look-up Table Flip- flop Look-up Table Flip- flop Implements combinational logic (any logic function of 4 variables) It is RAM 16x1 (holds the truth-table)

LUT – configuration choices Look-up Table Flip- flop Look-up Table Flip- flop • Can be configured as: • LUT (function generator) • RAM 16x1 • SRL16 (upto 16-bit shift register)

Pipeline? All A5/1 bits should have been accessible in parallel max. 240 A5/1 units(64x FF/unit)(and no control unit, …) Array of small computing elements? LFSRs can be implemented using SRL16 (1 LUT config. as up to 16-bit shift register) max. 480 A5/1 units(8x SRL16 + 5x FF/unit)(enough FFs for control unit, …) Implementation choices

A5/1 TMTO basic element • Calculates one chain • Two-stroke mode: • core #1 generates keystream, core #2 is loaded • core #2 generates keystream, core #1 is loaded

A5/1 TMTO basic element First, the start point SPj is loaded to core #1

A5/1 TMTO basic element … then one rainbow period f1f2f3 … fS-1fS is performed … In odd steps: Core #1 generates keystream … ... that is re-randomized … … and loaded to core # 2 as a new internal state

FPGA Design Challenges for Realtime A5/1 Attack