1 / 18

2004. 8. 24.

Hyperelliptic Curve Coprocessors On a FPGA. 2004. 8. 24. HoWon Kim ETRI, Korea. Contents. Introduction Design Philosophy for Fast HEC coprocessors Parallelism Pipelining Loop unfolding on inversion operation Design Methodology Arithmetic Unit HECC coprocessor Architecture

tavon
Download Presentation

2004. 8. 24.

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Hyperelliptic Curve Coprocessors On a FPGA 2004. 8. 24. HoWon Kim ETRI, Korea

  2. Contents • Introduction • Design Philosophy for Fast HEC coprocessors • Parallelism • Pipelining • Loop unfolding on inversion operation • Design Methodology • Arithmetic Unit • HECC coprocessor Architecture • Various HECC types : from high performance to low area • Performance Results • Conclusions

  3. Introduction (1/4)

  4. Introduction (2/4) • Group Cardinality • HEC of genus g over Fq • The cardinality of JC(Fq) is given by Hasse-Weil: • Major implication : group size  (field size)g • Don’t choose genus ≥ 4 (5) because of possible attacks [Frey/Rück, Gaudry, Theriault, …] • Group size vs. Field size • Group size of 2160(commercial security level) • ECC (g=1): field size = 160 bit • HECC (g=2): field size = 80 bit • HECC (g=3): field size = 56 bit • HECC (g=4): field size = 52 bit

  5. Introduction (3/4) • Explicit Formulae of HECC Explicit formulae (field arithmetic only): Polynomial arithmetic: s0 = w2*s0s; s1 = w2*s1s; s2 = w2*s2s; z0 = s0*c; z1 = s1*c+s0*b; z2 = s0*a+s1*b+c; z3 = s1*a+s0+b; z4 = a+s1; z5 = to_GF2E(1L); t1 = w4*h2; t2 = w4*h3; u3s = d + z4 + s1; u2s = d*u3s + e + z3 + s0 + t2 + s1*z4; u1s = d*u2s + e*u3s + f + z2 + t1 + s1*(z3+t2) + s0*z4 + w5; u0s = d*u1s + e*u2s + f*u3s + z1 + w4*h1 + s1*(z2+t1) + s0*(z3+t2) + w5*(a+f6); t1 = u3s+z4; v0s = w3*(u0s*t1 + z0) + h0 + m; v1s = w3*(u1s*t1 + u0s + z1) + h1 + l; v2s = w3*(u2s*t1 + u1s + z2) + h2 + k; v3s = w3*(u3s*t1 + u2s + z3) + h3; a3 = f6 + u3s + v3s*(v3s+h3); b3 = u2s + a3*u3s + f5 + v3s*h2 + v2s*h3; c3 = u1s + a3*u2s + b3*u3s + f4 + v2s*(v2s+h2) + v3s*h1 + v1s*h3; k3 = v2s + (v3s+h3)*a3 + h2; l3 = v1s + (v3s+h3)*b3 + h1; m3 = v0s + (v3s+h3)*c3 + h0; t1 = a*e; t2 = b*d; t3 = b*f; t4 = c*e; t5 = a*f; t6 = c*d; t7 = sqr(c+f); t8 = sqr(b+e); t9 = (a+d)*(t3+t4); t10= (a+d)*(t5+t6); r =(f+c+t1+t2)*(t7+t9) + t10*(t5+t6) + t8*(t3+t4); t11 = (b+e)*(c+f); inv2 = (t1+t2+c+f)*(a+d)+t8; inv1 = inv2*d + t10 + t11; inv0 = inv2*e + d*(t10+t11) + t9 + t7; t12 = (inv1+inv2)*(k+n+l+o); t13 = (l+o)*inv1; t14 = (inv0+inv2)*(k+n+m+p); t15 = (m+p)*inv0; t16 = (inv0+inv1)*(l+o+m+p); t17 = (k+n)*inv2; rs0 = t15; rs1 = t13+t15+t16; rs2 = t13+t14+t15+t17; rs3 = t12+t13+t17; rs4 = t17; t18 = rs3+rs4*d; s0s = rs0 + f*t18; s1s = rs1 + rs4*f + e*t18; s2s = rs2 + rs4*e + d*t18; w1 = inv(r*s2s); w2 = r*w1; w3 = w1*sqr(s2s); w4 = r*w2; w5 = sqr(w4); Input: D1 = div(a1,b1), D2 = div(a2,b2) Output: D3 = D1 + D2 = div(a3,b3) Composition: d = gcd(a1,a2,b1+b2+h)=s1a1+s2a2+s3(b1+b2+h) a‘3 = a1a2/d b‘3 = [s1a1b2+s2a2b1+s3(b1b2+f)]/f mod a‘3 Reduction: WHILE deg(a‘k) > g, DO a‘k = f – b‘k-1 mod a‘k b‘k = (-h-b‘k-1) mod a‘k END WHILE a3 = a‘k b3 = b‘k Harley’s explicit method Explicit formulae : ITCC04 [PWP04] Group doubling: 1inv, 9 mults Group Addition: 1 inv, 21 mults

  6. Introduction (4/4) • Pros & Cons of the HECC • Pros • Short field size : for genus 2 HECC, the size of the underlying field size is a half of that of ECC • So, It has room to adopt high speed implementation techniques such as parallelism and loop unfolding • Cons • There are many multiplication stages in Explicit formulae • So, when HECC is implemented as a hardware, its interconnect network and buffer allocation will be complicated • Purpose of this work • To check its applicability as a high performance public key crypto system • To check its applicability at the resource constrained environment such as PDA & Smart Cards from practical point of view

  7. Design Philosophy (1/2) • To make HECC coprocessor faster, we have used the following techniques: • Parallelism • Multiple number of field operation units to execute the explicit formulae as fast as possible • The number of multipliers is decided by drawing data dependency graph (DDG) for explicit formulae • For genus-2 HECC explicit formulae, we can see two multipliers are good choice for implementation • The usage rate of two multipliers is about 90 % group addition operation in affine coord.

  8. Design Philosophy (2/2) • Pipelining • Field operations(field addition, field squaring) and data copy operation between buffers are performed at the same clock cycle • And can be overlapped with multiplication and inversion • Loop Unfolding • “Loop unfolding” is the process of unfolding a loop so that several iterations(clock cycles) are unrolled into the same iteration(one clock cycle) • Is applied to the MAIA inversion algorithm to boost the performance with reasonable hardware increases

  9. Fast Inversion Block (1/2) • MAIA algorithm with 4 loops are unfolded Maximally 4 loops are executed in one clock cycle Can be realized by simple XOR, rewiring

  10. Fast Inversion Block (2/2) • Features of the Inversion Block of the HECC coprocessors Four loopsare unfolded  We get two times better performance !!

  11. Design Methodology • Design Methodology • Architecture design  VHDL coding  synthesis & implementation to FPGA • Main Points toward high performance HECC coprocessor Design • Make the H/W complexity of the Interconnect Network as small as possible • Is done by carefully designed arithmetic units and data path, etc. • Make the number of registers as small as possible • Is done by careful buffer allocation • Make efficient AUs • By using parallelism, pipelining, loop unfolding techniques, etc.

  12. Arithmetic Unit • AU (Arithmetic Unit) • Field addition : simple XOR (done on the data-path) • Field squaring : XOR and rewiring (done on the data-path) • Field multiplication : scalable, high performance multiplication logic is implemented (digit serial multiplier) • Field inversion : high performance inversion logic is implemented (modified almost inverse algorithm with a loop unfolding technique) • AU Block Diagram

  13. HEC Architecture (1/3) • Various HECC Coprocessor Types from High Performance to Moderate Size • Type 1 : for high performance • Parallel execution of the group addition & doubling • 2 multipliers & 1 inversion logic for group addition • 1 multiplier & 1 inversion logic for group doubling (Affine case) • Fast execution of the addition & doubling is possible. but, it causes high hardware complexity

  14. HEC Architecture (2/3) • Type 2 • Use only registers for RF and multiplexers as an interconnect network • Parallel execution of data read & write is possible. but, it causes high complexity at the interconnect network • Multipliers and inversion logic are shared for group ops. • Technology independent design as Type 1 (portable to any FPGA and ASIC) • Type 3 : low hardware complexity • Uses memory to reduce hardware complexity • Uses buses to reduce the complexity of interconnect network • Incurs more latencies to perform explicit formulae, but, reduces hardware complexity

  15. HEC Architecture (3/3) • Architectural characteristics of HECC coprocessors

  16. Performance Results (1/2) • Performance of the HECC coprocessors (scalar mult.) Target platform : Xilinx FPGA XC2V4000 -6

  17. Performance Results (2/2) • Performance of the HECC coprocessors (scalar mult.) Xilinx Virtex II FPGA (XC2V4000ff1517-6) Normalized to the best AT product • Performance (TTC) • Area-Time Product

  18. Conclusions • The high performance of the HECC coprocessor is due to • fast inversion algorithm • High operating frequency of multiplier in spite of its large digit size (D=32) • Reduced interconnect network latency by using carefully designed buffer allocation and Arithmetic Units • Parallel execution of field operations • Pipelined execution of the field operations and data movement between register files • We can say that HECC coprocessor can be used at high performance & resource constrained security environments • Since the performance is about 0.436ms with moderate H/W size (Type 1, GF(289)) • However, more research works are still necessary to surpass the ECC

More Related