Accelerations of Scalar Multiplication Advanced Techniques Debdeep Mukhopadhyay Chester Rebeiro Department of Computer Science and Engineering Indian Institute of Technology Kharagpur 23-27 May 2011 Anurag Labs, DRDO 1 Non-Adjacency Form (NAF) NAF(29)=(1,0,0,-1,0,1), since 29=32-4+1 Binary(29)=(1,1,1,0,1), since 29=16+8+4+1 Pros: NAF does not have any consecutive ones (hence called non-adjacent). Average density of non-zero terms in NAF is 1/3. It reduces the number of point additions in ECC scalar multiplication. Cons: Maximum length of NAF can be one more than the binary. 23-27 May 2011 Anurag Labs, DRDO 2 Algorithm for NAF generation k=29. k0=2-(29%4)=1, k=29-1=28, k=14 k1=0 (Note that it can never be 1). k=7 k2=2-(7%4)=-1, k=4 k3=0, k=2 k4=0, k=1 k5=2-(1%4)=1, k=0 (algorithm terminates) 23-27 May 2011 Anurag Labs, DRDO 3 Why Non-adjacent? When k is odd, it can be either 4p+1 or 4p+3 (p is an integer). Case 1: k=4p+1 ◦ ki=1, k=2p (even) => next NAF bit is 0 Case 2: k=4p+3 ◦ ki=-1, k=2p+2 (even) => next NAF bit is 0 23-27 May 2011 Anurag Labs, DRDO 4 Scalar Multiplication with NAF Expected Run time = m/3 A + m D Normal Run time = m/2 A + mD Note that here number of doubling is unchanged. Later we see a method to remove doubling all together. 23-27 May 2011 Anurag Labs, DRDO 5 Width w-NAF k=29, w=3 NAF digits = (1,0,0,0,0,-3) 29=(1,0,0,0,0,-3)=1.32-3 Pros: Density of non-zero terms =1/(w+1) Cons: Pre-computation required, this means storage in hardware Length is unaltered as normal NAF 23-27 May 2011 Anurag Labs, DRDO 6 Algorithm for width w-NAF generation u≡k (mod 2w) => -2w-1≤k≤2w-1 k=29, w=3 ◦ k0=-3, k=16 ◦ k1=0, k=8 ◦ k2=0, k=4 ◦ k3=0, k=2 ◦ k4=0, k=1 ◦ k5=1, k=0 (algorithm terminates) 23-27 May 2011 Anurag Labs, DRDO 7 Scalar Multiplication with width w-NAF Pre-computation: 1D + (2w-2-1)A Expected Run time = m/(w+1) A + m D Normal Run time = m/2 A + mD Hence designing an architecture would incur the initial pre-computation phase. 23-27 May 2011 Anurag Labs, DRDO 8 Koblitz Curves The previous methods did not reduce the number of doubling operations. Koblitz invented a set of curves which does not require any doubling. he curves were named after him. • Koblitz curves are a special class of elliptic curves and are defined on where elliptic curve parameter • Koblitz curves are computationally efficient compared to random curves, as Frobenius map can be utilized to accelerate scalar multiplication. 23-27 May 2011 Anurag Labs, DRDO 9 Choice of the curve Choice of the curve depends on a, which can be either 0 or 1. As we have seen the Elliptic Curve is a group of points. ◦ Group should be chosen that ECDLP is difficult. ◦ The number of elements in the elliptic group is called the order of the group. ◦ For security, the order of the group should be very nearly prime (it has a factor of a prime number and a small integer) as otherwise there can be subgroups which are called as divisors of the group, which makes the curve cryptographically weak. ◦ The field elements belong to GF(2m) The subgroups belong to GF(2d), where d | m. If m is prime, d=1. Thus the only subgroups are E0(GF(2)) and E1(GF(2)). It can be easily checked that: E0(GF(2)) = (O, (0,1)) E1(GF(2))= (O, (0,1), (1,0), (1,1)) 23-27 May 2011 Anurag Labs, DRDO 10 An Interesting Property • The curve satisfies : (x4,y4)+2(x,y)=µ(x2,y2), where µ=(-1)1-a •Define, Frobenius Map as: • Frobenius map follows the relation • For a point P on the Koblitz curve, we can use the property of Frobenius map to compute 2P. 23-27 May 2011 where Anurag Labs, DRDO 11 τ-adic NAF The scalar k can be represented as a polynomial, where τ is the inderminate. ◦ this sum is analogous to the binary expansion. ◦ the scalar is said to belong to the ring Z[τ]. ◦ It can be proved that the τ-adic NAF representation is unique. 23-27 May 2011 Anurag Labs, DRDO 12 Divisibility by τ In order to generate this NAF, we divide the element by τ, like we divided by 2 in the binary NAF. As it is a NAF, the remainder is generated such that the next NAF digit is zero. 23-27 May 2011 Anurag Labs, DRDO 13 Algorithm for τ-adic NAF generation k=29. The τ-adic NAF is (-1,0,1,0,1,0,-1,0,1)=> 29=1- τ2+ τ4+ τ6- τ8 29P=P- τ2(P)+ τ4(P)+ τ6(P)- τ8(P) 29P=(x,y)-(x4,y4)+(x16,y16)+(x64,y64)-(x256,y256) Thus, the scalar multiplication avoids any doubling operation, instead it performs easy squaring operation. It may be noted that the length is almost twice of the binary expansion, hence a reduction is necessary. 23-27 May 2011 Anurag Labs, DRDO 14 Reduction of the scalar τm(P)=P [from Fermat’s Little Theorem] (τm-1)(P)=O ◦ Hence, if γ≡k (mod τm-1)=> γ(P)=k(P) 23-27 May 2011 Anurag Labs, DRDO 15 Reduction of Scalar • Solinos presented efficient reduction algorithm for reduction of a scalar. The algorithm involves integer multiplication. Thus, costly for hardware implementations. • An alternative approach known as Lazy Reduction was proposed by Brumley and Jarvinen which uses the observation that division by • is cheap. The algorithm uses multiplication and division by 2 and integer additions. Implementation is simple and area requirement is small. • The algorithm takes m clock cycles. So, Lazy. 23-27 May 2011 Anurag Labs, DRDO 16 Scalar Multiplication with τ-adic NAF Expected Run time = m/3 A Normal Run time = m/2 A + mD 23-27 May 2011 Anurag Labs, DRDO 17 Summary • Basic steps of scalar multiplication on Koblitz curves Reduction of the scalar. NAF generation from reduced scalar. Point addition for nonzero NAF digits. Point addition is performed in Lopez-Dahab projective co-ordinate system. Point squaring for every NAF digit. Final field inversion to transform scalar multiplication result into affine co-ordinate system from projective co-ordinate system. • Our Koblitz curve scalar multiplier uses simple scalar multiplication. 23-27 May 2011 Anurag Labs, DRDO NAF method for 18 Top Level Architecture 23-27 May 2011 Anurag Labs, DRDO 19 Reduction of Scalar • Solinos presented efficient reduction algorithm for reduction of a scalar. The algorithm involves integer multiplication. Thus, costly for hardware implementations. • An alternative approach known as Lazy Reduction was proposed by Brumley and Jarvinen which uses the observation that division by is cheap. • The algorithm uses multiplication and division by 2 and integer additions. Implementation is simple and area requirement is small. • The algorithm takes m clock cycles. So, Lazy. 23-27 May 2011 Anurag Labs, DRDO 20 Architecture for Reduction of Scalar • Arrangement of adders and shift circuits is used to perform reduction of scalar. Here u is the LSB of c0. There are registers to store intermediate values. Control unit generates control signals for Multiplexers and write enable signal for storage registers. Initially storage register for c0 contains the value of scalar. 23-27 May 2011 Anurag Labs, DRDO 21 T-NAF Generation from Reduced Scalar r0=b0+c0 r1=b1+c1 Reduced Scalar Can be found by observing last two bits of c0 and c1. T-NAF digits are generated after performing reduction of the scalar. As, the algorithm does integer additions and subtractions, adders of the reduction circuit can be used to generate T-NAF digits. 23-27 May 2011 Anurag Labs, DRDO 22 Architecture for Reduction & T-NAF Generation • The left portion of the circuit is used to generate digits. The NAF generation and reduction hardware shares the adders and registers. During reduction, control signal M is set to 0. After the reduction is over, NAF generation starts and M is changed to 1. 23-27 May 2011 Anurag Labs, DRDO 23 Choice of Scalar Multiplication Algorithm • There are two scalar multiplication algorithms in literature: • • Process the scalar starting from MSB (Left-to-Right). • Process the scalar starting from LSB (Right-to-Left). The Left-to-Right algorithm first computes the entire NAF of the reduced scalar and then starts processing the NAF from MSB. • So, it waits for the entire NAF generation and this takes nearly m clock cycles in GF(2m). • Additionally, at every iteration, Q is squared. So, when a point addition in progress, we cannot perform in parallel. is But, squaring is cheap in hardwares and the algorithm does not uses this advantage of parallel processing. 23-27 May 2011 Anurag Labs, DRDO 24 • Fast Scalar Multiplication Algorithm The Right-to-Left algorithm for scalar multiplication is shown below • The scalar multiplication does not wait for entire NAF of the scalar. As soon as the LSB, i.e the first NAF digit is generated, the scalar multiplication starts. • Additionally, point addition independent of Q. • So, we can use the fact that point squaring is cheap in hardware and can perform parallel with . updates only Q and point squaring is in So, we select this Right-to-Left algorithm for scalar multiplication. 23-27 May 2011 Anurag Labs, DRDO 25 Point Addition Unit • The point addition unit does point addition in Lopez-Dahzb co-ordinate system and takes 8 clock cycles. Initially these three registers are initialized with base point (Px, Py, 1). After every point addition, result Q Q+P is stored in register (RA1, RB1, RC1). In the figure, P = (RA2, RB2 ). In every clock cycle field multiplication is performed and the Multiplier is of Hybrid Karatsuba type. Control signals are used to control the multiplexers and write eneble signals for storage registers. 23-27 May 2011 Anurag Labs, DRDO 26 Point Addition Unit 23-27 May 2011 Anurag Labs, DRDO 27 Point Squaring Unit • During scalar multiplication, point squaring is performed in every clock cycle. The base point is updated P(x, y) P(x2, y2). Point squarings are performed using dedicated squarer circuits as squarers are cheap. • If we see the scalar multiplication algorithm, then it can be seen that point squaring is independent of point additions. • A nonzero NAF digit is followed by several Zero digits (NAF property). So, during point addition, we can continue point squaring in parallel until another nonzero NAF digit appears. 23-27 May 2011 Anurag Labs, DRDO 28 Point Squaring Unit The NAF digits are generated from LSB side. Let us consider a portion of the entire NAF be <. . . . . .1 0 0 0 0 0 1 . . . . .>. For the first 1, a point addition is required nad this point addition takes 8 clock cycles. If we check the algorithm, then it can be seen that for a nonzer point addition takes place and uses the present value of P. NAF digit u, when a If we consider only sequential processing, then it can be seen that after performing point addition for 1, the algorithm will perform 6 point squarings for the sequence <0 0 0 0 0 1>. This will require 6 clock cycles. As P is independent of Q, we can perform the 6 point squarings in parallel with point additions (which takes 8 clock cycles). Thus saving 6 clock cycles. When the next nonzero appears in <. . 1 0 0 0 0 0 1 . . > , then we must stop this parallel processing of zeros, as the last updated value of P for <. . 1 0 0 0 0 0 1 . . > will be required during the next point addition. 23-27 May 2011 Anurag Labs, DRDO 29 Architecture for Point Squaring Unit • The point P(x, y) is in affine co-ordinate and two dedicated squarers are used for squaring x and y co-ordinates. • Initially the registers are assigned with the base point. When the scalar multiplication starts, point squaring is performed for every digit and the registers are updated. • A write enable signal en is used to protect content of registers from unnecessary squarings specially for the case (another Nonzer) mentioned in previous slide. 23-27 May 2011 Anurag Labs, DRDO 30 Architecture for Inversion • Scalar multiplication when done in Lopez-Dahab co-ordinate system, requires a final inversion after processing the entire scalar. • For ECC, Itoh-Tsujii inversion is efficient. • In a field GF(2m), the inverse of an element a is • Using quad operation field GF(2233). • This requires multiplications and repeated quad operations. We can implement this using a multiplier and quad circuits. 23-27 May 2011 Anurag Labs, DRDO . we can compute the inverse. Here is an example for the 3 Architecture for Inversion • This is the basic block diagram for the inversion unit. The multiplier is actually a part of the point addition unit. This multiplier is shared between point addition unit and inversion unit. • It can be seen from the previous slide that there are repeated quad operations. For example in step 7, computation of . If we use a single quad circuit, then the exponentiation will take 14 clock cycles. To reduce number of clock cycles, we use a cascade of several quad circuits. This cascade of quad circuits is called Quadblock. 23-27 May 2011 Anurag Labs, DRDO 32 Architecture for Quadblock … • Here is an example for a Quadblock which contains 11 cascaded quad circuits. So, for an element a, we can raise it to a maximum of . • A multiplexer is used to get intermediate results, for example • To raise an element to a power which is more than the number of cascaded quad circuits, repeated application of the quad block is required. So, the number of clock cycles depend on the number of quad circuits. For example, to perform , we can do it in two clock cycles. • Number of clock cycles reduce if we increase number of quad circuits. But delay increases. So, there must be a balance in the design between delay and number of quad circuits. 23-27 May 2011 Anurag Labs, DRDO 33 . Experimental Performance • Experimentation was performed on Xilinx Virtex V FPGA for GF(2283). • Scalar multiplier on random curve in the field GF(2283) has an area of around 40,000 LUTs, frequency 37 MHz and computation time of 63 micro seconds. • Koblitz curve scalar multiplier (in first stage of implementation) which uses in GF(2283), has an area of 41,300 LUTs, frequency 31 MHz and average computation time of 35 micro seconds. • It can be seen, that a Koblitz curve crypto processor takes almost half computation time compared to random curve crypto processor. 23-27 May 2011 Anurag Labs, DRDO 34 Further Acceleration • We have found a novel technique to reduce number of point additions during scalar multiplication using representation of a scalar. • For any scalar, we have found that length of is close to half of the length of . • However, there is an overhead of small amount of pre-computations and an increased area. • In Virtex IV FPGA, scalar multiplication using for the field GF(2283) saves 35% computation time compared to method. 23-27 May 2011 Anurag Labs, DRDO 35 Thank You 23-27 May 2011 Anurag Labs, DRDO 36
© Copyright 2025 Paperzz