Informatica 30 (2006) 111-129 111 A Review of Modular Multiplication Methods and Respective Hardware Implementations Nadia Nedjah Department of Electronics Engineering and Telecommunications, Engineering Faculty, State University of Rio de Janeiro, Rio de Janeiro, Brazil nadia@eng.uerj.br, http://www.eng.uerj.br/~nadia Luiza de Macedo Mourelle Department of System Engineering and Computation, Engineering Faculty, State University of Rio de Janeiro, Rio de Janeiro, Brazil ldmm@eng.uerj.br, http://www.eng.uerj .br/~ldmm Keywords: cryptography, encryption, modular multiplication, modular reduction. Received: April 18, 2005 Generally speaking, public-key cryptographic systems consist of raising elements of some group such as GF(2n), Z/NZ or elliptic curves, to large powers and reducing the result modulo some given element. Such operation is often called modular exponentiation and is performed using modular multiplications repeatedly. The practicality of a given cryptographic system depends heavily on how fast modular exponentiations are performed. Consequently, it also depends on how efficiently modular multiplications are done as these are at the base of the computation. This problem has received much attention over the years. Software as well as hardware efficient implementation were proposed. However, the results are scattered through the literature. In this paper we survey most known and recent methods for efficient modular multiplication, investigating and examining their strengths and weaknesses. For each method presented, we provide an adequate hardware implementation. Povzetek: Podan je pregled modernih metod kriptografije. 1 Introduction Electronic communication is growing exponentially so should be the care for information security issues [10]. Data exchanged over public computer networks must be authenticated, kept confidential and its integrity protected against alteration. In order to run successfully, electronic businesses require secure payment channels and digital valid signatures. Cryptography provides a solution to all these problems and many others [17]. One of the main objectives of cryptography consists of providing confidentiality, which is a service used to keep secret publicly available information from all but those authorized to access it. There exist many ways to providing secrecy. They range from physical protection to mathematical solutions, which render the data unintelligible. The latter uses encryption/decryption methods [10], [17], [30], [31]. The modular exponentiation is a common operation for scrambling and is used by several public-key cryptosystems, such Deffie and Hellman [8], [9] and the Rivest, Shamir and Adleman encryption schemes [34], as encryption/decryption method. RSA cryptosystem consists of a set of three items: a modulus M of around 1024 bits and two integers D and E called private and public keys that satisfy the property TDE = T mod M. Plain text T obeying 0 < T < M. Messages are encrypted using the public key as C = Ie mod M and uniquely decrypted as T = CD mod M. So the same operation is used to perform both processes: encryption and decryption. The modulus M is chosen to be the product of two large prime numbers, say P and Q. The public key E is generally small and contains only few bits set (i.e. bits = 1), so that the encryption step is relatively fast. The private key D has as many bits as the modulus M and is chosen so that DE = 1 mod (P-1)(Q-1). The system is secure as it is computationally hard to discover P and Q. It has been proved that it is impossible to break an RSA cryptosystem with a modulus of 1024-bit or more. The modular exponentiation applies modular multiplication repeatedly. So the performance of public-key cryptosystems is primarily determined by the implementation efficiency of the modular multiplication and exponentiation. As the operands (the plaintext or the cipher text or possibly a partially ciphered text) are usually large (i.e. 1024 bits or more), and in order to improve time requirements of the encryption/decryption operations, it is essential to attempt to minimize the number of modular multiplications performed and to reduce the time required by a single modular multiplication. Modular multiplication AxB mod M can be performed in two different ways: multiplying, i.e. computing P = AxB; then reducing, i.e. R = P mod M or 112 Informatica 30 (2006) 111-129 N. Nedjah et al. interleave the multiplication and the reduction steps. There are various algorithms that implement modular multiplication. The most prominent are Karatsuba-Ofman's [12] and Booth's [3] methods for multiplying, Barrett's [2], [6], [7] method for reducing, and Montgomery's algorithms [18], and Brickell's method [4], [37] for interleaving multiplication and reduction. Throughout this paper, we will consider each one of the methods cited in the previous paragraph. The review will be organised as follows: First we describe, in Section 2, Karatsuba-Ofman's and Booth's methods for multiplying. Later, in Section 3, we present Barrett's method for reducing an operand modulo a given modulus. Then we detail Montgomery's algorithms for interleaving multiplication and reduction, in Section 4. 2 Efficient Multiplication Methods The multiply-then-reduce methods consist of first computing the product then reducing it with respect to the given modulus. This method is generally preferred as there are very fast on-the-shelf multiplication algorithms as they were over studied [3], [12], [33]. The nowadays most popular multiplication methods that are suitable for hardware implementation are Karatsuba-Ofman's method and Booth's method. 2.1 Karatsuba-Ofman Method Karatsuba-Ofman's algorithm is considered one of the fastest ways to multiply long integers. Generalizations of this algorithm were shown to be even faster than Schonhage-Strassen's FFT method [35], [36]. Karatsuba-Ofman's algorithm is based on a divide-and-conquer strategy. A multiplication of a 2«-digit integer is reduced to two «-digits multiplications, one («+1)-digits multiplication, two «-digits subtractions, two left-shift operations, two «-digits additions and two 2«-digits additions. Even though this algorithm was proposed long ago and as far as we know, there is no published hardware implementation for this algorithm. In contrast with the work presented in this paper, and after an extensive paper research, we only found publications on hardware implementations of Karatsuba-Ofman's algorithm adapted to multiplication in the Galois fields [13], [32]. Unlike in our implementation, the addition (mod 2) of two bits in these implementations delivers a single bit using a XOR gate In contrast with these, our implementation cares about the carryout bit, as it is necessary to obtaining the product. It is unnecessary to emphasize that this makes the designer face a completely different problem as explained later on. The hardware specification is expressed using the most popular hardware description language VHDL [20]. Note that VHDL does not provide a recursive feature to implement recursive computation [1], [27], [28]. The proposed model exploits the generate feature to yield the recursive hardware model. This subsection is organized as follows: First, we describe the Karatsuba-Ofman's algorithm and sketch its complexity. Then, we adapt the algorithm so that it can be implemented efficiently. Subsequently, we propose a recursive and efficient architecture of the hardware multiplier for Karatsuba-Ofman's algorithm. After that, we implement the proposed hardware using the Xilinx™ project manager and present some figures concerning time and space requirements of the obtained multiplier. We then compare our hardware with a Synopsis™ library multiplier and two other multipliers that implement Booth's multiplication algorithm. 2.1.1 Karatsuba-Ofman's Algorithm We now describe the details of Karatsuba-Ofman's multiplication algorithm [12], [27], [36]. Let X and Y be the binary representation of two long integers: k-1 k-1 X = £ X 2i and Y = £ y,. 2i i=0 i=0 We wish to compute the product XY. The operands X and Y can be decomposed into to equal-size parts XH and XL, Yh and YL respectively, which represent the « higher order bits and lower order bits of X and Y. Let k = 2«. If k is odd, it can be right-padded with a zero. f «-1 \ «-1 X =2« Y = 2 « £ x,+« 2' + £ x, 2i = xh 2« + xl V i=0 J i=0 f «-1 ^ «-1 + £ yi 2i = Yh 2« + Yl £ yi+«2 i V i =0 / i=0 So the product P = XY can be computed as follows: P = XY = (Xh 2« + Xl)(Yh 2« + Yl) = 22«(XhYh) + 2«(XhYl + XlYh) + XlYl Using the equation above, it needs 4 «-bits multiplications to compute the product P. The standard multiplication algorithm is based on that equation. So assuming that a multiplication of k-bits operands is performed using T(k) one-bit operations, we can formulate that T(k) =T(«) + Sk, wherein Sk is a number of one-bit operations to compute all the additions and shift operations. Considering that T(1) = 1, we find that the standard multiplication algorithm requires: T(k) = (klog24) = (k2) The computation of P can be improved by noticing the following: XhYL + XLYH = (Xh + Xl)(Yh + Yl) - XHYH - XLYL The Karatsuba-Ofman's algorithm is based on the above observation and so the 2«-bits multiplication can be reduced to three «-bits multiplications, namely XHYH, XlYl and (XH + Xl)(Yh + YL). The Karatsuba-Ofman's multiplication method can then be expressed as in the algorithm in Figure 1. wherein function Size(X) returns the number of bits of X, function High(X) returns the higher half part of X, function Low(X) returns the lower half of X, RightShift(X, «) returns X2« and A REVIEW OF MODULAR MULTIPLICATION... Informatica 30 (2006) 111-129 113 Algorithm KaratsubaOfman(X, Y) If (Size(X) = 1) Then KaratsubaOfman= OneBitMultiplier(X, Y) Else Product1 := KaratsubaOfman(High(X), High(Y)); Product2 := KaratsubaOfman(Low(X), Low(Y)); Product3 := KaratsubaOfman(High(X)+Low(X), High(Y)+Low(Y)); KaratsubaOfman := RightShift(Product1, Size(X)) + RightShift(Product3-Product1-Product2, Size (X)/2) + Product2; End KaratsubaOfman. Figure 1: Karatsuba-Ofman recursive multiplication algorithm OneBitMultiplication(X, Y) returns XY when both X and Y are formed by a single bit. If Size(X) is odd, then High(X) and Low(X) right-pad X with a zero before extracting the high and the low half respectively. The algorithm above requires 3 n-bits multiplications to compute the product P. So we can stipulate that: T(k) = 2T(n) + T(n+1)+ S'k = 3T(n) + S'k wherein S'n is a number of one-bit operations to compute all the additions, subtractions and shift operations. Considering that T(1) = 1, we find that the Karatsuba-Ofman's algorithm requires: T(k) ( 10S2 3 ) = (158 ), parts of Z and U respectively. Note that ZH and UH may be equal to 0 or 1. Product3 = ZU = (Zh 2n + ZL)(UH 2n + UL) = 22"(ZhUh) + 2"(ZhUl + ZlUh) + ZlYl Depending on the value of ZH and UH, the above expression can be obtained using one of the alternatives of Table 1. As it is clear from Table 1, computing the third product requires one multiplication of size n and some extra adding, shifting and multiplexing operations. So we adapt Karatsuba-Ofman's algorithm of Figure 1 to this modification as shown in the algorithm of Figure 2. and so is asymptotically faster than the standard multiplication algorithm. 2.1.1 Adapted Karatsuba's Algorithm We now modify Karatsuba-Ofman's algorithm of Figure 1 so that the third multiplication is performed efficiently. For this, consider the arguments of the third recursive call, which computes Product3. They have Size(X)/2+1 bits. Let Z and U be these arguments left-padded with Size(X)/2-1 0-bits. So now Z and U have Size(X) bits. So we can write the product Product3 as follows, wherein Size(X) = 2n, ZH and UH are the high parts of Z and U respectively and ZL and UL are the low ZH UH Product3 0 0 ZLYL 0 1 2n ZL + ZLYL 1 0 2"Ul + ZLYL 1 1 22n + 2n(UL + ZL) + ZlYl Table 1: computing the third product2.1.3 Hardware Architecture Recursive In this section, we concentrate on explaining the proposed architecture of the hardware. The component KaratsubaOfman implements the Algorithm AdaptedKaratsubaOfman(X, Y) If (Size(X) = 1) Then KaratsubaOfman := OneBitMultiplier(X, Y) Else Product1 := KaratsubaOfman(High(X), High(Y)); Product2 := KaratsubaOfman(Low(X), Low(Y)); P := KaratsubaOfman(Low(High(X)+Low(X)), Low(High(Y)+Low(Y))); If Msb(High(X)+Low(X)) = 1 Then A := Low(High(Y)+Low(Y)) Else A := 0; If Msb(High(Y)+Low(Y)) = 1 Then B := Low(High(X)+Low(X)) Else B := 0; Product3 := LeftShift(Msb(High(X)+Low (X))•Msb(High(X)+Low(X)), Size(X)) + LeftShift(A + B, Size (X)/2) + P; KaratsubaOfman = LeftShift(Product1, Size(X)) + LeftShift(Product3-Product1-Product2 , Size (X)/2) + Product2; End AdaptedKaratsubaOfman. Figure 2: Adapted Karatsuba-Ofman's algorithm 114 Informatica 30 (2006) 111-129 N. Nedjah et al. algorithm of Figure 2. Its interface is given in Figure 3. The input ports are the multiplier X and the multiplicand Y and the single output port is the product XY. It is clear that the multiplication of 2 «-bit operands yields a product of 2n-bits product. The VHDL recursive specification of the component architecture is given in the concise code of Figure 4. The architecture details of the component KaratsubaOfman are given in Figure 5. Entity KaratsubaOfman is Generic( n: positive Port ( X: In bit_vector (Size-1 To 0); Y: In bit_vector (Size-1 To 0); XY: Out bit vector(2*Size-1 To 0) T represents CX xCY. The computation implemented by component ShiftSubnAdd (in Figure 5) i.e. the computation specified in the last line of the Karatsuba-Ofman algorithm in Figure 1 and Figure 2 can be performed efficiently if the execution order of the operations constituting it is chosen carefully. This is shown in the architecture of Figure 7. Figure 6: Operation performed by the ShiftnAdder2l End KaratsubaOfman; Figure 3: Interface of component KaratsubaOfman The signals SXL and SYL are the two n-bits results of the additions XH + XY and YH + YL respectively. The two one-bit carryout of these additions are represented in Figure 5 by CX and CY respectively. The component ShiftnAdd (in Figure 5) first computes the sum S as SXl + SYL, SXL, SYL, or 0 depending on the values of CX and CY (see also Table 1). Then computes Product3 as depicted in Figure 6, wherein Component ShiftSubnAdd proceeds as follows: first computes R = Product1 + Product2; then obtains 2CR, which is the two's complement of R; subsequently, computes U = Product3 + 2CR; finally, as the bits of Product1 and U must be shifted to the left 2n times and n times respectively, the component reduces the first and last additions as well as the shift operations in the last line computation of Karatsuba-Ofman's algorithm (see Figure 1 and Figure 2) to a unique addition that is depicted in Figure 8. Architecture RecursiveArchitecture of KaratsubaOfman is -- declaration part including components and temporary signals Begin Termination: If k = 1 Generate TCell: OneBitMultiplier Generic Map(n) Port Map(X(0), Y(0), XY(0) ); End Generate Termination; Recursion: If k /= 1 Generate ADD1: Adder Generic Map(k/2) Port Map(X(k/2-1 Downto 0), X(k-1 Downto k/2), SumX ); ADD2: Adder Generic Map(k/2) Port Map(Y(k/2-1 Downto 0), Y(k-1 Downto k/2), SumY ); KO1: KaratsubaOfman Generic Map(k/2) Port Map(X(k-1 Downto k/2),Y(k-1 Downto k/2),Product1); KO2: KaratsubaOfman Generic Map(k/2) Port Map(X(k/2-1 Downto 0),Y(k/2-1 Downto 0),Product2); KO3: KaratsubaOfman Generic Map(k/2) Port Map(SumX(k/2-1 Downto 0), SumY(k/2-1 Downto 0), P); SA: ShiftnAdder Generic Map(k) Port Map(SumX(k/2),SY(n/2), SX(k/2-1 Downto 0), SY(k/2-1 Downto 0), P,Product3); SSA: ShifterSubnAdder Generic Map(k) Port Map( Productl, Product2, Product3, XY ); End Generate Recursion; End RecursiveArchitecture; Figure 4: Recursive architecture of the component KaratsubaOfman of size n A REVIEW OF MODULAR MULTIPLICATION... Informatica 30 (2006) 111-129 115 Figure 5: Macro view of KaratsubaOfman2n in terms of KaratsubaOfman of size 2n 2.2 Booth's Multiplication Method Algorithms that formalize the operation of multiplication generally consist of two steps: one generates a partial product and the other accumulates it with the previous partial products. The most basic algorithm for multiplication is based on the add-and-shift method: the shift operation generates the partial products while the add step sums them up [3], Product, Product^ Figure 7: Architecture of ShiftSubnAdder2„ 2n jj n Product^ Product^ JY Figure 8: Last addition performed by ShiftSubnAdder2 The straightforward way to implement a multiplication is based on an iterative adder-accumulator for the generated partial products as depicted in Figure 9. However, this solution is quite slow as the final result is only available after n clock cycles, n is the size of the operands. Figure 9: Iterative multiplier A faster version of the iterative multiplier should add several partial products at once. This could be achieved 116 Informatica 30 (2006) 111-129 N. Nedjah et al. by unfolding the iterative multiplier and yielding a combinatorial circuit that consists of several partial product generators together with several adders that operate in parallel. In this paper, we use such a parallel multiplier as described in Figure 10. Now, we detail the algorithms used to compute the partial products and to sum them up. Figure 10: Parallel multiplier. 2.2.1 Booth's Algorithm Now, we concentrate on the algorithm used to compute partial products as well as reducing the corresponding number without deteriorating the space and time requirement of the multiplier. Let X and Y be the multiplicand and multiplicator respectively and let n and m be their respective sizes. So, we denote X and Y as follows: n m X = Vxt x2! and Y = Vyi x2! i=0 i=0 >X xY = V x xY x2! Inspired by the above notation of X, Y and that of XxY, the add-and-shift method [2], [3] generates n partial products: xixY, 0 < i < n. Each partial product obtained is shifted left or right depending on whether the starting bit was the less or the most significant and added up. The number of partial products generated is bound above by the size (i.e. number of bits) of the multiplier operand. In cryptosystems, operands are quite large as they represent blocks of text (i.e. > 1024 bits). Another notation of X and Y allows to halve the number of partial products without much increase in space requirements. Consider the following notation of X and XxY: [(n+1)/2]-1 X = V x22xi , where = x2x. , + x2x. -2xx. The possible values of ~xi with the respective values of x2xi+1, x2xi, and x2xi-1 are -2 (100), -1 (101, 110), 0 (000, 111), 1 (001, 010) and 2(011). Using this recoding will generate T (n+1)/2l -1 partial products. Inspired by the above notation, the modified Booth algorithm [3], [12] generates the partial products x. xY. These partial products can be computed very efficiently due to the digits of the new representation xxi . The hardware implementation will be detailed in Section 3. In the algorithm of Figure 11, the terms 4x2n+1 and 3 x2n+1 are supplied to avoid working with negative numbers. The sum of these additional terms is congruent to zero modulo 2n+(n+1)] - 1. So, once the sum of the partial products is obtained, the rest of this sum in the division by 2n+(n+1)l -1 is finally the result of the multiplication XxY. The partial product generator is composed of k Booth recoders [3], [6]. They communicate directly with k partial product generators as shown in Figure 12. Algorithm Booth(x2xi-1,x2xi ,x2xi+1 ,Y) Int product := 0; Int pp[T (n+1)/2l -1]; pp[0] := ( x0 xY + 4x2n+1)x22xi ; For i = 0 To T(n+1)/2l -1 Do pp[i] := ( X. xY + 3x2n+1)x22xi ; product := product + pp[i]; Return product mod 2n+T(n+1)l - 1; End Booth Figure 11: Multiplication algorithm The required partial products, i.e. x. xY are easy multiple. They can be obtained by a simple shift. The negative multiples in 2's complement form, can be obtained form the positive corresponding number using a bit by bit complement with a 1 added at the least significant bit of the partial product. The additional terms introduced in the previous section can be included into the partial product generated as three/two/one most significant bits computed as follows, whereby, ++ is the bits concatenation operation, (A) is the binary notation of integer A, 0" is a run of . zeros and B[n:0] is the selection of the n less significant bits of the binary representation B. {pp 0) = s0s0s0 ++(0lxY®s0) + s0 (PP 2x, 2x j HY ® S2x j )+ S2x j X + 0 2xj for 1 < j < k-1 and for j = k -1 = k, we have: {PP 2x k ++{2x k 'hY ® S 2x k 0+ S 2x k <>- + 02x k' (PP 2xk H2xk hY )[n:0] + + 0'xk 2x!+1 and T(n+1)/2l-1 X xY = V ~ xY x 2 .=0 2x. x-1 xn xn+1 0 1=0 1 =0 A REVIEW OF MODULAR MULTIPLICATION... Informatica 30 (2006) 111-129 117 MU LT] PLIC AH D Jm-l Jm-f . . . J*j BR, PPd GENERATOR — Ui ... y, ■ ■ ■ |; mm BR, PP, GENERATOR " * ' Jm-l| Jm-?| \y, A BR, PP( GENERATOR Figure 12: The partial product generator architecture. The Booth selection logic circuitry used, denoted by BRj for 0 < i < k in Figure 12, is very simple. The cell is described in Figure 13. The inputs are the three bits forming the Booth digit and outputs are three bits: the first one SY is set when the partial product to be generated is Y or -Y, the second one S2Y is set when the partial product to be generated is 2xY or -2xY, the last bit is the simply the last bit of the Booth digit given as input. It allows us to complement the bits of the partial products when a negative multiple is needed. Jsb msii Y y J5Y JSSY J Figure 13: Booth recoder selection logic. The circuitry of the partial generator denoted by PPi Generator, is given in Figure 14. In order to implement the adder of the generated partial products, we use a hybrid new kind of adder. It consists cascade of intercalated stages of carry save adders and delayed carry adders. 2.3 Multipliers Area/Time Requirements The entire design was done using the Xilinx™ Project Manager (version Build 6.00.09) [40] through the steps of the Xilinx design cycle shown in Figure 15. Figure 14: The partial product generator. —f— Figure 15: Design cycle. The design was elaborated using VHDL [20]. The synthesis step generates an optimized netlist that is the mapping of the gate-level design into the Xilinx format: XNF. Then, the simulation step consists of verifying the functionality of the elaborated design. The implementation step consists of partitioning the design into logic blocks, then finding a near optimal placement of each block and finally selecting the interconnect routing for a specific device family. This step generates a logic PE array file from which a bit stream can be obtained. The implementation step provides also the number of configurable logic blocks (CLBs). The verification step allows us to verify once again the functionality of the design and determine the response time of the design including all the delays of the physical net and padding. The programming step consists of loading the generated bit stream into the physical device. The design was implemented into logic blocks using a specific device family, namely SPARTAN S05PC84-4. As explained before, the Karatsuba's multiplier reduces to an ensemble of adders. These adders are implemented using ripple-carry adders, which can be very efficiently implemented into FPGAs as the carryout signal uses dedicated interconnects in the CLB and so there is no routing delays in the data path. An n-bit ripple-carry adder is implemented using n/2+2 CLBs and has a total fixed delay of 4.5+0.35n nanoseconds. 118 Informatica 30 (2006) 111-129 N. Nedjah et al. Table 2 shows the delay introduced and area required by the Karatsuba-Ofman multiplier (KO) together with those for a hardware implementation of the Booth multiplier which uses a Wallace tree for adding up the partial products (BW), another hardware implementation of Booth's algorithm that uses a redundant binary Booth encoding (PRB) and the Synopsys™ library multiplier (DW02) [11]. This is given for three different operand sizes. The delays are expressed in ns. These delays are represented graphically in Figure 16. 1 JO 120 £ SO 10 30 □ ED QUE BEUf QDW102 in S II 32 dp^ntnrf Sid Figure 16: Representing time requirement size KO BW PRB DW02 delay area delay area delay area delay area 8 12.6 1297 44.6 1092 31.8 862 56.2 633 16 22.8 6300 93.9 5093 46.6 3955 114.9 2760 32 29.1 31740 121.5 20097 64.9 17151 164.5 11647 Table 2: Delays and areas for different multipliers Table 2 also shows the area required by our multiplier compared with those needed for the implementation of BW, PBR and DW02. The areas are given in terms of total number of gates necessary for the implementation. These results are represented graphically in Figure 17. It is clear from Figure 16 and Figure 17 that the engineered Karatsuba-Ofman multiplier works much faster than the other three multipliers. However, it consumes more hardware area. Nevertheless, the histogram of Figure 18, which represents the areaxtime factor for the four compared multipliers implementations, shows that proposed multiplier improves this product. 20 00 00 0 a 1J00000 L- S3 5 10 00 00 0 y □ ED QEW BliE DDW102 1 u y gpintaid ike Figure 18: Representing areaxtime factor So, our multiplier improves the areaxtime factor as well as time requirement while the other three improve area at the expense of both time requirement and the areaxtime factor. Moreover, we strongly think that for larger operands, the Karatsuba-Ofman multiplier will yield very much better characteristics, i.e. time and area requirements as it is clear from Figure 16, Figure 17 and Figure 18. 3 Barrett's Reduction Method A modular reduction is simply the computation of the remainder of an integer division. It can be denoted by: X M _ very expensive even X mod M = X - x M However, a division is compared with a multiplication. The naive sequential division algorithm successively shifts and subtracts the modulus until the remainder that is non-negative and smaller than the modulus is found. Note that after a subtraction, a negative remainder may be obtained. So in that case, the last non-negative remainder needs to be restored and so will be the expected remainder. This computation is described in the algorithm of Figure 19. Algorithm NaiveReduction(P, M) Int R := P; Do R := R - M; While R > 0; If R * 0 Then R := R + M; Return R; End NaiveReduction Figure 19: Naive reduction algorithm In the context of this paper, P is the result of a product so it has at most 2n bits assuming that the operands have both n bits. The computation performed in the naive algorithm above is very inefficient as it may require 2n-1 subtractions, 2n comparisons and an extra addition. Instead of subtracting a single M one can subtract a A REVIEW OF MODULAR MULTIPLICATION... Informatica 30 (2006) 111-129 119 multiple of it at once. However, in order to yield multiples of M further computations, namely multiplications, need be performed, except for power of two multiples, i.e. 2kM. These are simply M left-shifted k times, which can very cheaply implemented on hardware. This idea is described in the restoring division algorithm given in Figure 20. It attempts to subtract the biggest possible power of two multiple of M from the actual remainder. Whenever the result of that operation is negative it restores the previous remainder and repeats the computation for all possible power of two multiples Return Rt ; End RestoringReduction of M, i.e. 2nM, 2n-1M, , 2M, M. Algorithm RestoringReduction(P, M) Int R0 := P; Int N := LeftShift(M, n); For i = 1 To n Do Ri := Ri-i - N; If R < 0 Then Ri := Ri-1; N := RightShift(N); Return Ri ; End RestoringReduction Figure 20: Restoring reduction algorithm The computation performed in the restoring reduction algorithm requires n subtractions, n comparisons and some 2n shifting as well as some restoring operations. This is very much more efficient than the computation of the algorithm in Figure 19. An alternative to the restoring reduction algorithm is the non-restoring one. The non-restoring reduction algorithm is given in Figure 21. Algorithm NonRestoringReduction(P, M) Int R0 := P; Int N := LeftShift(M, n); For i = 1 To n Do If R > 0 Then Ri := Ri-1 - N; Else Ri := Ri-1 + N; N := RightShift(N); If Ri < 0 Then Ri := Ri-1 + N; Figure 21: Non-restoring reduction algorithm It allows negative remainder. When the remainder is non-negative it sums it up with the actual power of two multiple of M. Otherwise, it subtracts that multiple of M from it. It keeps doing so repeatedly for all possible power of two multiples of M, i.e. 2nM, 2n-1M, ..., 2M, M. The non-restoring reduction computation requires a final restoration that adds M to the obtained remainder when the latter is negative. Using Barrett's method [2], [6], we can estimate the remainder using two simple multiplications. The approximation of the quotient is calculated as follows: X 2n-1 x 2n+1 M X M X M The equation above can be calculated very efficiently as division by a power of two 2x are simply a truncation of the operand' x-least significant digits. The term |_22xn/M J depends only on M and so is constant for a given modulus. So, it can be pre-computed and saved in an extra register. Hence the approximation of the remainder using Barrett's method [2], [6] is a positive integer smaller than 2x(M-1). So, one or two subtractions of M might be required to yield the exact remainder (see Figure 22). 4 Booth-Barrett's Method In this section, we outline the architecture of the multiplier, which is depicted in Figure 4. Later on in this section and for each of the main parts of this architecture, we give the detailed circuitry, i.e. that of the partial product generator, adder and reducer. The multiplier of Figure 4 performs the modular multiplication XxY mod M in three main steps: Figure 22: The modular multiplier architecture 2 x x n-1 n-1 2 2 2 2 120 Informatica 30 (2006) 111-129 N. Nedjah et al. 1. Computing the product P = XxY; 2. Computing the estimate quotient Q = P/M ^ Q = P/2n-1 x\_22xnlMJ; 3. Computing the final result P - QxM. During the first step, the modular multiplier first loads register1 and register2 with X and Y respectively; then waits for PPG to yield the partial products and finally waits for the ADDER to sum all of them. During the second step, the modular multiplier loads register1, register2 and register3 with the obtained product P, the pre-computed constant \_22xn/M J and P respectively; then waits for PPG to yield the partial products and finally waits for the ADDER to sum all of them. During the third step, the modular multiplier first loads register1 and register2 with the obtained product Q and the modulus M respectively; then awaits for PPG to generate the partial products, then waits for the ADDER to provide the sum of these partial products and finally waits for the REDUCER to calculate the final result P-QxM, which is subsequently loaded in the accumulator acc. 4.1 The Montgomery Algorithm Algorithms that formalize the operation of modular multiplication generally consist of two steps: one generates the product P = AxB and the other reduces this product P modulo M. The straightforward way to implement a multiplication is based on an iterative adder-accumulator for the generated partial products. However, this solution is quite slow as the final result is only available after n clock cycles, n is the size of the operands [19]. A faster version of the iterative multiplier should add several partial products at once. This could be achieved by unfolding the iterative multiplier and yielding a combinatorial circuit that consists of several partial product generators together with several adders that operate in parallel [15], [16]. One of the widely used algorithms for efficient modular multiplication is the Montgomery's algorithm [18]. This algorithm computes the product of two integers modulo a third one without performing division by M. It yields the reduced product using a series of additions Let A, B and M be the multiplicand and multiplier and the modulus respectively and let n be the number of digit in their binary representation, i.e. the radix is 2. So, we denote A , B and M as follows: n-1 n-1 n-1 A=^at x2!, B=x2l and M=x2l i=0 i=0 i=0 The pre-conditions of the Montgomery algorithm are as follows: The modulus M needs to be relatively prime to the radix, i.e. there exists no common divisor for M and the radix; The multiplicand and the multiplicator need to be smaller than M. As we use the binary representation of the operands, then the modulus M needs to be odd to satisfy the first pre-condition. The Montgomery algorithm uses the least significant digit of the accumulating modular partial product to determine the multiple of M to subtract. The usual multiplication order is reversed by choosing multiplier digits from least to most significant and shifting down. If R is the current modular partial product, then q is chosen so that R+qxM is a multiple of the radix r, and this is right-shifted by r positions, i.e. divided by r for use in the next iteration. So, after n iterations, the result obtained is R =AxBxr- mod M [14]. A modified version of Montgomery algorithm is given in Figure 23. algorithm Montgomery(A, B, M) int R = 0; for i= 0 to n-1 R = R + aiXB; if r0 = 0 then R = R div 2 else R = (R + M) div 2; return R; end Montgomery. Figure 23: Montgomery modular algorithm. In order to yield the right result, we need an extra Montgomery modular multiplication by the constant 2n mod M. However as the main objective of the use of Montgomery modular multiplication algorithm is to compute exponentiations, it is preferable to Montgomery pre-multiply the operands by 22n and Montgomery post-multiply the result by 1 to get rid of the 2-n factor. Here we concentrate on the implementation of the Montgomery multiplication algorithm of Figure 23. In order to yield the right result, we need an extra Montgomery modular multiplication by the constant r2n mod M. As we use binary representation of numbers, we compute the final result using the algorithm of Figure 24. algorithm ModularMult(A, B, M, n) const C := 2n mod M; int R := 0; R := Montgomery(A, B, M); return Montgomery(R, C, M); end ModularMult. Figure 24: Modular multiplication algorithm 4.2 Iterative Montgomery Architecture In this section, we outline the architecture of the Montgomery modular multiplier. The interface of the Montgomery modular multiplier is given in Figure 25. It expects the operands A, B and M and it computes R = (AxBx2-n) mod M. A REVIEW OF MODULAR MULTIPLICATION... Informatica 30 (2006) 111-129 121 Figure 25: Montgomery multiplier interface The detailed architecture of the Montgomery modular multiplier is given in Figure 26. It uses two multiplexers, two adders, two shift registers, three registers and a controller. The latter will be described in the next section. The first multiplexer of the proposed architecture, i.e. mux21 passes 0 or the content of register B depending on whether bit a0 indicates 0 or 1 respectively. The second multiplexer, i.e. mux22 passes 0 or the content of register M depending on whether bit r0 indicates 0 or 1 respectively. The first adder, i.e. adder1, delivers the sum R + at x B (line 2 of algorithm of Fig. 1), and the second adder, i.e. adder2, yields the sum R + M (line 6 of the same algorithm). The shift register shift register1 provides the bit a. In each iteration i of the multiplier, this shift register is right-shifted once so that a0 contains ai. The role of the controller consists of synchronizing the shifting and loading operations of the shift register1 and shift register2. It also controls the number of iterations that have to be performed by the multiplier. For this end, the controller uses a simple down counter. The counter is inherent to the controller. The interface of the controller is given in Figure 27. A E Figure 26: Montgomery multiplier architecture reset clock lcsdEl lcsdR2 start sehMü 3BbMî2 ODunt Figure 27: Interface of the Montgomery controller In order to synchronize the work of the components of the architecture, the controller consists of a state machine, which has 6 states defined as follows: • S0: Initialize of the state machine; Go to Si; • S1: Load multiplicand and modulus into the corresponding registers; Load multiplier into shift register 1; Go to S2; • S2: Wait for adders Wait for adder2; Load multiplier into shift register2; Increment counter; Go to S3; • S3: Enable shift register2; Enable shift register1; • S4: Check the counter; If 0 then go to S5 else go to S2; • S5: Halt; 4.3 Modular Multiplier Architecture The modular multiplier yields the actual value of AxB mod M. It first computes R = AxBx2-n mod M using the Montgomery modular multiplier. Then, it computes R x C mod M, where C = 2n mod M. The modular multiplier interface is shown in Figure 28. i=> => i=> Figure 28: The modular multiplier interface The modular multiplier uses a 4-to-1 multiplexer mux4 and a register register. • Step 0: Multiplexer mux4 passes 0 or B. mux2 passes A. It yields R1 = AxBx2-n mod M. The register denoted by register contains 0. • Step 1: Multiplexer mux4 passes 0 or R. mux2 passes C. It yields R = R1xC mod M The register denoted by register contains the result of the first step computation, i.e. R = AxBx2-n mod M A EKn-l:0> R M Aiit-1 0> Biit-1 0> R Mdt-l 0> Ciit-1 0> 122 Informatica 30 (2006) 111-129 N. Nedjah et al. The modular multiplier architecture is given in Figure 29. In order to synchronize the actions of the components of the modular multiplier, the architecture uses a controller, which consists of a state machine of 10 states. The interface of controller is that of Figure 30. The modular multiplier controller does all the control that the Montgomery modular multiplier needs as described in the previous section. Furthermore, it controls the changing from step 0 to step 1, the loading of the register denoted by register. The state machine is depicted in Figure 31. • S0: Initialize of the state machine; Set step to 0; Go to Sj; • S1: Load multiplicand and modulus; Load multiplier into shift register1 ; Go to S2; Figure 29: The modular multiplier architecture Figure 30. The interface of the multiplier controller • S2: Wait for adder1; Wait for adder2; Load partial result into shift register2; Increment counter; Go to S3; • S3: Enable shift register2; Enable shift register1 ; Go to S4; • S4: Load the partial result of step 0 into register; Check the counter; If 0 then go to S5 else go to S2; • S5: Load constant into shift register^ Reset register; set step to 1; Go to S6; • S6: Wait for adder1 ; Wait for adder2; Load partial result into shift register2; Increment counter; Go to S7; • S7: Enable shift register2; Enable shift register^ Go to S8; • S8: Check the counter; If 0 then go to S9 else go to S6; • S9: Halt. 4.4 Simulation Results The project of the modular multiplier described throughout this section was specified in Very High Speed Integrated Circuit Description Language - VHDL [20], and simulated using the Xilinx™ Project Manager [40]. It allows the user to design and simulate the functionality A REVIEW OF MODULAR MULTIPLICATION... Informatica 30 (2006) 111-129 123 0.0 Z Ons 40ns 60ns 80ns 100ns lZOns 140ns 160ns 130ns ZOOns 11 111 111 i i 111 111 11 i 111 111 11 111 111 111 11 111 111 i i 111 111 11 111 111 111 11 111 1111 11 111 111 i i 111 111 11 111 111 11 ZZOns <<<„ b m„ 0 l^H.n mi. flu cell. h M&n|&n|™n| 0j Ja teilt. vt" mi l"H . I mi, i, h, 0 "WTX,, mi„ i„k, 0 cellar h,