2014-02-27 (updated 2014-03-04)

Montgomery multiplication

% Remco Bloemen % 2014-02-27, last updated 2014-03-04

I looked at the wonderful concept of Montgomery reduction. At the start I didn’t think such a complicated algorithm could be faster than a single divq instruction. But it was! By more than six times!

To compare the algorithms I contrived a small program (mul_monty.cpp) to calculate a lot of modular exponents, which in turn use even more multiplications. This was then run with no calculation, calculation using divq and calculation with Montgomery multiplication. All three loops compute the Montogomery conversion too take that out of the comparison.

Method Time

Empty loop 0.37 s divq 38.02 s Montgomery 6.19 s

So Montgomery multiplication is about six times faster. Lesson learned:

One instruction can be slower than many lines of code.

Here is how to convert to Monty form:

$x' = x\; 2^{64} \mod m$

/// Converts from modular to Montgomery reduced form
/// @param a The number in a mod m form
/// @param R Fixed to R = 2⁶⁴ mod m
/// @param m The modulus
/// @returns a 2⁶⁴ mod m
uint64 mod_to_monty(uint64 a, uint64 R, uint64 m)
{
        return mul_asm(a, R, m);
}

The mul_asm function is the original divq based multiplication. It might seem silly to depend on the original method to improve it, but remember that this conversion only happens sporadically in comparison to multiplication.

And here is how to convert back to modular form:

$x = x'\, 2^{-64} \mod m$

/// Converts from Montgomery reduced form to modular
/// @param a The number in x = aR mod m  form
/// @param r Fixed to r = 2⁻⁶⁴ mod m
/// @param m The modulus, must be m > 2⁶³
/// @returns a 2⁻⁶⁴ mod m
uint64 monty_to_mod(uint64 a, uint64 r, uint64 m)
{
	if(a >= m)
		a -= m;
	return mul_asm(a, r, m);
}

Now, what you have been waiting for: the multiplication! The task is to calculate $p = a b \mod m$ or rather, its Monty form:

$p' = a b 2^{64} \mod m$

The parameters given are:

$\begin{aligned} a' &= a\, 2^{64} \mod m \\ b' &= b\, 2^{64} \mod m \end{aligned}$

The first step is to calculate (with 128 bits)

$t = a' b' = a b\, 2^{128} \mod m^2$

Now, if this number is divisible by $2^{64}$ the we can simply bit shift $t$ and return this result as $p'$ .

If g is not a divisible by $2^{64}$ we can make it divisible by adding a (128 bit) number $v$ to it. The great Montgomery trick is to choose this number such that $v \mod m = 0$ . It turns out that with:

$\begin{aligned} k &= \left(-m\right)^{-1} \mod 2^{64} \\ u &= t k \mod 2^{64} \\ v &= u m \end{aligned}$

We obtain the number $v$ we want!

Now all that remains is to compute t + v and bit shift this result. And since we know the least significant bits will be zero with a carry this can be done using 64 bit arithmetic.

All in all the algorithm can be implemented as follows:

/// Multiplication with Montgomery reduction
/// @param a The first factor in Monty form
/// @param b The second factor in Monty form
/// @param R Fixed to R = 2⁶⁴ mod m
/// @param r Fixed to r = 2⁻⁶⁴ mod m
/// @param k Fixed to k = (-m)⁻¹ mod 2⁶⁴
/// @param m The modulus
/// @returns The product in Monty form: a b 2⁻⁶⁴ mod 2⁶⁴
uint64 mul_monty(uint64 a, uint64 b, uint64 R, uint64 k, uint64 m)
{
	// th 2⁶⁴ + tl = a b
	uint64 th, tl;
	asm("mulq %3" : "=a"(tl), "=d"(th) : "a"(a), "rm"(b));

	// If t is a multiple of 2⁶⁴
	// then return the quotient
	if(tl == 0)
		return th;

	// u = t n  mod  2⁶⁴
	uint64 u = tl * k;

	// vh 2⁶⁴ + vl = u m
	uint64 vh, vl;
	asm("mulq %3;" : "=a"(vl), "=d"(vh) : "a"(u), "rm"(m));

	// a = vh + th + 1 (take care of overflow)
	uint64 p = vh + th + 1;
	if (p < th)
		p += R;

	return p;
}