Multiplicative Inverses
In this post, I will show a trick to divide a large number using only a small multiplication by an inverse. I will show two ways of computing these inverses and how to build a small fast lookup table. In the process, we will see that a single instruction can sometimes be more expensive than many cheaper instructions!
In this post, I build a function to compute the modular multiplicative inverse for modulus 2^{256}. Given a number a, I want a number r such that
\mod{a ⋅ r}_{2^{256}} = 1
Note on notation: As before, I use square brackets with a subscript to denote the modulus operation.
We write the solution formally as
r = \mod{a^{-1}}_{2^{256}}
This number r has an interesting property: multiplication by r is the same as division by a. Let me illustrate this with an example. Take 1000 as the modulus and let a be 73. The inverse of 73 is 137:
\mod{73^{-1}}_{1000} = 137
If I want to divide 876 by 73, I can multiply 876 by 137 which is 120\,012 and take the last three digits, 12. Indeed 876 = 12 · 73. This is an interesting party trick.
The real magic happens with larger numbers. Let’s say we want to divide 46\,209 by 73. For this, we multiply 209 by 137 and take the last three digits. This is 633 and indeed 46\,209 = 663 · 73. But wait! We did not even use the digits 46! How can we divide a number without even looking at the full number? This trick works when 1) the division is exact, 2) the multiplicative inverse exists, and 3) the result fits in three digits.
Those who have read my post on 512-bit multiplication can guess where I’m going with this. We can use this to do division on 512-bit numbers with a 256-bit multiply! But before we get to that, we first need a way to compute these inverses.
The Fermat-Euler-Carmichael theorem
This funny theorem is an elementary result in number theory. It started out as Pierre de Fermat’s little theorem (1640). Later proven and improved by Leonhard Euler (1736). Finally perfected by Robert Carmichael (1914). The theorem is:
\mod{a^n}_m = \mod{a^{\mod{n}_{λ(m)}}}_m
where λ(m) is the Carmichael function. This implies that for numbers modulo m, the exponents behave as numbers modulo λ(m). For our modulus λ(2^{256}) = 2^{254}. Thus, for 256-bit integers, exponents behave as 254-bit integers. This also extends to negative numbers and we can us that to get modular inverses:
\mod{a^{-1}}_{2^{256}} = \mod{a^{\mod{-1}_{2^{254}}}}_{2^{256}} = \mod{a^{2^{254}-1}}_{2^{256}}
Solidity supports exponentiation, so we can directly implement this:
contract MulInv {
function inv256(uint256 a)
public pure
returns (uint256 r)
{
r = a ** (2**254 - 1);
return r;
}
}
Inversion using the Fermat-Euler-Carmichael theorem
This compiles to a single EXP
instruction with a constant exponent, which is about as short as it can get. But this function takes no less than 1614 gas! This is because the EXP operation takes 10 gas plus 50 gas for every byte in the exponent. In our case the exponent is almost all ones, so we pay the maximum price of 1600.
Why is this operation so expensive? In the virtual machines, the operation is likely implemented using exponentiation by squaring. In this case, a single EXP instruction does the same amount of work as 512 MUL instructions. At five gas each, those MUL
s cost about 2560 gas. In light of this, the EXP
operation is not so oddly priced. In fact, it used to be a bit cheaper but that led to problems and the price had to be raised in EIP160.
Can we do better? Yes. We could use the extended Euclidean algorithm to solve r · a + 1 · m = \gcd(a, m). This is generally faster than exponentiation. But there is an even better way.
Newton-Raphson-Hensel inversion
The Newton-Raphson method can be used to approximate division in real numbers. Thanks to Hensel’s lemma, this works even better in modular arithmetic. For powers of two it is the recurrence:
\begin{aligned} r_0 &= 1 \\\\ r_{i+1} &= \mod{r_i ⋅ (2 - a ⋅ r_i)}_{2^{2^i}} \end{aligned}
For odd numbers a, the values of this recurrence satisfy the formula
r_i = \mod{a^{-1}}_{2^{2^i}}
Each iteration doubles the number of correct bits. This means that we want to compute r_8 to get the modular inverse for a 256-bit number.
We can skip one iteration by observing that r_1 = a for all odd values of a. This leaves only seven iterations to compute.
Small lookup table
We can skip one more iteration by creating a small lookup table for r_2. The values values of r_2 where an inverse exists are a ∈ {1, 3, 5, …, 15} with corresponding inverses [1, 11, 13, 7, 9, 3, 5, 15].
For other values of a we use zero instead of the real values of r_2. Using zeros has the advantage that it makes the final result zero when an inverse does not exist. Zero is never a valid inverse, so this is a nice flag value.
bytes16 table = 0x001000b000d0007000900030005000f;
r= uint256(table[a % 16]);
A sixteen entry lookup table for r_2
The instructions the Solidity compiler (v. 0.4.18) produces from this snippet are a bit disappointing. An unnecessary bounds check is inserted. A multiplication is done to typecast from uint8
to bytes1
which is immediately undone by a division to convert from bytes1 to uint256
. Finally, the % 16
is not replaced by the cheaper & 15
. Let’s fix all this:
bytes16 table = 0x001000b000d0007000900030005000f;
assembly {
r := bytes(and(15, a), table);
}
A much faster lookup table for r_2
With these optimizations, the table lookup is a net gain of a few gas. Going further is currently not worth it. For r_3 we would need a lookup table of 256 entries. This can theoretically be done using CODECOPY
and MLOAD
, but current unsupported. For r_4 the table would have to be 65 kilobyte which is not worth the code size.
Conclusion
Combining the lookup table with six iterations results in the final implementation:
contract MulInv {
function inv256(uint256 a)
public pure
returns (uint256 r)
{
// 4 bit lookup table
bytes16 table = 0x001000b000d0007000900030005000f;
assembly {
r := bytes(and(15, a), table);
}
// 6 iterations of Newton-Raphson for 4 ⋅ 2^6 = 256 bit.
r *= 2 - a * r;
r *= 2 - a * r;
r *= 2 - a * r;
r *= 2 - a * r;
r *= 2 - a * r;
r *= 2 - a * r;
return r;
}
}
Multiplicative inverse modulo 2^{256}
Despite having 20 times more operations, this function takes only about 154 gas. Less than 10% of the EXP
based implementation!
With this inversion function, I finally have all the tools I need to solve the problem I wanted to solve. In the next post, I will work around the constraints and create a proportion function that never fails.
Update. I have since learned that r_2 = 3 * a ^ 2
works instead of the table (Montgomery, cited in Mayer 2016). It is slightly cheaper and requires no assembly, but it does not return zero when the inverse does not exist.