Multiplicative Inverses

In this post, I will show a trick to divide a large number using only a small multiplication by an inverse. I will show two ways of computing these inverses and how to build a small fast lookup table. In the process, we will see that a single instruction can sometimes be more expensive than many cheaper instructions!

In this post, I build a function to compute the modular multiplicative inverse for modulus $2^{256}$. Given a number $a$, I want a number $r$ such that

$$ \mod{a ⋅ r}_{2^{256}} = 1 $$

Note on notation: As before, I use square brackets with a subscript to denote the modulus operation.

We write the solution formally as

$$ r = \mod{a^{-1}}_{2^{256}} $$

This number $r$ has an interesting property: multiplication by $r$ is the same as division by $a$. Let me illustrate this with an example. Take $1000$ as the modulus and let $a$ be $73$. The inverse of $73$ is $137$:

$$ \mod{73^{-1}}_{1000} = 137 $$

If I want to divide $876$ by $73$, I can multiply $876$ by $137$ which is $120\,012$ and take the last three digits, $12$. Indeed $876 = 12 · 73$. This is an interesting party trick.

The real magic happens with larger numbers. Let’s say we want to divide $46\,209$ by $73$. For this, we multiply $209$ by $137$ and take the last three digits. This is $633$ and indeed $46\,209 = 663 · 73$. But wait! We did not even use the digits $46$! How can we divide a number without even looking at the full number? This trick works when 1) the division is exact, 2) the multiplicative inverse exists, and 3) the result fits in three digits.

Those who have read my post on 512-bit multiplication can guess where I’m going with this. We can use this to do division on 512-bit numbers with a 256-bit multiply! But before we get to that, we first need a way to compute these inverses.

The Fermat-Euler-Carmichael theorem

This funny theorem is an elementary result in number theory. It started out as Pierre de Fermat’s little theorem (1640). Later proven and improved by Leonhard Euler (1736). Finally perfected by Robert Carmichael (1914). The theorem is:

$$ \mod{a^n}_m = \mod{a^{\mod{n}_{λ(m)}}}_m $$

where $λ(m)$ is the Carmichael function. This implies that for numbers modulo $m$, the exponents behave as numbers modulo $λ(m)$. For our modulus $λ(2^{256}) = 2^{254}$. Thus, for 256-bit integers, exponents behave as 254-bit integers. This also extends to negative numbers and we can us that to get modular inverses:

$$ \mod{a^{-1}}_{2^{256}} = \mod{a^{\mod{-1}_{2^{254}}}}_{2^{256}} = \mod{a^{2^{254}-1}}_{2^{256}} $$

Solidity supports exponentiation, so we can directly implement this:

contract MulInv {
    function inv256(uint256 a)
        public pure
        returns (uint256 r)
    {
        r = a ** (2**254 - 1);
        return r;
    }
}

Inversion using the Fermat-Euler-Carmichael theorem

This compiles to a single EXP instruction with a constant exponent, which is about as short as it can get. But this function takes no less than 1614 gas! This is because the EXP operation takes 10 gas plus 50 gas for every byte in the exponent. In our case the exponent is almost all ones, so we pay the maximum price of 1600.

Why is this operation so expensive? In the virtual machines, the operation is likely implemented using exponentiation by squaring. In this case, a single EXP instruction does the same amount of work as 512 MUL instructions. At five gas each, those MULs cost about 2560 gas. In light of this, the EXP operation is not so oddly priced. In fact, it used to be a bit cheaper but that led to problems and the price had to be raised in EIP160.

Can we do better? Yes. We could use the extended Euclidean algorithm to solve $r · a + 1 · m = \gcd(a, m)$. This is generally faster than exponentiation. But there is an even better way.

Newton-Raphson-Hensel inversion

The Newton-Raphson method can be used to approximate division in real numbers. Thanks to Hensel’s lemma, this works even better in modular arithmetic. For powers of two it is the recurrence:

$$ \begin{aligned} r_0 &= 1 \\\\ r_{i+1} &= \mod{r_i ⋅ (2 - a ⋅ r_i)}_{2^{2^i}} \end{aligned} $$

For odd numbers $a$, the values of this recurrence satisfy the formula

$$ r_i = \mod{a^{-1}}_{2^{2^i}} $$

Each iteration doubles the number of correct bits. This means that we want to compute $r_8$ to get the modular inverse for a 256-bit number.

We can skip one iteration by observing that $r_1 = a$ for all odd values of $a$. This leaves only seven iterations to compute.

Small lookup table

We can skip one more iteration by creating a small lookup table for $r_2$. The values values of $r_2$ where an inverse exists are $a ∈ {1, 3, 5, …, 15}$ with corresponding inverses $[1, 11, 13, 7, 9, 3, 5, 15]$.

For other values of $a$ we use zero instead of the real values of $r_2$. Using zeros has the advantage that it makes the final result zero when an inverse does not exist. Zero is never a valid inverse, so this is a nice flag value.

bytes16 table = 0x001000b000d0007000900030005000f;
r= uint256(table[a % 16]);

A sixteen entry lookup table for $r_2$

The instructions the Solidity compiler (v. 0.4.18) produces from this snippet are a bit disappointing. An unnecessary bounds check is inserted. A multiplication is done to typecast from uint8 to bytes1 which is immediately undone by a division to convert from bytes1 to uint256. Finally, the % 16 is not replaced by the cheaper & 15. Let’s fix all this:

bytes16 table = 0x001000b000d0007000900030005000f;
assembly {
    r := bytes(and(15, a), table);
}

A much faster lookup table for $r_2$

With these optimizations, the table lookup is a net gain of a few gas. Going further is currently not worth it. For $r_3$ we would need a lookup table of 256 entries. This can theoretically be done using CODECOPY and MLOAD, but current unsupported. For $r_4$ the table would have to be 65 kilobyte which is not worth the code size.

Conclusion

Combining the lookup table with six iterations results in the final implementation:

contract MulInv {
    function inv256(uint256 a)
        public pure
        returns (uint256 r)
    {
        // 4 bit lookup table
        bytes16 table = 0x001000b000d0007000900030005000f;
        assembly {
            r := bytes(and(15, a), table);
        }

        // 6 iterations of Newton-Raphson for 4 ⋅ 2^6 = 256 bit.
        r *= 2 - a * r;
        r *= 2 - a * r;
        r *= 2 - a * r;
        r *= 2 - a * r;
        r *= 2 - a * r;
        r *= 2 - a * r;
        return r;
    }
}

Multiplicative inverse modulo $2^{256}$

Despite having 20 times more operations, this function takes only about 154 gas. Less than 10% of the EXP based implementation!

With this inversion function, I finally have all the tools I need to solve the problem I wanted to solve. In the next post, I will work around the constraints and create a proportion function that never fails.

Update. I have since learned that $r_2 =$ 3 * a ^ 2 works instead of the table (Montgomery, cited in Mayer 2016). It is slightly cheaper and requires no assembly, but it does not return zero when the inverse does not exist.

Remco Bloemen
Math & Engineering
https://2π.com