The Goldilocks Prime

In Ham15 Mike Hamburg introduced a class of primes of the form $p = φ^2 - φ - 1$ and named them Goldilocks primes. In those prime fields $φ$ satisfies the Golden ratio relation $φ² = φ + 1$. When $φ$ is a power of two this allows for very efficient implementation. Unfortunately, the $-1$ at the end means it will have very few power-of-two roots of unity. We'd like to have many so we can do big number theoretic transforms.

We will instead consider primes of the form $p = φ^2 - φ + 1$. Following the plonky2 project we keep the name Goldilocks prime. Now $φ$ satisfies a modified Golden ration relation $φ² = φ - 1$. More importantly, the $+1$ at the end guarantees we will have a $φ$-th root of unity.

The values of $φ$ and $p$ where $p$ is prime are the sequences A055494 and A002383 respectively:

$$ \begin{aligned} φ &= 2, 3, 4, 6, 7, 9, 13, 15, 16, 18, 21, 22, 25, 28, …, \\ p &= 3, 7, 13, 31, 43, 73, 157, 211, 241, 307, 421, … \end{aligned} $$

As mentioned, in the prime field $\F_p$ we have $φ^2 = φ - 1$. From this further useful relations on the powers of $φ$ follow:

$$ \begin{aligned} φ^3 &= - 1 &\hspace{2em} φ^4 &= - φ &\hspace{2em} φ^5 &= 1 - φ &\hspace{2em} φ^6 &= 1 \end{aligned} $$

And the pattern repeats. This shows hat $φ$ is a primitive 6th root of unity. These relations leads to a particularly efficient modular reduction algorithm. First we write the number $n$ in base $φ$:

$$ n = n_0 + n_1 ⋅ φ + n_2 ⋅ φ^2 + n_3 ⋅ φ^3 + n_4 ⋅ φ^4 + ⋯ $$

We get rid of all the terms $n_{>3}$ by folding them into $n_0$, $n_1$, $n_2$ using the relations on $φ^3$ and $φ^6$.

$$ \begin{aligned} n_0' &= n_0 - n_3 + n_6 - n_{9\phantom{0}} + ⋯ \\ n_1' &= n_1 - n_4 + n_7 - n_{10} + ⋯ \\ n_2' &= n_2 - n_5 + n_8 - n_{11} + ⋯ \\ \end{aligned} $$

To reduce this from three to two terms we use the relation on $φ^2$.

$$ \begin{aligned} n_0' = n_0 - n_2 \\ n_1' = n_1 + n_2 \end{aligned} $$

The reduced number is $n_0 + n_1 ⋅ φ$. All that remains is carry propagation and some conditional subtractions of the modulus. Note that the reduction requires no multiplication operations.

To multiply two numbers $a,b ∈ \F_p$ we can write them in base $φ$ and make use of Karatsuba multiplication:

$$ \begin{aligned} a⋅b &= (a_0 + a_1 ⋅ φ) ⋅ (b_0 + b_1 ⋅ φ) \\ &= a_0 ⋅ b_0 + (a_0⋅b_1 + a_1⋅b_0) ⋅ φ + a_1⋅b_1 ⋅ φ^2 \\ &= a_0 ⋅ b_0 - a_1⋅b_1 + (a_0⋅b_1 + a_1⋅b_0 + a_1⋅b_1) ⋅ φ \\ &= a_0 ⋅ b_0 - a_1⋅b_1 + ((a_0 + a_1) ⋅ (b_0 + b_1) - a_0⋅b_0) ⋅ φ \\ \end{aligned} $$

This requires three full multiplications in base $φ$ and results in three terms. We reduce this to two terms using the final step of the modular reduction procedure.

Question. What do Barrett and Montgomery reduction look like?

A lucky binary base

To implement base $φ$ operations efficiently we'd like it to be a power of two, i.e. $φ = 2^k$. There are not many such primes, the only solutions with $k < 4000$ are $k = 1$, $4$, and $32$. The first to are too small to be practically useful which leaves us with $32$. $φ = 2^{32}$ makes an especially useful prime $p = 2^{64} - 2^{32} + 1$ or written in full

$$ p = 18446744069414584321 = \mathtt{0x\,ffff\,ffff\,0000\,0001} $$

This is a great size. It is small enough to fit in a 64-bit number, but large enough to do a lot of useful math in. For example, if we have three 32-bit numbers $a,b,c ∈ [0, 2^{32}-1]$ we can compute $a ⋅ b + c$ without overflowing:

$$ a⋅b + c \enspace ≤ \enspace (2^{32} - 1)^2 + (2^{32} - 1) \enspace=\enspace 2^{64} - 2^{32} \enspace<\enspace p $$

When computing on 32/64-bit computers we have some useful interactions with the binary rings:

$$ \begin{aligned} \mod{2^{n⋅32}}_p &= \mod{φ^n}_p &\quad \mod{2^{64}}_p &= \mod{-1}_{2^{32}} &\quad \mod{-p}_{2^{64}} &= \mod{-1}_{2^{32}} &\quad \mod{φ}_{2^{32}} &= 0 & \end{aligned} $$

The first identity gives us a very efficient way to implement bit-shifts. This in turn leads to an efficient inversion algorithm from the half-extended binary GCD:

def inverse(a):
    (u, a) = (MODULUS, a)
    (t0, t1) = (0, 1)
    twos = trailing_zeros(v)
    v >>= twos
    twos += 96 # 2^96 = -1
    while u != v:
        u -= v
        t0 += t1
        count = trailing_zeros(u)
        u >>= count
        t1 <<= count
        twos += count
        if u < v:
            (u, v) = (v, u)
            (t0, t1) = (t1, t0)
            twos += 96 # 2^96 = -1
    return t0 << ((191 * twos) % 192) # 2^-1 = 2^191, 2^192 = 1

Multiplicative group

The multiplicative group has order $p - 1$ and factors into a power of two and, interestingly, all five known Fermat primes.

$$ p - 1 = 2^{32} ⋅ 3 ⋅ 5 ⋅ 17 ⋅ 257 ⋅ 65537 $$

The factor $2^{32}$ means we can do efficient NTTs of power-of-two sizes. But what if we want a different size? The usual method is to pad the data until it is a power of two size. This is wasteful as can mean almost doubling the size. Instead we could go beyond powers of two and include the other small primes $3$, $5$ things improve quite a bit

For example if we do a $2^{32} ⋅ 3$ mixed radix, then in half of the cases it is 25% smaller than if we only use $2^{32}$. If we also add $5$ then in a quarter of the cases it is an extra 16% smaller. These two seem worthwhile. Adding $17$ still gives a visible improvement, but is probably not worth the implementation complexity. And $257$ and $65537$ do not contribute anything noticeable.

NTTs require lot's of multiplications by roots of unity. Goldilocks again helps us make this efficient. We already found that $φ=2^{32}$ is a $6$th root of unity, and multiplying by it is simply a left shift of 32 bits. But if $2^{32}$ is a $6$th root, then $2$ itself must be a $6 ⋅ 32 = 192$th root. Great! This means multiplying by $192$th roots is just cheap bit shift. And since $192 = 2^6 ⋅ 3$ has many divisors we can now create a whole set of $n$-th roots $ω_n$ that can be implemented using bit shifts:

$$ \begin{aligned} ω_2 &= 2^{96} & ω_{6\phantom{0}} &= 2^{32} & ω_{16} &= 2^{12} & ω_{48} &= 2^{4} & ω_{192} &= 2\\ ω_3 &= 2^{64} & ω_{8\phantom{0}} &= 2^{24} & ω_{24} &= 2^{8} & ω_{64} &= 2^{3} \\ ω_4 &= 2^{48} & ω_{12} &= 2^{16} & ω_{32} &= 2^{6} & ω_{96} &= 2^{2} \\ \end{aligned} $$

We can go even further and construct a $384$th root using two bit shifts, but for that we need a trick. Schönhage observed that given an $8$th root $ω_8$, the value $ω_8 + ω_8^7$ is a square root of $2$. Let's quickly check this

$$ (ω_8 + ω_8^7)^2 = ω_8^2 + ω_8^{7⋅2} + 2 ⋅ ω_8^{1+7} = ω_8^2 - ω_8^2 + 2 = 2 $$

In Goldilocks $2^{24}$ is an $8$th root and $2^{96} = -1$ so $n = 2^{24} - 2^{72}$ is a square root of two. Furthermore since $n^2 = 2$ and $2$ is a $192$th root, it follows that $n$ is a $384$th root $ω_{384}$. If we don't want that factor of $3$ we can compute

$$ ω_{128} = ω_{384}^3 = ω_{384}^2 ⋅ ω_{384} = 2⋅(2^{24} - 2^{72}) = 2^{25} - 2^{73} $$

This also shows that all even powers of $ω_{384}$ have one bit set and all odd powers two.

$$ ω_{384}^i = \begin{cases} 2^{k} & i = 2⋅k \\ 2^{k + 24} - 2^{k + 73} & i = 2⋅k + 1 \\ \end{cases} $$

This $\sqrt{2}$ trick is useful, let's understand it better and see if we can take it further.

Schönhage's trick

To intuitively understand why this works, forget for the moment that we are working in a prime field and imagine the 8th roots of unity in the complex plane.

We can construct the sum $ω_8 + ω_8^7$ by adding vectors. Pythagoras theorem shows it is $\sqrt{2}$. This procedure generalizes this to $n$-roots with $ω_n + ω_n^{n-1}$, where the angles between roots are $\frac{2⋅π}{n}$. A bit of trigonometry shows the solutions

$$ ω_n + ω_n^{n-1} = 2 ⋅ \cos \frac{2⋅π}{n} $$

For certain values of $n$ there are simple exact values for $\cos$, see a table of exact trigonometric values. We indeed get $\sqrt{2}$ and a few more. Most look quite uninteresting, but these two I like

$$ \begin{aligned} % ω_{8\phantom{0}} + ω_{8\phantom{0}}^7 &= \sqrt{2} \\ ω_{12} + ω_{12}^{11} &= \sqrt{3} &\hspace{2em} ω_{10} + ω_{10}^9 &= \frac{1 + \sqrt{5}}{2} \end{aligned} $$

The last one is also known as the Golden ratio. It turns out these work in the Goldilocks field as well. Since $ω_{12} = 2^{16}$ we have $\sqrt{3} = 2^{16} - 2^{80}$. For $ω_{10}$ I don't know a multiplication free construction.

Question. It is possible to geometrically construct $\sqrt[4]{2}$ using the geometric mean theorem. Is there a construction that leads to an efficient (multiplication free) implementation?

Adam P. Goucher explains that it is not possible to construct $\sqrt[4]{2}$ as a rational function of roots of unity.

Generators

So far we've cleverly constructed roots up to $ω_{384}$, but for our NTTs we need all the roots up to the $2^{32}$th. The standard way to do this is to find a generator of the multiplicative group. A generator is a field element $g$ such that all non-zero numbers can be written as powers $g^i$ for some $i$. There are many such numbers and the smallest generator in Goldilocks in $7$. By Fermat's little theorem we have $g^{p - 1} = 1$ and thus $g$ is a primitive $(p-1)$-th root of unity. Since this is the highest possible root of unity, we can derive all others from it using

$$ ω_n = g^{\frac{p - 1}{n}} $$

It is convention to use the smallest generator, but if we compute $ω_{192}$ with $g=7$ we get $35184372080640$ instead of $2$. This happens because $n$-th primitive roots are not unique and there are many choices that work. While this is still a power of two, $ω_{192} = 2^{77}$, it makes things more complicated than they need to be and we get a more complicated formula for $ω_{384}^i$.

A priory there is no reason to pick one choice of roots over the other. But we do need our choices to be consistent, meaning we want

$$ ω_n = ω_{k⋅n}^k $$

to hold where applicable. Using a generator full fills that requirement. We'd also like a choice where we have an easy formula for multiplication by $ω_{384}^i$ in terms of bit shifts. The smallest generator that satisfies this is $554$. But this one has $ω_{384} = -(ω_8 + ω_8^7)$, it takes the negative square root of two. The smallest generator that picks the positive root is $2717$.

Question. Are there larger values of $k$ for which $2^{2⋅k} - 2^k + 1$ is prime or is $32$ the largest?

Further references

Remco Bloemen
Math & Engineering
https://2π.com