The Goldilocks Prime
In Ham15 Mike Hamburg introduced a class of primes of the form and named them Goldilocks primes. In those prime fields satisfies the Golden ratio relation . When is a power of two this allows for very efficient implementation. Unfortunately, the at the end means it will have very few power-of-two roots of unity. We'd like to have many so we can do big number theoretic transforms.
We will instead consider primes of the form . Following the plonky2 project we keep the name Goldilocks prime. Now satisfies a modified Golden ration relation . More importantly, the at the end guarantees we will have a -th root of unity.
The values of and where is prime are the sequences A055494 and A002383 respectively:
As mentioned, in the prime field we have . From this further useful relations on the powers of follow:
And the pattern repeats. This shows hat is a primitive 6th root of unity. These relations leads to a particularly efficient modular reduction algorithm. First we write the number in base :
We get rid of all the terms by folding them into , , using the relations on and .
To reduce this from three to two terms we use the relation on .
The reduced number is . All that remains is carry propagation and some conditional subtractions of the modulus. Note that the reduction requires no multiplication operations.
To multiply two numbers we can write them in base and make use of Karatsuba multiplication:
This requires three full multiplications in base and results in three terms. We reduce this to two terms using the final step of the modular reduction procedure.
Question. What do Barrett and Montgomery reduction look like?
A lucky binary base
To implement base operations efficiently we'd like it to be a power of two, i.e. . There are not many such primes, the only solutions with are , , and . The first to are too small to be practically useful which leaves us with . makes an especially useful prime or written in full
This is a great size. It is small enough to fit in a 64-bit number, but large enough to do a lot of useful math in. For example, if we have three 32-bit numbers we can compute without overflowing:
When computing on 32/64-bit computers we have some useful interactions with the binary rings:
The first identity gives us a very efficient way to implement bit-shifts. This in turn leads to an efficient inversion algorithm from the half-extended binary GCD:
def inverse(a):
(u, a) = (MODULUS, a)
(t0, t1) = (0, 1)
twos = trailing_zeros(v)
v >>= twos
twos += 96 # 2^96 = -1
while u != v:
u -= v
t0 += t1
count = trailing_zeros(u)
u >>= count
t1 <<= count
twos += count
if u < v:
(u, v) = (v, u)
(t0, t1) = (t1, t0)
twos += 96 # 2^96 = -1
return t0 << ((191 * twos) % 192) # 2^-1 = 2^191, 2^192 = 1
Multiplicative group
The multiplicative group has order and factors into a power of two and, interestingly, all five known Fermat primes.
The factor means we can do efficient NTTs of power-of-two sizes using the Cooley-Tukey algorithm. The Fermat primes are not just curious, they allow us to use Rader's FFT algorithm to turn the Fermat prime sized NTTs into power-of-two NTTs. With efficient NTTs for all prime factors we can use the Good-Thomans algorithm to efficiently combine these without requirng twiddle factors. Overall this gives decent NTT performance for any divisor of .
But what if we want a different size? The usual method is to pad the data until it is a power of two size. This is wasteful as can mean almost doubling the size. Instead we could go beyond powers of two and include the other small primes , things improve quite a bit

For example if we do a mixed radix, then in half of the cases it is 25% smaller than if we only use . If we also add then in a quarter of the cases it is an extra 16% smaller. These two seem worthwhile. Adding still gives a visible improvement, but is probably not worth the implementation complexity. And and do not contribute anything noticeable.
NTTs require lot's of multiplications by roots of unity. Goldilocks again helps us make this efficient. We already found that is a th root of unity, and multiplying by it is simply a left shift of 32 bits. But if is a th root, then itself must be a th root. Great! This means multiplying by th roots is just cheap bit shift. And since has many divisors we can now create a whole set of -th roots that can be implemented using bit shifts:
We can go even further and construct a th root using two bit shifts, but for that we need a trick. Schönhage observed that given an th root , the value is a square root of . Let's quickly check this
In Goldilocks is an th root and so is a square root of two. Furthermore since and is a th root, it follows that is a th root . If we don't want that factor of we can compute
This also shows that all even powers of have one bit set and all odd powers two.
This trick is useful, let's understand it better and see if we can take it further.
Schönhage's trick
To intuitively understand why this works, forget for the moment that we are working in a prime field and imagine the 8th roots of unity in the complex plane.
We can construct the sum by adding vectors. Pythagoras theorem shows it is . This procedure generalizes this to -roots with , where the angles between roots are . A bit of trigonometry shows the solutions
For certain values of there are simple exact values for , see a table of exact trigonometric values. We indeed get and a few more. Most look quite uninteresting, but these two I like
The last one is also known as the Golden ratio. It turns out these work in the Goldilocks field as well. Since we have . For I don't know a multiplication free construction.
Question. It is possible to geometrically construct using the geometric mean theorem. Is there a construction that leads to an efficient (multiplication free) implementation?
Adam P. Goucher explains that it is not possible to construct as a rational function of roots of unity.
Generators
So far we've cleverly constructed roots up to , but for our NTTs we need all the roots up to the th. The standard way to do this is to find a generator of the multiplicative group. A generator is a field element such that all non-zero numbers can be written as powers for some . There are many such numbers and the smallest generator in Goldilocks in . By Fermat's little theorem we have and thus is a primitive -th root of unity. Since this is the highest possible root of unity, we can derive all others from it using
It is convention to use the smallest generator, but if we compute with we get instead of . This happens because -th primitive roots are not unique and there are many choices that work. While this is still a power of two, , it makes things more complicated than they need to be and we get a more complicated formula for .
A priory there is no reason to pick one choice of roots over the other. But we do need our choices to be consistent, meaning we want
to hold where applicable. Using a generator full fills that requirement. We'd also like a choice where we have an easy formula for multiplication by in terms of bit shifts. The smallest generator that satisfies this is . But this one has , it takes the negative square root of two. The smallest generator that picks the positive root is .
Question. Are there larger values of for which is prime or is the largest?
Further references
- Gou21. Adam P. Goucher (2021). "An efficient prime for number-theoretic transforms".
- Po22. Thomas Pornin (2022).
https://github.com/mir-protocol/plonky2/issues/1#issuecomment-809323437