Number Theoretic Transforms
$$ \gdef\delim#1#2#3{\mathopen{}\mathclose{\left#1 #2 \right#3}} \gdef\p#1{\delim({#1})} \gdef\F{\mathbb F} \gdef\i{\mathrm i} \gdef\mod#1{\delim[{#1}]} \gdef\M{\mathsf{\small M}} \gdef\A{\mathsf{\small A}} \gdef\Mc{\mathsf{\small M^c}} \gdef\Ac{\mathsf{\small A^c}} \gdef\NTT{\mathsf{\small NTT}} \gdef\CCV{\mathsf{\small CCV}} $$
NTTs, FFTs, convolutions, multiplication, interpolation, Chinese remainder theorem, Lagrange basis, Representation theory, FRI. It's all the same thing.
To do. Prime length FRI using Rader?
A lot of research on them has been done in DSP. But there the more sophisticated methods are not as popular because
 Exchanging muls for adds is not very worthwhile.
 Numerical accuracy is challenging.
Neither of these holds in finite fields.
Introduction
$$ \gdef\ω{\text{ω}} \gdef\MF{\mathrm{F}} $$
The number theoretic transform (NTT) of a sequence $x_i$ of length $n$ over a field $\F$ is defined as a sequence $f_k$ of length $n$ by
$$ \begin{align} f_k = \sum_{i∈[0,n)} x_i ⋅ \ω_n^{i⋅k} \end{align} $$
where $\ω_n$ is a primitive $n$th root of unity. When $\F$ is the complex numbers the roots are $\ω_n = \mathrm{e}^{\frac{2⋅\mathrm{π}⋅\mathrm{i}}{n}}$ and the transform is called the discrete Fourier transform (DFT) and its implementations fast Fourier transform (FFT). In a prime order finite field $\F_p$ with generator $g$ the roots are $\ω_n = g^{\frac{p1}{n}}$. Note that the roots are not unique and $\ω_n^k$ is also a valid $n$th root of unity whenever $k$ and $n$ a coprime.
The roots are just numbers and satisfy all the algebraic properties such that $\ω_n^0 = 1$ and $\ω_n^{a + b} = \ω_n^a⋅\ω_n^b$. In addition they have some unique properties such as $\ω_n^n = 1$, $\ω_{k⋅n}^k = \ω_n$ and $\sum_{i∈[0,n)} \ω_n^i = 0$.
$$ \begin{align} \sum_{i∈[0,n)} \ω_n^{a⋅i} = \begin{cases} n & a = 0 \\ 0 & a ≠ 0 \end{cases} \end{align} $$
From these it follows that $\ω_2 = 1$, which is why the $n=2$ case is particularly efficient:
$$ \begin{aligned} f_0 &= x_0 + x_1 & f_1 &= x_0  x_1 \end{aligned} $$
NTT is Almost it's own Inverse
The inverse of the NTT is
$$ x_i = \frac 1n \sum_{k∈[0,n)} f_k ⋅ \ω_n^{i⋅k} $$
Proof
$$ \begin{aligned} \frac 1n \sum_{k∈[0,n)} f_k ⋅ \ω_n^{i⋅k} &= \frac 1n \sum_{k∈[0,n)} \p{\sum_{j∈[0,n)} x_j ⋅ \ω_n^{j⋅k}} ⋅ \ω_n^{i⋅k} \\ &= \frac 1n \sum_{k∈[0,n)} \sum_{j∈[0,n)} x_j ⋅ \ω_n^{j⋅ki⋅k} \\ &= \frac 1n \sum_{j∈[0,n)} x_j \sum_{k∈[0,n)} \ω_n^{\p{ji}⋅k} \\ &= \frac 1n \sum_{j∈[0,n)} x_j \p{\begin{cases} n & i = j \\ 0 & i ≠ j \end{cases}} \\ &= \frac 1n x_i ⋅ n \\ &= x_i \\ \end{aligned} $$
The factor of $\frac 1n$ is constant can be applied to the output (as in the sum above) or brought inside the sum and applied to the input. It can even be applied together with the $\ω$ factors to save operations, but this requires a bit more care to correctly propagate through the various algorithms presented below.
The other difference with a regular $\NTT$ is the factor of $1$ in the exponent of $\ω$. Since for each $\ω_n' = \ω_n^{1}$ is also a primitive $n$th root of unity we can use the regular algorithm with a different root. Instead of changing the root, we can also substitute $i' = \mod{i}_n$. This has the effect of reversing the order of all but the $0$th index element in the output sequence. By symmetry with $k$ this can also be done on the input elements.
NTT is Matrix Multiplication
This equation can also be seen as a multiplying the vector $\vec x$ by the DFT matrix $\MF_n$ where $(\MF_n)_{ij} = \ω_n^{i⋅j}$, i.e.
$$ \begin{bmatrix} f_0 \\ f_1 \\ f_2 \\ ⋮ \\ f_{n1} \end{bmatrix} = \begin{bmatrix} ω^0 & ω^0 & ω^0 & ⋯ & w^0 \\ ω^0 & ω^1 & ω^2 & ⋯ & w^{n 1}\\ ω^0 & ω^2 & ω^4 & ⋯ & w^{2⋅(n1)}\\ ⋮ & ⋮ & ⋮ & ⋱ & ⋮ \\ ω^0 & ω^{n1} & ω^{2⋅(n1)} & ⋯ & ω^{(n1)^2} \\ \end{bmatrix}⋅ \begin{bmatrix} x_0 \\ x_1 \\ x_2 \\ ⋮ \\ x_{n1} \end{bmatrix} $$
The matrix is symmetric, $\MF_n^\mathsf{T}=\MF_n$, and has inverse $(\MF_n^{1})_{ij} = \frac 1n ⋅ \ω_n^{i⋅j}$. It is also a Vandermonde matrix on the basis $(\ω_n^0,…, \ω_n^{n1})$.
NTT is Polynomial Evaluation
We can also interpret the NTT equation as evaluating a polynomial $P∈\F[X]$ with coefficients $x_i$ in points $\ω_n^k$ as
$$ \begin{aligned} P(X) &= \sum_{i∈[0,n)} x_{i} ⋅ X^i &\text{and}&& f_k = P\p{\ω_n^k} \end{aligned} $$
NTT is Polynomial Interpolation
If we interpret $x_i$ as a set of values of some polynomial $P∈\F[X]$ on the basis $(\ω_n^0,…, \ω_n^{n1})$. Then we can construct the polynomial using Lagrange interpolation.
$$ P(X) = \sum_{i∈[0,n)} x_i ⋅ L_{n,i}(X) $$
where the Lagrange polynomials $L_{n,i}(X)$ are
$$ L_{n,i}(X) = \prod_{j∈[0,n)\setminus i} \frac{X  \ω_n^j}{\ω_n^i  \ω_n^j} $$
NTT is Chinese Remainder Theorem
Another way of evaluating a polynomial at $x$ is by reducing it modulo the monomial $(X  x)$, which leads to
$$ \begin{aligned} f_k = \mod{P}_{\p{X  \ω_n^k}} \end{aligned} $$
The product of all these moduli is a Cyclotomic polynomial
$$ \prod_{k∈[0,n)} \p{X  \ω_n^k} = X^n  1 $$
If we know two factors $n = a⋅b$ then
$$ X^n  1 = \p{X^a  1}⋅\p{X^b  1} $$
$$ X^{a + b}  X^a  X^b + 1 $$
NTT is a ring isomorphism
The vector space $\F^n$ forms a ring under vector addition and Hadamard product. The $\NTT$ is a ring isomorphism between $\F^n$ and the ring $\F[X]/\p{X^n  1}$. The fastest known algorithms for multiplying in $\F[X]$ work by finding a sufficiently large $n$ to lower the problem to $\F[X]/\p{X^n  1}$ and then use $\NTT$ to transform to $\F^n$, apply the Hadamard product, and transform back.
NTT Diagonalized shiftinvariant operators
Given $S_n$ the $n×n$ shift matrix. For a (not necessarily primitive) $n$ throot of unity $ω$ the vector $\begin{bmatrix} ω^0, ω^1, ω^1, …, ω^{n1}\end{bmatrix}$ is an eigenvector of $S_n$ with eigenvalue $ω$.
$$ ω ⋅ \begin{bmatrix} ω^0 \\ ω^1 \\ ω^1 \\ ⋮ \\ ω^{n1}\end{bmatrix} = \begin{bmatrix} ω^1 \\ ω^2 \\ ⋮ \\ ω^{n1} \\ ω^0\end{bmatrix} = S_n ⋅ \begin{bmatrix} ω^0 \\ ω^1 \\ ω^1 \\ ⋮ \\ ω^{n1}\end{bmatrix} $$
Matrices that commute have the same eigenbasis (up to eigenspaces). So the NTT matrix is an eigenbasis for the shift matrix and any matrix that commutes with it. All matrices diagonalized by the FFT are the ones that commute with the shift operator.
When talking about time series, shiftinvariance means time invariance.
See https://news.ycombinator.com/item?id=37915044
The companion matrix of $X^n 1$ is $S_n$.
Question: What do companion matrices of cyclic polynomials look like?
https://en.wikipedia.org/wiki/Triangular_matrix#Simultaneous_triangularisability
https://en.wikipedia.org/wiki/Commuting_matrices
Circulant matrices form a commutative ring, as they are closed under summation. So to the diagonal matrices.
https://en.wikipedia.org/wiki/Hilbert%27s_Nullstellensatz
If we know two factors $n = 2 ⋅ m$ then
$$ X^n  1 = \p{X^m  1} ⋅ \p{X^m + 1} $$
$$ \p{X^m  1} ⋅ \p{X^m + 1} = X^{2⋅m}  1 $$
Using the Chinese remainder theorem we can reconstruct $P$ from these remainders
$$ \begin{aligned} M &= \prod_{i∈[0,n)} m_i & M_i &= \frac{M}{m_i} & \mod{x}_M = \mod{ \sum_{i∈[0,n)} \frac{M}{m_i} ⋅ \mod{x⋅\frac{m_i}{M} }_{m_i} }_M \end{aligned} $$
$$ \begin{aligned} \mod{P}_{\p{X^n 1}} = \mod{ \sum_{i∈[0,n)} \frac{\p{X^n 1}}{\p{X  \ω_n^i}} ⋅ \mod{P⋅\frac{\p{X  \ω_n^i}}{\p{X^n 1}} }_{\p{X  \ω_n^i}} }_{\p{X^n 1}} \end{aligned} $$
$$ \begin{aligned} P = \sum_{k∈[0,n)} f_k⋅\prod_{k∈[0,n)} \p{X  \ω_n^k} \,\operatorname{mod}\, \prod_{k∈[0,n)} \p{X  \ω_n^k} \end{aligned} $$
The CooleyTukey algorithm
If $n = a ⋅ b$ we can rewrite $i = i_a + i_b ⋅ a$ with $i_a∈[0,a)$ and $i_b∈[0,b)$. This creates a correspondence $[0,n) ≃ [0,a)×[0,b)$. An alternative correspondence is $k = k_b + k_a ⋅ b$. If we substitute these in the definition of NTT we get:
$$ f_{k_b + k_a ⋅ b} = \sum_{i_a∈[0,a)}\sum_{i_b∈[0,b)} x_{i_a + i_b ⋅ a} ⋅ \ω_n^{\p{i_a + i_b ⋅ a}⋅\p{k_b + k_a ⋅ b}} $$
Expanding the exponent of $\ω_n$ we get
$$ \ω_n^{\p{i_a + i_b ⋅ a}⋅\p{k_b + k_a ⋅ b}} = \ω_n^{i_a ⋅ k_b} ⋅ \ω_n^{i_b ⋅ k_b ⋅ a} ⋅ \ω_n^{i_a ⋅ k_a ⋅ b} ⋅ \ω_n^{i_b ⋅ k_a ⋅ a ⋅ b} $$
Using $\ω_{l⋅m}^l = \ω_m$ we get $\ω_n^a = \ω_b$ and $\ω_n^b = \ω_a$, combined with $\ω_n^{a⋅b} = \ω_n^n = 1$ this simplifies the exponent to
$$ \ω_n^{\p{i_a + i_b ⋅ a}⋅\p{k_b + k_a ⋅ b}} = \ω_b^{i_b ⋅ k_b} ⋅ \ω_n^{i_a ⋅ k_b} ⋅ \ω_a^{i_a ⋅ k_a} $$
Substituting this back in
$$ f_{k_b + k_a ⋅ b} = \sum_{i_a∈[0,a)}\sum_{i_b∈[0,b)} x_{i_a + i_b ⋅ a} ⋅ \ω_b^{i_b ⋅ k_b} ⋅ \ω_n^{i_a ⋅ k_b} ⋅ \ω_a^{i_a ⋅ k_a} $$
This allows us to introduce some brackets
$$ f_{k_b + k_a ⋅ b} = \sum_{i_a∈[0,a)} \delim[{ \p{\sum_{i_b∈[0,b)} x_{i_a + i_b ⋅ a} ⋅ \ω_b^{i_b ⋅ k_b}} ⋅ \ω_n^{i_a ⋅ k_b} }]⋅\ω_a^{i_a ⋅ k_a} $$
We can now see that the sum in parenthesis is an NTT of size $b$ operating over subsequences of $x$. This then gets multiplied by an extra factor $\ω_n^{i_a ⋅ k_b}$, known as a twiddle factor. The outer sum over the square brackets is again an NTT, this time of size $a$. This is the core of the CooleyTukey algorithm.
It's instructive to visualize $x$ as a $b×a$sized matrix in rowmajor order:
$$ x = \begin{bmatrix} x_0 & x_1 & x_2 & ⋯ & x_{a1} \\ x_a & x_{1+a} & x_{2+a} & ⋯ & x_{(a1) + a} \\ x_{2⋅a} & x_{1 + 2⋅a} & x_{2 + 2⋅a} & ⋯ & x_{(a1) + 2⋅ a} \\ ⋮ &⋮ &⋮ & ⋱ & ⋮ \\ x_{(b1)⋅a} & x_{1 + (b1)⋅a} & x_{3 + (b1)⋅a} & ⋯ & x_{(a1) + (b1)⋅a} \end{bmatrix} $$
We then transform all the columns in this matrix using a $b$sized NTT. To apply the twiddle factors, these can also be written as a $b × a$ matrix with $W_{ij} = \ω_n^{i⋅j}$ which can then be applied by a pointwise product
$$ W = \begin{bmatrix} 1 & 1 & 1 & ⋯ & 1 \\ 1 & \ω & \ω^2 & ⋯ & \ω^{a1} \\ 1 & \ω^2 & \ω^4 & ⋯ & \ω^{\p{a1}⋅2} \\ ⋮ &⋮ &⋮ & ⋱ & ⋮ \\ 1 & \ω^{b1} & \ω^{2⋅\p{b1}} & ⋯ & \ω^{\p{a1}⋅\p{b 1}} \\ \end{bmatrix} $$
The next step is to transform all the rows using a $a$sized NTT. All the values are now computed, but they are not yet in the correct order. The output order, $f_{k_b + k_a ⋅ b}$, corresponds to a $a×b$ sized matrix in rowmajor order. Fortunately it is just the transpose of the matrix we have computed.
In summary the steps are
 Reinterpreted as a $b×a$ matrix in rowmajor order.
 Apply $b$sized NTTs to all the columns in $x$
 Multiply each element by its twiddle factor $\ω_n^{i⋅j}$.
 Apply $a$sized NTTs to all the rows.
 Transpose the matrix to have the output in rowmajor order.
The first and fifth step are commonly omitted and the algorithm is known as the fourstep NTT or Bailey's FFT. It has the implementation drawback that the NTTs in the second step are over values discontinuous in memory. We can solve this by doing two more transpositions after steps 1 and 2 to get the sixstep NTT
If the factors $a$ and $b$ of $n$ are again composite the CooleyTukey algorithm can be applied recursively. A common strategy is to pick a highly composite $n$ such as $n = 2^k$ and recursively split it. There are several strategies to do this. If we make $a$ as small as possible, it is called decimationintime (DIT). If instead we make $b$ as small as possible it is decimationinfrequency (DIF). In both cases the small base case is known as the radix. It is also common to refer to NTT algorithms by their radix as in 'radix4 NTT'. Highly optimized small NTTs to be used as base case are known as codelets. These are the subject of ongoing research but this is mostly focussed on the complex numbers. If we make $a$ and $b$ about the same size it is called a radix$\sqrt{n}$. The sixstep radix$\sqrt{n}$ algorithm is a cacheoblivious algorithm when combined with an cacheoblivious matrix transpose. This makes it particularly suitable for very large $n$. Finally for complex numbers an variant called splitradix recurses into a size $\frac n2$ and two $\frac n4$ NTTs.
To do. Bower's FFT: https://github.com/Plonky3/Plonky3/blob/c58f6aa11f6f578a7b9a9a596f51958f9cb8332f/dft/src/radix_2_bowers.rs#L18
The GoodThomas algorithm
Starting again with a factorization $n = a⋅b$, let's look at all the bijections between $[0,n)$ and $[0,a)×[0,b)$ of the form
$$ \begin{aligned} i &= \mod{i_a ⋅ e_a + i_b ⋅ e_b}_n \\ k &= \mod{k_a ⋅ e_a' + k_b ⋅ e_b'}_n \\ \end{aligned} $$
where $e$'s are parameters and $\mod{}_n$ denotes reduction modulo $n$. We can substitute this into the $\ω_n$ exponent again. Note that modular reduction is already implied by the root of unity.
$$ \ω_n^{\p{i_a ⋅ e_a + i_b ⋅ e_b}⋅\p{k_a ⋅ e_a' + k_b ⋅ e_b'}} = \ω_n^{i_a ⋅ k_a ⋅ e_a ⋅ e_a'} ⋅ \ω_n^{i_b ⋅ k_b ⋅ e_b ⋅ e_b'} \ω_n^{i_a ⋅ k_b ⋅ e_a ⋅ e_b'} ⋅ \ω_n^{i_b ⋅ k_a ⋅ e_b ⋅ e_a'} ⋅ $$
In CooleyTukey we used the solutions $e_a = 1$, $e_b = a$, $e_a' = b$, $e_b' = 1$. This turned the first two roots into $\ω_a$ and $\ω_b$ and got rid of the $i_b⋅k_a$ term. We where left with a $i_a⋅k_b$ term that turned into twiddle factors. Can we get rid of this term? The necessary conditions for this are
$$ \begin{aligned} \mod{e_b ⋅ e_a'}_n &= 0 &\hspace{2em} \mod{e_a ⋅ e_b'}_n &= 0 \\ \mod{e_a ⋅ e_a'}_n &= b &\hspace{2em} \mod{e_b ⋅ e_b'}_n &= a \end{aligned} $$
We also need some conditions to make sure the map is a bijection.
In Bur77 it is shown that the map is a bijection iff
 $\gcd(a, b) = 1$ and
 $\gcd(e_a, a) = \gcd(e_b, b) = 1$ and at least one of
 $e_a = α ⋅ a$
 $e_b = β ⋅ b$
 $\gcd(e_a, a) = \gcd(e_b, b) = 1$ and at least one of
 $\gcd(a, b) > 1$ and either
 $e_a = α ⋅ a$ and $e_a ≠ β ⋅ b$ and $\gcd(α, a) = \gcd(e_b, b) = 1$
 $e_a ≠ α ⋅ a$ and $e_a = β ⋅ b$ and $\gcd(e_a, a) = \gcd(β, b) = 1$
The simplest solution to this set of constraints is
$$ \begin{aligned} e_a &= b &\hspace{2em} a_b &= a \\ e_a' &= \mod{b^{1}}_a⋅b &\hspace{2em} e_b' &= \mod{a^{1}}_b⋅a \end{aligned} $$
The inverses only exist if $\gcd(a,b)=1$, i.e. we have a factorization of $n$ into coprime factors. It is shown in Bur77 and others that all solutions require a coprime factorization. Proceeding as before we get
$$ f_{\mod{k_a ⋅ e_a' + k_b ⋅ e_b'}_n} = \sum_{i_a∈[0,a)} \p{\sum_{i_b∈[0,b)} x_{\mod{i_a ⋅ e_a + i_b ⋅ e_b}_n} ⋅ \ω_b^{i_b ⋅ k_b}} ⋅ \ω_a^{i_a ⋅ k_a} $$
We have a clean separation into two NTTs of size $a$ and $b$ without twiddle factors. The order of the two NTTs does not matter. This is the GoodThomas or Prime Factor NTT algorithm. It can also be visualized as a matrix, but this time the input and output mapping is no longer a simple reinterpretation in rowmajor order but a more complex permutation.
Our optimal strategy is now to first use GoodThomas recursively to break down $n$ into coprime factors. These factors of the form $p^n$ for some prime $p$ can be further reduced using CooleyTukey which leaves us with primesized NTTs as base cases.
Bluestein algorithm
Start again with the NTT formula
$$ \begin{aligned} f_k = \sum_{i∈[0,n)} x_i ⋅ \ω_n^{i⋅k} \end{aligned} $$
and substitute the identity
$$ i⋅k = \frac{1}{2}⋅\p{i^2 + k^2  (i  k)^2} $$
in the exponent to get
$$ \begin{aligned} \ω_n^{i⋅k} &= \ω_n^{\frac{1}{2}⋅\p{i^2 + k^2  (i  k)^2}} \\ &= \ω_{2⋅n}^{i^2 + k^2  (i  k)^2} \\ &= \ω_{2⋅n}^{i^2}⋅\ω_{2⋅n}^{k^2}⋅\ω_{2⋅n}^{(i  k)^2} \\ \end{aligned} $$
We use the fact that $\ω_n^{\frac 12} = \ω_{2⋅n}$ if $\ω_{2⋅n}$ exists. Substituting this back in the sum and moving factors around we get
$$ \begin{aligned} f_k &= \ω_{2⋅n}^{k^2}⋅\sum_{i∈[0,n)} \p{x_i ⋅ \ω_{2⋅n}^{i^2}} ⋅ \ω_{2⋅n}^{(i  k)^2} \end{aligned} $$
This formula has the shape
$$ \begin{aligned} f_k &= a_k ⋅ \sum_{i∈[0,n)} b_i ⋅ c_{ik} \end{aligned} $$
Which is a convolution between $b$ and $c$ followed by pointwise multiplication by $a$.
$$ \begin{aligned} a_i &= \ω_{2⋅n}^{i^2} &\hspace{2em} b_i &= x_i ⋅ \ω_{2⋅n}^{i^2} &\hspace{2em} c_i &= \ω_{2⋅n}^{i^2} & \end{aligned} $$
We have
$$ c_{i+n} = \ω_{2⋅n}^{\p{i+n}^2} = \ω_{2⋅n}^{i^2n^2  2⋅i⋅n} = \ω_{2⋅n}^{i^2}⋅\ω_{2⋅n}^{n^2}⋅\ω_{2⋅n}^{2⋅i⋅n} = c_i ⋅ \ω_{2}^{n} = \p{1}^n ⋅ a_i $$
If $n$ is even, the sequence $c$ is cyclic, if $n$ is odd it is negacyclic.
TODO. The sign can be absorbed in $\ω_{2⋅n}$. See HH19 p 13.
This is Bluestein's algorithm
 Pointwise multiply by $\ω_{2⋅n}^{i^2}$.
 Convolve with $\ω_{2⋅n}^{i^2}$.
 Pointwise multiply by $\ω_{2⋅n}^{i^2}$.
Q. Can we use a field extension to add $\ω_{2⋅n}$ if it does not exist? i.e. work in $\F[X]/(X^2\ω_n)$. Then substitute $X$ for $\ω_{2⋅n}$.
$$ \begin{aligned} a_i &= X^{i^2} \end{aligned} $$
$X^0, X^1, X^4, X^9, X^{16}$.
Rader's NTT
If $n$ is prime then the nonzero numbers $[1,n)$ form a group under multiplication modulo $n$. Pick a generator $g$ for this group, then $g^{1}$ is also a generator. We can now rewrite the nonzero indices as
$$ \begin{aligned} i &= \mod{g^{i'}}_n &\hspace{2em} k = \mod{g^{k'}}_n \end{aligned} $$
To use this we first split the zero indices out of the NTT equation
$$ \begin{aligned} f_0 &= \sum_{i∈[0,n)} x_i &\hspace{2em} f_i &= x_0 + \sum_{i∈[1,n)} x_i ⋅ \ω_n^{i⋅k} \end{aligned} $$
and substitute our index map to get
$$ \begin{aligned} f_{\mod{g^{k'}}_n} = x_0 + \sum_{i'∈[1,n)} x_{\mod{g^{i'}}_n} ⋅ \ω_n^{g^{i'k'}} \end{aligned} $$
The sum has the form $\sum_{i∈[0,n1)} a_{i} ⋅ b_{ik}$ which is a convolution of size $n1$. The $b$ term is cyclic, so this is a cyclic convolution. This method for computing NTTs is called Rader's algorithm. Much like GoodThomas, it uses primality and clever permutation to avoid twiddle factors.
RaderWinograd NTT
If $n = p^m$ with odd prime $p$ then the elements not a multiple of $p$, i.e. $\mod{i}_p>0$, form a group of size $p^m p^{m1}$ elements for which we can find a generator $g$.
TODO
https://encyclopediaofmath.org/wiki/Winograd_Fourier_transform_algorithm
Binary RaderWinograd NTT
If $n = 2^m$ then the elements not a multiple of $p$, i.e. $\mod{i}_p>0$, form a group of size $p^m p^{m1}$ elements for which we can find a generator $g$.
Cyclotomic Fast Fourier transforms
https://en.wikipedia.org/wiki/Cyclotomic_fast_Fourier_transform
Bruun's NTT
https://en.wikipedia.org/wiki/Bruun%27s_FFT_algorithm
Johnson–Burrus fast Fourier transform
See [Bla10] 12.5. Generalization over GoodThomas and Winograd nesting.
Summary of $\NTT_n$ algorithms
All presented methods require the $\ω_n$ to exist and turn the $\NTT_n$ into a simpler subproblem. Here $\NTT$ is a number theoretic transform, $\CCV$ is a cyclic convolution, $\CCV^{}$ is a negacyclic convolution, $\A$, $\M$ are additions and multiplications in $\F$, and $\Ac$, $\Mc$ the same but by a constant.
Algorithm  Requirements  Sub problems 

Direct  $n^2⋅\Mc + n⋅\p{n1}⋅\A$  
CooleyTukey  $n= a⋅b$  $b⋅\NTT_a + a⋅\NTT_b + n⋅\Mc$ 
GoodThomas  $n=a⋅b$, $\gcd(a,b)=1$  $b⋅\NTT_a + a⋅\NTT_b$ 
Bluestein  $\ω_{2⋅n}$ exists  $\CCV_n^{\p{1}^n} + 2⋅n⋅\Mc$ 
Rader  $n$ prime  $\CCV_{n1} + \p{2⋅n1}⋅\A$ 
RaderWinograd  $n=p^m$, odd prime $p$  $\CCV_{p^m  p^{m−1}} + \p{p^{m1}+1}⋅\CCV_{p1}$ 
Binary RaderWinograd  $n=2^m$  $2⋅\CCV_{2^{m−2}}$ 
Note: The $\Mc$ values are deceptive. Often these can be efficiently folded into sub problems, or they are special values in $\F$ that are easier to compute. The $\A$'s in Radercan also be folded into some of the
Synthetic roots of unity
For a given finite field $\F_q$ a primitive $n$th root of unity exists iff $n$ divides $q1$. This also holds true for extension fields.
Suppose we want to use Bluesteins algorithm and we have $\ω_n$ but not the required $\ω_{2⋅n}$. We can work around this by taking the extension field $\F_{q^2} = \F_q[X]/(X^2  \ω_n)$ and using $\ω_{2⋅n} = X$.
In prime fields $\F_p$ with $\mod{p}_4 = 3$ the polynomial $X^2+1$ is irreducible and we can consider the field extension $\F_{p^2} = \F_p[X]/(X^2 + 1)$. We have $q^2  1 = \p{q+1}⋅\p{q1}$. So the new field must have a primitive root of order $q+1$. [Hab23]
What if we want to do an NTT of size $n$, but don't have $n$th roots of unity in the field? If out field is $\R$ we have at most a 2nd root of unity, $1$. If we want more we need an extension $\R/(X^2 + 1)$. More commonly known as the complex numbers $\mathbb{C}$ with $\i^2 = 1$.
$$ \ω_n = \mathrm{e}^{\frac{2⋅\mathrm{π}⋅\mathrm{i}}{n}} = 1 + \frac{2⋅\mathrm{π}}{n}⋅\i \frac{1}{2}⋅\p{\frac{2⋅\mathrm{π}}{n}}^2 \frac{1}{6}⋅\p{\frac{2⋅\mathrm{π}}{n}}^3⋅\i +\cdots $$
By definition $\ω_n^n = 1$, so to synthetically construct one we consider $\F[X]/\p{X^n  1}$. But $X^n1$ is not necessarily ireducable.
NTT by Complex Field Extension
If the $X^2 + 1$ is an irreducable polynomial in $𝔽_p$ then we can define the complex extension field of $𝔽_p$ as
$$ ℂ(𝔽_p) = 𝔽_p[X]/(X^2 + 1) $$
Note that the polynomial is irreducable exactly when $\mod{p}_4 = 3$.
The polynomials $a + b⋅X$ in this extension field have the property that $X$ acts as $\sqrt{1}$. So we will adopt the complex number notation $\i$ for $X$ and write $a + b ⋅ i$ instead.
For a finite field $\F_q$ Roots of unity exist for every divisor of $q1$. In a quadratic field extension $q1=p^21 = \p{p1}⋅\p{p+1}$. So the we have all the original divisors, plus what
NTT by Elliptic Curves
https://solvable.group/posts/ecfft/
Finding curves SS14.
Find an elliptic curve over $\F$ with
Take a subgroup of the elliptic curve of desired size $n$. Take as $x$ coordinates the $x$ coordinates of those points. We will interpolate a polynomial on those coordinates.
Next we'll need an isogeny to half the elliptic curve size. An isogeny is a surjective morphism between two elliptic curves. If the curves are of the form $y^2 = f(x)$ they can always be written in the standard affine form:
$$ ψ(x,y) = \p{\frac{u(x)}{v(x)}, \frac{s(x)}{t(x)}⋅y} $$
where $u,v$ and $s,t$ are coprime.
The groups $\Z_n^\times$.
The multiplicative group of numbers modulo $n$, $\Z_n^\times$, is cyclic iff $n$ is $1,2,4,p^k,2⋅p^k$ for odd prime $p$. In this case $\Z_n^\times ≅ \mathrm{C}_{\mathrm{φ}(n)}$ where $\mathrm{φ}$ is Euleur's totient function. For powers of two $\Z_{2^k}^\times ≅ \mathrm{C}_{\mathrm{φ}(n)}$
The Chinese Remainder Theorem
Let I and J be ideals of a ring R. Remainder arithmetic modulo I and J means mapping R/IJ to (R/I) × (R/J) by z 7 → (z mod I, z mod J).
The Chinese remainder theorem states that R/IJ is isomorphic to (R/I)×(R/J) if I and J are coprime, i.e., if there exist u ∈ I and v ∈ J with u + v = 1. The inverse map is (x, y) 7 → vx + uy.
Applied to a modular basis of polynomials $X  x_i$ this is called evaluation and interpolation.
Applied to a modular basis of polynomials $X  \ω_n^i$ this is the $\NTT$.
Karatsuba's trick is $\F[X]/(X^2  X) ≅ \F[X]/(X  1)×\F[X]/X$
References

R. Tolimieri, M. An, C. Lu, "Algorithms for discrete Fourier transform and convolution", Springer (1989) pp. 1997.

D.P. Kolba, T.W. Parks, "Prime factor FFT algorithm using high speed convolution" IEEE Trans. Acoustics, Speech and Signal Processing, ASSP–25 (1977) pp. 281–294

Blahut (2010). "Fast algorithms for signal processing".

C. S. Burrus (2012). “Fast Fourier Transforms”. https://eng.libretexts.org/Bookshelves/Electrical_Engineering/Signal_Processing_and_Modeling/Fast_Fourier_Transforms_(Burrus)

[46,47] C. S. Burrus (1977). “Index mappings for multidimensional formulation of the DFT and convolution.” https://doi.org/10.1109/TASSP.1977.1162938
https://cr.yp.to/lineartime/multapps20080515.pdf Mostly about fast multiplication and related algorithms.
https://cr.yp.to/arith/tangentfft20070919.pdf The Tangent FFT. Goes further than splitRadix. Not sure how it generalizes to fields.
https://cr.yp.to/papers/m320010811retypeset20220327.pdf
https://en.wikipedia.org/wiki/Multiplicative_group_of_integers_modulo_n
TODO. Nussbaumer–Quandalle permutation algorithm. See [Bla10] p. 411.
TODO. Additive NTT https://arxiv.org/abs/1404.3458