Winograd Fast Inner Product
$$ \gdef\delim#1#2#3{\mathopen{}\mathclose{\left#1 #2 \right#3}} \gdef\p#1{\delim({#1})} \gdef\ps#1{\delim\{{#1}\}} \gdef\floor#1{\delim\lfloor{#1}\rfloor} \gdef\ceil#1{\delim\lceil{#1}\rceil} \gdef\vec#1{\mathbf{#1}} \gdef\mat#1{\mathrm{#1}} \gdef\A{\mathcal A} \gdef\B{\mathcal B} \gdef\T{\mathsf T} $$
PN23 mentions a clever trick due to Winograd W68 to compute inner products with half the number of multiplications after pre-processing. Observe for $(a_0, a_1)⋅(b_0, b_1)$ we have
$$ a_0⋅b_0 + a_1 ⋅b_1 = (a_0 + b_1) ⋅ (a_1 + b_0) - a_0 ⋅ a_1 - b_0 ⋅ b_1 $$
where the left hand side, the conventional method, takes two multiplications. The right hand side needs three multiplications but two of them only depend on one of the inputs and can be precomputed.
For arbitrary length vectors this can be repeated for all pairs
$$ \begin{aligned} \sum_{i \in [0,n)} a_i ⋅ b_i &= \sum_{i \in 2⋅[0,n/2)} \p{a_{i} ⋅ b_{i} + a_{i + 1} ⋅ b_{i + 1} } \\ &= \sum_{i \in 2⋅[0,n/2)} \p{ \p{a_{i} + b_{i + 1}} ⋅ \p{a_{i + 1} + b_{i}} - a_{i} ⋅ a_{i + 1} - b_{i} ⋅ b_{i + 1} } \\ &= A + B + \sum_{i \in 2⋅[0,n/2)} \p{a_{i} + b_{i + 1}} ⋅ \p{a_{i + 1} + b_{i}} \\ \end{aligned} $$
where we only need to store a single precomputed value per input vector
$$ \begin{aligned} A &= \!\!\!\sum_{i \in 2⋅[0,n/2)}\!\!\! -a_{i} ⋅ a_{i + 1} &\quad B &= \!\!\!\sum_{i \in 2⋅[0,n/2)}\!\!\! -b_{i} ⋅ b_{i + 1} \end{aligned} $$
Note. Generalizing this to more than two terms does not seem beneficial, as it generates $n$ useful terms, but $n^2-n$ noise terms that need to be cancelled. For $n>2$ these noise terms also include cross terms between $\vec a$ and $\vec b$ that can not be pre-computed in the same way.
Picking odd and even indices is arbitrary. Instead we can pad the vectors $\vec a$, $\vec b$ with zeros to have an even length, and then create two $\ceil{\frac{n}{2}}$ length subsequences $\mathcal A$, $\mathcal B$ and compute:
$$ \vec a ⋅ \vec b = \p{\vec a_{\mathcal A} + \vec b_{\mathcal B}}⋅\p{\vec a_{\mathcal B} + \vec b_{\mathcal A}} - \vec a_{\mathcal A} ⋅ \vec a_{\mathcal B} - \vec b_{\mathcal B} ⋅ \vec a_{\mathcal B} $$
In particular, we can can pick the lower and higher half of indices and get a block-matrix formulation
$$ \begin{aligned} \begin{bmatrix} \mat A & \mat A' \end{bmatrix}⋅ \begin{bmatrix} \mat B \\ \mat B' \end{bmatrix} &=\mat A ⋅ \mat B + \mat A' ⋅ \mat B' \\ &=\p{\mat A + \mat B'}⋅ \p{\mat A' + \mat B}-\mat A⋅\mat A'-\mat B'⋅\mat B \end{aligned} $$
The naive algorithm for matrix multiplication involves a large number of dot products. W68 shows that for large matrices the pre-processing cost ammortizes away and we can indeed half the number of multiplications. Winograd further shows that this leads to improvements for matrix inversion and solving linear systems.
Implementation
To vectorize this we can swap pairs of elements in the vector $\vec a$, so that the addition is in order:
$$ A + B + \sum_{i \in 2⋅[0,n/2)} \p{a_{i} + b_{i}} ⋅ \p{a_{i + 1} + b_{i + 1}} $$
If we split $a$ and $b$ into odd and even vectors we get:
$$ \sum_{i \in [0,n/2)} \p{a_{i} + b_{i}} ⋅ \p{a_{i}' + b_{i}'} $$
Or batched:
$$ \mat C = \vec a + \vec d + \p{\mat A + \mat B} ⋅ \p{\mat A' + \mat B'} $$