Fast ζ(2n) calculation

Remco Bloemen

2009-11-06, last updated 2014-02-27

In this post I will show a few very fast algorithms for computing consecutive approximations of ζ(2n)\zeta (2n), derive my own superior algorithm for high n and produce a hybrid strategy for generating the zeta series required to compute Khinchin’s constant.

In the previous post we posed the following question:

How does one cal­cu­late ζ(2)ζ(2N)\zeta (2) \ldots \zeta (2N) fast and ac­cu­rate up to B(n)B(n) bits?

Where:

Parameter Value
BB 3 321 929
NN B2+1\left\lceil\frac{B}{2}\right\rceil + 1
B(n)B(n) B+log2Nlog2n+log2ln2B + \left\lceil\log_2 N \right\rceil - \left\lfloor\log_2 n +\log_2 \ln 2 \right\rfloor

There are several interesting algorithms to calculate ζ(n)\zeta (n) or the Bernoulli numbers, BnB_n which are related by:

ζ(n)=|Bn|(2π)n2n! \zeta(n) = \frac{\vert B_n \vert (2 \pi)^n}{2 n!}

mpfr_zeta_ui
This is the standard implementation of MPFR, it is accurate up to rounding in the last bit.
Series summation
This algorithm is trivially summing the [1,)kks\sum_{[1,\infty)}^k k^{-s} terms. However, when s is large the number of terms required to reach the accuracy becomes very low.
Euler product
Similar to the summation, but with a product of primes p11ps\prod_{\mathbb{P}}^p \frac{1}{1-p^s}
von Staudt–Clausen algorithm
This is a more involved trick which can produce an exact value of Zeta from an inaccurate one.
David Harvey’s BernMM
First calculate BnmodpB_n \mathrm{mod}{p} for a set of small primes pp and then determine an exact value using the Chinese remainder theorem.

The von Staudt-Clausen algorithm

This was our first improvement over mpfr_zeta_ui. In order to get an exact expression for ζ\zeta only the first

2+(n+1)(log2(n+1)1ln2)nlog2(2π) 2 + (n+1)(\log_2( n+1 ) - \frac{1}{\ln{2}}) - n \log_2( 2 \pi)

bits are required. Unfortunately this number quickly becomes even larger than the number of bits we require in the final ζ\zeta as n grows. However, for small n it still provides an incredible boost over mpfr_zeta_ui.

Below you can find the source code, the 2n!2 n! and (2π)n(2\pi)^n have been taken out so they can be recycled.

// Table of small primes
extern const uint primes[100000];
// Log2 ( 2 pi )
const double log22pi = 2.6514961294723187980;
// 1 / ln(2)
const double ln2inverse = 1.4426950408889634074;

/// @param zeta is set to ζ(n) - 1
/// @param factor must be set to (2π)^n/(2n!)
/// @returns false when von Staudt-Clausen gives no precision reduction
bool von_Staudt_Clausen_zeta(mpfr_t zeta, uint n, mpfr_t factor)
{
        mp_prec_t low_precision, precision;
        precision = mpfr_get_prec(zeta);
        mpfr_t zeta_low;
        mpfr_t B, term, divisor_sum;
        mpfr_init2(B, precision);
        mpfr_init2(term, precision);
        mpfr_init2(divisor_sum, precision);
        mp_exp_t bits;

        // If n<30 ζ=1 is a good enough apporximation
        if(n < 30)
        {
                low_precision = 2;
                mpfr_init2(zeta_low, low_precision);
                mpfr_set_ui(zeta_low, 1, GMP_RNDN);
        }
        else
        {
                // low_precision = 2 + (n+1)( log2( n+1 ) - 1/ln(2) ) - n log2( 2 pi )
                double lp = 2.0 + (n+1) * ( log2(n+1) - ln2inverse)  - n*log22pi;
                low_precision = floor(lp);
                if(low_precision > precision)
                {
                        return false;
                }
                mpfr_init2(zeta_low, low_precision);
                mpfr_zeta_ui(zeta_low, n, GMP_RNDN);
        }

        // TODO: What is the precision required for B here?
        // B =  zeta / factor   (low precision)
        mpfr_div(B, zeta_low, factor, GMP_RNDN);

        // divisor_sum = Σ 1/p  s.t.  p-1 divides 2n
        mpfr_set_d(divisor_sum, 0.0, GMP_RNDN);
        for(int i=0; true; i++)
        {
                uint prime = primes[i];
                if(prime > n+1)
                {
                        break;
                }
                if(n % (prime-1) == 0)
                {
                        mpfr_set_ui(term, prime, GMP_RNDN);
                        mpfr_ui_div(term, 1, term, GMP_RNDN);
                        mpfr_add(divisor_sum, divisor_sum, term, GMP_RNDN);
                }
        }

        // divisor_sum *= (-1)^(n+1)
        if((n % 4) == 0)
        {
                mpfr_neg(divisor_sum, divisor_sum, GMP_RNDN);
        }

        // von Staudt-Clausen theorem:
        // B = Round( B + divisor_sum ) - divisor_sum
        mpfr_add(B, B, divisor_sum, GMP_RNDN);
        mpfr_round(B, B);
        mpfr_sub(B, B, divisor_sum, GMP_RNDN);

        // zeta = factor * zeta
        mpfr_mul(zeta, factor, B, GMP_RNDN);

        // Remove the term 1
        mpfr_sub_ui(zeta, zeta, 1, GMP_RNDN);

        mpfr_clear(zeta_low);
        mpfr_clear(B);
        mpfr_clear(divisor_sum);
        mpfr_clear(term);
        return true;
}

David Harvey’s multimodular Bernoulli algorithm

Here is the source code and here is the paper paper describing how it works. We do not understand most of it, but the paper is clear about it being the fastest known algorithm for single exact Bernoulli numbers. A zeta function was constructed as follows (again recycling the factors):

Series summation

As said, if we want to beat Harvey’s algorithm we must abuse the fact that we want consecutive inexact values of ζ\zeta , the series summation lends itself beautifully for that:

ζ(n)=[1,)k1kn \zeta(n) = \sum_{[1,\infty)}^k \frac{1}{k^n}

The error introduced when stopping the summation before K is

[K,)k1kn=ζ(n,K)<2Kn \sum_{[K,\infty)}^k \frac{1}{k^n} = \zeta(n, K) <\frac{2}{K^n}

or expressed in bits of accuracy

B=nlog2(K)1 B = n\log_2(K) -1

so for B bits of accuracy in the sum we need to add up to but excluding

K=2B+1n K = \left\lceil 2^{\frac{B+1}{n}} \right\rceil

and these terms on their turn need to be accurate up to

Bterm=B+log2(K) B_{term} = B + \left\lceil \log_2(K) \right\rceil

But these terms tend to have a lot of zeros in front of them, in fact, nlog2(k)\left\lfloor n \log_2(k) \right\rfloor of them (well, when k=2mk = 2^{m} it is off by one, but this doesn’t matter since the remainder will be zero anyway). And since mpfr is a floating point library, these leading zeros do not requite additional precision and can be deducted:

Bterm=B+log2(K)nlog2(k) B_{term} = B + \left\lceil \log_2(K) \right\rceil - \left\lfloor n \log_2(k) \right\rfloor

Powtable series summation

Analysis of the series summation program showed that almost all of the time was spent in mpfr_ui_pow_ui. We can eliminate this call by recycling the 1kn\frac{1}{k^n} values!

/// Fast zeta calculation when few terms are required
/// @param zeta is set to ζ(n)-1
/// @param powtable is initialized to powtable[k-3] = k^(-2n+2)
/// it will be updated for ζ(n+2)-1 in the process
void powtable_summation_zeta(mpfr_t zeta, uint n, mpfr_t* powtable)
{
        mp_prec_t B = mpfr_get_prec(zeta) + 1;

        // K = ⌈ 2^((B+1)/n) ⌉
        uint K = ceil( pow(2.0, (B + 1.0) / n) );

        // B_S = B + ⌈ log2( K ) ⌉
        mp_prec_t B_S = B + ceil(log2(K));
        mpfr_set_prec(zeta, B_S);

        // zeta = 1/2^n
        mpfr_set_d(zeta, 1.0, GMP_RNDN);
        mpfr_div_2ui(zeta, zeta, n, GMP_RNDN);

        for(uint k=3; k < K; k++)
        {
                // B_term = B + ⌈ log2( K ) ⌉ - ⌊ n log2( k ) - 1 ⌋
                mp_prec_t B_term = B_S - floor( n*log2(k) );
                mpfr_prec_round(powtable[k-3], B_term, GMP_RNDN);

                // powtable[k-3] /= k*k
                mpfr_div_ui(powtable[k-3], powtable[k-3], k*k, GMP_RNDN);

                // zeta += powtable[k-3]
                mpfr_add(zeta, zeta, powtable[k-3], GMP_RNDN);
        }

        mpfr_prec_round(zeta, B, GMP_RNDN);
}

/// Produce a table for powtable_summation_zeta
mpfr_t* init_powtable(mp_prec_t B, uint n)
{
        cout << "Calculating series power table…   0%" << flush;
        timespec start, finish;
        clockid_t cpuclock;
        clock_getcpuclockid(0, &cpuclock);
        clock_gettime(cpuclock, &start);

        B++;

        // K = ⌈ 2^((B+1)/n) ⌉
        uint K = ceil( pow(2.0, (B + 1.0) / n) );
        mpfr_t* powtable = new mpfr_t[K-2];

        // B_S = B + ⌈ log2( K ) ⌉
        mp_prec_t B_S = B + ceil(log2(K));

        for(uint k=3; k < K; k++)
        {
                // B_term = B + ⌈ log2( K ) ⌉ - ⌊ n log2( k ) - 1 ⌋
                mp_prec_t B_term = B_S - floor( n*log2(k) );
                mpfr_init2(powtable[k-3], B_term);

                // powtable[k-3] = 1/k^(n-2)
                mpfr_ui_pow_ui(powtable[k-3], k, n-2, GMP_RNDN);
                mpfr_ui_div(powtable[k-3], 1, powtable[k-3], GMP_RNDN);

                // Display a nice progress indicator
                if(k % 10 == 0)
                {
                        double complete = 100.0 * sqrt(k) / sqrt(K);
                        cout << "\b\b\b" << setw(2) << setprecision(0) << floor(complete) << "%" << flush;
                }
        }
        clock_gettime(cpuclock, &finish);
        cout << "\b\b\b\b100% (in " << timems(start, finish) << " ms)" <<  endl;
        cout << fixed << setprecision(4);

        return powtable;
}

Benchmarking

To decide which algorithms we will use in our attempt to compute K0K_0 we measured (for BB = 34, 340 and 3400 kilobit) the run time of:

Line Method
Red line mpfr_zeta_ui
Blue line mpfr_zeta_ui with our von Staudt-Clausen implementation
Purple line Darvid Harvey’s bernoulli
Green line Series summation without recycling
Black line Series summation with recycling
Benchmark for ζ(2n) with 34 kb accuracy
Benchmark for ζ(2n) with 34 kb accuracy

The sudden drop at n=11000n =11 000 corresponds to the moment when ζ(2n)1\zeta (2n)-1 can be approximated by 22n2^{-2n}, apparently that optimisation is included in mpfr_zeta_ui. At this point the powtable is slightly faster since it doesn’t allocate memory for an (at this point unnecessary) temporary variable. The noise corresponds to me using my desktop for other things, like a screensaver which I forgot to disable, while running the benchmark.

Benchmark for ζ(2n) with 340kb accuracy
Benchmark for ζ(2n) with 340kb accuracy

Plain mpfr_zeta_ui was excluded because it would have taken minutes to compute a single zeta for low nn. It would have been identical to the blue line after the peak in the blue line.

Benchmark for ζ(2n) with 3.4 Mb accuracy
Benchmark for ζ(2n) with 3.4 Mb accuracy

This time the von Staudt-Clausen and Harvey’s method where stopped early since they took up too much time and were not that interesting.

Conclusion

The the mpfr_zeta_ui was not very impressive but with von Staudt-Clausen it was very good until nn grows large. But Harvey’s method is faster when many bits are required.

Harvey’s algorithm performs very good but is beaten by the series summation when the number of terms included required for the series drops to around a couple hundred.

But the powtable series blows everything away! We will use the powtable series as soon as the number of terms drops below ten thousand, by then its more then three times faster than Harvey’s algorithm and the algorithm kept getting faster as less terms were needed (unlike the Harvey’s, which gets slower for higher n*n*).

There is a reason we don’t start using the powtable method earlier. When nn is small exponentially more terms in the table are required, and generating more table items will soon cost more than the time gained by switching to the powtable method a little earlier. Taking 10 000 items seems to be near the optimum, here is how much time it takes to compute the table:

Accuracy n<n < time
34 kb 1 113 1 665 ms
340 kb 11 121 59 298 ms
3 400 kb 111 201 1 667 349 ms

Integrating the minimum of the running time algorithms reveals that the total time to calculate the ζ(2n)\zeta (2n)’s required for a million decimals of K0K_0 is 9.7 days. Quite long, but not impossible. Calculating a hundred thousand decimal should take about an hour, let’s try that first.