2014-02-27 (updated 2014-03-04)

Integer factorization in 64 bit

% Remco Bloemen % 2014-02-27, last updated 2014-03-04

For a residue number system I am trying to implement I need to factor 64 bit integers n. They have the special for that n+1 is a prime number, so they are all divisible by two and they tend to have other small factors. Here are a few representative examples:

$\begin{aligned} 18446744073709551556 &= 2^2 \cdot 11 \cdot 137 \cdot 547 \cdot 5594472617641 \\ 18446744073709551532 &= 2^2 \cdot 43 \cdot 67 \cdot 193 \cdot 809383 \cdot 10247197 \\ 18446744073709551520 &= 2^5 \cdot 5 \cdot 2663 \cdot 43294085790719 \\ 18446744073709551436 &= 2^2 \cdot 41 \cdot 101 \cdot 4051 \cdot 6199 \cdot 44347651 \\ 18446744073709551426 &= 2 \cdot 3 \cdot 23 \cdot 133672058505141677 \end{aligned}$

Trial division

Trial division is the simplest method, simply start dividing out numbers from one to $\sqrt{n}$ . Since you will factor out the primes before any of their composites the resulting list only contains prime factors. I also tested a version which used a list of small primes first, before trying all numbers, but the extra memory bandwidth was worse than the speedup from having to try less numbers. Here is the source:

vector<uint64> prime_factors(uint64 n)
{
	vector<uint64> f;

	// Remove the twos
	if((n & 1) == 0) {
		f.push_back(2);
		do
			n >>= 1;
		while((n & 1) == 0);
	}

	// Remove other factors
	uint64 i;
	for(i=3; i*i < n; i += 2) {
		if(fast_false(n % i == 0)) {
			f.push_back(i);
			do
				n /= i;
			while(n % i == 0);
		}
	}
	f.push_back(n);
	return f;
}

Wheel factorisation

In the previous example I took the two out as a special case and skipped all the odd integers. A generalisation of this trick is wheel factorisation, where a few small primes are taken out and a skip list is produced of numbers that must be multiples of those small primes. The end result is a list of numbers that contains more primes (without missing any!) than plainly listing all odd integers.

There is a trade off between memory usage by the skip list and small primes list and the percentage of composites that are skipped. As often with these memory trade offs, its best to choose the size such that it fits snugly in the cpu cache. Now that I think about it, I could probably take 32 bit numbers for the primes and skip list, and fit twice the amount in the cpu cache.

The small primes are produced using the Sieve of Erastosthenes I implemented earlier. This implementation creates the largest possible wheel for a given number of small primes, which can then be used to factor integers.

struct prime_wheel
{
	prime_wheel();
	int factors;
	uint64 size;
	vector<uint64> deltas;
};

const uint64s small_primes = prime_sieve(1<<20);
const prime_wheel wheel;

prime_wheel::prime_wheel()
{
	uint64 max_size = small_primes[small_primes.size() - 1];

	max_size = 211;

	size = 1;
	for(factors = 0; size < max_size; factors++) {
		size *= small_primes[factors];
	}
	factors--;
	size /= small_primes[factors];

	uint64 previous = 1;
	for(int i = 0; i < size; i++) {
		bool include = true;
		for(int j = 0; j < factors; j++) {
			if((i % small_primes[j] == 0)) {
				include = false;
			}
		}
		if(!include)
			continue;
		deltas.push_back(i - previous);
		previous = i;
	}
	deltas.push_back(size - previous + 1);
}

vector<uint64> prime_factors_wheel(uint64 n)
{
	vector<uint64> f;

	// Remove the twos
	if((n & 1) == 0) {
		f.push_back(2);
		do
			n >>= 1;
		while((n & 1) == 0);
	}

	// Remove prime wheel factors
	uint64 p = 2;
	for(int i = 1; p < wheel.size; i++) {
		p = small_primes[i];
		if(fast_false(n % p == 0)) {
			f.push_back(p);
			do
				n /= p;
			while(n % p == 0);
		}
	}

	// Turn the prime wheel
	for(p = wheel.size + 1; p * p < n;) {
		for(int j = 0; j < wheel.deltas.size(); j++) {
			if(fast_false(n % p == 0)) {
				f.push_back(p);
				do
					n /= p;
				while(n % p == 0);
			}
			p += wheel.deltas[j];
		}
	}

	// What's left must be prime or one
	if(n != 1)
		f.push_back(n);

	return f;
}

Brent-Pollard rho factorisation

This is John Pollard’s rho factorisation method (see wikipedia and planet math. It operates by iterating a random function f modulo n. This iteration must become cyclic sometime since there are only a finite amount numbers modulo n. The congruence relations resulting from this cycle give us clues about factors of n.

Richard Brent has made quite a few improvements to this algorithm. The original algorithm compares $x_i$ with $x_{2i}$ to detect a cycle, but I implemented Brent’s improvement, which compares $x_i$ with $x_j$ , where $j$ is the first power of two below $i$ . Brent also suggested that one can multiply a few differences $\left\vert x_i - x_j \right\vert$ modulo $n$ and compute the gcd of the product and $n$ once in a while. I did not implement this last suggestion, but I did measure that >90% of the time is spent computing gcd’s, so this would be an important improvement. I also did not implement Montogomery reduced multiplication, which may speed up the multiplication.

This algorithm will run indefinitely for prime numbers, so these are checked for on start. There are also some special cases where the cycle may be very long or where it will not result in a proper factor, so after a while the algorithm terminates and tries again with a different random function. I have not thought hard about the best time to terminate and try again, so this might be sub-optimal.

// Source: http://en.wikipedia.org/wiki/Binary_GCD_algorithm
uint64 gcd(uint64 u, uint64 v)
{
	if (u == 0 || v == 0)
		return u | v;

	// Remove common twos in u and v
	int shift;
	for (shift = 0; ((u | v) & 1) == 0; ++shift) {
		u >>= 1;
		v >>= 1;
	}

	// Remove twos from u
	while ((u & 1) == 0)
		u >>= 1;

	do {
		// Remove twos in v
		while ((v & 1) == 0)
			v >>= 1;

		// Now u and v are both odd, so diff(u, v) is even.
		// Let u = min(u, v), v = diff(u, v)/2.
		if (u < v) {
			v -= u;
		} else {
			uint64 diff = u - v;
			u = v;
			v = diff;
		}
		v >>= 1;
	} while (v != 0);

	return u << shift;
}

void brents_factor(uint64 n, vector<uint64>& f)
{
	if(is_prime(n)) {
		f.push_back(n);
		return;
	}

	uint64 a = random() % n;
	uint64 x = random() % n;
	uint64 y = 1;

	for(int i = 0; i*i / 2 < n; i++)
	{
		// x = x² + a mod n
		x = mul(x, x, n);
		x += a;
		if(x < a)
			x += (max_uint64 - n) + 1;
		x %= n;

		uint64 g = gcd(n, y - x);
		if(g != 1 && g != n) {
			n /= g;
			brents_factor(g, f);
			if(n != g)
				brents_factor(n, f);
			return;
		}

		if((i & (i-1)) == 0)
				y = x;
	}

	// Found no factors, yet n is not a prime, retry
	brents_factor(n, f);
}

Benchmark and conclusion

The algorithms are compared by factoring p-1 for the thousand largest 64 bit primes:

Algorithm Running time

Trial division 959 s Wheel factorisation 544 s Brent-Pollard 3 s

Clearly the Brent-Pollard method is the way to go, I should implement the optimisations I suggested and think more thoroughly about the loop termination.