2014-02-27

Variable Sized Integers

% Remco Bloemen % 2014-02-27

In this article I’ll show different methods to compactly encode integers which are likely to be small. In the process I’ll show some neat bit-fiddeling techniques and I’ll try to develop a coding that is both faster and denser.

Existing work

Variable sized integer codings are used in several existing projects, usually when very efficient use of storage space is required and large values can occur but small values are much more frequent. Typical cases would be content sizes, or anything that follows a power-law distribution, which according to Zipf’s law happens very often. I’ll start by showing three well-known variable integer coders.

EBML (Matroska)

The Matroska file format, with file extenstion .mkv, is a popular choice for high definition video content. Matroska is a modern and flexible container format based on the EBML specification. An EBML formatted file consists of a stream of key, size, value triples known as ‘tags’. Here the key is an one to four byte identifier and the size is the varint encoded length of the value. These tags can be nested, making the format heirarchical and easily extended.

The varint is between 1 and 9 bytes, the most significant bits of the first byte determine the total number of bytes used in the encoding.

EBML varint format:

value encoded as

< 2⁷ 1xxx xxxx₂ < 2¹⁴ 01xx xxxx xxxx xxxx₂ < 2²¹ 001x xxxx xxxx xxxx xxxx xxxx₂ < 2²⁸ 0001 xxxx xxxx xxxx xxxx xxxx xxxx xxxx₂ ⋮ ⋮ < 2⁶⁴ 0000 0001 xxxx xxxx …(48 bits)… xxxx xxxx₂

Note

EBML uses this format to encode tags and field sizes, not to encode integer valued fields. For example, encoding an integer valued field with value 123₁₆ would result in a size of two bytes, which is varint encoded as 02₁₆ and then prefixed to the integer’s bytes giving the output 020123₁₆. Since the size is 0–8 bytes, which is always encoded as a single byte, the net effect is a one-byte prefix.

The varint coder is defined in EbmlElement.cpp, but the original code is too complicated to show it here. It contains special cases for the end of the read/write buffer, some minor variants for signed integers and special values, but is is also rather entangled with other functions. I will therefore show a simplified version, with my own variable names and comments to clarify. I did not try to optimize the functions. Please rever to the original sourcecode at the links above before making any judgements.

EBML write function (simplified).

void write(char*& buffer, uint64 value)
{
	// Determine the number of bytes to write
	int num_bytes;
	if (value <= 127) // 2^7 - 1
		num_bytes = 1;
	else if (value <= 16383) // 2^14 - 1
		num_bytes = 2;
	else if (value <= 2097151L) // 2^21 - 1
		num_bytes = 3;
	else if (value <= 268435455L) // 2^28 - 1
		num_bytes = 4;
	else num_bytes = 5; // 

	// Write the bytes
	int mask = 0xFF;
	OutBuffer[0] = 1 << (8 - num_bytes);
	for (int i = 1; i < num_bytes; i++) {
		buffer[num_bytes - i] = value & 0xFF;
		value >>= 8;
		mask >>= 1;
	}
	buffer[0] |= value & 0xFF & mask;
	buffer += num_bytes;
}

Higher cases where marked marked as "TODO" in original code

EBML read function (simplified).

uint64 read(const char*& buffer)
{
	char mask = 1 << 7;
	char size_buffer[8];
	int num_bytes = 0;
	int index = 0;
	uint64 value = 0;
	for (index = 0; index < 8; ++index) {
		if (buffer[0] & (mask >> index)) {
			mask >>= index;
			num_bytes = index + 1;
			for (index = 0; index < num_bytes; ++index) {
				size_buffer[index] = buffer[index];
			}
			for (index = 0; index < num_bytes - 1; ++index) {
				value <<= 7;
				value |= 0xFF;
			}
			value = 0;
			value |= size_buffer[0] & ~mask;
			for (unsigned int i = 1; i < num_bytes; ++i) {
				value <<= 8;
				value |= PossibleSize[i];
			}
			buffer += size_buffer;
			return value;
		}
	}
	// Handle overflow
}

Google Protocol Buffers

Google Protocol Buffers is designed as a highly efficient alternative to xml for data exchange between processes and data storage. It is apparently the default remote procedure call protocol for Google’s internal software, so one can assume that the code is very well designed and povides a significant improvement over existing alternatives. A fundamental part of the specification are "base 128 varints". These varints are the default encoding format for integer field types and they are used to encode data lengths.

The varint format consist of groups of eight bits, bytes. The most significant bit in each byte is a flag that is one when more groups are comming and zero if this is was the last group. The other seven bits are part of the encoded number, but counterintuitively the format starts with the least significant group of seven bits first.

Google Protocol Buffers varint format:

value encoded as

< 2⁷ 0xxx xxxx₂ < 2¹⁴ 1xxx xxxx 0xxx xxxx₂ < 2²¹ 1xxx xxxx 1xxx xxxx 0xxx xxxx₂ < 2²⁸ 1xxx xxxx 1xxx xxxx 1xxx xxxx 0xxx xxxx₂ ⋮ ⋮ < 2⁶⁴ 1xxx xxxx 1xxx xxxx …(48 bits)… 0xxx xxxx₂

The code has some special cases at the end of the buffer and for 32-bit and signed integer which I will not go into. When the pointer is not to close to the end of the buffer then the encoding is done using CodedInputStream::ReadVarint64Fallback and the decoding is done using CodedOutputStream::WriteVarint64ToArrayInline. The functions are defined as follows:

Google Protocol Buffers writer.

inline uint8* CodedOutputStream::WriteVarint64ToArrayInline(
	uint64 value, uint8* target)
{
	// Splitting into 32-bit pieces gives better performance on 32-bit
	// processors.
	uint32 part0 = static_cast<uint32>(value      );
	uint32 part1 = static_cast<uint32>(value >> 28);
	uint32 part2 = static_cast<uint32>(value >> 56);

	int size;

	// Here we can't really optimize for small numbers, since the value is
	// split into three parts.  Cheking for numbers < 128, for instance,
	// would require three comparisons, since you'd have to make sure part1
	// and part2 are zero.  However, if the caller is using 64-bit integers,
	// it is likely that they expect the numbers to often be very large, so
	// we probably don't want to optimize for small numbers anyway.  Thus,
	// we end up with a hardcoded binary search tree...
	if (part2 == 0) {
		if (part1 == 0) {
			if (part0 < (1 << 14)) {
			if (part0 < (1 << 7)) {
				size = 1; goto size1;
			} else {
				size = 2; goto size2;
			}
			} else {
			if (part0 < (1 << 21)) {
				size = 3; goto size3;
			} else {
				size = 4; goto size4;
			}
			}
		} else {
			if (part1 < (1 << 14)) {
			if (part1 < (1 << 7)) {
				size = 5; goto size5;
			} else {
				size = 6; goto size6;
			}
			} else {
			if (part1 < (1 << 21)) {
				size = 7; goto size7;
			} else {
				size = 8; goto size8;
			}
			}
		}
	} else {
		if (part2 < (1 << 7)) {
			size = 9; goto size9;
		} else {
			size = 10; goto size10;
		}
	}
	
	GOOGLE_LOG(FATAL) << "Can't get here.";
	
	size10: target[9] = static_cast<uint8>((part2 >>  7) | 0x80);
	size9 : target[8] = static_cast<uint8>((part2      ) | 0x80);
	size8 : target[7] = static_cast<uint8>((part1 >> 21) | 0x80);
	size7 : target[6] = static_cast<uint8>((part1 >> 14) | 0x80);
	size6 : target[5] = static_cast<uint8>((part1 >>  7) | 0x80);
	size5 : target[4] = static_cast<uint8>((part1      ) | 0x80);
	size4 : target[3] = static_cast<uint8>((part0 >> 21) | 0x80);
	size3 : target[2] = static_cast<uint8>((part0 >> 14) | 0x80);
	size2 : target[1] = static_cast<uint8>((part0 >>  7) | 0x80);
	size1 : target[0] = static_cast<uint8>((part0      ) | 0x80);
	
	target[size-1] &= 0x7F;
	return target + size;
}

Google Procol Buffers reader.

bool CodedInputStream::ReadVarint64Fallback(uint64* value) {
	if (BufferSize() >= kMaxVarintBytes ||
		// Optimization:  If the varint ends at exactly the end of the buffer,
		// we can detect that and still use the fast path.
		(buffer_end_ > buffer_ && !(buffer_end_[-1] & 0x80))) {
		// Fast path:  We have enough bytes left in the buffer to guarantee that
		// this read won't cross the end, so we can skip the checks.
		
		const uint8* ptr = buffer_;
		uint32 b;
		
		// Splitting into 32-bit pieces gives better performance on 32-bit
		// processors.
		uint32 part0 = 0, part1 = 0, part2 = 0;
		
		b = *(ptr++); part0  = (b & 0x7F)      ; if (!(b & 0x80)) goto done;
		b = *(ptr++); part0 |= (b & 0x7F) <<  7; if (!(b & 0x80)) goto done;
		b = *(ptr++); part0 |= (b & 0x7F) << 14; if (!(b & 0x80)) goto done;
		b = *(ptr++); part0 |= (b & 0x7F) << 21; if (!(b & 0x80)) goto done;
		b = *(ptr++); part1  = (b & 0x7F)      ; if (!(b & 0x80)) goto done;
		b = *(ptr++); part1 |= (b & 0x7F) <<  7; if (!(b & 0x80)) goto done;
		b = *(ptr++); part1 |= (b & 0x7F) << 14; if (!(b & 0x80)) goto done;
		b = *(ptr++); part1 |= (b & 0x7F) << 21; if (!(b & 0x80)) goto done;
		b = *(ptr++); part2  = (b & 0x7F)      ; if (!(b & 0x80)) goto done;
		b = *(ptr++); part2 |= (b & 0x7F) <<  7; if (!(b & 0x80)) goto done;
		
		// We have overrun the maximum size of a varint (10 bytes).  The data
		// must be corrupt.
		return NULL;
		
	done:
		Advance(ptr - buffer_);
		*value = (static_cast<uint64>(part0)) |
			(static_cast<uint64>(part1) << 28) |
			(static_cast<uint64>(part2) << 56);
		return true;
	} else {
		return ReadVarint64Slow(value);
	}
}

Looking at the comments, a lot of thought went into the performance. Loops are unrolled and the code is filled with goto's.

Unicode

This list would not be complete without to of the most popular variable sized integer coders, UTF-8 and UTF-16. Yes, that’s right, these unicode encodings can also be seen as varint coders. They encode the valid unicode codepoints in the range 0–10FF₁₆ to sequences of bytes or words.

UTF-8 format:

value encoded as

< 2⁷ 0xxx xxxx₂ < 2¹¹ 110x xxxx 10xx xxxx₂ < 2¹⁶ 1110 xxxx 10xx xxxx 10xx xxxx₂ < 2²¹ 1111 0xxx 10xx xxxx 11xx xxxx 10xx xxxx₂

This format is a lot less size efficient than the previous ones, but it satisfies a few additional constraints due to ascii compatibility. Values less than 127, the ascii values, are encoded as themselves and non ascii values are encoded using only non-ascii bytes. Furthermore the start of a sequence can always be identified by the bitstring 11xx xxxx, this allows a decoder to resynchronize on a corrupt bitstream.

The UTF-16 format has a preprocessing step, if the value is higher than 10000₁₆ then 10000₁₆ is subtracted from the value befor encoding using the pattern in the second row otherwise the value is written out plainly:

UTF-16 format:

value encoded as

< 2¹⁶ xxxx xxxx xxxx xxxx₂ < 2²⁰ 1110 10xx xxxx xxxx 1110 11xx xxxx xxxx₂

To let the decoder identify between the first and second row, the first row must not start with 1110 10₂! With a nice disregard for sepparation of concerns the Unicode standard was modified to exclude the whole range 1110 1xxx xxxx xxxx solely for UTF-16. In fact, UTF-8 originally went up to 31 bits, but was limited to 21 bits to conform with UTF-16.

Improving

Operations on 64-bit unsigned integers are allowed
The buffer is overallocated eight bytes
Very fast encoding/decoding
Encode whole number of bytes
Encode minimal number of bytes
Bijective encoding (every value has a unique encoding)

One can learn a few things from the the three formats above. The EBML and UTF-8 format have the advantage that the the total number of bytes to read is known after the first byte is read, the Google format lacks this advantage. However, Google’s format shares with UTF-8 that values less than 127 are encoded as themselfs, this is fundamental to the ascii-compatibility of UTF-8. There are both pro’s and con’s to this approach, the pro is that users with non-unicode tools might be able to read the file, the con is that user might not feel inclined to update their tools untill something breaks horribly.

I personally shun backwards compatibility, it often does more harm than good. Besides, a non compatible solution saves me one instruction on my processor. Therefore my format will be the same as EBML's. (Should you want the backwards compatibility, you can reverse the ones and zeros in the prefix and modify the source accordingly).

custom varint format:

value encoded as

Loop free, almost branch free

Now that the format is settled, how are we going to encode/decode it efficiently? The EBML implementation is filled with loops within loops, that can never be efficient. The Google one solves this by unrolling the loop and jumping into it using a decission tree. But both solution still require conditional branching wich can cause branch misprediction and the associated performance penalty of ten to twenty cycles.

So I would like to avoid fors and' ifs where possible. And in fact this can be done to a large degree! Using GCC’s builtin function __builtin_clzll we can quickly count the number of zero bits left of the most significant one bit. Another usefull function is builtin_bswap64, which converts between little-endian and big-endian. I’ve wrapped the builtin functions in leading_zeros and byte_swap so one can provide compiler independent implementations for them. Additionally GCC’s __builtin_clzll is undefined for the value zero, which I would like to be set to 64. Testing shows that in practice it returns zero on zero input, the following corrects that:

inline int leading_zeros(uint64 a){return (a)?__builtin_clzll(a):64;}
inline uint64 byte_swap(uint64 a){return __builtin_bswap64(a);}
#define expect_false(cond) __builtin_expect(cond, false)
#define expect_true(cond) __builtin_expect(cond, true)

I have also defined two macro’s, expect_false and expect_true which give the compiler a clue about wich branch is more likely, these are purely optional. With these functions one can construct the following variabel integer encoder for the format given above:

void write(char*& buffer, uint64 value) {
	int num_bytes = 9 - leading_zeros(value) / 7;
	if(expect_true(num_bytes < 9)) {
		value <<= 63 - num_bytes * 7;
		value |= 1ul << (64 - num_bytes);
		value = byte_swap(value);
		*reinterpret_cast<uint64*>(buffer) = value;
		buffer += num_bytes;
		return;
	}
	*buffer = 0x00;
	value = byte_swap(value);
	*reinterpret_cast<uint64*>(buffer + 1) = value;
	buffer += 9;
}

Line 2 calculates the number of bytes required using that we encode the value in groups of seven bits. The number of bits in to write down a value $x$ in binary is $⌊\log₂ x⌋$ . Note that leading_zeros returns the value $64 - ⌊\log₂ x⌋$ . Which gives the following derivation for the number of seven bit blocks required:

$⌈ \frac{⌊\log₂ x⌋}7 ⌉ ≟ 9 - ⌊ \frac{64 - ⌊\log₂ x⌋}7 ⌋$

Lines 4–9 handle the case where the encoded value fits in a uint64. Line 4 shifts the value to the most significant bytes. Line 5 adds the one bit that marks the number of bytes. Line 6 converts the in memory representation from little-endian to big-endian. Line 7 writes the eight bytes of value in one go by reinterpreting the buffer as an uint64 array. This will always write eight bytes to the buffer, but we have the assumption that this is can always be done. Line 8 increments the buffer pointer by the correct amount.

Lines 11–15 handle the very large value case, which is unlikely to happen. An all-zero prefix byte is written and the big endian value is written in one go.

uint64 read(char*& buffer) {
	uint64 value = *reinterpret_cast<uint64*>(buffer);
	value = byte_swap(value);
	int num_bytes = leading_zeros(value);
	buffer += leading_zeros + 1;
	if(expect_true(leading_zeros < 8)) {
		value <<= leading_zeros;
		value += offset << (56 - leading_zeros * 7);
		value >>= 56 - leading_zeros * 7;
		return value;
	}
	if(expect_false(num_bytes > 8))
		throw "Overflow: value ≧ 2⁶⁴";
	value = *reinterpret_cast<uint64*>(buffer - 8);
	value = byte_swap(value);
	return value;
}

Removing ambiguity

// This constant will add the proper offsets for all the different
// cases and move the leading one one position to the right
const uint64 offset = 0x0102040810204080;

void write(char*& buffer, uint64 value) {
	value -= offset;
	int num_bytes = 9 - leading_zeros(~value ^ offset) / 7;
	if(expect_true(num_bytes)) {
		value <<= 63 - num_bytes * 7;
		value >>= num_bytes - 1;
		value = byte_swap(value);
		*reinterpret_cast<uint64*>(buffer) = value;
		buffer += num_bytes;
		return;
	}
	*buffer = 0x00;
	value = byte_swap(value);
	*reinterpret_cast<uint64*>(buffer + 1) = value;
	buffer += 9;
}

uint64 read(char*& buffer) {
	uint64 value = *reinterpret_cast<uint64*>(buffer);
	value = byte_swap(value);
	int leading_zeros = leading_zeros(value);
	buffer += leading_zeros + 1;
	if(expect_true(leading_zeros < 8)) {
		value <<= leading_zeros;
		value += offset << (56 - leading_zeros * 7);
		value >>= 56 - leading_zeros * 7;
		return value;
	}
	value = *reinterpret_cast<uint64*>(buffer - 8);
	value = byte_swap(value);
	value += offset;
	if(expect_false(num_bytes > 8 || value < offset))
		throw "Overflow: value ≧ 2⁶⁴";
	return value;
}

Optimizing the Google format.

int num_bytes = leading_zeros(~value & 0x8080808080808080);

Variable Sized Integers

Existing work

EBML (Matroska)

Google Protocol Buffers

Unicode

Improving

Loop free, almost branch free

Removing ambiguity

Benchmarking

Written size

Writing time

Reading time