Variable Sized Integers
% Remco Bloemen % 2014-02-27
In this article I’ll show different methods to compactly encode integers which are likely to be small. In the process I’ll show some neat bit-fiddeling techniques and I’ll try to develop a coding that is both faster and denser.
Existing work
Variable sized integer codings are used in several existing projects, usually when very efficient use of storage space is required and large values can occur but small values are much more frequent. Typical cases would be content sizes, or anything that follows a power-law distribution, which according to Zipf’s law happens very often. I’ll start by showing three well-known variable integer coders.
EBML (Matroska)
The Matroska file
format, with file
extenstion .mkv
, is a popular choice for high definition video
content. Matroska is a modern and flexible container format based on the
EBML
specification. An EBML
formatted file consists of a stream of key, size, value triples known as
‘tags’. Here the key is an one to four byte identifier and the size is
the varint encoded length of the value. These tags can be nested, making
the format heirarchical and easily extended.
The varint is between 1 and 9 bytes, the most significant bits of the first byte determine the total number of bytes used in the encoding.
EBML varint format:
value encoded as
< 2⁷ 1xxx xxxx₂
< 2¹⁴ 01xx xxxx xxxx xxxx₂
< 2²¹ 001x xxxx xxxx xxxx xxxx xxxx₂
< 2²⁸ 0001 xxxx xxxx xxxx xxxx xxxx xxxx xxxx₂
⋮ ⋮
< 2⁶⁴ 0000 0001 xxxx xxxx
…(48 bits)… xxxx xxxx₂
Note
EBML
uses this format to encode tags and field sizes, not to encode integer valued fields. For example, encoding an integer valued field with value 123₁₆ would result in a size of two bytes, which is varint encoded as 02₁₆ and then prefixed to the integer’s bytes giving the output 020123₁₆. Since the size is 0–8 bytes, which is always encoded as a single byte, the net effect is a one-byte prefix.
The varint coder is defined in EbmlElement.cpp, but the original code is too complicated to show it here. It contains special cases for the end of the read/write buffer, some minor variants for signed integers and special values, but is is also rather entangled with other functions. I will therefore show a simplified version, with my own variable names and comments to clarify. I did not try to optimize the functions. Please rever to the original sourcecode at the links above before making any judgements.
EBML write function (simplified).
void write(char*& buffer, uint64 value)
{
// Determine the number of bytes to write
int num_bytes;
if (value <= 127) // 2^7 - 1
num_bytes = 1;
else if (value <= 16383) // 2^14 - 1
num_bytes = 2;
else if (value <= 2097151L) // 2^21 - 1
num_bytes = 3;
else if (value <= 268435455L) // 2^28 - 1
num_bytes = 4;
else num_bytes = 5; //
// Write the bytes
int mask = 0xFF;
OutBuffer[0] = 1 << (8 - num_bytes);
for (int i = 1; i < num_bytes; i++) {
buffer[num_bytes - i] = value & 0xFF;
value >>= 8;
mask >>= 1;
}
buffer[0] |= value & 0xFF & mask;
buffer += num_bytes;
}
Higher cases where marked marked as "TODO" in original code
EBML read function (simplified).
uint64 read(const char*& buffer)
{
char mask = 1 << 7;
char size_buffer[8];
int num_bytes = 0;
int index = 0;
uint64 value = 0;
for (index = 0; index < 8; ++index) {
if (buffer[0] & (mask >> index)) {
mask >>= index;
num_bytes = index + 1;
for (index = 0; index < num_bytes; ++index) {
size_buffer[index] = buffer[index];
}
for (index = 0; index < num_bytes - 1; ++index) {
value <<= 7;
value |= 0xFF;
}
value = 0;
value |= size_buffer[0] & ~mask;
for (unsigned int i = 1; i < num_bytes; ++i) {
value <<= 8;
value |= PossibleSize[i];
}
buffer += size_buffer;
return value;
}
}
// Handle overflow
}
Google Protocol Buffers
Google Protocol Buffers is designed as a highly efficient alternative to xml for data exchange between processes and data storage. It is apparently the default remote procedure call protocol for Google’s internal software, so one can assume that the code is very well designed and povides a significant improvement over existing alternatives. A fundamental part of the specification are "base 128 varints". These varints are the default encoding format for integer field types and they are used to encode data lengths.
The varint format consist of groups of eight bits, bytes. The most significant bit in each byte is a flag that is one when more groups are comming and zero if this is was the last group. The other seven bits are part of the encoded number, but counterintuitively the format starts with the least significant group of seven bits first.
Google Protocol Buffers varint format:
value encoded as
< 2⁷ 0xxx xxxx₂
< 2¹⁴ 1xxx xxxx 0xxx xxxx₂
< 2²¹ 1xxx xxxx 1xxx xxxx 0xxx xxxx₂
< 2²⁸ 1xxx xxxx 1xxx xxxx 1xxx xxxx 0xxx xxxx₂
⋮ ⋮
< 2⁶⁴ 1xxx xxxx 1xxx xxxx
…(48 bits)… 0xxx xxxx₂
The code has some special cases at the end of the buffer and for 32-bit and signed integer which I will not go into. When the pointer is not to close to the end of the buffer then the encoding is done using CodedInputStream::ReadVarint64Fallback and the decoding is done using CodedOutputStream::WriteVarint64ToArrayInline. The functions are defined as follows:
Google Protocol Buffers writer.
inline uint8* CodedOutputStream::WriteVarint64ToArrayInline(
uint64 value, uint8* target)
{
// Splitting into 32-bit pieces gives better performance on 32-bit
// processors.
uint32 part0 = static_cast<uint32>(value );
uint32 part1 = static_cast<uint32>(value >> 28);
uint32 part2 = static_cast<uint32>(value >> 56);
int size;
// Here we can't really optimize for small numbers, since the value is
// split into three parts. Cheking for numbers < 128, for instance,
// would require three comparisons, since you'd have to make sure part1
// and part2 are zero. However, if the caller is using 64-bit integers,
// it is likely that they expect the numbers to often be very large, so
// we probably don't want to optimize for small numbers anyway. Thus,
// we end up with a hardcoded binary search tree...
if (part2 == 0) {
if (part1 == 0) {
if (part0 < (1 << 14)) {
if (part0 < (1 << 7)) {
size = 1; goto size1;
} else {
size = 2; goto size2;
}
} else {
if (part0 < (1 << 21)) {
size = 3; goto size3;
} else {
size = 4; goto size4;
}
}
} else {
if (part1 < (1 << 14)) {
if (part1 < (1 << 7)) {
size = 5; goto size5;
} else {
size = 6; goto size6;
}
} else {
if (part1 < (1 << 21)) {
size = 7; goto size7;
} else {
size = 8; goto size8;
}
}
}
} else {
if (part2 < (1 << 7)) {
size = 9; goto size9;
} else {
size = 10; goto size10;
}
}
GOOGLE_LOG(FATAL) << "Can't get here.";
size10: target[9] = static_cast<uint8>((part2 >> 7) | 0x80);
size9 : target[8] = static_cast<uint8>((part2 ) | 0x80);
size8 : target[7] = static_cast<uint8>((part1 >> 21) | 0x80);
size7 : target[6] = static_cast<uint8>((part1 >> 14) | 0x80);
size6 : target[5] = static_cast<uint8>((part1 >> 7) | 0x80);
size5 : target[4] = static_cast<uint8>((part1 ) | 0x80);
size4 : target[3] = static_cast<uint8>((part0 >> 21) | 0x80);
size3 : target[2] = static_cast<uint8>((part0 >> 14) | 0x80);
size2 : target[1] = static_cast<uint8>((part0 >> 7) | 0x80);
size1 : target[0] = static_cast<uint8>((part0 ) | 0x80);
target[size-1] &= 0x7F;
return target + size;
}
Google Procol Buffers reader.
bool CodedInputStream::ReadVarint64Fallback(uint64* value) {
if (BufferSize() >= kMaxVarintBytes ||
// Optimization: If the varint ends at exactly the end of the buffer,
// we can detect that and still use the fast path.
(buffer_end_ > buffer_ && !(buffer_end_[-1] & 0x80))) {
// Fast path: We have enough bytes left in the buffer to guarantee that
// this read won't cross the end, so we can skip the checks.
const uint8* ptr = buffer_;
uint32 b;
// Splitting into 32-bit pieces gives better performance on 32-bit
// processors.
uint32 part0 = 0, part1 = 0, part2 = 0;
b = *(ptr++); part0 = (b & 0x7F) ; if (!(b & 0x80)) goto done;
b = *(ptr++); part0 |= (b & 0x7F) << 7; if (!(b & 0x80)) goto done;
b = *(ptr++); part0 |= (b & 0x7F) << 14; if (!(b & 0x80)) goto done;
b = *(ptr++); part0 |= (b & 0x7F) << 21; if (!(b & 0x80)) goto done;
b = *(ptr++); part1 = (b & 0x7F) ; if (!(b & 0x80)) goto done;
b = *(ptr++); part1 |= (b & 0x7F) << 7; if (!(b & 0x80)) goto done;
b = *(ptr++); part1 |= (b & 0x7F) << 14; if (!(b & 0x80)) goto done;
b = *(ptr++); part1 |= (b & 0x7F) << 21; if (!(b & 0x80)) goto done;
b = *(ptr++); part2 = (b & 0x7F) ; if (!(b & 0x80)) goto done;
b = *(ptr++); part2 |= (b & 0x7F) << 7; if (!(b & 0x80)) goto done;
// We have overrun the maximum size of a varint (10 bytes). The data
// must be corrupt.
return NULL;
done:
Advance(ptr - buffer_);
*value = (static_cast<uint64>(part0)) |
(static_cast<uint64>(part1) << 28) |
(static_cast<uint64>(part2) << 56);
return true;
} else {
return ReadVarint64Slow(value);
}
}
Looking at the comments, a lot of thought went into the performance.
Loops are unrolled and the code is filled with goto
's.
Unicode
This list would not be complete without to of the most popular variable
sized integer coders, UTF-8
and UTF-16
. Yes, that’s right, these
unicode encodings can also be seen as varint coders. They encode the
valid unicode codepoints in the range 0–10FF₁₆ to sequences of bytes or
words.
UTF-8 format:
value encoded as
< 2⁷ 0xxx xxxx₂
< 2¹¹ 110x xxxx 10xx xxxx₂
< 2¹⁶ 1110 xxxx 10xx xxxx 10xx xxxx₂
< 2²¹ 1111 0xxx 10xx xxxx 11xx xxxx 10xx xxxx₂
This format is a lot less size efficient than the previous ones, but it
satisfies a few additional constraints due to ascii compatibility.
Values less than 127, the ascii values, are encoded as themselves and
non ascii values are encoded using only non-ascii bytes. Furthermore the
start of a sequence can always be identified by the bitstring
11xx xxxx
, this allows a decoder to resynchronize on a corrupt
bitstream.
The UTF-16 format has a preprocessing step, if the value is higher than 10000₁₆ then 10000₁₆ is subtracted from the value befor encoding using the pattern in the second row otherwise the value is written out plainly:
UTF-16 format:
value encoded as
< 2¹⁶ xxxx xxxx xxxx xxxx₂
< 2²⁰ 1110 10xx xxxx xxxx 1110 11xx xxxx xxxx₂
To let the decoder identify between the first and second row, the first
row must not start with 1110 10₂
! With a nice disregard for
sepparation of concerns the Unicode standard was modified to exclude the
whole range 1110 1xxx xxxx xxxx
solely for UTF-16. In fact, UTF-8
originally went up to 31 bits, but was limited to 21 bits to conform
with UTF-16.
Improving
- Operations on 64-bit unsigned integers are allowed
- The buffer is overallocated eight bytes
- Very fast encoding/decoding
- Encode whole number of bytes
- Encode minimal number of bytes
- Bijective encoding (every value has a unique encoding)
One can learn a few things from the the three formats above. The EBML
and UTF-8
format have the advantage that the the total number of bytes
to read is known after the first byte is read, the Google format lacks
this advantage. However, Google’s format shares with UTF-8
that values
less than 127 are encoded as themselfs, this is fundamental to the
ascii-compatibility of UTF-8
. There are both pro’s and con’s to this
approach, the pro is that users with non-unicode tools might be able to
read the file, the con is that user might not feel inclined to update
their tools untill something breaks horribly.
I personally shun backwards compatibility, it often does more harm than
good. Besides, a non compatible solution saves me one instruction on my
processor. Therefore my format will be the same as EBML
's. (Should you
want the backwards compatibility, you can reverse the ones and zeros in
the prefix and modify the source accordingly).
custom varint format:
value encoded as
< 2⁷ 1xxx xxxx₂
< 2¹⁴ 01xx xxxx xxxx xxxx₂
< 2²¹ 001x xxxx xxxx xxxx xxxx xxxx₂
< 2²⁸ 0001 xxxx xxxx xxxx xxxx xxxx xxxx xxxx₂
⋮ ⋮
< 2⁶⁴ 0000 0001 xxxx xxxx
…(48 bits)… xxxx xxxx₂
Loop free, almost branch free
Now that the format is settled, how are we going to encode/decode it
efficiently? The EBML
implementation is filled with loops within
loops, that can never be efficient. The Google one solves this by
unrolling the loop and jumping into it using a decission tree. But both
solution still require conditional branching wich can cause branch
misprediction and the
associated performance penalty of ten to twenty cycles.
So I would like to avoid fors and' ifs where possible. And in fact this
can be done to a large degree! Using GCC’s builtin function
__builtin_clzll
we can
quickly count the number of zero bits left of the most significant one
bit. Another usefull function is builtin_bswap64
, which
converts between little-endian and big-endian. I’ve wrapped the builtin
functions in leading_zeros
and byte_swap
so one can provide compiler
independent implementations for them. Additionally GCC’s
__builtin_clzll
is undefined for the value zero, which I would like to
be set to 64. Testing shows that in practice it returns zero on zero
input, the following corrects that:
inline int leading_zeros(uint64 a){return (a)?__builtin_clzll(a):64;}
inline uint64 byte_swap(uint64 a){return __builtin_bswap64(a);}
#define expect_false(cond) __builtin_expect(cond, false)
#define expect_true(cond) __builtin_expect(cond, true)
I have also defined two macro’s, expect_false
and expect_true
which
give the compiler a clue about wich branch is more likely, these are
purely optional. With these functions one can construct the following
variabel integer encoder for the format given above:
void write(char*& buffer, uint64 value) {
int num_bytes = 9 - leading_zeros(value) / 7;
if(expect_true(num_bytes < 9)) {
value <<= 63 - num_bytes * 7;
value |= 1ul << (64 - num_bytes);
value = byte_swap(value);
*reinterpret_cast<uint64*>(buffer) = value;
buffer += num_bytes;
return;
}
*buffer = 0x00;
value = byte_swap(value);
*reinterpret_cast<uint64*>(buffer + 1) = value;
buffer += 9;
}
Line 2 calculates the number of bytes required using that we encode the
value in groups of seven bits. The number of bits in to write down a
value x in binary is ⌊\log₂ x⌋. Note that leading_zeros
returns the
value 64 - ⌊\log₂ x⌋. Which gives the following derivation for the
number of seven bit blocks required:
⌈ \frac{⌊\log₂ x⌋}7 ⌉ ≟ 9 - ⌊ \frac{64 - ⌊\log₂ x⌋}7 ⌋
Lines 4–9 handle the case where the encoded value fits in a uint64
.
Line 4 shifts the value to the most significant bytes. Line 5 adds the
one bit that marks the number of bytes. Line 6 converts the in memory
representation from little-endian to big-endian. Line 7 writes the eight
bytes of value
in one go by reinterpreting the buffer as an uint64
array. This will always write eight bytes to the buffer, but we have the
assumption that this is can always be done. Line 8 increments the buffer
pointer by the correct amount.
Lines 11–15 handle the very large value case, which is unlikely to happen. An all-zero prefix byte is written and the big endian value is written in one go.
uint64 read(char*& buffer) {
uint64 value = *reinterpret_cast<uint64*>(buffer);
value = byte_swap(value);
int num_bytes = leading_zeros(value);
buffer += leading_zeros + 1;
if(expect_true(leading_zeros < 8)) {
value <<= leading_zeros;
value += offset << (56 - leading_zeros * 7);
value >>= 56 - leading_zeros * 7;
return value;
}
if(expect_false(num_bytes > 8))
throw "Overflow: value ≧ 2⁶⁴";
value = *reinterpret_cast<uint64*>(buffer - 8);
value = byte_swap(value);
return value;
}
Removing ambiguity
// This constant will add the proper offsets for all the different
// cases and move the leading one one position to the right
const uint64 offset = 0x0102040810204080;
void write(char*& buffer, uint64 value) {
value -= offset;
int num_bytes = 9 - leading_zeros(~value ^ offset) / 7;
if(expect_true(num_bytes)) {
value <<= 63 - num_bytes * 7;
value >>= num_bytes - 1;
value = byte_swap(value);
*reinterpret_cast<uint64*>(buffer) = value;
buffer += num_bytes;
return;
}
*buffer = 0x00;
value = byte_swap(value);
*reinterpret_cast<uint64*>(buffer + 1) = value;
buffer += 9;
}
uint64 read(char*& buffer) {
uint64 value = *reinterpret_cast<uint64*>(buffer);
value = byte_swap(value);
int leading_zeros = leading_zeros(value);
buffer += leading_zeros + 1;
if(expect_true(leading_zeros < 8)) {
value <<= leading_zeros;
value += offset << (56 - leading_zeros * 7);
value >>= 56 - leading_zeros * 7;
return value;
}
value = *reinterpret_cast<uint64*>(buffer - 8);
value = byte_swap(value);
value += offset;
if(expect_false(num_bytes > 8 || value < offset))
throw "Overflow: value ≧ 2⁶⁴";
return value;
}
Optimizing the Google format.
int num_bytes = leading_zeros(~value & 0x8080808080808080);