2010-01-30

Working around bugs in the C++ standard

% Remco Bloemen % 2010-01-30

Anyone serious into high-performance computing should sooner or later start programming in C++. This language is powerful, flexible and has a large collection of libraries. But the language is also old and complicated and thus full of little annoying sub-optimalities. In this post I will explore some of those awkwardness’s and describe ways to work around them.

Cross platform integer types

The first oddity in the standard is that C++ does not make any guarantees about how many bits there are in an int. I’m quite fond of mathematical computation and bit twiddling algorithms, so here is the workaround:

#include<stdint.h>

// Remove those ugly _t's…
typedef uint32_t uint32;
typedef int32_t sint32;
typedef uint64_t uint64;
typedef int64_t sint64;</pre>

In the future you can use cstdint instead of stdint.h. The cstdint header puts everything in a namespaces, other than that I don’t think it will differ.

Its also interesting to note how the standard library seems to combine underscores, which I find too verbose for variable names, with mnemonics like mem_fn and codecvt, which I find too concise. The rest of the modern world uses CamelCase with ReadableAndDescriptiveNames.

Unicode support

When programming only two text encodings are relevant: UTF-8 and whatever the local environment considers the default encoding. When working with western scripts or computer languages (source files, html, etc.) UTF-8 gives the smallest encoding. Since most network and disk operations are bandwidth limited it is also the fastest encoding in a lot of cases.

In the case of unicode in C++ a third one comes into play, UTF-32. This encoding stores the codepoints as 32 bit integers in the platforms native byte order. This has the advantages that there hardly any encoding/decoding overhead. The idea is that characters have a fixed length encoding, but this is an illusion since there are combining characters.

I’m still not entirely convinced that UTF-32 is the best encoding for strings in memory. A western/computer script encoded in UTF-32 is roughly 4× the size of the UTF-8 encoding. Since memory bandwidth quite limited UTF-32 might be slower to process than UTF-8. I’ll have to benchmark this sometime.

For now it seems that C++'s native (read: least unsupported) unicode encoding is UTF-32. These are constucted as wchar_t* or wstring. However, you will often need to convert to and from char* and string, for example to decode the arguments to main. The following function does such a decoding:

(also note how C++ has full exception handling support, but std::codecvt::in still returns error codes).

#include<locale>
#include<exception>
#include<stdexcept>

std::wstring decodeLocale(const std::string& encoded)
{
	typedef std::codecvt<wchar, char, std::mbstate_t> converter;
	uint32 length = encoded.length();
	std::locale locale;
	const converter& facet = use_facet<converter>(locale);
	std::mbstate_t state;

	// Bug workaround
	// http://gcc.gnu.org/bugzilla/show_bug.cgi?id=28059
	std::memset(&state, 0, sizeof(std::mbstate_t));

	const char* in = encoded.c_str();
	const char* in_next = 0;
	wchar out[length];
	wchar* out_next = 0;
	converter::result result = facet.in(state,
		in, in + length, in_next,
		out, out + length, out_next);

	if(result != converter::ok)
		throw invalid_argument("The argument could not be decoded.");

	return std::wstring(out, out_next - out);
}

Yes, this is the easiest correct way! The C++ unicode interface is designed to be impossible to use. This seems to be a requirement for inclusion in the standard. To show my point, compare the above with an equivalent implementation in C#:

using System.Text;
using System.Text.Encoding;

string decodeLocal(byte[] encoded)
{
	return new String(Encoding.Default.GetChars(encoded));
}

But you wouldn’t even need such a function since all text input/output functions operate on String objects and these abstract away encoding details.

Useful output on STL containers

Programing is so much easier if you can just dump any variable to the output to inspect its value. For this reason most modern languages define ToString() functions on every type which are automatically used in such situations. As you could have guessed by now, C++ does not offer such convenience functions by default, but it allows you to define them.

Here is how you could implement an pretty-printer for the std::vector type:

template<class T>
std::wostream& operator<<(wostream& out, std::vector<T> v)
{
	out << L"[";
	for(uint32 i = 0; i < v.size(); i++) {
		out << v[i];
		if(i != v.size()-1)
			out << L", ";
	}
	out << L"]";
	return out;
}