Working around bugs in the C++ standard
% Remco Bloemen % 2010-01-30
Anyone serious into high-performance computing should sooner or later start programming in C++. This language is powerful, flexible and has a large collection of libraries. But the language is also old and complicated and thus full of little annoying sub-optimalities. In this post I will explore some of those awkwardness’s and describe ways to work around them.
Cross platform integer types
The first oddity in the standard is that C++ does not make any
guarantees about how many bits there are in an int
. I’m
quite fond of mathematical computation and bit twiddling algorithms, so
here is the workaround:
#include<stdint.h>
// Remove those ugly _t's…
typedef uint32_t uint32;
typedef int32_t sint32;
typedef uint64_t uint64;
typedef int64_t sint64;</pre>
In the future you can use cstdint instead of stdint.h. The cstdint header puts everything in a namespaces, other than that I don’t think it will differ.
Its also interesting to note how the standard library seems to combine
underscores, which I find too verbose for variable names, with mnemonics
like mem_fn
and codecvt
, which I find
too concise. The rest of the modern world uses CamelCase with
ReadableAndDescriptiveNames.
Unicode support
When programming only two text encodings are relevant: UTF-8 and whatever the local environment considers the default encoding. When working with western scripts or computer languages (source files, html, etc.) UTF-8 gives the smallest encoding. Since most network and disk operations are bandwidth limited it is also the fastest encoding in a lot of cases.
In the case of unicode in C++ a third one comes into play, UTF-32. This encoding stores the codepoints as 32 bit integers in the platforms native byte order. This has the advantages that there hardly any encoding/decoding overhead. The idea is that characters have a fixed length encoding, but this is an illusion since there are combining characters.
I’m still not entirely convinced that UTF-32 is the best encoding for strings in memory. A western/computer script encoded in UTF-32 is roughly 4× the size of the UTF-8 encoding. Since memory bandwidth quite limited UTF-32 might be slower to process than UTF-8. I’ll have to benchmark this sometime.
For now it seems that C++'s native (read: least unsupported) unicode
encoding is UTF-32. These are constucted as wchar_t*
or wstring
. However, you will often need to convert to
and from char*
and string
, for
example to decode the arguments to main. The following function does
such a decoding:
(also note how C++ has full exception handling support, but
std::codecvt::in
still returns error codes).
#include<locale>
#include<exception>
#include<stdexcept>
std::wstring decodeLocale(const std::string& encoded)
{
typedef std::codecvt<wchar, char, std::mbstate_t> converter;
uint32 length = encoded.length();
std::locale locale;
const converter& facet = use_facet<converter>(locale);
std::mbstate_t state;
// Bug workaround
// http://gcc.gnu.org/bugzilla/show_bug.cgi?id=28059
std::memset(&state, 0, sizeof(std::mbstate_t));
const char* in = encoded.c_str();
const char* in_next = 0;
wchar out[length];
wchar* out_next = 0;
converter::result result = facet.in(state,
in, in + length, in_next,
out, out + length, out_next);
if(result != converter::ok)
throw invalid_argument("The argument could not be decoded.");
return std::wstring(out, out_next - out);
}
Yes, this is the easiest correct way! The C++ unicode interface is designed to be impossible to use. This seems to be a requirement for inclusion in the standard. To show my point, compare the above with an equivalent implementation in C#:
using System.Text;
using System.Text.Encoding;
string decodeLocal(byte[] encoded)
{
return new String(Encoding.Default.GetChars(encoded));
}
But you wouldn’t even need such a function since all text input/output functions operate on String objects and these abstract away encoding details.
Useful output on STL containers
Programing is so much easier if you can just dump any variable to the
output to inspect its value. For this reason most modern languages
define ToString()
functions on every type which are
automatically used in such situations. As you could have guessed by now,
C++ does not offer such convenience functions by default, but it allows
you to define them.
Here is how you could implement an pretty-printer for the
std::vector
type:
template<class T>
std::wostream& operator<<(wostream& out, std::vector<T> v)
{
out << L"[";
for(uint32 i = 0; i < v.size(); i++) {
out << v[i];
if(i != v.size()-1)
out << L", ";
}
out << L"]";
return out;
}