Remco Bloemen

Software errors

The authors define a critical failure as something that can take down a whole cluster or cause data corruption, and then look at a couple hundred bugs in Cassandra, HBase, HDFS, MapReduce, and Redis, to find 48 critical failures. They then look at the causes of those failures and find that most bugs were due to bad error handling. 92% of those failures are actually from errors that are handled incorrectly.

Drilling down further, 25% of bugs are from simply ignoring an error, 8% are from catching the wrong exception, 2% are from incomplete TODOs, and another 23% are “easily detectable”, which are defined as cases where “the error handling logic of a non-fatal error was so wrong that any statement coverage testing or more careful code reviews by the developers would have caught the bugs”.

Hardware errors


For example, we observe DRAM error rates that are orders of magnitude higher than previously reported, with 25,000 to 70,000 errors per billion device hours per Mbit and more than 8% of DIMMs affected by errors per year.


Package drop

Package re-order

Package corruption

TCP Checksum limitations:

Stone and Partridge estimated that between 1 in 16 million and 1 in 10 billion TCP segments will have corrupt data and a correct TCP checksum. This estimate is based on their analysis of TCP segments with invalid checksums taken from several very different types of networks. The wide range of the estimate reflects the wide range of traffic patterns and hardware in those networks. One in 10 billion sounds like a lot until you realize that 10 billion maximum length Ethernet frames (1526 bytes including Ethernet Preamble) can be sent in a little over 33.91 hours on a gigabit network (10 * 10^9 * 1526 * 8 / 10^9 / 60 / 60 = 33.91 hours), or about 26 days over a T3.


The SATA advertised bit error rate of one error in 10 terabytes is frightening. We moved 2 PB through low-cost hardware and saw five disk read error events, several controller failures, and many system reboots caused by security patches. We conclude that SATA uncorrectable read errors are not yet a dominant system-fault source – they happen, but are rare compared to other problems. We also conclude that UER (uncorrectable error rate) is not the relevant metric for our needs. When an uncorrectable read error happens, there are typically several damaged storage blocks (and many uncorrectable read errors.) Also, some uncorrectable read errors may be masked by the operating system. The more meaningful metric for data architects is Mean Time To Data Loss (MTTDL.)