- John: “What was wrong with the server that crashed last week?”
- Chris: “I don’t know. I rebooted it and it’s just fine. Perhaps the software crashed!”
I’m sure anyone who has been in operations has had the above dialog, sometimes quite frequently! In computer science such a failure would be called “transient” because the failure affects a piece of the system only for a fixed amount of time. People who have been running large-scale systems for a long time will attest that transient failures are extremely common and can lead to system unavailability if not handled right.
In this post I want to explore why transient failures are an important threat to availability and how a distributed database can handle them.
To see why transient failures are frequent and unavoidable, let’s consider what can cause them. Here’s an easy (albeit non-intuitive) reason: software bugs. All production-quality software still has bugs; most of the bugs that escape testing are difficult to track down and resolve, and they take the form of Heisenbugs, race conditions, resource leaks, and environment-dependent bugs, both in the OS and the applications. Some of these bugs will cause a server to crash unexpectedly. A simple reboot will fix the issue, but in the meantime the server will not be available. Configuration errors are another common cause. Somebody inserts the wrong parameters into a network switch console and as a result a few servers suddenly go offline. And, sometimes, the cause of the failure just remains unidentified because it can be hard to reproduce and thus examine more thoroughly.
I submit to you that it is much harder to prevent transient failures than permanent ones. Permanent failures are predictable, and are often caused by hardware failures. We can build software or hardware to work around permanent failures. For example, one can build a RAID scheme to prevent a server from going down if a disk fails, but no RAID level can prevent a memory leak in the OS kernel from causing a crash!
What does this mean? Since transient failures are unpredictable and harder to prevent, MTTF (mean time to failure) for transient failures is hard to increase.
Clearly, a smaller MTTF means more frequent outages and larger downtimes. But if MTTF is so hard to increase for transient failures, what can we do to always keep the system running?
The answer is that instead of increasing MTTF we can reduce MTTR (mean time to recover). Mathematically this concept is expressed by the formula:
Availability = MTTF/(MTTF+MTTR)
It is obvious that as MTTR approaches zero, Availability approaches 1, (i.e. 100%). In other words, if failure recovery is very fast, (instantaneous in an extreme example) then even if failures happen frequently, overall system availability will continue to be very high. This interesting approach to availability, called Recovery Oriented Computing was developed jointly by Berkeley and Stanford researchers, including my co-founder George Candea.
Applying this concept to a massively parallel distributed database yields interesting design implications. As an example, let’s consider the case where a server fails temporarily due to an OS crash in a 100-server distributed database. Such an event means that the system has fewer resources to work with: in our example after the failure we have a 1% reduction of available resources. A reliable system will need to:
(a) Be available while the failure lasts and
(b) Recover to the initial state as soon as possible after the failed server is restored.
Thus, recovering from this failure needs to be a two-step process:
(a) Keep the system available with a small performance/capacity hit while the failure is ongoing (availability recovery)
(b) Upgrade the system to its initial levels of performance and capacity as soon as the transient failure is resolved (resource recovery)
Minimizing MTTR means minimizing the sum of the time it takes to do (a) and (b), ta + tb. Keeping ta very low requires having replicas of data spread across the cluster; this, coupled with fast failure detection and fast activation of the appropriate replicas, will ensure that ta remains as low as possible.
Minimizing tb requires seamless re-incorporation of the transiently failed nodes into the system. Since in a distributed database each node has a lot of state, and the network is the biggest bottleneck, the system must be able to reuse as much of the state that pre-existed on the failed nodes as possible to reduce the recovery time. In other words, if most of the data that was on the node before the failure is still valid (a very likely case) then it needs to be identified, validated and reused during re-incorporation.
Any system that lacks the capacity to keep either ta or tb low does not provide good tolerance to transient failures.
And because there will always be more transient failures the bigger a system gets, any architecture that cannot handle failures correctly is – simply – not scalable. Any attempt to scale it up will likely result in outages and performance problems. Having a system designed with a Recovery-Oriented architecture, such as the Aster nCluster database, can ensure that transient failures are tolerated with minimal disruption, and thus true scalability is possible.