Archive for the ‘Manageability’ Category

12
Aug
By Tasso Argyros in Administration, Availability, Blogroll, Manageability, Scalability on August 12, 2008
   

- John: “What was wrong with the server that crashed last week?”

- Chris: “I don’t know. I rebooted it and it’s just fine. Perhaps the software crashed!”

I’m sure anyone who has been in operations has had the above dialog, sometimes quite frequently! In computer science such a failure would be called “transient” because the failure affects a piece of the system only for a fixed amount of time. People who have been running large-scale systems for a long time will attest that transient failures are extremely common and can lead to system unavailability if not handled right.

In this post I want to explore why transient failures are an important threat to availability and how a distributed database can handle them.

To see why transient failures are frequent and unavoidable, let’s consider what can cause them. Here’s an easy (albeit non-intuitive) reason:  software bugs.  All production-quality software still has bugs; most of the bugs that escape testing are difficult to track down and resolve, and they take the form of Heisenbugs, race conditions, resource leaks, and environment-dependent bugs, both in the OS and the applications. Some of these bugs will cause a server to crash unexpectedly.  A simple reboot will fix the issue, but in the meantime the server will not be available.  Configuration errors are another common cause.  Somebody inserts the wrong parameters into a network switch console and as a result a few servers suddenly go offline. And, sometimes, the cause of the failure just remains unidentified because it can be hard to reproduce and thus examine more thoroughly.

I submit to you that it is much harder to prevent transient failures than permanent ones. Permanent failures are predictable, and are often caused by hardware failures. We can build software or hardware to work around permanent failures. For example, one can build a RAID scheme to prevent a server from going down if a disk fails, but no RAID level can prevent a memory leak in the OS kernel from causing a crash!

What does this mean? Since transient failures are unpredictable and harder to prevent, MTTF (mean time to failure) for transient failures is hard to increase.

Clearly, a smaller MTTF means more frequent outages and larger downtimes. But if MTTF is so hard to increase for transient failures, what can we do to always keep the system running?

The answer is that instead of increasing MTTF we can reduce MTTR (mean time to recover). Mathematically this concept is expressed by the formula:

Availability = MTTF/(MTTF+MTTR)

It is obvious that as MTTR approaches zero, Availability approaches 1, (i.e. 100%). In other words, if failure recovery is very fast, (instantaneous in an extreme example) then even if failures happen frequently, overall system availability will continue to be very high. This interesting approach to availability, called Recovery Oriented Computing was developed jointly by Berkeley and Stanford researchers, including my co-founder George Candea.

Applying this concept to a massively parallel distributed database yields interesting design implications. As an example, let’s consider the case where a server fails temporarily due to an OS crash in a 100-server distributed database. Such an event means that the system has fewer resources to work with: in our example after the failure we have a 1% reduction of available resources. A reliable system will need to:

(a) Be available while the failure lasts and

(b) Recover to the initial state as soon as possible after the failed server is restored.

Thus, recovering from this failure needs to be a two-step process:

(a) Keep the system available with a small performance/capacity hit while the failure is ongoing (availability recovery)

(b) Upgrade the system to its initial levels of performance and capacity as soon as the transient failure is resolved (resource recovery)

Minimizing MTTR means minimizing the sum of the time it takes to do (a) and (b), ta + tb. Keeping ta very low requires having replicas of data spread across the cluster; this, coupled with fast failure detection and fast activation of the appropriate replicas, will ensure that ta remains as low as possible.

Minimizing tb requires seamless re-incorporation of the transiently failed nodes into the system. Since in a distributed database each node has a lot of state, and the network is the biggest bottleneck, the system must be able to reuse as much of the state that pre-existed on the failed nodes as possible to reduce the recovery time. In other words, if most of the data that was on the node before the failure is still valid (a very likely case) then it needs to be identified, validated and reused during re-incorporation.

Any system that lacks the capacity to keep either ta or tb low does not provide good tolerance to transient failures.

And because there will always be more transient failures the bigger a system gets, any architecture that cannot handle failures correctly is – simply – not scalable. Any attempt to scale it up will likely result in outages and performance problems. Having a system designed with a Recovery-Oriented architecture, such as the Aster nCluster database, can ensure that transient failures are tolerated with minimal disruption, and thus true scalability is possible.



27
May
By George Candea in Administration, Blogroll, Manageability on May 27, 2008
   

When developing a system that is expected to take care of itself (self-managing, autonomic, etc.) the discussion of how much control to give users over the details of the system inevitably comes up. There is, however, a clear line between visibility and control.

Users want control primarily because they don’t have visibility into the reasons for a system’s behavior. Take for instance a database whose performance has suddenly dropped 3x… This can be due to someone running a crazy query, or some other process on the same machine updating a filesystem index, or the battery of a RAID controller’s cache having run out and forcing all updates to be write-through, etc. In order to figure out what is going on, the DBA would normally start poking around with ps, vmstat, mdadm, etc. and for this (s)he needs control. However, what the DBA really wants is visibility into the cause of the slowdown… the control needed to remedy the situation is minimal: kill a query, reboot, replace a battery, etc.)

To provide good visibility, one ought to expose why the system is doing something, not how it is doing it. Any system that self-manages must be able to explain itself when requested to do so. If a DB is slow, it should be able to provide a profile of the in-flight queries. If a cluster system reboots nodes frequently, it should be able to tell whether it’s rebooting due to the same cause or a different one every time. If a node is taken offline, the system should be able to tell it’s because of suspected failure of disk device /dev/sdc1 on that node. And so on… this is visibility.

We do see, however, very many systems and products that substitute control for visibility, such as providing root access on the machines running the system. I believe this is mainly because the engineers themselves do not understand very well in which way the how turns into the why, i.e., they do not understand all the different paths that lead to poor system behavior.

Choosing to expose the why instead of the how influences the control knobs provided to users and administrators. Retrofitting complex systems to provide visibility instead of control is hard, so this really needs to be done from day one. What’s more, when customers get used to control, it becomes difficult to give it up in exchange for visibility, so the product must maintain the user-accessible controls for backward compatibility. This allows administrators to introduce unpredictable causes of system behavior (e.g., by allowing RAID recovery to be triggered at arbitrary times), which makes self-management that much harder and inaccurate. Hence the need to build visibility in from day one and to minimize unnecessary control.



19
May
By Tasso Argyros in Blogroll, Database, Manageability, Scalability on May 19, 2008
   

One of the most interesting, complex and perhaps overused terms in data analytics today is scalability. People constantly talk about “scaling problems” and “scalable solutions.” But what really makes a data analytics system “scalable”? Unfortunately, despite its importance, this question is rarely discussed so I wanted to post my thoughts here.

Any good definition of scalability needs to be a multi-dimensional concept. In other words, there is no single system property that is enough to make a data analytics system scalable. But what are the dimensions that separate scalable from non-scalable systems? In my opinion the three most important are (a) data volume; (b) analytical power; and (c) manageability. Let me provide a couple of thoughts on each.

(a) Data Volume. This is definitely an important scale dimension because enterprises today generate huge amounts of data. For a shared-nothing MPP system this means accommodating a sufficient number of nodes to accommodate the available data. Evolution in disk and server technology have made it possible to store 10s of TBs of data per node, so this scale dimension alone can be achieved even with a relatively small number of nodes.

(b) Analytical Power. This is an equally important scale dimension to Data Volume because storing large amounts of data alone has little benefit; one needs to be able to extract deep insights out of it to provide real business value. And for non-trivial queries in a shared-nothing environment this presents two requirements. First, the system needs to be able to accommodate a large number of nodes to have adequate processing power to execute complex analytics. And secondly, the system needs to scale its performance linearly as more nodes are added. The latter is particularly hard for queries that involve processing of distributed state such as distributed joins: really intelligent algorithms have to be in place or else interconnect bottlenecks just kill performance and the system is not truly scalable.

(c) Manageability. Scalability across the manageability dimension means that a system can scale up and keep operating at a large scale without armies of administrators or downtime. For an MPP architecture this translates to seamless incremental scalability, scalable replication and failover, and little if any requirement for human intervention during management operations. Despite popular belief, we believe manageability can be measured and we need to take such metrics into account when characterizing a system as scalable or non-scalable.

At Aster, we focus on building systems that scale across all dimensions. We believe that even if one dimension is missing our products do not deserve to be called scalable. And since this is such an important issue, I’ll be looking forward to more discussion around it!



17
May
By George Candea in Administration, Blogroll, Database, Manageability on May 17, 2008
   

I want databases that are as easy to manage as Web servers.

IT operations account for 50%-80% of today’s IT budgets and amount to 10s of billions of dollars yearly(1). Poor manageability impacts the bottomline and reduces reliability, availability, and security.

Stateless applications, like Web servers, require little configuration, can be scaled through mere replication, and are reboot-friendly. I want to do that with databases too. But the way they’re built today, the number of knobs is overwhelming: the most popular DB has 220 initialization parameters and 1,477 tables of system parameters, while its “Administrator’s Guide” is 875 pages long(2).

What worries me is an impending manageability crisis, as large data repositories are proliferating at an astonishing pace; in 2003, large Internet services were collecting >1 TB of clickstream data per day(3). 5 years later we’re encountering businesses that want SQL databases to store >1 PB of data. PB-scale databases are by necessity distributed, since no DB can scale vertically to 1 PB; now imagine taking notoriously hard-to-manage single-node databases and distributing them;

How does one build a DB as easy to manage as a Web server? All real engineering disciplines use metrics to quantitatively measure progress toward a design goal, to evaluate how different design decisions impact the desired system property.

We ought to have a manageability benchmark, and the place to start is a concrete metric for manageability, one that is simple, intuitive, and applies to a wide range of systems. We don’t just use the metric to measure, but also to guide developers in making day-to-day choices. It should tell engineers how close their system is to the manageability target. It should enable IT managers to evaluate and compare systems to each other. It should lay down a new criterion for competing in the market.

Here’s a first thought;

I think of system management as a collection of tasks the administrators have to perform to keep a system running in good condition (e.g., deployment, configuration, upgrades, tuning, backup, failure recovery). The complexity of a task is roughly proportional to the number of atomic steps Stepsi required to complete task i; the larger Stepsi, the more inter-step intervals, so the greater the opportunity for the admin to mess up. Installing an operating system, for example, has Stepsinstall in the 10s or 100s.

Efficiency of management operations can be approximated by the time Ti in seconds it takes the system to complete task i ; the larger Ti , the greater the opportunity for unrelated failures to impact atomicity of the management operation. For a trouble-free OS install, Tinstall is probably around 1-3 hours.

If Ni represents the number of times task i is performed during a time interval Tevaluation (e.g., 1 year) and Ntotal=N1+; +Nn, then task i ‘s relative frequency of occurrence is Frequencyi = Ni / Ntotal . Typical values for Frequencyi can be derived empirically or extracted from surveys(4),(5),(6). The less frequently one needs to manage a system, the better.

Manageability can now be expressed with a formula, with larger values of manageability being better:

manageability formula

This says that, the more frequently a system needs to be “managed,” the poorer its manageability. The longer each step takes, the poorer the manageability. The more steps involved in each management action, the poorer the manageability. The longer the evaluation interval, the better the manageability, because observing a system longer increases the confidence in the “measurement.”

While complexity and efficiency are system-specific, their relative importance is actually specific to a customer: an improvement in complexity may be preferred over an improvement in efficiency or vice-versa; this differentiated weighting is captured by T. I would expect T>2 in general, because having fewer, atomic steps is valued more from a manageability perspective than reducing task duration, since the former reduces the risk of expensive human mistakes and training costs, while the latter relates almost exclusively to service-level agreements.

So would this metric work? Is there a simpler one that’s usable?