Archive for May 17th, 2008

By George Candea in Administration, Blogroll, Database, Manageability on May 17, 2008

I want databases that are as easy to manage as Web servers.

IT operations account for 50%-80% of today’s IT budgets and amount to 10s of billions of dollars yearly(1). Poor manageability impacts the bottomline and reduces reliability, availability, and security.

Stateless applications, like Web servers, require little configuration, can be scaled through mere replication, and are reboot-friendly. I want to do that with databases too. But the way they’re built today, the number of knobs is overwhelming: the most popular DB has 220 initialization parameters and 1,477 tables of system parameters, while its “Administrator’s Guide” is 875 pages long(2).

What worries me is an impending manageability crisis, as large data repositories are proliferating at an astonishing pace; in 2003, large Internet services were collecting >1 TB of clickstream data per day(3). 5 years later we’re encountering businesses that want SQL databases to store >1 PB of data. PB-scale databases are by necessity distributed, since no DB can scale vertically to 1 PB; now imagine taking notoriously hard-to-manage single-node databases and distributing them;

How does one build a DB as easy to manage as a Web server? All real engineering disciplines use metrics to quantitatively measure progress toward a design goal, to evaluate how different design decisions impact the desired system property.

We ought to have a manageability benchmark, and the place to start is a concrete metric for manageability, one that is simple, intuitive, and applies to a wide range of systems. We don’t just use the metric to measure, but also to guide developers in making day-to-day choices. It should tell engineers how close their system is to the manageability target. It should enable IT managers to evaluate and compare systems to each other. It should lay down a new criterion for competing in the market.

Here’s a first thought;

I think of system management as a collection of tasks the administrators have to perform to keep a system running in good condition (e.g., deployment, configuration, upgrades, tuning, backup, failure recovery). The complexity of a task is roughly proportional to the number of atomic steps Stepsi required to complete task i; the larger Stepsi, the more inter-step intervals, so the greater the opportunity for the admin to mess up. Installing an operating system, for example, has Stepsinstall in the 10s or 100s.

Efficiency of management operations can be approximated by the time Ti in seconds it takes the system to complete task i ; the larger Ti , the greater the opportunity for unrelated failures to impact atomicity of the management operation. For a trouble-free OS install, Tinstall is probably around 1-3 hours.

If Ni represents the number of times task i is performed during a time interval Tevaluation (e.g., 1 year) and Ntotal=N1+; +Nn, then task i ‘s relative frequency of occurrence is Frequencyi = Ni / Ntotal . Typical values for Frequencyi can be derived empirically or extracted from surveys(4),(5),(6). The less frequently one needs to manage a system, the better.

Manageability can now be expressed with a formula, with larger values of manageability being better:

manageability formula

This says that, the more frequently a system needs to be “managed,” the poorer its manageability. The longer each step takes, the poorer the manageability. The more steps involved in each management action, the poorer the manageability. The longer the evaluation interval, the better the manageability, because observing a system longer increases the confidence in the “measurement.”

While complexity and efficiency are system-specific, their relative importance is actually specific to a customer: an improvement in complexity may be preferred over an improvement in efficiency or vice-versa; this differentiated weighting is captured by T. I would expect T>2 in general, because having fewer, atomic steps is valued more from a manageability perspective than reducing task duration, since the former reduces the risk of expensive human mistakes and training costs, while the latter relates almost exclusively to service-level agreements.

So would this metric work? Is there a simpler one that’s usable?