By George Candea in Administration, Blogroll, Manageability on May 27, 2008

When developing a system that is expected to take care of itself (self-managing, autonomic, etc.) the discussion of how much control to give users over the details of the system inevitably comes up. There is, however, a clear line between visibility and control.

Users want control primarily because they don’t have visibility into the reasons for a system’s behavior. Take for instance a database whose performance has suddenly dropped 3x… This can be due to someone running a crazy query, or some other process on the same machine updating a filesystem index, or the battery of a RAID controller’s cache having run out and forcing all updates to be write-through, etc. In order to figure out what is going on, the DBA would normally start poking around with ps, vmstat, mdadm, etc. and for this (s)he needs control. However, what the DBA really wants is visibility into the cause of the slowdown… the control needed to remedy the situation is minimal: kill a query, reboot, replace a battery, etc.)

To provide good visibility, one ought to expose why the system is doing something, not how it is doing it. Any system that self-manages must be able to explain itself when requested to do so. If a DB is slow, it should be able to provide a profile of the in-flight queries. If a cluster system reboots nodes frequently, it should be able to tell whether it’s rebooting due to the same cause or a different one every time. If a node is taken offline, the system should be able to tell it’s because of suspected failure of disk device /dev/sdc1 on that node. And so on… this is visibility.

We do see, however, very many systems and products that substitute control for visibility, such as providing root access on the machines running the system. I believe this is mainly because the engineers themselves do not understand very well in which way the how turns into the why, i.e., they do not understand all the different paths that lead to poor system behavior.

Choosing to expose the why instead of the how influences the control knobs provided to users and administrators. Retrofitting complex systems to provide visibility instead of control is hard, so this really needs to be done from day one. What’s more, when customers get used to control, it becomes difficult to give it up in exchange for visibility, so the product must maintain the user-accessible controls for backward compatibility. This allows administrators to introduce unpredictable causes of system behavior (e.g., by allowing RAID recovery to be triggered at arbitrary times), which makes self-management that much harder and inaccurate. Hence the need to build visibility in from day one and to minimize unnecessary control.

FS on May 27th, 2008 at 11:11 am #

The underlying question is how much automation will the administrators accept? While I agree that the “why” is important, human nature often dictates that the “how” garners equal importance. Often times, this will manifest itself by not allowing true autonomic computing. Instead, administrators want the ability to select those functions that are automated, manual, or suggested actions with the added safety and security of a “kill” switch.

George Candea on May 28th, 2008 at 12:28 pm #

FS, indeed administrators are reluctant to give up control, and rightly so: when taking away control, they should get visibility in exchange and a small set of simple-but-effective controls (like the “kill” switch you mention). It is natural to distrust a self-* system, especially if it cannot explain what happened whenever it messes up.

Said differently, there are 2 types of knobs: those that allow admins to correct system behavior, and those that allow admins to dig around the bowels of the system to understand why it’s doing whatever it’s doing. The latter type of knobs should be replaced with properly thought-out visibility into system activity (counters, visualization of component health, performance, and activity, etc.)

FS on May 31st, 2008 at 6:41 pm #

George, this is an important discussion that has the potential to impact many layers within information technology. The key to success may not lie within the core technology itself, but instead lies within the “knobs” themselves. These “knobs” must present a single pane of glass into the system with comprehensive “drill-down” capabilities. However, instead of today’s “how” focused views, the glass will be focused on the “why” with guard-rail choices that the administrators can make. In this way, the administrators are still connected to the systems they manage and maintain a high degree of value while experiencing an amazing paradigm shift.

Post a comment