Archive for May 27th, 2008

27
May
By George Candea in Administration, Blogroll, Manageability on May 27, 2008
   

When developing a system that is expected to take care of itself (self-managing, autonomic, etc.) the discussion of how much control to give users over the details of the system inevitably comes up. There is, however, a clear line between visibility and control.

Users want control primarily because they don’t have visibility into the reasons for a system’s behavior. Take for instance a database whose performance has suddenly dropped 3x… This can be due to someone running a crazy query, or some other process on the same machine updating a filesystem index, or the battery of a RAID controller’s cache having run out and forcing all updates to be write-through, etc. In order to figure out what is going on, the DBA would normally start poking around with ps, vmstat, mdadm, etc. and for this (s)he needs control. However, what the DBA really wants is visibility into the cause of the slowdown… the control needed to remedy the situation is minimal: kill a query, reboot, replace a battery, etc.)

To provide good visibility, one ought to expose why the system is doing something, not how it is doing it. Any system that self-manages must be able to explain itself when requested to do so. If a DB is slow, it should be able to provide a profile of the in-flight queries. If a cluster system reboots nodes frequently, it should be able to tell whether it’s rebooting due to the same cause or a different one every time. If a node is taken offline, the system should be able to tell it’s because of suspected failure of disk device /dev/sdc1 on that node. And so on… this is visibility.

We do see, however, very many systems and products that substitute control for visibility, such as providing root access on the machines running the system. I believe this is mainly because the engineers themselves do not understand very well in which way the how turns into the why, i.e., they do not understand all the different paths that lead to poor system behavior.

Choosing to expose the why instead of the how influences the control knobs provided to users and administrators. Retrofitting complex systems to provide visibility instead of control is hard, so this really needs to be done from day one. What’s more, when customers get used to control, it becomes difficult to give it up in exchange for visibility, so the product must maintain the user-accessible controls for backward compatibility. This allows administrators to introduce unpredictable causes of system behavior (e.g., by allowing RAID recovery to be triggered at arbitrary times), which makes self-management that much harder and inaccurate. Hence the need to build visibility in from day one and to minimize unnecessary control.