06
Nov
By Tasso Argyros in Blogroll on November 6, 2008
   

Forget about total cost of ownership (TCO). In the Internet age, scaling up can be the biggest cost factor in your infrastructure. And overlooking the scaling challenge can be disastrous.

Take Bob, for instance. Bob is the fictional database manager at FastGrowth, an Internet company with a fast-growing user base.  Bob is 36 and has done more than a dozen data warehouse deployments in his career. He’s confident in his position. Granted, this is his first Internet gig But this shouldn’t matter a lot, right? Bob’s been there, done that, got the tee-shirt.

Bob’s newest gig is to implement a new data warehouse system to accommodate FastGrowth’s explosive data growth. He estimates that there will be 10TB of data in the next 6 months, 20TB in the next 12, and 40TB 18 months from now.

Bob needs to be very careful about cost (TCO); getting way overboard on his budget could cost him his reputation or even (gasp) his job. He thus asks vendors and friends how much hardware and software he needs to buy at each capacity level. He also makes conservative estimates about the number of people required to manage the system and its data at 10, 20, and 40 terabytes.

Fast-forward 18 months. Bob’s DW is in complete chaos; it can hardly manage half of the 40 TB target and it required twice the number of people and dollars so far. Luckily for Bob, his boss (Suzy), has been doing Internet infrastructure projects for her whole career and knew exactly what mistake Bob made (and why he deserves a second chance.)

What went wrong? Bob did everything almost perfectly. His TCO estimates at each scale level were, in fact, correct. But what he did not account for was the effort of going from one scale level to the other in such a short time! Doubling the size of data every 6 months is 3x faster than Moore’s law. That’s like trying to build a new car that is 3x faster than a Ferrari. As a result, growing from 10TB to 20TB in six months may cost many times more than (in terms of people, time and dollars) running a 20TB system for 6 months.

In some way, this is no news. The Internet space is full of stories where scaling was either too expensive or too disruptive to be carried out properly. Twitter, with its massive success, has had to put huge effort to scale up its systems. And Friendster lost the opportunity to be a top social network partly because it was taking too long to scale up its infrastructure. Moreover, as new data sources become available, companies outside Internet are facing similar kind of challenges -scaling needs that are too hard to manage!

So how can we reason about this new dimension of infrastructure cost? What happens when data is growing constantly, and scaling up ends up being the most expensive part of most projects?

The answer, we believe, is that the well-known concept of TCO not good enough to capture scaling costs in this new era of fast data growth. Instead, we need to also start thinking about the Total Cost of Scaling -or TCS.

Why is TCS useful? TCS captures all costs -in terms of hardware, software and people -that are required to increase the capacity of the infrastructure. Depending on the application, capacity can mean anything such as amount of data (e.g. for data warehousing projects) or queries per second (for OLTP  systems.)  TCO together with TCS gives a true estimate of project costs for environments that have been blessed with a growing business.

Let’s see how TCS works in an example. Say that you need 100 servers to run your Web business at a particular point in time, and you have calculated the TCO for that. You can also calculate the TCO of having 250 servers running 12 months down the road, when your business has grown. But going from 100 severs to 250 -that’s where TCS comes in. The careful planner (e.g. Bob in his next project) will need to add all three numbers together -TCO at 100 servers, TCO at 250 servers and TCS for scaling from 100 to 250 -to get an accurate picture of the full cost.

At Aster, we have been thinking about TCS from day one exactly because we design our systems for environments of fast data growth. We have seen TCS dominating the cost of data projects. As a result, we have built a product that is designed from the ground-up to make scalability seamless and reduce the TCS of our deployments to a minimum. For example, one of our customers scaled up their Aster deployment from 45 to 90 servers with a click of a button. In contrast, traditional scaling approaches -manual, tedious and risky -bloat TCS and can jeopardize whole projects.

As fast data growth becomes the rule rather than the exception, we expect more people to start measuring TCS and seek ways to reduce it. As Francis Ford Coppola put it, “anything you build on a large scale or with intense passion invites chaos.” And while passion is hard to manage, there is something we can do about scale.

Bookmark and Share

Comments:

[...] Thinking about more than query performance and initial system cost is something we believe firmly in. People often overlook the cost of maintaining and scaling systems in a time of need. [...]

Post a comment
Name: 
Email: 
URL: 
Comments: