By Tasso Argyros in Analytics, Blogroll on July 12, 2010

I have always enjoyed the subtle irony of someone trying to be impressive by saying “my data warehouse is X Terabytes”? [muted: "and it's bigger than yours"?]! Why is this ironic? Because it describes a data warehouse, which is supposed to be all about data processing and analysis, using a storage metric. Having an obese 800 Terabytes system that may take hours or days to just do a single pass over the data is not impressive and definitely calls for some diet.

Surprisingly though, several vendors went down the path of making their data warehousing offerings fatter and fatter. Greenplum is a good example. Prior to Sun’s acquisition by Oracle, they were heavily pushing systems based on the Sun Thumper, a 48-disk-heavy 4U box that can store up to 100TBs/box. I was quite familiar with that box as it partly came out of a startup called Kealia that my Stanford advisor, David Cheriton, and Sun co-founder Andy Bechtolsheim had founded and then sold to Sun in 2004. I kept wondering, though, what a 50TB/CPU configuration has to do with data analytics.

After long deliberation I came to the conclusion that it has nothing to do with it. There were two reasons why people were interested in this configuration. First, there were some use cases that required “near-line storage”?, a term that’s used to describe a data repository whose major purpose is to store data but also allows for basic & infrequent data access. In that respect, Greenplum’s software on top of the Sun Thumpers represented a cheap storage solution that offered basic data access and was very useful for applications where processing or analytics was not the main focus.

The second reason for the interest, though, is a tendency to drive DW projects towards an absolute low per-TB price to reduce costs. Experienced folks will recognize that such an approach leads to disaster, because (as mentioned above) analytics is more than just Terabytes. Perfectly low per-TB price using fat storage looks great on glossy paper but in reality it’s no good because nobody’s analytical problems are that simple.

The point here is that analytics have more to do with processing rather than storage. It requires a fair number of balanced servers (thus good scalability & fault tolerance), CPU cycles, networking bandwidth, smart & efficient algorithms, fair amounts of memory to avoid thrashing etc. It’s also about how much processing can it be done by SQL, and how much of your analytics need to use next-generation interfaces like MapReduce or pre-packaged in-database analytical engines. In the new decade in which we’re embarking, solving business problems like fraud, market segmentation & targeting, financial optimization, etc., require much more than just cheap, overweight storage.

So going to the EMC/Greenplum news, I think such an acquisition makes sense, but in a specific way. It will lead to systems that live between storage and data warehousing, systems able to store data and also give the ability to retrieve it on an occasional basis or if the analysis required is trivial. But the problems Aster is excited about are those of advanced in-database analytics for rich, ad hoc querying, delivered through a full application environment inside a MPP database. It’s these problems that we see as opportunities to not only cut IT costs but also provide tremendous competitive advantages to our customers. And on that front, we promise to continue innovating and pushing the limits of technology as much as possible.

Bookmark and Share

Bill McColl on July 12th, 2010 at 3:43 pm #

Great post.

I too have been puzzled by the use of the phrase “we have an nTB data warehouse” by DW vendors, as though storage volume alone was an important metric. You are spot on. That is only relevant if what you are buying is a storage system with minimal added processing power, not an analytics system.

I’ve even seen an even more bizarre statement by certain cloud database vendors - that pricing is based on the size of the database. So, if I have a 10TB database in the cloud and I have 50 users bombarding it with complex queries 24×7, I pay the same as a single user who runs a simple report query once every three days? The cloudonomics of this just doesn’t make any sense, unless the light user is paying way too much.

At Cloudscale we’ve been building the first scalable REALTIME data warehouse, to complement the kind of next-gen scalable analytics products that Aster Data offers. In the cloud, we price based on TBs PROCESSED, not TBs stored. In-house we price based on number of sockets and number of users, not on the total volume of SSD and spinning media!

Daniel Abadi on July 12th, 2010 at 8:30 pm #

I totally agree.

I wrote something similar a little more than a year ago:

Tasso Argyros on July 13th, 2010 at 11:51 am #


Thanks for the link - relevant post, indeed.


Post a comment