I have always enjoyed the subtle irony of someone trying to be impressive by saying “my data warehouse is X Terabytes”? [muted: "and it's bigger than yours"?]! Why is this ironic? Because it describes a data warehouse, which is supposed to be all about data processing and analysis, using a storage metric. Having an obese 800 Terabytes system that may take hours or days to just do a single pass over the data is not impressive and definitely calls for some diet.
Surprisingly though, several vendors went down the path of making their data warehousing offerings fatter and fatter. Greenplum is a good example. Prior to Sun’s acquisition by Oracle, they were heavily pushing systems based on the Sun Thumper, a 48-disk-heavy 4U box that can store up to 100TBs/box. I was quite familiar with that box as it partly came out of a startup called Kealia that my Stanford advisor, David Cheriton, and Sun co-founder Andy Bechtolsheim had founded and then sold to Sun in 2004. I kept wondering, though, what a 50TB/CPU configuration has to do with data analytics.
After long deliberation I came to the conclusion that it has nothing to do with it. There were two reasons why people were interested in this configuration. First, there were some use cases that required “near-line storage”?, a term that’s used to describe a data repository whose major purpose is to store data but also allows for basic & infrequent data access. In that respect, Greenplum’s software on top of the Sun Thumpers represented a cheap storage solution that offered basic data access and was very useful for applications where processing or analytics was not the main focus.
The second reason for the interest, though, is a tendency to drive DW projects towards an absolute low per-TB price to reduce costs. Experienced folks will recognize that such an approach leads to disaster, because (as mentioned above) analytics is more than just Terabytes. Perfectly low per-TB price using fat storage looks great on glossy paper but in reality it’s no good because nobody’s analytical problems are that simple.
The point here is that analytics have more to do with processing rather than storage. It requires a fair number of balanced servers (thus good scalability & fault tolerance), CPU cycles, networking bandwidth, smart & efficient algorithms, fair amounts of memory to avoid thrashing etc. It’s also about how much processing can it be done by SQL, and how much of your analytics need to use next-generation interfaces like MapReduce or pre-packaged in-database analytical engines. In the new decade in which we’re embarking, solving business problems like fraud, market segmentation & targeting, financial optimization, etc., require much more than just cheap, overweight storage.
So going to the EMC/Greenplum news, I think such an acquisition makes sense, but in a specific way. It will lead to systems that live between storage and data warehousing, systems able to store data and also give the ability to retrieve it on an occasional basis or if the analysis required is trivial. But the problems Aster is excited about are those of advanced in-database analytics for rich, ad hoc querying, delivered through a full application environment inside a MPP database. It’s these problems that we see as opportunities to not only cut IT costs but also provide tremendous competitive advantages to our customers. And on that front, we promise to continue innovating and pushing the limits of technology as much as possible.