Archive for May 8th, 2008

08
May
How can I analyze all of this data?
By Tasso Argyros in Analytics, Analytics tech on May 8, 2008
   

Over the last couple of years I’ve talked to scores of companies that face data analytics problems and ask this question. From these discussions it was pretty clear that no existing infrastructure can really solve the problem of driving deep insights from massive amounts of data for most enterprises. But why? And how do companies today try to cope with this issue?

I’ve seen three classes of “solutions” that companies attempt to implement in a desperate attempt to overcome their data analytics challenges. Let me try to describe what I’ve seen here.

“Solution” One. Vertical scale-up. If you are like most companies, database performance problems make your favorite hardware vendor sales rep lots of money every year! There is nothing new here. Ever since the 1960s, when the first data management systems came around, performance issues were solved by buying much more expensive hardware. So here’s the obvious problem with this approach: cost. And here’s the non-obvious one: there’s a limit in how much you can scale this way, which is actually pretty low. (Question: what is the maximum number of CPUs that you can buy in a high-end server? How does it compare to the average Google cluster?)

“Solution” Two. “Massively” parallel database clusters. Sometimes I’ve heard an argument that goes like this: “Why shouldn’t it be simple to build a farm of databases just like we have farms of app servers or web servers?” Driven by this seemingly innocent question, you may try (or have tried) to put together clusters of databases to do analytics, either on your own or using one of the MPP products that are in the marketplace. This will work fine for small datasets *or* very simple queries (e.g. computing a sum of values). But, as any student of distributed systems knows, there is a reason why web servers scale so nicely: they are stateless! That’s why they’re so easy to deploy and scale. On the other hand, databases do have state. In fact, they have lots of it, perhaps several Gigabytes per box. And, guess what, in analytics each query potentially needs access to all of it at once! So what works fine for very small numbers of nodes or small amount of data, doesn’t do anything for slightly more complex queries and larger systems - which is probably the issue you were trying to solve in the first place.

By the way, all the solutions that are in the marketplace today solve the wrong problems. For instance, some optimize disk I/O of the individual nodes and not overall system performance for complex queries, which is the real issue (e.g., “columnar” systems). Others allow for fast execution of really simple queries but do nothing to allow more complex ones to go really quickly (e.g., “MPP” databases). None of these products can provide a solution that is even relevant to the hardest problems these systems face.

“Solution” Three. Write custom code. Why not? Google and Yahoo have done it pretty successfully! The only problem is, this approach is even more expensive than approach #1! Google has built a great infrastructure, but what is the cost to retain and compensate the best minds in the world who can develop and maintain your analytics? (Hint: It’s more than free snacks and soda). I’ve frequently seen what starts as a simple, cheap solution for a single point problem evolve to a productivity nightmare, where each new data insight requires development time and specialized (thus expensive) skills. If you can afford that, that’s fine. But I’ll bet you do not want to spend your most precious resources reinventing the wheel every time you need to run a new query instead of doing what makes your company most successful.

The end result is that all of these approaches are pretty far from solving the real problem. Rather, the cost of becoming more competitive through data is currently huge - and it shouldn’t be! I believe that as soon as the right tools are built and made available, companies will immediately take advantage of them to be more competitive and successful. This is the upcoming data revolution that I see, and, frankly, it has been long overdue.