Archive for June, 2010

23
Jun
By Tasso Argyros in Analytics, Blogroll, Data-Analytics Server on June 23, 2010
   

Recently, a journalist called to ask about in-memory data processing, a very interesting subject. I always thought that in-memory processing will be more and more important as memory prices keep falling drastically. In fact, these days you can get 128GB of memory into a single system for less than $5K plus the server cost, not to mention that DDR3 and multiple memory controllers are giving a huge performance boost. And if you run software that can handle shared-nothing parallelism (MPP), your memory cost increases linearly, and systems with TBs of memory are possible.

So what do you do with all that memory? There are two classes of use cases that are emerging today. First is the case where you need to increase concurrent access to data with reduced latency. Tools like memcached offer in-memory caching that, used properly, can vastly improve latency and concurrency for large-scale OLTP applications like websites. Also the nice thing with object caching is that it scales well in a distributed way and people have build TB-level caches. Memory-only OLTP databases have started to emerge, such as VoltDB. And memory is used implicitly as a very important caching layer in open-source key-value products like Voldemort. We should only expect memory to play a more and more important role here.

The second way to use memory is to gain “processing flexibility” when doing analytics. The idea is to throw your data into memory (however much it fits, of course) without spending much time thinking how to do that or what queries you’ll need to run. Because memory is so fast, most simple queries will be executed at interactive times and also concurrency is handled well. European upstart QlikView exploits this fact to offer a memory-only BI solution which provides simple and fast BI reporting. The downside is its applicability to only 10s of GBs of data as Curt Monash notes.

By exploiting an MPP shared-nothing architecture, Aster Data has production clusters with TBs of total memory. Our software takes advantage of memory in two ways: first, it uses caching aggressively to ensure the most relevant data stays in memory; and when data is in memory, processing is much faster and more flexible. Secondly, MapReduce is a great way to utilize memory as it provides full flexibility to the programmer to use memory-focused data structures for data processing. In addition, Aster Data’s SQL-MapReduce provides tools to the user to encourage the development of memory-only MapReduce applications.

However, one shouldn’t fall into the trap of thinking that all analytics will be in-memory anytime soon. While memory is down to $30/GB, disk manufacturers have been busy increasing platter density and dropping their price to less than $0.06/GB. Given that the amount of data in the world grows faster than Moore’s law and memory, there will always be more data to be stored and analyzed than what fits into any amount of memory that an enterprise can use. In fact, most big data applications will have data sets that do not fit into memory because, while tools like memcached worry only about the present (e.g. current Facebook users), analytics need to worry about the past, as well – and that means much more data. So a multi-layer architecture will be the only cost-effective way of analyzing large amounts of data for some time.

One shouldn’t be discussing memory without mentioning solid-state disk products (like Aster Data partner company Fusion-io). SSDs are likely to make the surprise here given that their per-GB price is falling faster than disks (being a solid-state product that follows Moore’s law does help). In the next few years we’ll witness SSDs in read-intensive applications providing similar advantages to memory while accommodating much larger data sizes.



22
Jun
By Tasso Argyros in Blogroll, Data-Analytics Server on June 22, 2010
   

Rumors abound that Intel is “baking”? the successor of the very successful Nehalem CPU architecture, codenamed Westmere. It comes with an impressive spec: 10 CPU cores (supporting 20 concurrent threads) packed in a single chip. You can soon expect to see 40 cores in middle range 4-socket servers – a number hard to imagine just five years ago.

We’re definitely talking about a different era. In the old days, you could barely fit a single core in a chip. (I still remember 15 years ago when I had to buy and install a separate math co-processor on my Mac LC to run Microsoft Excel and Mathematica.) And with the hardware, software has to change, too. In fact, modern software means software that can handle parallelism. This is what makes MapReduce such an essential and timely tool for big data applications. MapReduce’s purpose in life is to simplify data and processing parallelism for big data applications. It gives ample freedom to the programmer on how to do things locally; and takes over when data needs to be communicated across processes/cores/servers, thus evaporating a lot of the parallelism complexity.

Once someone designs their software and data to operate in a parallelized environment using MapReduce, gains will come on multiple levels. Not only will MapReduce help your analytical applications scale across a cluster of servers with terabytes of data, it will also exploit the billions of transistors and the 10s of CPU cores inside each server. The best part: the programmer doesn’t need to think about the difference.

As an example, consider this great paper out of Stanford discusses MapReduce implementations of popular Machine Learning algorithms. The Stanford researchers considered MapReduce as a way of “porting”? these algorithms (traditionally implemented to run in a single CPU) to a multi-core architecture. But, of course, the same MapReduce implementations can be used to scale these algorithms across a distributed cluster as well.

Hardware has changed – MPP, shared-nothing, commodity servers, and, of course, multi-core. In this new world MapReduce is software’s response for big data processing. Intel and Westmere have just found an unexpected friend.



14
Jun
By Steve Wooledge in Blogroll on June 14, 2010
   

As the market around big data heats up, it’s great to see the ecosystem for Hadoop, MapReduce, and massively parallel databases expanding. This includes events for education and networking around big data.

As such, Aster Data is co-sponsoring our first official “unconference” the night before the 2010 Hadoop Summit. It’s called BigDataCamp and will be June 28th at the TechMart from 5:00-9:30PM (adjacent to the Hyatt where Hadoop Summit is taking place). Similar to our ScaleCamp event last year where we heard from companies like LinkedIn and ShareThis and industry practitioners like Chris Wensel (author of Cascading), there will be a lineup of great talks, including hands-on workshops led by Amazon Web Services, Karmasphere, and more. In addition, we’re lucky to have Dave Nielsen as the moderator/organizer of the event as he’s chaired similar unconferences such as CloudCamp, and is an expert at facilitating content and discussions to best fit attendee interest.

It’s very fitting to have the more open/dynamic agenda style of an unconference given the audience will be more of the “analytic scientists” – a title which I’ve seen LinkedIn use when describing the rise in job roles dedicated to tackling big data in companies to tease out insights and develop data-driven products and applications. The analytic scientist-customers I speak with who use Aster Data together with Hadoop challenge the norms and move quickly – not unlike an unconference agenda. I expect a night of free thinking (and free drinks/food), big ideas, and a practical look at emerging technologies and techniques to tackle big data. Best of all, the networking portion is a great chance to meet folks to hear what they’re up to and exchange ideas.

Check out the agenda at www.bigdatacamp.org and note that seats are limited and we expect to sell out, so please REGISTER NOW. Hope to see you there!