Recently, a journalist called to ask about in-memory data processing, a very interesting subject. I always thought that in-memory processing will be more and more important as memory prices keep falling drastically. In fact, these days you can get 128GB of memory into a single system for less than $5K plus the server cost, not to mention that DDR3 and multiple memory controllers are giving a huge performance boost. And if you run software that can handle shared-nothing parallelism (MPP), your memory cost increases linearly, and systems with TBs of memory are possible.
So what do you do with all that memory? There are two classes of use cases that are emerging today. First is the case where you need to increase concurrent access to data with reduced latency. Tools like memcached offer in-memory caching that, used properly, can vastly improve latency and concurrency for large-scale OLTP applications like websites. Also the nice thing with object caching is that it scales well in a distributed way and people have build TB-level caches. Memory-only OLTP databases have started to emerge, such as VoltDB. And memory is used implicitly as a very important caching layer in open-source key-value products like Voldemort. We should only expect memory to play a more and more important role here.
The second way to use memory is to gain “processing flexibility” when doing analytics. The idea is to throw your data into memory (however much it fits, of course) without spending much time thinking how to do that or what queries you’ll need to run. Because memory is so fast, most simple queries will be executed at interactive times and also concurrency is handled well. European upstart QlikView exploits this fact to offer a memory-only BI solution which provides simple and fast BI reporting. The downside is its applicability to only 10s of GBs of data as Curt Monash notes.
By exploiting an MPP shared-nothing architecture, Aster Data has production clusters with TBs of total memory. Our software takes advantage of memory in two ways: first, it uses caching aggressively to ensure the most relevant data stays in memory; and when data is in memory, processing is much faster and more flexible. Secondly, MapReduce is a great way to utilize memory as it provides full flexibility to the programmer to use memory-focused data structures for data processing. In addition, Aster Data’s SQL-MapReduce provides tools to the user to encourage the development of memory-only MapReduce applications.
However, one shouldn’t fall into the trap of thinking that all analytics will be in-memory anytime soon. While memory is down to $30/GB, disk manufacturers have been busy increasing platter density and dropping their price to less than $0.06/GB. Given that the amount of data in the world grows faster than Moore’s law and memory, there will always be more data to be stored and analyzed than what fits into any amount of memory that an enterprise can use. In fact, most big data applications will have data sets that do not fit into memory because, while tools like memcached worry only about the present (e.g. current Facebook users), analytics need to worry about the past, as well – and that means much more data. So a multi-layer architecture will be the only cost-effective way of analyzing large amounts of data for some time.
One shouldn’t be discussing memory without mentioning solid-state disk products (like Aster Data partner company Fusion-io). SSDs are likely to make the surprise here given that their per-GB price is falling faster than disks (being a solid-state product that follows Moore’s law does help). In the next few years we’ll witness SSDs in read-intensive applications providing similar advantages to memory while accommodating much larger data sizes.