In the recently announced nCluster 4.6 we continue to innovate and improve nCluster on many fronts to make it the high performance platform of choice for deep, high value analytics. One of the new features is a hybrid data store, which now gives nCluster users the option of storing their data in either a row or column orientation. With the addition of this feature, nCluster is the first data warehouse and analytics platform to combine a tightly integrated hybrid row- and column-based storage with SQL-MapReduce processing capabilities. In this post we’ll discuss the technical details of the new hybrid store as well as the nCluster customer workloads that prompted the design.
Row- and Column-store Hybrid
Let’s start with the basics of row and column stores. In a row store, all of the attribute values for a particular record are stored together in the same on-disk page. Put another way, each page contains one or more entire records. Such a layout is the canonical database design found in most database textbooks, as well as both open source and commercial databases. A column store flips this model around and stores values for only one attribute on each on-disk page. This means that to construct, say, an entire two-attribute record will require data from two different pages in a column store, whereas in a row-store the entire record would be found on only one page. If a query needs only one attribute in that same two-attribute table, then the column store will deliver more needed values per page read. The row store must read pages containing both attributes even though only one attribute is needed, wasting some I/O bandwidth on the unused attribute. Research has shown that for workloads where a small percentage of attributes in a table are required, a column oriented storage model can result in much more efficient I/O because only the required data is read from disk. As more attributes are used, a column store becomes less competitive with a row store because there is an overhead associated with combining the separate attribute values into complete records. In fact, for queries that access many (or all!) attributes of a table, a column store performs worse and is the wrong choice. Having a hybrid store provides the ability to choose the optimal storage for a given query workload.
Aster Data customers have a wide range of analytics use cases from simple reporting to advanced analytics such as fraud detection, data mining, and time series analysis. Reports typically ask relatively simple questions of data such as total sales per region or per month. Such queries tend to require only a few attributes and therefore benefit from columnar storage. In contrast, deeper analytics such as applying a fraud detection model to a large table of customer behaviors relies on applying that model to many attributes across many rows of data. In that case, a row store makes a lot more sense.
Clearly there are cases where having both a column and row store benefits an analytics workload, which is why we have added the hybrid data store feature to nCluster 4.6.
What does the addition of a hybrid store mean for typical nCluster workloads? The performance improvements from reduced I/O can be considerable: a 5x to 15x speedup was typical in some in-house tests on reporting queries. These queries were generally simple reporting queries with a few joins and aggregation. Performance improvement on more complex analytics workloads, however, was highly variable, so we took a closer look at why. As one would expect (and a number of columnar publications demonstrate), we also find that queries that use all or almost all attributes in a table benefit little or are slowed down by columnar storage. Deep analytical queries in nCluster like scoring, fraud detection, and time series analysis tend to use a higher percentage of columns. Therefore, as a class, they did not benefit as much from columnar, but when these queries do use a smaller percentage of columns, choosing the columnar option in the hybrid store provided good speedup.
A further reason that these more complex queries benefit less from a columnar approach is Amdahl’s law. As we push more complex applications into the database via SQL-MapReduce, we see a higher percentage of query time spent running application code rather than reading or writing from disk. This highlights an important trend in data analytics: user CPU cycles per byte is increasing, which is one reason that deployed nCluster nodes tend to have a higher CPU per byte ratio than one might expect in a data warehouse. The takeaway message is that the hybrid store provides an important performance benefit for simple reporting queries and for analytical workloads that include a mix of ad hoc and simple reporting queries, performance is maximized by choosing the data orientation that is best suited for each workload.
The hybrid store is made possible by integrating a column store within the nCluster data storage and query-processing engine, which already used row-storage. The new column storage is tightly integrated with existing query processing and system services. This means that any query answerable by the existing Aster storage engine can now also be answered in our hybrid store, whether the data is stored in row or column orientation. Moreover, all SQL-MapReduce features, workload management, replication, fail-over, and cluster backup features are available to any data stored in the hybrid store.
Providing flexibility and high performance on a wide range of workloads, makes Aster Data the best platform for high value analytics. To that end, we look forward to continuing development of the nCluster hybrid storage engine to further optimize row and column data access. Coupled with workload management and SQL-MapReduce, the new hybrid nCluster storage highlights Aster Data’s commitment to provide nCluster users with the most flexibility to make the most of their data.
Recently, a journalist called to ask about in-memory data processing, a very interesting subject. I always thought that in-memory processing will be more and more important as memory prices keep falling drastically. In fact, these days you can get 128GB of memory into a single system for less than $5K plus the server cost, not to mention that DDR3 and multiple memory controllers are giving a huge performance boost. And if you run software that can handle shared-nothing parallelism (MPP), your memory cost increases linearly, and systems with TBs of memory are possible.
So what do you do with all that memory? There are two classes of use cases that are emerging today. First is the case where you need to increase concurrent access to data with reduced latency. Tools like memcached offer in-memory caching that, used properly, can vastly improve latency and concurrency for large-scale OLTP applications like websites. Also the nice thing with object caching is that it scales well in a distributed way and people have build TB-level caches. Memory-only OLTP databases have started to emerge, such as VoltDB. And memory is used implicitly as a very important caching layer in open-source key-value products like Voldemort. We should only expect memory to play a more and more important role here.
The second way to use memory is to gain “processing flexibility” when doing analytics. The idea is to throw your data into memory (however much it fits, of course) without spending much time thinking how to do that or what queries you’ll need to run. Because memory is so fast, most simple queries will be executed at interactive times and also concurrency is handled well. European upstart QlikView exploits this fact to offer a memory-only BI solution which provides simple and fast BI reporting. The downside is its applicability to only 10s of GBs of data as Curt Monash notes.
By exploiting an MPP shared-nothing architecture, Aster Data has production clusters with TBs of total memory. Our software takes advantage of memory in two ways: first, it uses caching aggressively to ensure the most relevant data stays in memory; and when data is in memory, processing is much faster and more flexible. Secondly, MapReduce is a great way to utilize memory as it provides full flexibility to the programmer to use memory-focused data structures for data processing. In addition, Aster Data’s SQL-MapReduce provides tools to the user to encourage the development of memory-only MapReduce applications.
However, one shouldn’t fall into the trap of thinking that all analytics will be in-memory anytime soon. While memory is down to $30/GB, disk manufacturers have been busy increasing platter density and dropping their price to less than $0.06/GB. Given that the amount of data in the world grows faster than Moore’s law and memory, there will always be more data to be stored and analyzed than what fits into any amount of memory that an enterprise can use. In fact, most big data applications will have data sets that do not fit into memory because, while tools like memcached worry only about the present (e.g. current Facebook users), analytics need to worry about the past, as well – and that means much more data. So a multi-layer architecture will be the only cost-effective way of analyzing large amounts of data for some time.
One shouldn’t be discussing memory without mentioning solid-state disk products (like Aster Data partner company Fusion-io). SSDs are likely to make the surprise here given that their per-GB price is falling faster than disks (being a solid-state product that follows Moore’s law does help). In the next few years we’ll witness SSDs in read-intensive applications providing similar advantages to memory while accommodating much larger data sizes.
Rumors abound that Intel is “baking”? the successor of the very successful Nehalem CPU architecture, codenamed Westmere. It comes with an impressive spec: 10 CPU cores (supporting 20 concurrent threads) packed in a single chip. You can soon expect to see 40 cores in middle range 4-socket servers – a number hard to imagine just five years ago.
We’re definitely talking about a different era. In the old days, you could barely fit a single core in a chip. (I still remember 15 years ago when I had to buy and install a separate math co-processor on my Mac LC to run Microsoft Excel and Mathematica.) And with the hardware, software has to change, too. In fact, modern software means software that can handle parallelism. This is what makes MapReduce such an essential and timely tool for big data applications. MapReduce’s purpose in life is to simplify data and processing parallelism for big data applications. It gives ample freedom to the programmer on how to do things locally; and takes over when data needs to be communicated across processes/cores/servers, thus evaporating a lot of the parallelism complexity.
Once someone designs their software and data to operate in a parallelized environment using MapReduce, gains will come on multiple levels. Not only will MapReduce help your analytical applications scale across a cluster of servers with terabytes of data, it will also exploit the billions of transistors and the 10s of CPU cores inside each server. The best part: the programmer doesn’t need to think about the difference.
As an example, consider this great paper out of Stanford discusses MapReduce implementations of popular Machine Learning algorithms. The Stanford researchers considered MapReduce as a way of “porting”? these algorithms (traditionally implemented to run in a single CPU) to a multi-core architecture. But, of course, the same MapReduce implementations can be used to scale these algorithms across a distributed cluster as well.
Hardware has changed – MPP, shared-nothing, commodity servers, and, of course, multi-core. In this new world MapReduce is software’s response for big data processing. Intel and Westmere have just found an unexpected friend.
Our architecture enables SAS software procs to run natively inside the database thereby preserving the statistical integrity of SAS software computations while giving unprecedented performance increases during analysis of large data sets. SAS Institute partners in this initiative with other databases too – but the difference is that each of these databases require the re-implementation of SAS software procs as proprietary UDFs or Stored Procedures.
We also allow dynamic workload management capabilities to enable graceful resource sharing between SAS software computations, SQL queries, loads, backups and scale-outs – all of which may be going on concurrently. The workload management enables administrators to dial-up or dial-down resources to the data mining operations based on the criticality of the mining and other tasks being performed.
Our fast loading and trickle feed capabilities ensure that SAS software procs have access to fresh data for modeling and scoring, ensuring a timely and accurate analysis. This avoids the need to export snapshots (or samples) of data to an external SAS server for analysis, saving analysts valuable time in their iterations and discovery cycles.
We’ve been working with SAS Institute for a while now, and it is very evident why SAS has been the market leader in analytic applications for three decades. The technology team is very sharp, driven to innovate and execute. And as a result we’ve achieved a lot working together in a short time.
We look forward to working with SAS Institute to dramatically advance analytics for big data!
I had commented that a new set of applications are being written that leverage data to act smarter to enable companies to deliver more powerful analytic applications. Operating a business today without serious insight into business data is not an option. Data volumes are growing like wildfire, applications are getting more data-heavy and more analytics-intensive, and companies are putting more demands on their data.
The traditional 20-year old data pipeline of Operational Data Stores (to pool data), Data Warehouses (to store data), Data Marts (to farm out data), Application Servers (to process data) and UI (to present data) are under severe strain – because we are expecting a lot of data to move from one tier to the other. Application Servers pull data from Databases for computations and push the results of the computation to the UI servers. But data is like a boulder – the larger the data, the more the inertia, and therefore the larger the time and effort needed to move it from one system to another.
The resulting performance problems of moving ‘big data’ are so severe that application writers unconsciously compromise the quality of their analysis by avoiding “big data computations” – they first reduce the “big data” to “small data” (via SQL-based aggregations/windowing/sampling) and then perform computations on “small data” or data samples.
The problem of ‘big data’ analysis will continue to grow severe in the next 10 years as data volumes grow and applications demand more data granularity to model behavior and identify patterns so as to better understand and service their customers. To do this, you have to analyze all your available data. For the last 5 years, companies have routinely upgraded their data infrastructure every 12-18 months as data sizes double and the traditional data pipeline buckles under the weight of larger data movement – and they will be forced to continue doing this in the next 10 years if nothing fundamental changes.
Clearly, we need a new, sustainable solution to address this state of affairs.
The ‘aha!’ for big data management is to realize that traditional data pipeline suffers from an architecture problem – of moving data to applications – that must change to allow applications to move to the data.
I am very pleased to announce a new version of Aster Data nCluster that addresses this challenge head-on.
Moving applications to the data requires a fundamental change in the traditional database architecture where applications are co-located inside the database engine so that they can iteratively read, write and update all data. The new infrastructure acts as a ‘Data-Application Server’ managing both data and applications as first-class citizens. Like a traditional database, it provides a very strong data management layer. Like a traditional application server, it provides a very strong application processing framework. It co-locates applications with data, thus eliminating data movement from the Database to the Application server. At the same time, it keeps the two layers separate to ensure the right fault-tolerance and resource-management models – bad data will not crash the application, and vice-versa a bad application will not crash the database.
Our architecture and implementation ensures that apps should not have to be re-written to make this transition. The application is pushed down into the Aster 4.0 system and transparently parallelized across the servers that store the relevant data. As a result, Aster Data nCluster 4.0 simultaneously also delivers 10x-100x boost in performance and scalability.
Those using Aster Data’s solution, including comScore, Full Tilt Poker, Telefonica I+D, Enquisite – are testament to the benefits of this fundamental change. In each case, it was the embedding of the application with the data that enables them to scale seamlessly and perform ultra-fast analysis.
The new release brings to fruition a major product roadmap milestone that we’ve been working on for the last 4 years. There is a lot more innovation coming – and this milestone is significant enough that we issue a clarion call to all persons working on “big data applications” – we need to move applications to the data because the other way round is unsustainable in this new era.