In the recently announced nCluster 4.6 we continue to innovate and improve nCluster on many fronts to make it the high performance platform of choice for deep, high value analytics. One of the new features is a hybrid data store, which now gives nCluster users the option of storing their data in either a row or column orientation. With the addition of this feature, nCluster is the first data warehouse and analytics platform to combine a tightly integrated hybrid row- and column-based storage with SQL-MapReduce processing capabilities. In this post we’ll discuss the technical details of the new hybrid store as well as the nCluster customer workloads that prompted the design.
Row- and Column-store Hybrid
Let’s start with the basics of row and column stores. In a row store, all of the attribute values for a particular record are stored together in the same on-disk page. Put another way, each page contains one or more entire records. Such a layout is the canonical database design found in most database textbooks, as well as both open source and commercial databases. A column store flips this model around and stores values for only one attribute on each on-disk page. This means that to construct, say, an entire two-attribute record will require data from two different pages in a column store, whereas in a row-store the entire record would be found on only one page. If a query needs only one attribute in that same two-attribute table, then the column store will deliver more needed values per page read. The row store must read pages containing both attributes even though only one attribute is needed, wasting some I/O bandwidth on the unused attribute. Research has shown that for workloads where a small percentage of attributes in a table are required, a column oriented storage model can result in much more efficient I/O because only the required data is read from disk. As more attributes are used, a column store becomes less competitive with a row store because there is an overhead associated with combining the separate attribute values into complete records. In fact, for queries that access many (or all!) attributes of a table, a column store performs worse and is the wrong choice. Having a hybrid store provides the ability to choose the optimal storage for a given query workload.
Aster Data customers have a wide range of analytics use cases from simple reporting to advanced analytics such as fraud detection, data mining, and time series analysis. Reports typically ask relatively simple questions of data such as total sales per region or per month. Such queries tend to require only a few attributes and therefore benefit from columnar storage. In contrast, deeper analytics such as applying a fraud detection model to a large table of customer behaviors relies on applying that model to many attributes across many rows of data. In that case, a row store makes a lot more sense.
Clearly there are cases where having both a column and row store benefits an analytics workload, which is why we have added the hybrid data store feature to nCluster 4.6.
What does the addition of a hybrid store mean for typical nCluster workloads? The performance improvements from reduced I/O can be considerable: a 5x to 15x speedup was typical in some in-house tests on reporting queries. These queries were generally simple reporting queries with a few joins and aggregation. Performance improvement on more complex analytics workloads, however, was highly variable, so we took a closer look at why. As one would expect (and a number of columnar publications demonstrate), we also find that queries that use all or almost all attributes in a table benefit little or are slowed down by columnar storage. Deep analytical queries in nCluster like scoring, fraud detection, and time series analysis tend to use a higher percentage of columns. Therefore, as a class, they did not benefit as much from columnar, but when these queries do use a smaller percentage of columns, choosing the columnar option in the hybrid store provided good speedup.
A further reason that these more complex queries benefit less from a columnar approach is Amdahl’s law. As we push more complex applications into the database via SQL-MapReduce, we see a higher percentage of query time spent running application code rather than reading or writing from disk. This highlights an important trend in data analytics: user CPU cycles per byte is increasing, which is one reason that deployed nCluster nodes tend to have a higher CPU per byte ratio than one might expect in a data warehouse. The takeaway message is that the hybrid store provides an important performance benefit for simple reporting queries and for analytical workloads that include a mix of ad hoc and simple reporting queries, performance is maximized by choosing the data orientation that is best suited for each workload.
The hybrid store is made possible by integrating a column store within the nCluster data storage and query-processing engine, which already used row-storage. The new column storage is tightly integrated with existing query processing and system services. This means that any query answerable by the existing Aster storage engine can now also be answered in our hybrid store, whether the data is stored in row or column orientation. Moreover, all SQL-MapReduce features, workload management, replication, fail-over, and cluster backup features are available to any data stored in the hybrid store.
Providing flexibility and high performance on a wide range of workloads, makes Aster Data the best platform for high value analytics. To that end, we look forward to continuing development of the nCluster hybrid storage engine to further optimize row and column data access. Coupled with workload management and SQL-MapReduce, the new hybrid nCluster storage highlights Aster Data’s commitment to provide nCluster users with the most flexibility to make the most of their data.