Archive for September, 2010

15
Sep
By Tasso Argyros in Analytics, Blogroll, Data-Analytics Server on September 15, 2010
   

In the recently announced nCluster 4.6 we continue to innovate and improve nCluster on many fronts to make it the high performance platform of choice for deep, high value analytics. One of the new features is a hybrid data store, which now gives nCluster users the option of storing their data in either a row or column orientation. With the addition of this feature, nCluster is the first data warehouse and analytics platform to combine a tightly integrated hybrid row- and column-based storage with SQL-MapReduce processing capabilities. In this post we’ll discuss the technical details of the new hybrid store as well as the nCluster customer workloads that prompted the design.

Row- and Column-store Hybrid

Let’s start with the basics of row and column stores. In a row store, all of the attribute values for a particular record are stored together in the same on-disk page. Put another way, each page contains one or more entire records. Such a layout is the canonical database design found in most database textbooks, as well as both open source and commercial databases. A column store flips this model around and stores values for only one attribute on each on-disk page. This means that to construct, say, an entire two-attribute record will require data from two different pages in a column store, whereas in a row-store the entire record would be found on only one page. If a query needs only one attribute in that same two-attribute table, then the column store will deliver more needed values per page read. The row store must read pages containing both attributes even though only one attribute is needed, wasting some I/O bandwidth on the unused attribute. Research has shown that for workloads where a small percentage of attributes in a table are required, a column oriented storage model can result in much more efficient I/O because only the required data is read from disk. As more attributes are used, a column store becomes less competitive with a row store because there is an overhead associated with combining the separate attribute values into complete records. In fact, for queries that access many (or all!) attributes of a table, a column store performs worse and is the wrong choice. Having a hybrid store provides the ability to choose the optimal storage for a given query workload.

Aster Data customers have a wide range of analytics use cases from simple reporting to advanced analytics such as fraud detection, data mining, and time series analysis. Reports typically ask relatively simple questions of data such as total sales per region or per month. Such queries tend to require only a few attributes and therefore benefit from columnar storage. In contrast, deeper analytics such as applying a fraud detection model to a large table of customer behaviors relies on applying that model to many attributes across many rows of data. In that case, a row store makes a lot more sense.

Clearly there are cases where having both a column and row store benefits an analytics workload, which is why we have added the hybrid data store feature to nCluster 4.6.

Performance Observations

What does the addition of a hybrid store mean for typical nCluster workloads? The performance improvements from reduced I/O can be considerable: a 5x to 15x speedup was typical in some in-house tests on reporting queries. These queries were generally simple reporting queries with a few joins and aggregation. Performance improvement on more complex analytics workloads, however, was highly variable, so we took a closer look at why. As one would expect (and a number of columnar publications demonstrate), we also find that queries that use all or almost all attributes in a table benefit little or are slowed down by columnar storage. Deep analytical queries in nCluster like scoring, fraud detection, and time series analysis tend to use a higher percentage of columns. Therefore, as a class, they did not benefit as much from columnar, but when these queries do use a smaller percentage of columns, choosing the columnar option in the hybrid store provided good speedup.

A further reason that these more complex queries benefit less from a columnar approach is Amdahl’s law. As we push more complex applications into the database via SQL-MapReduce, we see a higher percentage of query time spent running application code rather than reading or writing from disk. This highlights an important trend in data analytics: user CPU cycles per byte is increasing, which is one reason that deployed nCluster nodes tend to have a higher CPU per byte ratio than one might expect in a data warehouse. The takeaway message is that the hybrid store provides an important performance benefit for simple reporting queries and for analytical workloads that include a mix of ad hoc and simple reporting queries, performance is maximized by choosing the data orientation that is best suited for each workload.

Implementation

The hybrid store is made possible by integrating a column store within the nCluster data storage and query-processing engine, which already used row-storage. The new column storage is tightly integrated with existing query processing and system services. This means that any query answerable by the existing Aster storage engine can now also be answered in our hybrid store, whether the data is stored in row or column orientation. Moreover, all SQL-MapReduce features, workload management, replication, fail-over, and cluster backup features are available to any data stored in the hybrid store.

Providing flexibility and high performance on a wide range of workloads, makes Aster Data the best platform for high value analytics. To that end, we look forward to continuing development of the nCluster hybrid storage engine to further optimize row and column data access. Coupled with workload management and SQL-MapReduce, the new hybrid nCluster storage highlights Aster Data’s commitment to provide nCluster users with the most flexibility to make the most of their data.



09
Sep
By Mayank Bawa in Statements on September 9, 2010
   

I’m delighted to announce that we’ve appointed a new CEO, Quentin Gallivan, to lead our company through the next level of growth.

We’ve had tremendous growth at our company in the past 4 years – having grown Aster Data from 3 persons to a strong, well-rounded team and stellar management team, shipped products with market-defining features, working with customers doing fascinating projects across many industries including retail, Internet, media and publishing and financial services and established key partnerships that we’re really excited about. Tasso and I’ll be working closely with Quentin as he accelerates our trajectory, taking our company to the next level of market leadership, sales and partnership execution, and our international expansions.

Quentin brings more than 20 years of senior executive experience to Aster Data. He has held a variety of CEO and senior executive positions with leading technology companies. Quentin joins us from PivotLink, the leading provider of BI solutions, where as CEO, he rapidly grew the company to over 15,000 business users from mid-sized companies to F1000 companies, across key industries including retail, financial services, CPG, manufacturing and high technology. Prior to PivotLink, Quentin served as CEO of Postini where he scaled the company to 35,000 customers and over 10 million users until its eventual acquisition by Google in 2007. Quentin also served as executive vice president of worldwide sales and services at VeriSign where he grew sales from $20M to $1.2B and was responsible for the global distribution strategy for the company’s security and services business. Quentin has also held a number of key executive and leadership positions at Netscape Communications and GE Information Services.

I’ll transition to a role that I’m really passionate about. I’ll be working closely with our customers and, as our Chief Customer Officer, I’ll lead our organization devoted to ensuring customer success and innovation in our fast-growing customer base. When the company was smaller, I was very actively involved in our customer deployments. As the company scaled, I had to withdraw into operations. In my new role, I’ll be back doing tasks that I relish – solving problems at the intersection of technology and usage – and providing a feedback loop from customers to Tasso, our CTO, to chart our product development.

Together, Quentin, Tasso and I are excited to accelerate our momentum and success in the market.