|
|
|
|
|
The “big data” world is a product of exploding applications. The number of applications that are generating data has just gone through the roof. The number of applications that are being written to consume the generated data is also growing rapidly. Each application wants to produce and consume data in a structure that is most efficient for its own use. As Gartner points out in a recent report on big data[1], “Too much information is a storage issue, certainly, but too much information is also a massive analysis issue.”
In our data-driven economy, business models are being created (and destroyed) and shifting based on the ability to compete on data and analytics. The winners realize the advantage of having platforms, that allow data to be stored in multiple structures and (more importantly) allow data to be processed in multiple structures. This allows companies to more easily 1) harness and 2) quickly process ALL of the data about their business to better understand customers, behaviors, and opportunities/threats in the market. We call this “multi-structured” data, which has been a topic of discussion lately with IDC Research (where we first saw the term referenced) and other industry analysts. It is also the upcoming topic of a webcast we’re doing with the IDC on June 15th.
To us, multi-structured data means “a variety of data formats and types.” This could include any data “structured” or “unstructured” - “relational” or “non-relational”. Curt Monash has blogged about naming such data Poly-structured or Multi-structured. At the core is the ability for an analytic platform to both 1) store and 2) process a diversity of formats in the most efficient means possible.
Handling Multi-structured Data
We in the industry use the term “structured” data to mean “relational” data. And data that is not “relational” is called “unstructured” or “semi-structured.”
Unfortunately, this definition lumps text, csv, pdf, doc, mpeg, jpeg, html, log files as unstructured data. Clearly, all of these forms of data have an implicit “structure” to them!
My first observation is that Relational is one way of manifesting the data. Text is another way of expressing the data – Jpeg, gif, bmp and other formats are structured forms of expressing images. For example, (Mayank, Aster Data, San Carlos, 6/1/2011) is a relational row stored in a table (Name, Company Visited, City Visited, Date Visited) – the same data can be expressed in text as “Mayank visited Aster Data, based in San Carlos, on June 1, 2011.” A geo-tagged photograph of Mayank entering the Aster Data office in San Carlos on June 1, 2011 will also capture the same information.
My second observation is that “structure” of data is what makes applications understand the data and know what to do with it. For example, a SQL-based application can issue the right SQL queries to process its logic; an image viewer can interpret JPG/GIF/BMP files to interpret the data; a text-engine can parse subject-object-verbs to interpret the data; etc.
Each application leverages the structure of data to do its processing in the most efficient manner. Thus, search engines recognize the white-space structure in English and can build inverted indexes on words to do fast searches. Relational engines recognize row headers and tuple boundaries to build indexes that can be used to retrieve selected rows very quickly. And so on.
My third observation is that each application produces data in a structure that is most efficient for its use. Thus, applications produce logs; cameras produce images; business applications produce relational rows; Web content engines produce HTML pages; etc. It is very hard to “Transform” data from one structure to the other. ETL tools have their hands full in just doing transformations from a relational schema to another relational schema. And semantic engines have a hard time “transforming” text to relational forms. All such “across structure” transforms are lost in the information.
Relational databases handle relational structure and relational processing very efficiently, but they are severely limiting in their capabilities to store and process other structures (e.g., text, xml, jpg, pdf, doc). In these engines, relations are a first-class citizen; every other structure is a distant second-class citizen.
Hadoop is exciting in the “Big Data” world because it doesn’t pre-suppose any structure. Data in any structure can be stored in plain files. Applications can read the files and build their own structures on the fly. It is liberating. However, it is not efficient – precisely because it reduces all data to its base form of files and robs the data of its structure – the structure that would allow for efficient processing or storage by applications! Each application has to redo its work from scratch.
What would it take for a platform to treat multiple structures of data as first class citizens? How could it natively support each format, yet provide a unified way to express queries or analytic logic at the end-user level to as to abstract away the complexity/diversity of the data and provide insights more quickly? It’d be liberating as well as efficient!
[1] “’Big Data’ Is Only the Beginning of Extreme Information Management”. Gartner Research, April 7, 2011
|
|
|
|
|
|
|
|
|
|
|
|
In case you missed the news, Aster Data just took another step to make SQL-MapReduce the best programming framework for big data analytics. The Aster Data SQL-MapReduce® Developer Portal is the first collaborative online developer community for SQL-MapReduce analytics, our framework for processing non-relational data and ultra-fast analytics. It builds on other efforts to enable MapReduce analytics including: Developer Center, a resource center for MapReduce and SQL-MapReduce developers; Aster Data Developer Express, the first integrated development environment for SQL-MapReduce; and Aster Data Analytic Foundation, a suite of ready-to-use SQL-MapReduce functions.
The Developer Portal gives our customers and partners a community for collaborating with peers to leverage the flexibility and power of SQL-MapReduce for analytics that were previously impossible or impractical. Data scientists, quantitative analysts, and developers from customers, partners, and Aster Data are using the portal to highlight insights and best practices, share analytic functions, and leverage the experience and knowledge of the community to easily harness the power of SQL-MapReduce for big data analytics.
The portal enables collaboration that is key in making it easy for our customers to become SQL-MapReduce experts so they can solve core business challenges. As Navdeep Alam, director of data architecture at Mzinga, said, the portal “will allow us the ability to share and leverage insights with others in using big data analytics to attain a deeper understanding of customers’ behavior and create competitive advantage for our business.”
We’re seeing strong interest in the Developer Portal from our current customers. Early activity and content on the portal includes discussions about using the GSL libraries, programming in .NET, and writing sessionization and sampling functions. We plan to expand on this with tutorials for additional functions over the next few months.
If you aren’t already a customer, we encourage you to get started at the Aster Data Developer Center, where you can get your hands on SQL-MapReduce by downloading Aster Data Developer Express for free and find links to other resources like www.mapreduce.org. If you are an Aster Data customer, we encourage you to also register for access to the new SQL-MapReduce Developer Portal for additional content and learning.
We’re always interested in your feedback as to how we can better help developers learn about and use MapReduce and Aster Data’s SQL-MapReduce. If you have any suggestions, please feel free to add them below in the comments.
|
|
|
|
|
|
|
|
|
|
|
|
We are very excited to share with you that today we announced our company, Aster Data, is being acquired by Teradata, who as you all know commands the #1 position in data warehousing. Together, we will tackle the massive opportunity in the big data and big data analytics market. Upon close, Aster Data will become part of the Teradata organization and our products will become part of the Teradata family of products, sold stand-alone, and integrated into their product line.
The combined goal is big, as said on Teradata’s web site home page:

Today marks a major milestone in our continuing journey, and we are thrilled to join forces with the market leader in data management. Our company has achieved a lot since our inception just 5 years ago, and we look forward to accelerating our innovation and market reach even further – with the market strength of Teradata and the speed of our combined cultures. In 5 years, we’ve played a big role in shaping the Big Data Analytics Platform market and innovated on new technologies that enable customers to store diverse, granular data and process it in diverse ways. The big data opportunity as we see it is more about extracting insights from your diverse data than just finding cost-effective ways to store it. Processing and extracting deep insights from diverse and big data is where we’ve innovated and broken new ground, and with this merger we will accelerate it further.
Our journey started when we realized that (a) it was hard and expensive to manage big data, and (b) it was nearly impossible to process and analyze diverse (non-relational) data types like Web clicks, social connections, and text files at scale. The two worlds of data management and data processing were separate – RDBMSs would store and manage data in their world; however, applications and tools would do analytics outside of the database. This division severely restricted the types of analytics possible on large amounts of data. We discussed this in more detail on an earlier blog post from January 26.
The real impact of the above two restrictions was that organizations were drawing in a flood of data and couldn’t make any sense out of it. For instance, organizations couldn’t analyze enough data to understand their customers at an individual level, and thus they couldn’t improve their products and customer experience. Or, they couldn’t detect advanced fraud schemes because the offenders were hiding in terabytes of data (the outliers) and complicated money network schemes, resulting in huge losses.
Foreseeing this opportunity, we decided to change the enterprise data infrastructure and build a platform that (a) uses commodity hardware to scale at unprecedented levels while keeping costs low, (b) combines data management and data processing in one platform to allow much deeper analysis of data at much larger scale, and (c) accommodates the processing of diverse data types (e.g. machine generated data, social network data, text data, etc.) in a single platform.
Over the past 5 years we have been aggressively building our technology and developing this big new market. We’ve had continuous and increasing success – one recognition of this was Gartner’s recent Magic Quadrant. And looking forward, we were seeing a 2011 where the new market we were creating would become mainstream reality across organizations. As a Gartner press release recently stated: “2011 will be the year when data warehousing reaches what could well be its most-significant inflection point since its inception… The biggest, and possibly most-elaborate data management system in the IT house is changing. The new data warehouse will introduce new scope for flexibility in adding new information types and change detection.”
And this execution now sets the stage for our joining forces with Teradata. We love this merger for 3 reasons:
- First, we love that Teradata is by far the most successful data warehousing and data-driven applications company in the world. As founders, we understood that Teradata will accelerate our vision and will back us in realizing the full potential of the Big Data Analytics Platform.
- Second, we have always had a big and ambitious technology vision. A bold vision needs time and resources to execute to its full potential. As part of Teradata, we will have the resources and support needed to accelerate our technology. We will also have access to a global sales organization and channel to accelerate the adoption of the Big Data Analytics Platform, and ultimately bring more benefits to our customers, more quickly.
- Third, Aster Data nCluster is very complementary to Teradata’s existing product portfolio. By combining products from both companies, we can come to market with solutions that solve a very wide range of diverse data management and data analysis problems using “best of breed” components. We expect both Aster Data and Teradata customers to find our joint offerings very unique and valuable for their business, thus increasing their opportunities and decreasing their costs.
In closing, we want to re-iterate that we have never been more excited about our market, our company and our opportunity! Our vision has proven to be right early on and we’ve watch other players in our market try to follow suit – that’s just one external validation of our direction, and there have been many more as our customers use our products to break new ground in analytics insights on diverse and big data. As we innovated and as we delivered on the vision for big and diverse data management, our team’s execution has truly defined and helped shape the market. And in this evolution, we are more confident and tremendously excited as we write the next chapter of this market.
Upon close of this transaction, the merger with Teradata is about taking our products, our innovations, our IP, and the Aster Data team, and accelerating our lead in the big data and big data analytics market. Or simply put it’s about ‘going big.’
We really want to thank our customers that believed in us and drove key input into our product roadmap and see the big data opportunity. We promise that our commitment and support to you all will only increase in the future. Also our team, who joined a small company and have worked hard to make it so successful. And finally, our investors who understood the opportunity and believed they were going to be part of something new, valuable and exciting.
For more information on today’s announcement read the full press release at www.asterdata.com and also visit Teradata’s web site www.teradata.com
- Mayank Bawa & Tasso Argyros
|
|
|
|
|
|
|
|
|
|
|
|
In my previous post, I spoke about how strongly I feel that this is the year that the analytic platform will become its own distinct and unique category. As the market as a whole realizes the value of integrated data and process management, in-database applications and in-database analytics, the “analytic platform”, or “analytic computing system”, or “data analytics server” (pick your name) will gain even more momentum, reaching critical mass this year.
In this process, you will see significant movement from vendors, first in their marketing collateral (as it is always the case for followers in a technology space) and then scrambling to cover their product gaps in the 5 categories that define a true analytic platform that I mentioned in Part I of 2011: – The Year of the Analytics Platform.
What took Aster Data 6+ years to build is impossible to be done overnight, or over a few releases (side note: if you are interested in software product development and haven’t read the Mythical Man-Month, now is a good time – it’s an all-time classic and explains this point very clearly), and especially if the fundamental architecture is not there from day one.
But the momentum for the analytic platform category is there and, at this point, is irreversible. Part of this powerful trend is derived from the central place that analytics is taking in the enterprise and government. Analytics today is not a luxury, but a necessity for competitiveness. Every industry today is thinking how to employ analytics to better understand its customers, cut costs, and increase revenues. For example, companies in the financial services sector, a fiercely competitive space, want to use the wealth of data they have to become more relevant to their customers, increase customer satisfaction and retention rates. Governments’ use of data and analytics is one of few last resorts against terrorism and cyber threats. In retail, the advent of Internet, social networks, and globalization has increased competition and reduced margins. Using analytics to understand cross-channel behavior and preferences of consumers improves the returns of marketing campaigns and optimizes product pricing and placement, and can make the difference between red and black ink at the bottom of the balance sheet. Read the rest of this entry »
|
|
|
|
|
|
|
|
|
|
|
|
When we kicked off Aster Data back in 2005, we envisioned building a product that would advance the state of the art in data management in two areas; (1) size and diversity of data and (2) depth of insight/analytics. My co-founders and I quickly realized that building just another database wouldn’t cut it. With yet-another-database, even if we enabled companies to more cost-effectively manage large data sizes, it was not going to be enough given the explosion in diverse data types and the massive need to process all of it. So we set out to build a new platform that would solve these challenges – what’s now commonly known as the ‘Big Data’ challenge.
Fast forward to 2008 when Aster Data led the way in putting massive parallel processing inside a MPP database, using MapReduce, to advance how you process massive amounts of diverse data. While this was fully aligned with our vision for managing hoards of diverse data and allowing deep data processing in a single platform, most thought it was intriguing but couldn’t quite see the light in terms of where the future was going. At one point, we thought of naming our product XAP – “extreme analytic platform” or “extreme analytic processing” as that’s what it was designed to do from day one. However, we thought better of it since we thought we would have to educate people too much on what an “analytic platform” was and how it was different from a traditional DBMS for data warehousing. Since we also were serving the data architects in organizations as well as the front-line business that demands better, faster analytics, we needed to use terminology that resonated with both.
Then, in the fall of 2009, with our flagship product Aster Data nCluster 4.0, we made further strides in running advanced analytics inside the database by including all the built-in application services (e.g. like dynamic WLM, backup, monitoring, etc) to go with it. At that time, we referred to it as a Data-Application Server – which our customers quickly started calling a Data-Analytics Server. I remember when analyst Jim Kobielus at Forrester said,
“It’s really innovative and I don’t use those terms lightly. Moving application logic into the data warehousing environment is ‘a logical next step’.”
And others saying,
“The platform takes a different approach from traditional data warehouses, DBMS and data analytics solutions by housing data and applications together in one system, fully parallelizing both. This eradicates the need for movements of massive amounts of data and the problems with latency and restricted access that creates.”
What they started to fully appreciate and realize is that big data is not just about storing hoards of data, but rather, cracking the code on how to process all of it in deep ways, at blazing fast speeds. Read the rest of this entry »
|
|
|
|
|
|
|
|
|
|
By Tasso Argyros in MapReduce on December 8, 2010 |
| |
|
|
|
|
In the past couple of years, MapReduce – once an unknown, funky word – became a prominent, mainstream trend in data management and analytics. However even today I meet people that are not clear on what MapReduce exactly is and how it relates to some other terms and trends. In this post I attempt to clarify some of the MapReduce-related terminology. So here it goes.
MapReduce (the framework). MapReduce is a framework that allows programmers to develop analytical applications that run on (usually large) clusters of commodity hardware and process (usually large) amounts of data. It was first introduced by Google and it is language independent. It is abstract in the sense that an application that uses MapReduce doesn’t have to care about things like the number of servers/processes, fault tolerance, etc. MapReduce is supported by multiple implementations including the open source project Hadoop and Aster Data. Google also has its own proprietary implementation which, unfortunately, is also called MapReduce and sometimes creates confusion.
MapReduce (the Google implementation of MapReduce framework). As mentioned above, Google has its own implementation of MapReduce. This was described in the 2004 OSDI paper and it was the theoretical basis upon which Hadoop was developed. Google’s MapReduce was a processing framework and it was using Google’s GFS (Google File System) for data storage.
Aster Data’s SQL-MapReduce. Aster Data has a patent-pending implementation of MapReduce that (a) uses a database for data persistence, (b) is tightly integrated with SQL, i.e. an analyst or BI tool can invoke MapReduce via SQL queries, thus making MapReduce accessible to the enterprise. It supports multiple programming languages such as Java and C and it is accessible through standard interfaces such as ODBC and JDBC.
Hadoop. Hadoop is an Apache “umbrella” project that hosts many sub-projects, including Hadoop MapReduce and HDFS, Hadoop’s version of the Google File System which Hadoop MapReduce uses for data storage. Hadoop is the core open source project – however, there are many distributions for Hadoop, just as there are many distributions for Linux. These distributions contain Hadoop binaries together with other utilities and tools. The most popular distributions are the Cloudera distribution, the Yahoo distribution and the baseline Apache distribution.
HDFS. HDFS is Hadoop’s version of GFS and it is a distributed file system. HDFS can exist without Hadoop MapReduce, but usually Hadoop MapReduce requires HDFS. Aster Data’s MapReduce does not require HDFS as it uses an extensible MPP database for data storage and persistence.
Cloudera. Cloudera usually means either (a) the company, (b) Cloudera’s Distribution for Hadoop.
Sqoop. Sqoop which is short for “SQL to Hadoop” is an open source project that provides a framework for connecting to SQL data stores for data exchange.
NoSQL. NoSQL started as a term to describe a collection of products that did not support or rely on SQL. This included Hadoop and other products like Cassandra. However, as more people realized that SQL is a necessary interface for many data management systems, the term evolved to mean (N)ot (o)nly SQL. These days there are attempts to port SQL on top of Hadoop and other NoSQL products.
Are there any MapReduce-related terms I omitted? Please add them in the comments below and include a definition and links to good resources if you’d like.
|
|
|
|
|
|
|
|
|
|
|
|
Barton George is Cloud Computing and Scale-Out Evangelist for Dell.
Today at a press conference in San Francisco we announced the general availability of our Dell cloud solutions. One of the solutions we debuted was the Dell Cloud Solution for Data Analytics, a combination of our PowerEdge C servers with Aster Data’s nCluster, a massively parallel processing database with an integrated analytics engine.
Earlier this week I stopped by Aster Data‘s headquarters in San Carlos, CA and met up with their EVP of marketing, Sharmila Mulligan. I recorded this video where Sharmila discusses the Dell and Aster solution and the fantastic results a customer is seeing with it.
Some of the ground Sharmila covers:
- What customer pain points and problems does this solution address (hint: organizations trying to manage huge amounts of both structured and unstructured data)
- How Aster’s nCluster software is optimized for Dell PowerEdge C2100 and how it provides very high performance analytics as well as a cost effective way to store very large data.
- (2:21) InsightExpress, a leading provider of digital marketing research solutions, has deployed the Dell and Aster analytics solution and has seen great results:
- Up and running w/in 6 weeks
- Queries that took 7-9 minutes now run in 3 seconds
Pau for now…
Extra-credit reading
|
|
|
|
|
|
|
|
|
|
|
|
It’s ironic how all of a sudden Vertica is changing its focus from being a column-only database to claiming to be an Analytic Platform.
If you’ve used an Analytic Platform you know it’s more than just bolting in a layer of analytic functions on top of a database. But that’s how Vertica claims it’s now a full-blown analytic platform when in fact their analytics capability is rather thin. For instance, their first layer is a pair of window functions (CTE and CCE). The CCE window function is used, for example, to do sessionization. Vertica has a blog post that posits sessionization as a major advanced analytic operation. In truth, Vertica’s sessionization is not analytics. It is a basic data preparation step that adds a session attribute to each clickstream event so that very simple session-level analytics can be performed.
What’s interesting is the CCE window function is simply a pre-built function – some might say just syntactic sugar – that combines the functionality of finite width SQL window functions (LEAD/LAG) with CASE statements (WHEN condition THEN predicate). Nothing ground breaking to say the least!
For example, the CTE query referred to in a Vertica blog post can be rewritten very simply using SQL-99:
SELECT
symbol, bid, timestamp,
SUM(CASE WHEN bid > 10.6 THEN 1 ELSE 0 END)
OVER (PARTITION BY symbol ORDER BY timestamp) window_id
FROM tickstore;
The layering of custom pre-built functions has for a long time been the traditional way of adding functions to a database. The SQL-99 and SQL-2003 analytic functions themselves follow this tradition.
The problem with this is not just with Vertica but also with the giants of the market, Oracle and Microsoft for instance. Their approach is that the customer is at the mercy of the database vendor – pre-built analytic functions are hard-coded to every major release of the DBMS. There is no independence between the analytics layer and the DBMS – which real, well-architected analytic platforms need to have. Simply put, if you want to do a different sessionization semantic, you’ll have to wait for Vertica to build a whole new function. Read the rest of this entry »
|
|
|
|
|
|
|
|
|
|
By Steve Wooledge in Analytics on November 8, 2010 |
| |
|
|
|
|
One of the coolest parts of my job is seeing how companies use analytics to drive their business. The term “big data” has become somewhat of a superstar in the world of analytics recently, but it’s also the complexity and richness of the insights from that data which make it a “big data” challenge for companies to tackle with traditional data management infrastructures. It’s not just size that matters – it’s analytical power. That is to say, what you DO with data.
And it’s not just in Silicon Valley or on Wall Street. October marked one year of the “Big Data Summit” road show which we hosted across the US to offer high-level executives, data analytic practitioners, and analysts the opportunity to share best practices and exchange ideas about solving big data problem within their industries and organizations. It was a huge success with an average of 80-100 people attending each summit in major cities including held New York, Chicago, San Francisco, Dallas, and Washington DC. We are starting the tour again later this month in New York City on November 18 and are rebranding it “Data Analytics Summit,” again because of the feedback that it’s more about the application of data in analytics and applications within a specific business area or industry.
Here are some examples. The attendees at the summits have been providing us with interesting data through surveys. The attendees are from a variety of industries, from traditional retailers to bleeding-edge digital media companies. We asked respondents questions like, “What are the biggest opportunities for benefiting from big data within the market?” Let me know if you think they missed any big opportunities. Here are a few of our findings:
- Data exploration to discover new market opportunities: Nearly 30% of respondents thought that analyzing big data to find “the next big thing” was a huge opportunity. This supports the notion that data scientists will be one of the sexiest jobs in the future.
- Behavioral targeting: 16% surveyed called out the importance of establishing links between purchasing behavior and areas like advertising spend to better tailor budgets and promotional campaign
- Social Network Analysis: 15% of those surveyed responded that using social network analysis to build a more complete profile of their customer base is a key business opportunity
- Monetizing Data: 15% of respondents say monetizing data is key for organizations seeking to unlock the hidden value within previously untapped asset
- Fraud Reduction and Risk Profiling: Distinguishing good customers from bad ones, for fraud reduction (10%) and risk profiling (10%), was identified as critical for financial institutions
Another general observation from attendees is that using sampled or aggregated data is no longer a viable business option for rich analytics and there is an urgent need to analyze all available data including structured and unstructured data.

What other areas do you see? Let us know what you think or if you have any questions on the statistics. I don’t claim to be an industry analyst, but it was fun to look at the breakdown of how various cities responded to the survey.
|
|
|
|
|
|
|
|
|
|
|
|
In the recently announced nCluster 4.6 we continue to innovate and improve nCluster on many fronts to make it the high performance platform of choice for deep, high value analytics. One of the new features is a hybrid data store, which now gives nCluster users the option of storing their data in either a row or column orientation. With the addition of this feature, nCluster is the first data warehouse and analytics platform to combine a tightly integrated hybrid row- and column-based storage with SQL-MapReduce processing capabilities. In this post we’ll discuss the technical details of the new hybrid store as well as the nCluster customer workloads that prompted the design.
Row- and Column-store Hybrid
Let’s start with the basics of row and column stores. In a row store, all of the attribute values for a particular record are stored together in the same on-disk page. Put another way, each page contains one or more entire records. Such a layout is the canonical database design found in most database textbooks, as well as both open source and commercial databases. A column store flips this model around and stores values for only one attribute on each on-disk page. This means that to construct, say, an entire two-attribute record will require data from two different pages in a column store, whereas in a row-store the entire record would be found on only one page. If a query needs only one attribute in that same two-attribute table, then the column store will deliver more needed values per page read. The row store must read pages containing both attributes even though only one attribute is needed, wasting some I/O bandwidth on the unused attribute. Research has shown that for workloads where a small percentage of attributes in a table are required, a column oriented storage model can result in much more efficient I/O because only the required data is read from disk. As more attributes are used, a column store becomes less competitive with a row store because there is an overhead associated with combining the separate attribute values into complete records. In fact, for queries that access many (or all!) attributes of a table, a column store performs worse and is the wrong choice. Having a hybrid store provides the ability to choose the optimal storage for a given query workload.
Aster Data customers have a wide range of analytics use cases from simple reporting to advanced analytics such as fraud detection, data mining, and time series analysis. Reports typically ask relatively simple questions of data such as total sales per region or per month. Such queries tend to require only a few attributes and therefore benefit from columnar storage. In contrast, deeper analytics such as applying a fraud detection model to a large table of customer behaviors relies on applying that model to many attributes across many rows of data. In that case, a row store makes a lot more sense.
Clearly there are cases where having both a column and row store benefits an analytics workload, which is why we have added the hybrid data store feature to nCluster 4.6.
Performance Observations
What does the addition of a hybrid store mean for typical nCluster workloads? The performance improvements from reduced I/O can be considerable: a 5x to 15x speedup was typical in some in-house tests on reporting queries. These queries were generally simple reporting queries with a few joins and aggregation. Performance improvement on more complex analytics workloads, however, was highly variable, so we took a closer look at why. As one would expect (and a number of columnar publications demonstrate), we also find that queries that use all or almost all attributes in a table benefit little or are slowed down by columnar storage. Deep analytical queries in nCluster like scoring, fraud detection, and time series analysis tend to use a higher percentage of columns. Therefore, as a class, they did not benefit as much from columnar, but when these queries do use a smaller percentage of columns, choosing the columnar option in the hybrid store provided good speedup.
A further reason that these more complex queries benefit less from a columnar approach is Amdahl’s law. As we push more complex applications into the database via SQL-MapReduce, we see a higher percentage of query time spent running application code rather than reading or writing from disk. This highlights an important trend in data analytics: user CPU cycles per byte is increasing, which is one reason that deployed nCluster nodes tend to have a higher CPU per byte ratio than one might expect in a data warehouse. The takeaway message is that the hybrid store provides an important performance benefit for simple reporting queries and for analytical workloads that include a mix of ad hoc and simple reporting queries, performance is maximized by choosing the data orientation that is best suited for each workload.
Implementation
The hybrid store is made possible by integrating a column store within the nCluster data storage and query-processing engine, which already used row-storage. The new column storage is tightly integrated with existing query processing and system services. This means that any query answerable by the existing Aster storage engine can now also be answered in our hybrid store, whether the data is stored in row or column orientation. Moreover, all SQL-MapReduce features, workload management, replication, fail-over, and cluster backup features are available to any data stored in the hybrid store.
Providing flexibility and high performance on a wide range of workloads, makes Aster Data the best platform for high value analytics. To that end, we look forward to continuing development of the nCluster hybrid storage engine to further optimize row and column data access. Coupled with workload management and SQL-MapReduce, the new hybrid nCluster storage highlights Aster Data’s commitment to provide nCluster users with the most flexibility to make the most of their data.
|
|
|
|
|
|
|
|
|