Every year or so Google comes out with an interesting piece of infrastructure, always backed by claims that it’s being used by thousands of people on thousands of servers and processes petabytes or exabytes of web data. That alone makes Google papers interesting reading.
This latest piece of research just came out on Google’s Research Buzz page. It’s about a system called Dremel (note: Dremel is a company building hardware tools which I happened to use a lot when I was building model R/C airplanes as a kid). Dremel is an interesting move by Google which provides a system for interactive analysis of data. It was created because it was thought that native MapReduce has too much latency for for fast interactive querying/analysis. It uses data that sits on different storage systems like GFS or BigTable. Data is modeled in a columnar, semi-structured format and the query language is SQL-like with extensions to handle the non-relational data model. I find this interesting - below is my analysis of what Dremel is and the big conclusion.
Main characteristics of the system:
Data & Storage Model
• Data is stored in a semi-structured format. This is not XML, rather it uses Google’s Protocol Buffers. Protocol Buffers (PB) allow developers to define schemas that are nested.
• Every field is stored in its own file, i.e. every element of the Protocol Buffers schema is columnar-ized. Columnar modeling is especially important for Dremel for two specific reasons:
- Protocol Buffer data structures can be huge (> 1000 fields).
- Dremel does not offer any data modeling tools to help break these data structures down. E.g. there’s nothing in the paper that explains how you can take a Protocol Buffers data structure and break it down to 5 different tables.
• Data is stored in a way that makes it possible to recreate the orignial “flat” schema from the columnar representation. This however requires a full pass over the data - the paper doesn’t explain how point or indexed queries would be executed.
• There’s almost no information about how data gets in the right format, how is it stored, deleted, replicated, etc. My best guess is that when someone defines a Dremel table, data is copied from the underlying storage to the local storage of Dremel nodes (“leaf nodes”) and at the same time is replicated across the leaf nodes. Since data in Dremel cannot be updated (it seems to be a write-once-read-many model), design & implementation of the replication subsystem should be significantly simplified.
Interface
• Query interface is SQL-like but with extensions to handle the semi-structured, nested nature of data. Input of queries is semi-structured, and output is semi-structured as well. One needs to get used to this since it’s significantly different from the relational model.
• Tables can be defined from files, e.g. stored in GFS by means of a “DEFINE TABLE” command.
• The data model and query language makes Dremel appropriate for developers; for Dremel to be used by analysts or database folks, a different/simpler data model and a good number of tools (for loading, changing the data model etc) would be needed.
Query Execution
• Queries do NOT use MapReduce, unlike Hadoop query tools like Pig & Hive.
• Dremel provides optimizations for sequential data access, such as async I/O & prefetching.
• Dremel supports approximate results (e.g. return partial results after reading X% of data - this speeds up processing in systems with 100s of servers or more since you don’t have to wait for laggards).
• Dremel can use replicas to speed up execution if a server becomes too slow. This is similar to the “backup copies” idea from the original Google MapReduce paper.
• There seems to be a tree-like model of executing queries, meaning that there are intermediate layers of servers between the leaf nodes and the top node (which receives the user query). This is useful for very large deployments (e.g. thousands of servers) since it provides some intermediate aggregation points that reduce the amount of data that needs to flow to any single node.
Performance & Scale
• Compared to Google’s native MapReduce implementation, Dremel is two orders of magnitude faster in terms of query latency. As mentioned above, part of the reason is that the Protocol Buffers are usually very large and Dremel doesn’t have a way to break those down except for its columnar modeling. Another reason is the high startup cost of Google’s MapReduce implementation.
• Following Google’s tradition, Dremel was shown to scale reasonably well to thousands of servers although this was demonstrated only over a single query that parallelizes nicely and from what I understand doesn’t reshuffle much data. To really understand scalability, it’d be interesting to see benchmarks with a more complex workload collection.
• The paper mentions little to nothing about how data is partitioned across the cluster. Scalability of the system will probably be sensitive to partitioning strategies, so that seems like a significant omission IMO.
So the big question – can MapReduce itself handle fast, interactive querying?
• There’s a difference between the MapReduce paradigm, as an interface for writing parallel applications, and a MapReduce implementation (two examples are Google’s own MapReduce implementation, which is mentioned in the Dremel paper, and open-source Hadoop). MapReduce implementations have unique performance characteristics.
• It is well known that Google’s MapReduce implementation & Hadoop’s MapReduce implementation are optimized for batch processing and not fast, interactive analysis. Besides the Dremel paper, look at this Berkeley paper for some Hadoop numbers and an effort to improve the situation.
• Native MapReduce execution is not fundamentally slow; however Google’s MapReduce and Hadoop happen to be oriented more towards batch processing. Dremel tries to overcome that by building a completely different system that speeds interactive querying. Interestingly, Aster Data’s SQL-MapReduce came about to address this in the first place and offers very fast interactive queries even though it uses MapReduce. So the idea that one needs to get rid of MapReduce to achieve fast interactivity is something I disagree with - we’ve shown this is not the case with SQL-MapReduce.
There is a lot of talk these days about relational vs. non-relational data. But what about analytics? Does it make sense to talk about relational and non-relational analytics?
I think it does. Historically, a lot of data analysis in the enterprise has been done with pure SQL. SQL-based analysis is a type of “relational analysis,” which I define as analysis done via a set-based declarative language like SQL. Note how SQL treats every table as a set of values; SQL statements are relational set operations; and any intermediate SQL results, even within the same query, need to follow the relational model. All these are characteristics of a relational analysis language. Although recent SQL standards define the language to be Turing Complete, meaning you can implement any algorithm in SQL, in practice implementing any computation that departs from the simple model of sets, joins, groupings, and orderings is severely sub-optimal, in terms of performance or complexity.
On the other hand, an interface like MapReduce is clearly non-relational in terms of its algorithmic and computational capabilities. You have the full flexibility of a procedural programming language, like C or Java; MapReduce intermediate results can follow any form; and the logic of a MapReduce analytical application can implement almost arbitrary formations of code flow and data structures. In addition, any MapReduce computation can be automatically extended to a shared-nothing parallel system which implies ability to crunch big amounts of data. So MapReduce is one version of “non-relational” analysis.
So Aster Data’s SQL-MapReduce becomes really interesting if you see it as a way of doing non-relational analytics on top of relational data. In Aster Data’s platform, you can store your data in a purely relational form. By doing that, you can use popular RDBMS mechanisms to achieve things like adherence to a data model, security, compliance, integration with ETL or BI tools etc. The similarities, however, stop there. Because you can then use SQL-MapReduce to do analytics that were never possible before in a relational RDBMS, because they are MapReduce-based and non-relational and they extend to TBs or PBs. And that includes a large number of analytical applications like fraud detection, network analysis, graph algorithms, data mining, etc.
Recently, a journalist called to ask about in-memory data processing, a very interesting subject. I always thought that in-memory processing will be more and more important as memory prices keep falling drastically. In fact, these days you can get 128GB of memory into a single system for less than $5K plus the server cost, not to mention that DDR3 and multiple memory controllers are giving a huge performance boost. And if you run software that can handle shared-nothing parallelism (MPP), your memory cost increases linearly, and systems with TBs of memory are possible.
So what do you do with all that memory? There are two classes of use cases that are emerging today. First is the case where you need to increase concurrent access to data with reduced latency. Tools like memcached offer in-memory caching that, used properly, can vastly improve latency and concurrency for large-scale OLTP applications like websites. Also the nice thing with object caching is that it scales well in a distributed way and people have build TB-level caches. Memory-only OLTP databases have started to emerge, such as VoltDB. And memory is used implicitly as a very important caching layer in open-source key-value products like Voldemort. We should only expect memory to play a more and more important role here.
The second way to use memory is to gain “processing flexibility” when doing analytics. The idea is to throw your data into memory (however much it fits, of course) without spending much time thinking how to do that or what queries you’ll need to run. Because memory is so fast, most simple queries will be executed at interactive times and also concurrency is handled well. European upstart QlikView exploits this fact to offer a memory-only BI solution which provides simple and fast BI reporting. The downside is its applicability to only 10s of GBs of data as Curt Monash notes.
By exploiting an MPP shared-nothing architecture, Aster Data has production clusters with TBs of total memory. Our software takes advantage of memory in two ways: first, it uses caching aggressively to ensure the most relevant data stays in memory; and when data is in memory, processing is much faster and more flexible. Secondly, MapReduce is a great way to utilize memory as it provides full flexibility to the programmer to use memory-focused data structures for data processing. In addition, Aster Data’s SQL-MapReduce provides tools to the user to encourage the development of memory-only MapReduce applications.
However, one shouldn’t fall into the trap of thinking that all analytics will be in-memory anytime soon. While memory is down to $30/GB, disk manufacturers have been busy increasing platter density and dropping their price to less than $0.06/GB. Given that the amount of data in the world grows faster than Moore’s law and memory, there will always be more data to be stored and analyzed than what fits into any amount of memory that an enterprise can use. In fact, most big data applications will have data sets that do not fit into memory because, while tools like memcached worry only about the present (e.g. current Facebook users), analytics need to worry about the past, as well – and that means much more data. So a multi-layer architecture will be the only cost-effective way of analyzing large amounts of data for some time.
One shouldn’t be discussing memory without mentioning solid-state disk products (like Aster Data partner company Fusion-io). SSDs are likely to make the surprise here given that their per-GB price is falling faster than disks (being a solid-state product that follows Moore’s law does help). In the next few years we’ll witness SSDs in read-intensive applications providing similar advantages to memory while accommodating much larger data sizes.
Rumors abound that Intel is “baking” the successor of the very successful Nehalem CPU architecture, codenamed Westmere. It comes with an impressive spec: 10 CPU cores (supporting 20 concurrent threads) packed in a single chip. You can soon expect to see 40 cores in middle range 4-socket servers – a number hard to imagine just five years ago.
We’re definitely talking about a different era. In the old days, you could barely fit a single core in a chip. (I still remember 15 years ago when I had to buy and install a separate math co-processor on my Mac LC to run Microsoft Excel and Mathematica.) And with the hardware, software has to change, too. In fact, modern software means software that can handle parallelism. This is what makes MapReduce such an essential and timely tool for big data applications. MapReduce’s purpose in life is to simplify data and processing parallelism for big data applications. It gives ample freedom to the programmer on how to do things locally; and takes over when data needs to be communicated across processes/cores/servers, thus evaporating a lot of the parallelism complexity.
Once someone designs their software and data to operate in a parallelized environment using MapReduce, gains will come on multiple levels. Not only will MapReduce help your analytical applications scale across a cluster of servers with terabytes of data, it will also exploit the billions of transistors and the 10s of CPU cores inside each server. The best part: the programmer doesn’t need to think about the difference.
As an example, consider this great paper out of Stanford discusses MapReduce implementations of popular Machine Learning algorithms. The Stanford researchers considered MapReduce as a way of “porting” these algorithms (traditionally implemented to run in a single CPU) to a multi-core architecture. But, of course, the same MapReduce implementations can be used to scale these algorithms across a distributed cluster as well.
Hardware has changed – MPP, shared-nothing, commodity servers, and, of course, multi-core. In this new world MapReduce is Software’s response for big data processing. Intel and Westmere have just found an unexpected friend.
This Monday we announced a new web destination for MapReduce, MapReduce.org. At a high level, this site is the first consolidated source of information & education around MapReduce, the groundbreaking programming model which is rapidly revolutionizing the way people deal with big data. Our vision is to make this site the one-stop-shop for anyone looking to learn how MapReduce can help analyze large amounts of data.
There were a couple reasons why we thought the world of big data analytics needed a resource like this. First, MapReduce is a relatively new technology and we are constantly getting questions from people in the industry wanting to learn more about it, from basic facts to using MapReduce for complex data analytics at Petabyte scale. By placing our knowledge and references in one public destination, we hope to build a valuable self-serve resource to educate many more people than what we could ever reach directly. In addition, we were motivated by the fact that most MapReduce resources out there focus more on specific implementations of MapReduce, which fragments the available knowledge and reduces its value. In this new effort we hope to create a multi-vendor & multi-tool resource which will benefit anyone interested in MapReduce.
We’re already working with analysts such as Curt Monash, Merv Adrian, Colin White and James Kobielus to syndicate their MapReduce-related posts. Going forward, we expect even more analysts, bloggers, practitioners, vendors, and academics to contribute. If traffic grows like we expect, we may eventually add a community forum to aid in interaction and sharing of knowledge and best practices.
I hope you enjoy surfing this new site! Free to email me for any suggestions as we work to make MapReduce.org more useful for you.
Today Aster took a significant step and made it easier for developers building fraud detection, financial risk management, telco network optimization, customer targeting and personalization, and other advanced, interactive analytic applications.
Along with the release of Aster Data nCluster 4.5, we added a new Solution Partner level for systems integrators and developers.
Why is this relevant?
Recession or no-recession, IT executives are constantly challenged. They are asked to execute strategies based on better analytics and information to improve effectiveness of business processes (customer loyalty, inventory management, revenue optimization, ..), while staying on top of technology-based disruptions and managing (shrinking or flat) IT budgets.
IT organizations have taken on the challenge by building analytics-based offeringsleveraging existing data management skills and increasingly taking advantage of MapReduce, a disruptive technology introduced by Google and now being rapidly adopted by mainstream enterprise IT shops in Finance, Telco, LifeSciences, Govt. and other verticals.
As MapReduce and big data analytics goes mainstream, our customers and ecosystem partners have asked us to make it easier for their teams to leverage MapReduce across enterprise application lifecycles, while harvesting existing IT skills in SQL, Java and other programming languages. The Aster development team that brought us the SQL/MapReduce innovation, has now delivered the market’s first integrated visual development environment for developing, deploying and managing MapReduce and SQL-based analytic applications.
Enterprise MapReduce developers and system integrators can now leverage the integrated Aster platform and deliver compelling business results in record time (read how ComScore delivers 360 degree view of digital world to enterprise customers, Full Tilt Poker gains the upper hand tackling online fraud using Aster).
We are also teaming up with leaders in our ecosystem like MicroStrategy to deliver an end-to-end analytics solution to our customers that includes SQL/MapReduce enabled reporting and rich visualization. Aster is proud to be driving innovation in the Analytics and BI market and was recently honored at MicroStrategy’s annual customer conference.
I am delighted with the rapid adoption of Aster Data’s platform by our partners and the strong continued interest from enterprise developers and system integrators in building big data applications using Aster. New partners are endorsing our vision and technical innovation as the future of advanced analytics for large data volumes.
Sign up today to be an Aster solution partner and join the revolution to deliver compelling information and analytics-driven solutions.
We just wrapped up our first of a two-part series on Mastering MapReduce together with Curt Monash. We’ve spent a lot of time discussing MapReduce with Curt and wanted to help educate the community on exactly what it is and how it applies to data management and analysis. We’ve published the recorded webcast and below are the slides we presented from an Aster Data perspective which outline:
- What is Aster Data’s SQL-MapReduce?
- Example industry applications of SQL-MapReduce
- Walking through the SQL-MapReduce syntax
Curt has also posted his slides on DBMS2 with a great overview on dispelling the myths around MapReduce, and how MapReduce and SQL play nicely with each other.
We had great turn-out and questions from the sessions. If you have any questions after reviewing the material, please drop a comment.
On Friday we had series of events and announcements around our new Aster-Hadoop Data Connector which utilizes key new SQL-MapReduce functions to provide ultra-fast, two-way data loading between HDFS (Hadoop Distributed File System) and Aster Data’s MPP data warehouse.
In addition to the Big Data Summit we held in New York City (which we’ll detail in a separate post), Colin White presented on a Webcast with Aster on the various use-cases for Hadoop within data warehouse environments. Colin does a great job summarizing what Hadoop is, how it’s different from an RDBMS, the different types of users for each, and how they co-exist nicely in customer environments.
Below are the slides to view if you weren’t able to attend the event, which will also be available for on-demand viewing soon in our resource library.
This post was co-authored by John Cieslewicz, Eric Friedman, and Peter Pawlowski of Aster Data Systems
One year ago we introduced SQL/MapReduce for the Aster nCluster database, which integrates MapReduce and SQL to enable deep analytics within the database. Pushing computation inside the database and close to the data is increasingly important as data sizes grow exponentially. As SQL/MR turns one year old, we are happy to announce that we will be presenting our SQL/MR innovations next week at the 35th International Conference on Very Large Data Bases (VLDB), the premier international forum for database research.
We developed SQL/MR because we saw a growing gap between the deep analytics and application needs of very large data and the capabilities provided by SQL and traditional relational-only data processing. We call this gap, the “SQL Gap.”
SQL and the relational query processing model are well suited for many, but not all data processing tasks. Some queries are cumbersome, non-intuitive or impossible to express in SQL (note: now that SQL is turing complete, nothing is strictly impossible, but it can be very painful and perform very badly) - check out our paper for some examples. Moreover, query optimizers have a limited number of algorithms at their disposal to process data, which leads to convoluted data processing in situations where applying a little domain knowledge can yield a much more straightforward algorithm.
We found traditional user-defined functions (UDFs) to fall short in bridging this gap between SQL and the answers to challenging analytic problems that need to be solved. UDFs are often user-unfriendly, inflexible, and not easily parallelized. SQL/MR functions, in contrast, are designed to be easy to develop, easy to install, and easy to use - providing developers and analysts with a powerful tool to tackle the challenges posed by very large data.
To do this, we integrated the MapReduce programming model with SQL. MapReduce is a well known paradigm for parallel, fault-tolerant data processing that allows developers to write procedural code that is then applied to data in parallel. Pure MapReduce, however, misses out on aspects of SQL and relational data processing that are great - such as query optimizations, managed data, and transactions. By integrating SQL and MapReduce on top of Aster nCluster’s hardware management and fault tolerance, we leverage the strengths of each, resulting in a system that is much more powerful than either in isolation.
SQL/MapReduce At VLDB
SQL/MR is much more than a user-defined function. As our paper title states, a SQL/MR function is self-describing, polymorphic, and parallelizable. Let’s explore each of these characteristics and see why we hope researchers at VLDB will be as excited as we are by SQL/MR.
Self-Describing and Polymorphic
The behavior and output characteristics of a SQL/MR function are determined dynamically at query-time instead of statically at install-time as is the case with traditional user-defined functions. These characteristics allow SQL/MR functions to behave much more like general purpose library functions than single-use, specific application user-defined functions. When a SQL/MR function is used in a query, the nCluster query planner negotiates a contract with the SQL/MR function, providing the function’s input schema and optional user-supplied parameters. In return, the SQL/MR function agrees to a contract that specifies its output schema for the duration of the query. This contract negotiation is what makes SQL/MR functions self-describing, and their ability to be invoked on different input with different optional parameters makes them polymorphic as well. To summarize, by invoking a SQL/MR function on different input, with different optional parameters, the SQL/MR function may export a different output schema and perform different computation - the possibilities are entirely up to the developer!
Parallelizable
The SQL/MR programming model, like that of MapReduce, is inherently parallelizable. Developers write procedural code in the language of their choice, but at runtime that code will run in parallel across hundreds of nodes within an nCluster database. Advanced analytics and application code can now be executed in parallel, directly on data stored within nCluster making nCluster an application-friendly, high performance data warehouse and data application system. Advanced analytics capabilities that we have already pushed inside nCluster using SQL/MR include click-stream sessionization, general purpose time series path matching, dynamic and massively parallel data loading from heterogeneous sources, and genetic sequence analysis.
Bridging the Gap
SQL/MapReduce has matured greatly over the past year and is used by our customers in creative ways we never imagined - proof positive that SQL/MapReduce is an effective way to bridge the “SQL Gap” to deeper analytics on very large data. Check out the other application examples and sample code videos on www.asterdata.com, as well as the MapReduce category of this blog
We’d love to hear from you about any type of analysis you’re doing where stand-alone SQL is becoming overly-complex … and please look us up at the VLDB 2009 show if you’re making the trip to Lyon, France!
(Apart from the authors, Brent Chun, Mohit Aron, Abhishek Marwah, Raghu Venkat, Vinay Bondhugula, and Prasan Roy of Aster Data Systems are notable contributors to the overall SQL/MR effort.)
We are announcing the availability of an Enterprise-Ready MapReduce Data Warehouse Appliance.
The appliance is powered by Dell hardware and Aster’s nCluster SQL/ MR database, with optional software for BI platform from Microstrategy and data modeling software from Aqua Data Studio.
Our product portfolio now allows our customers to get the benefits of our flagship Aster nCluster SQL/MR database in the packaging that they are most comfortable with - on-premise software, in-cloud service, or pre-packaged appliance.
The appliance offering packs a lot of punch compared to other data warehousing appliances in the market - it has the highest ratio of compute & memory to data sizes, allowing you to run rich queries on the appliance without breaking a sweat.
We are especially proud of the open nature of our appliance - the hardware is from Dell built from industry-standard components, the BI server is from Microstrategy, and the data modeling tool is from AquaFold (Aqua Data Studio). The appliance brings together industry-leading components of a full data warehouse stack together - all pre-tested and configured for optimal performance.
Even the programming of our appliance is open - our SQL/MR framework allows applications to push computation into the appliance using industry standard SQL augmented with MapReduce in the language of your choice (Java, C#, Perl, Python, etc.).
We have been approached by a number of customers seeking a get-started-quickly system, especially those groups of users and departments seeking a Hadoop framework to build their solutions upon.
In response to the requests, we are proud to announce an Express Edition of the appliance that is designed to work for upto 1TB of user data. And it comes in an even more attractive price - that of $50K only - complete with hardware and software!
Give us a call - we’ll get your warehouse setup on our appliance to ensure that the time-to-first-query is measured in hours, not months!