Archive for July 19th, 2010

19
Jul
By Tasso Argyros in Blogroll on July 19, 2010
   

Every year or so Google comes out with an interesting piece of infrastructure, always backed by claims that it’s being used by thousands of people on thousands of servers and processes petabytes or exabytes of web data. That alone makes Google papers interesting reading. :)

This latest piece of research just came out on Google’s Research Buzz page. It’s about a system called Dremel (note: Dremel is a company building hardware tools which I happened to use a lot when I was building model R/C airplanes as a kid). Dremel is an interesting move by Google which provides a system for interactive analysis of data. It was created because it was thought that native MapReduce has too much latency for for fast interactive querying/analysis. It uses data that sits on different storage systems like GFS or BigTable. Data is modeled in a columnar, semi-structured format and the query language is SQL-like with extensions to handle the non-relational data model. I find this interesting – below is my analysis of what Dremel is and the big conclusion.

Main characteristics of the system:

Data & Storage Model

– Data is stored in a semi-structured format. This is not XML, rather it uses Google’s Protocol Buffers. Protocol Buffers (PB) allow developers to define schemas that are nested.
– Every field is stored in its own file, i.e. every element of the Protocol Buffers schema is columnar-ized.

Columnar modeling is especially important for Dremel for two specific reasons:

– Protocol Buffer data structures can be huge (> 1000 fields).
– Dremel does not offer any data modeling tools to help break these data structures down. E.g. there’s nothing in the paper that explains how you can take a Protocol Buffers data structure and break it down to 5 different tables.
– Data is stored in a way that makes it possible to recreate the orignial flat? schema from the columnar representation. This however requires a full pass over the data – the paper doesn’t explain how point or indexed queries would be executed.
– There’s almost no information about how data gets in the right format, how is it stored, deleted, replicated, etc. My best guess is that when someone defines a Dremel table, data is copied from the underlying storage to the local storage of Dremel nodes (leaf nodes?) and at the same time is replicated across the leaf nodes. Since data in Dremel cannot be updated (it seems to be a write-once-read-many model), design & implementation of the replication subsystem should be significantly simplified.

Interface

Query interface is SQL-like but with extensions to handle the semi-structured, nested nature of data. Input of queries is semi-structured, and output is semi-structured as well. One needs to get used to this since it’s significantly different from the relational model.
– Tables can be defined from files, e.g. stored in GFS by means of a DEFINE TABLE? command.
The data model and query language makes Dremel appropriate for developers; for Dremel to be used by analysts or database folks, a different/simpler data model and a good number of tools (for loading, changing the data model etc) would be needed.

Query Execution

Queries do NOT use MapReduce, unlike Hadoop query tools like Pig & Hive.
– Dremel provides optimizations for sequential data access, such as async I/O & prefetching.
– Dremel supports approximate results (e.g. return partial results after reading X% of data – this speeds up processing in systems with 100s of servers or more since you don’t have to wait for laggards).
– Dremel can use replicas to speed up execution if a server becomes too slow. This is similar to the “backup copies”? idea from the original Google MapReduce paper.
There seems to be a tree-like model of executing queries, meaning that there are intermediate layers of servers between the leaf nodes and the top node (which receives the user query). This is useful for very large deployments (e.g. thousands of servers) since it provides some intermediate aggregation points that reduce the amount of data that needs to flow to any single node.

Performance & Scale

Compared to Google’s native MapReduce implementation, Dremel is two orders of magnitude faster in terms of query latency. As mentioned above, part of the reason is that the Protocol Buffers are usually very large and Dremel doesn’t have a way to break those down except for its columnar modeling. Another reason is the high startup cost of Google’s MapReduce implementation.
– Following Google’s tradition, Dremel was shown to scale reasonably well to thousands of servers although this was demonstrated only over a single query that parallelizes nicely and from what I understand doesn’t reshuffle much data. To really understand scalability, it’d be interesting to see benchmarks with a more complex workload collection.
– The paper mentions little to nothing about how data is partitioned across the cluster. Scalability of the system will probably be sensitive to partitioning strategies, so that seems like a significant omission IMO.

So the big question: Can MapReduce itself handle fast, interactive querying?

– There’s a difference between the MapReduce paradigm, as an interface for writing parallel applications, and a MapReduce implementation (two examples are Google’s own MapReduce implementation, which is mentioned in the Dremel paper, and open-source Hadoop). MapReduce implementations have unique performance characteristics.
– It is well known that Google’s MapReduce implementation & Hadoop’s MapReduce implementation are optimized for batch processing and not fast, interactive analysis. Besides the Dremel paper, look at this Berkeley paper for some Hadoop numbers and an effort to improve the situation.
Native MapReduce execution is not fundamentally slow; however Google’s MapReduce and Hadoop happen to be oriented more towards batch processing. Dremel tries to overcome that by building a completely different system that speeds interactive querying. Interestingly, Aster Data’s SQL-MapReduce came about to address this in the first place and offers very fast interactive queries even though it uses MapReduce. So the idea that one needs to get rid of MapReduce to achieve fast interactivity is something I disagree with – we’ve shown this is not the case with SQL-MapReduce.