Archive for 2010

08
Dec
By Tasso Argyros in MapReduce on December 8, 2010
   

In the past couple of years, MapReduce – once an unknown, funky word – became a prominent, mainstream trend in data management and analytics. However even today I meet people that are not clear on what MapReduce exactly is and how it relates to some other terms and trends. In this post I attempt to clarify some of the MapReduce-related terminology. So here it goes.

MapReduce (the framework). MapReduce is a framework that allows programmers to develop analytical applications that run on (usually large) clusters of commodity hardware and process (usually large) amounts of data. It was first introduced by Google and it is language independent. It is abstract in the sense that an application that uses MapReduce doesn’t have to care about things like the number of servers/processes, fault tolerance, etc. MapReduce is supported by multiple implementations including the open source project Hadoop and Aster Data. Google also has its own proprietary implementation which, unfortunately, is also called MapReduce and sometimes creates confusion.

MapReduce (the Google implementation of MapReduce framework). As mentioned above, Google has its own implementation of MapReduce. This was described in the 2004 OSDI paper and it was the theoretical basis upon which Hadoop was developed. Google’s MapReduce was a processing framework and it was using Google’s GFS (Google File System) for data storage.

Aster Data’s SQL-MapReduce. Aster Data has a patent-pending implementation of MapReduce that (a) uses a database for data persistence, (b) is tightly integrated with SQL, i.e. an analyst or BI tool can invoke MapReduce via SQL queries, thus making MapReduce accessible to the enterprise. It supports multiple programming languages such as Java and C and it is accessible through standard interfaces such as ODBC and JDBC.

Hadoop. Hadoop is an Apache “umbrella” project that hosts many sub-projects, including Hadoop MapReduce and HDFS, Hadoop’s version of the Google File System which Hadoop MapReduce uses for data storage. Hadoop is the core open source project - however, there are many distributions for Hadoop, just as there are many distributions for Linux. These distributions contain Hadoop binaries together with other utilities and tools. The most popular distributions are the Cloudera distribution, the Yahoo distribution and the baseline Apache distribution.

HDFS. HDFS is Hadoop’s version of GFS and it is a distributed file system. HDFS can exist without Hadoop MapReduce, but usually Hadoop MapReduce requires HDFS. Aster Data’s MapReduce does not require HDFS as it uses an extensible MPP database for data storage and persistence.

Cloudera. Cloudera usually means either (a) the company, (b) Cloudera’s Distribution for Hadoop.

Sqoop. Sqoop which is short for “SQL to Hadoop” is an open source project that provides a framework for connecting to SQL data stores for data exchange.

NoSQL. NoSQL started as a term to describe a collection of products that did not support or rely on SQL. This included Hadoop and other products like Cassandra. However, as more people realized that SQL is a necessary interface  for many data management systems, the term evolved to mean (N)ot (o)nly SQL. These days there are attempts to port SQL on top of Hadoop and other NoSQL products.

Are there any MapReduce-related terms I omitted? Please add them in the comments below and include a definition and links to good resources if you’d like.



19
Nov
By Barton George in Analytics, Cloud Computing on November 19, 2010
   

Barton George is Cloud Computing and Scale-Out Evangelist for Dell.

Today at a press conference in San Francisco we announced the general availability of our Dell cloud solutions. One of the solutions we debuted was the Dell Cloud Solution for Data Analytics, a combination of our PowerEdge C servers with Aster Data’s nCluster, a massively parallel processing database with an integrated analytics engine.

Earlier this week I stopped by Aster Data‘s headquarters in San Carlos, CA and met up with their EVP of marketing, Sharmila Mulligan. I recorded this video where Sharmila discusses the Dell and Aster solution and the fantastic results a customer is seeing with it.

Some of the ground Sharmila covers:

  • What customer pain points and problems does this solution address (hint: organizations trying to manage huge amounts of both structured and unstructured data)
  • How Aster’s nCluster software is optimized for Dell PowerEdge C2100 and how it provides very high performance analytics as well as a cost effective way to store very large data.
  • (2:21) InsightExpress, a leading provider of digital marketing research solutions, has deployed the Dell and Aster analytics solution and has seen great results:
    • Up and running w/in 6 weeks
    • Queries that took 7-9 minutes now run in 3 seconds

Pau for now…

Extra-credit reading



09
Nov
By Mayank Bawa in Analytics, MapReduce on November 9, 2010
   

It’s ironic how all of a sudden Vertica is changing its focus from being a column-only database to claiming to be an Analytic Platform.

If you’ve used an Analytic Platform you know it’s more than just bolting in a layer of analytic functions on top of a database. But that’s how Vertica claims it’s now a full-blown analytic platform when in fact their analytics capability is rather thin. For instance, their first layer is a pair of window functions (CTE and CCE). The CCE window function is used, for example, to do sessionization. Vertica has a blog post that posits sessionization as a major advanced analytic operation. In truth, Vertica’s sessionization is not analytics. It is a basic data preparation step that adds a session attribute to each clickstream event so that very simple session-level analytics can be performed.

What’s interesting is the CCE window function is simply a pre-built function – some might say just syntactic sugar - that combines the functionality of finite width SQL window functions (LEAD/LAG) with CASE statements (WHEN condition THEN predicate). Nothing ground breaking to say the least!

For example, the CTE query referred to in a Vertica blog post can be rewritten very simply using SQL-99:

SELECT
symbol, bid, timestamp,
SUM(CASE WHEN bid > 10.6 THEN 1 ELSE 0 END)
OVER (PARTITION BY symbol ORDER BY timestamp) window_id
FROM tickstore;

The layering of custom pre-built functions has for a long time been the traditional way of adding functions to a database. The SQL-99 and SQL-2003 analytic functions themselves follow this tradition.

The problem with this is not just with Vertica but also with the giants of the market, Oracle and Microsoft for instance. Their approach is that the customer is at the mercy of the database vendor - pre-built analytic functions are hard-coded to every major release of the DBMS. There is no independence between the analytics layer and the DBMS – which real, well-architected analytic platforms need to have. Simply put, if you want to do a different sessionization semantic, you’ll have to wait for Vertica to build a whole new function. Read the rest of this entry »



08
Nov
By Steve Wooledge in Analytics on November 8, 2010
   

One of the coolest parts of my job is seeing how companies use analytics to drive their business. The term “big data” has become somewhat of a superstar in the world of analytics recently, but it’s also the complexity and richness of the insights from that data which make it a “big data” challenge for companies to tackle with traditional data management infrastructures. It’s not just size that matters – it’s analytical power. That is to say, what you DO with data.

And it’s not just in Silicon Valley or on Wall Street. October marked one year of the “Big Data Summit” road show which we hosted across the US to offer high-level executives, data analytic practitioners, and analysts the opportunity to share best practices and exchange ideas about solving big data problem within their industries and organizations. It was a huge success with an average of 80-100 people attending each summit in major cities including held New York, Chicago, San Francisco, Dallas, and Washington DC. We are starting the tour again later this month in New York City on November 18 and are rebranding it “Data Analytics Summit,” again because of the feedback that it’s more about the application of data in analytics and applications within a specific business area or industry.

Here are some examples. The attendees at the summits have been providing us with interesting data through surveys. The attendees are from a variety of industries, from traditional retailers to bleeding-edge digital media companies. We asked respondents questions like, “What are the biggest opportunities for benefiting from big data within the market?” Let me know if you think they missed any big opportunities. Here are a few of our findings:

- Data exploration to discover new market opportunities: Nearly 30% of respondents thought that analyzing big data to find “the next big thing” was a huge opportunity. This supports the notion that data scientists will be one of the sexiest jobs in the future.

- Behavioral targeting: 16% surveyed called out the importance of establishing links between purchasing behavior and areas like advertising spend to better tailor budgets and promotional campaign

- Social Network Analysis: 15% of those surveyed responded that using social network analysis to build a more complete profile of their customer base is a key business opportunity

- Monetizing Data: 15% of respondents say monetizing data is key for organizations seeking to unlock the hidden value within previously untapped asset

- Fraud Reduction and Risk Profiling: Distinguishing good customers from bad ones, for fraud reduction (10%) and risk profiling (10%), was identified as critical for financial institutions

Another general observation from attendees is that using sampled or aggregated data is no longer a viable business option for rich analytics and there is an urgent need to analyze all available data including structured and unstructured data.

What other areas do you see? Let us know what you think or if you have any questions on the statistics. I don’t claim to be an industry analyst, but it was fun to look at the breakdown of how various cities responded to the survey.



15
Sep
By Tasso Argyros in Analytics, Blogroll, Data-Analytics Server on September 15, 2010
   

In the recently announced nCluster 4.6 we continue to innovate and improve nCluster on many fronts to make it the high performance platform of choice for deep, high value analytics. One of the new features is a hybrid data store, which now gives nCluster users the option of storing their data in either a row or column orientation. With the addition of this feature, nCluster is the first data warehouse and analytics platform to combine a tightly integrated hybrid row- and column-based storage with SQL-MapReduce processing capabilities. In this post we’ll discuss the technical details of the new hybrid store as well as the nCluster customer workloads that prompted the design.

Row- and Column-store Hybrid

Let’s start with the basics of row and column stores. In a row store, all of the attribute values for a particular record are stored together in the same on-disk page. Put another way, each page contains one or more entire records. Such a layout is the canonical database design found in most database textbooks, as well as both open source and commercial databases. A column store flips this model around and stores values for only one attribute on each on-disk page. This means that to construct, say, an entire two-attribute record will require data from two different pages in a column store, whereas in a row-store the entire record would be found on only one page. If a query needs only one attribute in that same two-attribute table, then the column store will deliver more needed values per page read. The row store must read pages containing both attributes even though only one attribute is needed, wasting some I/O bandwidth on the unused attribute. Research has shown that for workloads where a small percentage of attributes in a table are required, a column oriented storage model can result in much more efficient I/O because only the required data is read from disk. As more attributes are used, a column store becomes less competitive with a row store because there is an overhead associated with combining the separate attribute values into complete records. In fact, for queries that access many (or all!) attributes of a table, a column store performs worse and is the wrong choice. Having a hybrid store provides the ability to choose the optimal storage for a given query workload.

Aster Data customers have a wide range of analytics use cases from simple reporting to advanced analytics such as fraud detection, data mining, and time series analysis. Reports typically ask relatively simple questions of data such as total sales per region or per month. Such queries tend to require only a few attributes and therefore benefit from columnar storage. In contrast, deeper analytics such as applying a fraud detection model to a large table of customer behaviors relies on applying that model to many attributes across many rows of data. In that case, a row store makes a lot more sense.

Clearly there are cases where having both a column and row store benefits an analytics workload, which is why we have added the hybrid data store feature to nCluster 4.6.

Performance Observations

What does the addition of a hybrid store mean for typical nCluster workloads? The performance improvements from reduced I/O can be considerable: a 5x to 15x speedup was typical in some in-house tests on reporting queries. These queries were generally simple reporting queries with a few joins and aggregation. Performance improvement on more complex analytics workloads, however, was highly variable, so we took a closer look at why. As one would expect (and a number of columnar publications demonstrate), we also find that queries that use all or almost all attributes in a table benefit little or are slowed down by columnar storage. Deep analytical queries in nCluster like scoring, fraud detection, and time series analysis tend to use a higher percentage of columns. Therefore, as a class, they did not benefit as much from columnar, but when these queries do use a smaller percentage of columns, choosing the columnar option in the hybrid store provided good speedup.

A further reason that these more complex queries benefit less from a columnar approach is Amdahl’s law. As we push more complex applications into the database via SQL-MapReduce, we see a higher percentage of query time spent running application code rather than reading or writing from disk. This highlights an important trend in data analytics: user CPU cycles per byte is increasing, which is one reason that deployed nCluster nodes tend to have a higher CPU per byte ratio than one might expect in a data warehouse. The takeaway message is that the hybrid store provides an important performance benefit for simple reporting queries and for analytical workloads that include a mix of ad hoc and simple reporting queries, performance is maximized by choosing the data orientation that is best suited for each workload.

Implementation

The hybrid store is made possible by integrating a column store within the nCluster data storage and query-processing engine, which already used row-storage. The new column storage is tightly integrated with existing query processing and system services. This means that any query answerable by the existing Aster storage engine can now also be answered in our hybrid store, whether the data is stored in row or column orientation. Moreover, all SQL-MapReduce features, workload management, replication, fail-over, and cluster backup features are available to any data stored in the hybrid store.

Providing flexibility and high performance on a wide range of workloads, makes Aster Data the best platform for high value analytics. To that end, we look forward to continuing development of the nCluster hybrid storage engine to further optimize row and column data access. Coupled with workload management and SQL-MapReduce, the new hybrid nCluster storage highlights Aster Data’s commitment to provide nCluster users with the most flexibility to make the most of their data.



09
Sep
By Mayank Bawa in Statements on September 9, 2010
   

I’m delighted to announce that we’ve appointed a new CEO, Quentin Gallivan, to lead our company through the next level of growth.

We’ve had tremendous growth at our company in the past 4 years – having grown Aster Data from 3 persons to a strong, well-rounded team and stellar management team, shipped products with market-defining features, working with customers doing fascinating projects across many industries including retail, Internet, media and publishing and financial services and established key partnerships that we’re really excited about. Tasso and I’ll be working closely with Quentin as he accelerates our trajectory, taking our company to the next level of market leadership, sales and partnership execution, and our international expansions.

Quentin brings more than 20 years of senior executive experience to Aster Data. He has held a variety of CEO and senior executive positions with leading technology companies. Quentin joins us from PivotLink, the leading provider of BI solutions, where as CEO, he rapidly grew the company to over 15,000 business users from mid-sized companies to F1000 companies, across key industries including retail, financial services, CPG, manufacturing and high technology. Prior to PivotLink, Quentin served as CEO of Postini where he scaled the company to 35,000 customers and over 10 million users until its eventual acquisition by Google in 2007. Quentin also served as executive vice president of worldwide sales and services at VeriSign where he grew sales from $20M to $1.2B and was responsible for the global distribution strategy for the company’s security and services business. Quentin has also held a number of key executive and leadership positions at Netscape Communications and GE Information Services.

I’ll transition to a role that I’m really passionate about. I’ll be working closely with our customers and, as our Chief Customer Officer, I’ll lead our organization devoted to ensuring customer success and innovation in our fast-growing customer base. When the company was smaller, I was very actively involved in our customer deployments. As the company scaled, I had to withdraw into operations. In my new role, I’ll be back doing tasks that I relish – solving problems at the intersection of technology and usage – and providing a feedback loop from customers to Tasso, our CTO, to chart our product development.

Together, Quentin, Tasso and I are excited to accelerate our momentum and success in the market.



10
Aug
By Tasso Argyros in Analytics, Blogroll on August 10, 2010
   

Coming out of Stanford to start Aster Data five years back, my co-founders and I had to answer a lot of questions. What kind of an engineering team do we want to build? Do we want people experienced in systems or databases? Do we want to hire people from Oracle or another established organization? When you’re just starting a company, embarking on a journey that you know will have many turns, answers are not obvious.

What we ended up doing very early on is bet on intelligent, smart and adaptable engineers, as opposed to experience or a long resume. It turned out that this was the right thing to do because, as a startup, we had react to market needs and change our focus at a blink of an eye. Having a team of people that were used to tackling never-seen-before problems made us super-agile as a product organization. As the company grew, we ended up having a mix of people that combined expertise in certain areas and core engineering talent. But the culture of the company was set in stone even though we didn’t realize it: even today our interview process expects talent, intelligence and flexibility to be there and strongly complement the experience our candidates may have.

There are three things that are great about being an engineer at Aster Data:

Our Technology Stack is Really Tall.

We have people working right above the Kernel on filesystems, workload management, I/O performance, etc. We have many challenging problems that involve very large scale distributed systems - and I’m talking about the whole nine yards, including performance, reliability, manageability, and data management at scale. We have people working on database algorithms from the I/O stack to the SQL planner to no-SQL planners. And we have a team of people working on data mining and statistical algorithms on distributed systems (this is our “quant”? group since people there come with a background in physics as much as computer science). It’s really hard to get bored or stop learning here.

We Build Real Enterprise Software.

There’s a difference between the software one would write in a company like Aster Data versus a company like Facebook. Both companies write software for big data analysis. However, a company like Facebook solves their problem (a very big problem, indeed) for themselves and each engineer gets to work on a small piece of the pie. At Aster Data we write software for enterprises and due to our relatively small size each engineer makes a world of a difference. We also ship software to third-party people and they expect our software to be out-of-the-box resilient, reliable and easy to manage/debug. This makes the problem more challenging but also gives us great leverage: once we get something right, not one, nor two, but potentially hundreds or thousands of companies can benefit from our products. The impact of the work of each engineer at Aster Data is truly significant.

We’re Working on (Perhaps) the Biggest IT Revolution of the 21st Century.

Big Data. Analytics. Insights. Data Intelligence. Commodity hardware. Cloud/elastic data management. You name it. We have it. When we started Aster Data in 2005 we just wanted to help corporations analyze the mountains of data that they generate. We thought it was a critical problem for corporations if they wanted to remain competitive and profitable. But the size and importance of data grew beyond anyone’s expectations over the past few years. We can probably thank Google, Facebook and the other internet companies for demonstrating to the world what data analytics can do. Given the importance and impact of our work, there’s no ceiling on how successful we can become.

You’ve probably guessed it by now, but the reason I’m telling you all this is to also tell you that we’re hiring. If you think you have what it takes to join such an environment, I’d encourage you to apply. We get many applications daily so the best way to get an interview here is through a recommendation and referral. With tools like LinkedIn (who happens to be a customer) it’s really easy to explore your network. My LinkedIn profile is here, so see if we have a professional or academic connection. You can also look at our management team, board of directors, investors and advisors to see if there are any connections there. If there’s no common connection, feel free to email your resume to jobs@asterdata.com. However, to stand out I’d encourage you to tell us a couple of words about what excites you about Aster Data, large scale distributed systems, databases, analytics and/or startups that work to revolutionize an industry, and why you think you’ll be successful here. Finally, take a look at the events we either organize or participate in - it’s a great way to meet someone from our team and explain why you’re excited to join our quest to revolutionize data management and analytics.



09
Aug
By Tasso Argyros in Analytics on August 9, 2010
   

Watching our customers use Aster Data to discover new insights and build new big data products is one of the most satisfying parts of my job. Having seen this process a few times, I found that it always has the same steps:

An Idea or Concept – Someone comes up with an idea of a hidden treasure that could be hidden in the data, e.g. a new customer segment that could be very profitable, a new pattern that reveals novel cases of fraud, or other event-triggered analysis.

Dataset – An idea based on data that doesn’t exist is like a great recipe without the ingredients. Hopefully the company has already deployed one or more big data repositories that have the necessary data in full detail (no summaries, sampling, etc). If that’s not the case, data has to be generated, captured and moved to a big data-analytics server, which is an MPP database with a fully integrated analytics engine, like Aster Data’s solution. It addresses both parts of the big data need – scalable data storage and data processing.

Iterative Experimentation – This is the fun part. In contrast to traditional reporting, where the idea translates almost automatically to a query or report (e.g.: I want to know average sales per store for the past 2 years), a big data product idea (e.g.: I want to know what is my most profitable customer segment) requires building an intuition about the data before coming up with the right answer. This can only be achieved by a large number of analytical queries using either SQL or MapReduce, and it’s the step where the analyst or data scientist builds their intuition and understanding of the dataset and of the hidden gems buried there.

Data Productization – Once iterative experimentation provides the data scientist with evidence of gold, the next step is to make the process repeatable so that its output can be systematically used by humans (e.g. marketing department) or systems (e.g. a credit card transaction clearing system that needs to identify fraudulent transactions). This requires not only a repeatable process but also data that’s certified to be of high quality and processing that can meet specific SLAs, always while using a hybrid of SQL and MapReduce for deep big data analysis

If you think about it, this process is similar to the process of coming up with a new product (software or otherwise). You start with an idea, you then get the first material and build a lot of prototypes. I’ve found that people who find an important and valuable data insight after a process of iterative experimentation feel the same satisfaction as an inventor who has just made a huge discovery. And the next natural step is to take that prototype, make it a repeatable manufacturing process and start using it in the real world.

In the “old”? world of simple reporting, the process of creating insights was straightforward. Respectively the value of the outcome (reports) was much lower and easily replicable by everyone. Big Data Analytics, on the other hand, require a touch of innovation and creativity, which is exactly why it is hard to replicate and why its results produce such important and sustainable advantages to businesses. I believe that Big Data Products are the next wave of corporate value creation and competitive differentiation.



19
Jul
By Tasso Argyros in Blogroll on July 19, 2010
   

Every year or so Google comes out with an interesting piece of infrastructure, always backed by claims that it’s being used by thousands of people on thousands of servers and processes petabytes or exabytes of web data. That alone makes Google papers interesting reading. :)

This latest piece of research just came out on Google’s Research Buzz page. It’s about a system called Dremel (note: Dremel is a company building hardware tools which I happened to use a lot when I was building model R/C airplanes as a kid). Dremel is an interesting move by Google which provides a system for interactive analysis of data. It was created because it was thought that native MapReduce has too much latency for for fast interactive querying/analysis. It uses data that sits on different storage systems like GFS or BigTable. Data is modeled in a columnar, semi-structured format and the query language is SQL-like with extensions to handle the non-relational data model. I find this interesting - below is my analysis of what Dremel is and the big conclusion.

Main characteristics of the system:

Data & Storage Model

– Data is stored in a semi-structured format. This is not XML, rather it uses Google’s Protocol Buffers. Protocol Buffers (PB) allow developers to define schemas that are nested.
– Every field is stored in its own file, i.e. every element of the Protocol Buffers schema is columnar-ized.

Columnar modeling is especially important for Dremel for two specific reasons:

– Protocol Buffer data structures can be huge (> 1000 fields).
– Dremel does not offer any data modeling tools to help break these data structures down. E.g. there’s nothing in the paper that explains how you can take a Protocol Buffers data structure and break it down to 5 different tables.
– Data is stored in a way that makes it possible to recreate the orignial flat? schema from the columnar representation. This however requires a full pass over the data - the paper doesn’t explain how point or indexed queries would be executed.
– There’s almost no information about how data gets in the right format, how is it stored, deleted, replicated, etc. My best guess is that when someone defines a Dremel table, data is copied from the underlying storage to the local storage of Dremel nodes (leaf nodes?) and at the same time is replicated across the leaf nodes. Since data in Dremel cannot be updated (it seems to be a write-once-read-many model), design & implementation of the replication subsystem should be significantly simplified.

Interface

Query interface is SQL-like but with extensions to handle the semi-structured, nested nature of data. Input of queries is semi-structured, and output is semi-structured as well. One needs to get used to this since it’s significantly different from the relational model.
– Tables can be defined from files, e.g. stored in GFS by means of a DEFINE TABLE? command.
The data model and query language makes Dremel appropriate for developers; for Dremel to be used by analysts or database folks, a different/simpler data model and a good number of tools (for loading, changing the data model etc) would be needed.

Query Execution

Queries do NOT use MapReduce, unlike Hadoop query tools like Pig & Hive.
– Dremel provides optimizations for sequential data access, such as async I/O & prefetching.
– Dremel supports approximate results (e.g. return partial results after reading X% of data - this speeds up processing in systems with 100s of servers or more since you don’t have to wait for laggards).
– Dremel can use replicas to speed up execution if a server becomes too slow. This is similar to the “backup copies”? idea from the original Google MapReduce paper.
There seems to be a tree-like model of executing queries, meaning that there are intermediate layers of servers between the leaf nodes and the top node (which receives the user query). This is useful for very large deployments (e.g. thousands of servers) since it provides some intermediate aggregation points that reduce the amount of data that needs to flow to any single node.

Performance & Scale

Compared to Google’s native MapReduce implementation, Dremel is two orders of magnitude faster in terms of query latency. As mentioned above, part of the reason is that the Protocol Buffers are usually very large and Dremel doesn’t have a way to break those down except for its columnar modeling. Another reason is the high startup cost of Google’s MapReduce implementation.
– Following Google’s tradition, Dremel was shown to scale reasonably well to thousands of servers although this was demonstrated only over a single query that parallelizes nicely and from what I understand doesn’t reshuffle much data. To really understand scalability, it’d be interesting to see benchmarks with a more complex workload collection.
– The paper mentions little to nothing about how data is partitioned across the cluster. Scalability of the system will probably be sensitive to partitioning strategies, so that seems like a significant omission IMO.

So the big question: Can MapReduce itself handle fast, interactive querying?

– There’s a difference between the MapReduce paradigm, as an interface for writing parallel applications, and a MapReduce implementation (two examples are Google’s own MapReduce implementation, which is mentioned in the Dremel paper, and open-source Hadoop). MapReduce implementations have unique performance characteristics.
– It is well known that Google’s MapReduce implementation & Hadoop’s MapReduce implementation are optimized for batch processing and not fast, interactive analysis. Besides the Dremel paper, look at this Berkeley paper for some Hadoop numbers and an effort to improve the situation.
Native MapReduce execution is not fundamentally slow; however Google’s MapReduce and Hadoop happen to be oriented more towards batch processing. Dremel tries to overcome that by building a completely different system that speeds interactive querying. Interestingly, Aster Data’s SQL-MapReduce came about to address this in the first place and offers very fast interactive queries even though it uses MapReduce. So the idea that one needs to get rid of MapReduce to achieve fast interactivity is something I disagree with - we’ve shown this is not the case with SQL-MapReduce.



13
Jul
By Tasso Argyros in Cloud Computing on July 13, 2010
   

Amazon announced today the availability of special EC2 cloud clusters that are optimized for low-latency network operations. This is useful for applications in the so-called High-Performance Computing area, where servers need to request and exchange data very fast. Examples of HPC applications range from nuclear simulations in government labs to playing chess.

I find this development interesting, not only because it makes scientific applications in the cloud a possibility, but also because it’s an indication of where cloud infrastructure is heading.

In the early days, Amazon EC2 was very simple: if you wanted 5 “instances” (that is, 5 virtual machines), that’s what you got. However, memory of the instances was low, as well as disk capacity. Over time, more and more configurations were added and now one can choose an instance type from a variety of disk & memory characteristics with up to 15GB of memory and 2TBs of disks per instance. However, network was always a problem independently of the size of the instance. (According to rumors, EC2 would make things worse by distributing instances as far away from each other as possible in the datacenter to increase reliability - as a result, network latency would suffer.) Now, the network problem is being solved by means of these special “Cluster Compute Instances” that provide guaranteed, non-blocking access to a 10GbE network infrastructure.

Overall this course represents a departure from the super-simple black-box model that EC2 started from. Amazon - wisely - realizes that accommodating more applications requires transparency - and providing guarantees - for the underlying infrastructure. Guaranteeing network latency is just the beginning: Amazon has the opportunity add much more options and guarantees around I/O performance, quality of service, SSDs versus hard drives, fail-over behavior etc. The more options & guarantees Amazon offers the closer we’ll get to the promise of the cloud - at least for resource-intensive IT applications.