By Tasso Argyros in MapReduce on December 8, 2010

In the past couple of years, MapReduce – once an unknown, funky word – became a prominent, mainstream trend in data management and analytics. However even today I meet people that are not clear on what MapReduce exactly is and how it relates to some other terms and trends. In this post I attempt to clarify some of the MapReduce-related terminology. So here it goes.

MapReduce (the framework). MapReduce is a framework that allows programmers to develop analytical applications that run on (usually large) clusters of commodity hardware and process (usually large) amounts of data. It was first introduced by Google and it is language independent. It is abstract in the sense that an application that uses MapReduce doesn’t have to care about things like the number of servers/processes, fault tolerance, etc. MapReduce is supported by multiple implementations including the open source project Hadoop and Aster Data. Google also has its own proprietary implementation which, unfortunately, is also called MapReduce and sometimes creates confusion.

MapReduce (the Google implementation of MapReduce framework). As mentioned above, Google has its own implementation of MapReduce. This was described in the 2004 OSDI paper and it was the theoretical basis upon which Hadoop was developed. Google’s MapReduce was a processing framework and it was using Google’s GFS (Google File System) for data storage.

Aster Data’s SQL-MapReduce. Aster Data has a patent-pending implementation of MapReduce that (a) uses a database for data persistence, (b) is tightly integrated with SQL, i.e. an analyst or BI tool can invoke MapReduce via SQL queries, thus making MapReduce accessible to the enterprise. It supports multiple programming languages such as Java and C and it is accessible through standard interfaces such as ODBC and JDBC.

Hadoop. Hadoop is an Apache “umbrella” project that hosts many sub-projects, including Hadoop MapReduce and HDFS, Hadoop’s version of the Google File System which Hadoop MapReduce uses for data storage. Hadoop is the core open source project - however, there are many distributions for Hadoop, just as there are many distributions for Linux. These distributions contain Hadoop binaries together with other utilities and tools. The most popular distributions are the Cloudera distribution, the Yahoo distribution and the baseline Apache distribution.

HDFS. HDFS is Hadoop’s version of GFS and it is a distributed file system. HDFS can exist without Hadoop MapReduce, but usually Hadoop MapReduce requires HDFS. Aster Data’s MapReduce does not require HDFS as it uses an extensible MPP database for data storage and persistence.

Cloudera. Cloudera usually means either (a) the company, (b) Cloudera’s Distribution for Hadoop.

Sqoop. Sqoop which is short for “SQL to Hadoop” is an open source project that provides a framework for connecting to SQL data stores for data exchange.

NoSQL. NoSQL started as a term to describe a collection of products that did not support or rely on SQL. This included Hadoop and other products like Cassandra. However, as more people realized that SQL is a necessary interface  for many data management systems, the term evolved to mean (N)ot (o)nly SQL. These days there are attempts to port SQL on top of Hadoop and other NoSQL products.

Are there any MapReduce-related terms I omitted? Please add them in the comments below and include a definition and links to good resources if you’d like.

Dec. 8: What We’re Reading About the Cloud: Cloud « on December 8th, 2010 at 7:15 pm #

[...] Clarifying the Terms Around MapReduce (From the Data Blog) If you’re at all confused around Big Data jargon, this is a great post. As you’ll see, Hadoop and MapReduce are not the same thing. [...]

someone who actually uses hadoop on December 8th, 2010 at 11:21 pm #

Hadoop’s MapReduce layer doesn’t require HDFS at all.

Tasso Argyros on December 9th, 2010 at 1:38 pm #

You’re right that the MapReduce and HDFS are now separate projects under the Apache Hadoop umbrella. However, I believe that using MapReduce on top of HDFS is still (and by far) the most popular configuration. If you are aware of popular use cases where Hadoop MapReduce is used without HDFS I’d love to hear about them.

Ajay Dawar on December 27th, 2010 at 1:29 pm #

Great post Tasso. Can you also please include an explanation of the Cassandra project?

Bill on February 28th, 2011 at 4:52 pm #

It would be cool to know how you guys will someday connect your outstanding system to an advanced Q&A program like IBM’s Watson. A system like this would so valuable, especially in uber-high-end marketing and financials analytics.


Post a comment