Archive for December, 2010

By Tasso Argyros in MapReduce on December 8, 2010

In the past couple of years, MapReduce – once an unknown, funky word – became a prominent, mainstream trend in data management and analytics. However even today I meet people that are not clear on what MapReduce exactly is and how it relates to some other terms and trends. In this post I attempt to clarify some of the MapReduce-related terminology. So here it goes.

MapReduce (the framework). MapReduce is a framework that allows programmers to develop analytical applications that run on (usually large) clusters of commodity hardware and process (usually large) amounts of data. It was first introduced by Google and it is language independent. It is abstract in the sense that an application that uses MapReduce doesn’t have to care about things like the number of servers/processes, fault tolerance, etc. MapReduce is supported by multiple implementations including the open source project Hadoop and Aster Data. Google also has its own proprietary implementation which, unfortunately, is also called MapReduce and sometimes creates confusion.

MapReduce (the Google implementation of MapReduce framework). As mentioned above, Google has its own implementation of MapReduce. This was described in the 2004 OSDI paper and it was the theoretical basis upon which Hadoop was developed. Google’s MapReduce was a processing framework and it was using Google’s GFS (Google File System) for data storage.

Aster Data’s SQL-MapReduce. Aster Data has a patent-pending implementation of MapReduce that (a) uses a database for data persistence, (b) is tightly integrated with SQL, i.e. an analyst or BI tool can invoke MapReduce via SQL queries, thus making MapReduce accessible to the enterprise. It supports multiple programming languages such as Java and C and it is accessible through standard interfaces such as ODBC and JDBC.

Hadoop. Hadoop is an Apache “umbrella” project that hosts many sub-projects, including Hadoop MapReduce and HDFS, Hadoop’s version of the Google File System which Hadoop MapReduce uses for data storage. Hadoop is the core open source project - however, there are many distributions for Hadoop, just as there are many distributions for Linux. These distributions contain Hadoop binaries together with other utilities and tools. The most popular distributions are the Cloudera distribution, the Yahoo distribution and the baseline Apache distribution.

HDFS. HDFS is Hadoop’s version of GFS and it is a distributed file system. HDFS can exist without Hadoop MapReduce, but usually Hadoop MapReduce requires HDFS. Aster Data’s MapReduce does not require HDFS as it uses an extensible MPP database for data storage and persistence.

Cloudera. Cloudera usually means either (a) the company, (b) Cloudera’s Distribution for Hadoop.

Sqoop. Sqoop which is short for “SQL to Hadoop” is an open source project that provides a framework for connecting to SQL data stores for data exchange.

NoSQL. NoSQL started as a term to describe a collection of products that did not support or rely on SQL. This included Hadoop and other products like Cassandra. However, as more people realized that SQL is a necessary interface  for many data management systems, the term evolved to mean (N)ot (o)nly SQL. These days there are attempts to port SQL on top of Hadoop and other NoSQL products.

Are there any MapReduce-related terms I omitted? Please add them in the comments below and include a definition and links to good resources if you’d like.