By Tasso Argyros in Blogroll, Data-Analytics Server on June 22, 2010

Rumors abound that Intel is “baking”? the successor of the very successful Nehalem CPU architecture, codenamed Westmere. It comes with an impressive spec: 10 CPU cores (supporting 20 concurrent threads) packed in a single chip. You can soon expect to see 40 cores in middle range 4-socket servers - a number hard to imagine just five years ago.

We’re definitely talking about a different era. In the old days, you could barely fit a single core in a chip. (I still remember 15 years ago when I had to buy and install a separate math co-processor on my Mac LC to run Microsoft Excel and Mathematica.) And with the hardware, software has to change, too. In fact, modern software means software that can handle parallelism. This is what makes MapReduce such an essential and timely tool for big data applications. MapReduce’s purpose in life is to simplify data and processing parallelism for big data applications. It gives ample freedom to the programmer on how to do things locally; and takes over when data needs to be communicated across processes/cores/servers, thus evaporating a lot of the parallelism complexity.

Once someone designs their software and data to operate in a parallelized environment using MapReduce, gains will come on multiple levels. Not only will MapReduce help your analytical applications scale across a cluster of servers with terabytes of data, it will also exploit the billions of transistors and the 10s of CPU cores inside each server. The best part: the programmer doesn’t need to think about the difference.

As an example, consider this great paper out of Stanford discusses MapReduce implementations of popular Machine Learning algorithms. The Stanford researchers considered MapReduce as a way of “porting”? these algorithms (traditionally implemented to run in a single CPU) to a multi-core architecture. But, of course, the same MapReduce implementations can be used to scale these algorithms across a distributed cluster as well.

Hardware has changed - MPP, shared-nothing, commodity servers, and, of course, multi-core. In this new world MapReduce is software’s response for big data processing. Intel and Westmere have just found an unexpected friend.

Post a comment