Archive for the ‘MapReduce’ Category

 
15
Apr
Posted by Mayank in Analytic applications, Analytics, Cloud Computing, MapReduce on April 15, 2010

In the last few years there has been a significant amount of market pickup, from users and vendors, on data clouds and advanced analytics – specifically a new class of data-driven applications run in a data cloud or on-premise. What’s different about this from past approaches is the frequency and speed at which these applications are accessed, the depth of the analysis, the number of data sources involved and the volume of data mined by these applications – terabytes to petabytes. In the midst of this cacophony of dialogue, recent announcements from vendors in this space are helping to clarify different visions and approaches to the big data challenge.

Both Aster Data and Greenplum made announcements this week that illustrated different approaches. At the same time that Aster Data announced the Aster Analytics Center, Greenplum announced an upcoming product named Chorus. I wanted to take a moment to compare and contrast what these announcements say about the direction of the two companies.

Greenplum’s approach speaks to two traditional problem areas i) access to data, from provisioning of data marts to connectivity to data across marts, and ii) some level of collaboration among certain developers and analysts. Their approach is to create a tool for provisioning, unified data access, and sharing of annotations and data among different developers and analysts. Interestingly, this is not an entirely new concept; these are well-known problems for which a number of companies and tools have already developed best-of-breed solutions over the last 15 years. For example, the capabilities for data access are another version of Export/Copy primitives that already exist in all databases and that have been built upon by common ETL and EII tools for cases in which richer support than Export & Copy are needed – for instance, when data has to be transformed, correlated or cleaned while being moved from one context (mart) to another (mart).

This approach is indicative of a product direction in which the primary focus is on adding another option to the list of tools available to customers to address these problems. It’s really not a ground-breaking innovation that evolves the world of analytics. New types of analytics, or ‘data-driven applications,’ is where the enormous opportunity lies. The Greenplum approach of data collaboration is interesting in a test environment or sandbox. When it comes to real production value however, it effectively increases the functions available to the end user, but at a big cost due to significant increases in complexity, security issues and extra administrative overhead. What does this mean exactly?

  • The spin-up of marts and moving data around can result in “data sprawl” which ultimately increases administrative overhead and is dangerous in these days of compliance and sensitivity to privacy and data leaks.
  • Adding a new toolset into the data processing stack creates difficult and painful work to either manage and administer multiple tool sets for similar purposes or to eliminate and transition away from investments in existing toolsets.
  • To enable effective communication and sharing, users need strong processes and features for source identification of data, data collection, data transformation, rule administration, error detection & correction, data governance and security. The quality and security policies around meta-data are especially important as free-form annotations can lead to propagation of errors or leaks in the absence of strong oversight.

In contrast, Aster Data’s recent announcements support our long-standing investments in our unique advanced in-database architecture where applications run fully inside Aster Data’s platform with complete application services essential for complex analytic applications. The announcements highlight that our vision is not to create a new set of tools and layers in the data stack that recreate capabilities currently available from a number of leading vendors, but rather to deliver a new Analytics Platform, a Data-Application Server, to uniquely enable analytics professionals to create data-rich applications that were impossible or impractical before – namely, to create and use advanced analytics for rich, rapid, and scalable insights into their data. This focus is complemented by our partners, who offer proven best-of-breed solutions for collaboration and data transformation.

Read the rest of this entry »



 
29
Jun
Posted by Mayank in MapReduce on June 29, 2009

We are announcing the availability of an Enterprise-Ready MapReduce Data Warehouse Appliance.

The appliance is powered by Dell hardware and Aster’s nCluster SQL/ MR database, with optional software for BI platform from Microstrategy and data modeling software from Aqua Data Studio.

Our product portfolio now allows our customers to get the benefits of our flagship Aster nCluster SQL/MR database in the packaging that they are most comfortable with – on-premise software, in-cloud service, or pre-packaged appliance.

The appliance offering packs a lot of punch compared to other data warehousing appliances in the market – it has the highest ratio of compute & memory to data sizes, allowing you to run rich queries on the appliance without breaking a sweat.

Read the rest of this entry »



 
26
Aug
Posted by Mayank in MapReduce on August 26, 2008

Pardon the tongue-in-cheek analogy to Oldsmobile when describing user-defined functions (UDFs), but I want to draw out some distinctions between this new class of functions that In-Database MapReduce enables.

Not Your Granddaddy's Oldsmobile

While similar on the surface, in practice there are stark differences between Aster In-Database MapReduce and traditional UDF’s.

MapReduce is a framework that parallelizes procedural programs to offload traditional cluster programming. UDF’s are simple database functions and while there are some syntactic similarities, that’s where the similarity ends. Several major differences between In-Database MapReduce and traditional UDF’s include:

Performance: UDF’s have limited or no parallelization capabilities in traditional databases (even MPP ones). Even where UDF’s are executed in parallel in an MPP database, they’re limited to accessing local node data, have byzantine memory management requirements, require multiple passes and costly materialization. In constrast, In-Database MapReduce automatically executes SQL/MR functions in parallel across potentially hundreds or even thousands of server nodes in a cluster, all in a single-pass (pipelined) fashion.

Flexibility: UDF’s are not polymorphic. Some variation in input/output schema may be allowed by capabilities like function overloading or permissive data-type handling, but that tends to greatly increase the burden on the programmer to write compliant code. In contrast, In-Database MapReduce MR/SQL functions are evaluated at run-time to offer dynamic type inference, an attribute of polymorphism that offers tremendous adaptive flexibility previously only found in mid-tier object oriented programming.

Manageability: UDF’s are generally not sandboxed in production deployments. Most UDF’s are executed in-process by the core database engine, which means bad UDF code can crash a database. SQL/MR functions execute in their own process for full fault isolation (bad SQL/MR code results in an aborted query, leaving other jobs uncompromised). A strong process management framework also ensures proper resource management for consistent performance and progress visibility.



 
25
Aug
Posted by Mayank in MapReduce on August 25, 2008

I’m unbelievably excited about our new In-Database MapReduce feature!

Google has used MapReduce and GFS on page rank analysis, but the sky is really the limit for anyone to build powerful analytic apps. Curt Monash has posted an excellent compendium of applications that are successfully leveraging the MapReduce paradigm today.

A few examples of SQL/MapReduce functions that we’ve collaborated with our customers on so far:

1. Path Sequencing: SQL/MR functions can be used for developing regular expression matching of complex path sequences (eg. time series financial analysis or clickstream behavioral recommendations). It can also be extended to discover Golden Paths to reveal interesting behavioural patterns useful for segmentation, issue resolution, and risk optimization.

2. Graph Analysis: many interesting graph problems like BFS (breadth first search), SSSP (single source shortest path), APSP (all-pairs shortest path), and page rank that depend on graph traversal.

3. Machine Learning: several statistical algorithms like linear regression, clustering, collaborative filtering, naive bayes, support vector machine, and neural networks can be used to solve hard problems like pattern recognition, recommendations/market basket analysis, and classification/segmentation.

4. Data Transformations and Preparation: Large-scale transformations can be parameterized as SQL/MR functions for data cleansing and standardization, unleashing the true potential for Extract-Load-Transform pipelines and making large-scale data model normalization feasible. Push down also enables rapid discovery and data pre-processing to create analytical data sets used for advanced analytics such as SAS and SPSS.

Read the rest of this entry »



 
25
Aug
Posted by Mayank in MapReduce on August 25, 2008

I am very pleased to announce today that Aster nCluster now brings together the expressive power of a MapReduce framework with the strengths of a Relational Database!

Jeff Dean and Sanjay Ghemawat at Google had invented the MapReduce framework in 2004 for processing large volumes of unstructured data on clusters of commodity nodes. Jeff and Sanjay’s goal was to provide a trivially parallelizable framework so that even novice developers (a.k.a interns) could write programs in a variety of languages (Java/C/C++/Perl/Python) to analyze data independent of scale. And, they have certainly succeeded.

Once implemented, the same MapReduce framework has been used successfully within Google (and outside, via Yahoo! sponsored Apache’s Hadoop) to analyze structured data as well.

Read the rest of this entry »