Archive for April, 2009

By Mayank Bawa in Uncategorized on April 26, 2009

A big congratulations to our CTO and Co-Founder, Tasso Argyros, who has been recognized as one of BusinessWeek’s Best Young Tech Entrepreneurs for 2009. I’d have given him a run for his spot, but I am over-the-hill and probably too old to run the distance - I wish they’d start a list for Best Entrepreneurs under the age of 40 :-)

Tasso’s hard work, dedication, confidence and vision have been a huge part of our success to date, and we know they will be a big part of great things ahead for Aster. Congratulations to you, and to all the other great companies that made the list as well; it’s an honor for them to be recognized alongside you.

By Peter Pawlowski in Blogroll, nPath on April 22, 2009

Aster’s SQL/MR framework (In-Database MapReduce) enables our users to write custom analytic functions (SQL/MR functions) in a programming language like Java or Python, install them in the cluster, and then invoke them from SQL to analyze data stored in nCluster database tables. These SQL/MR functions transform one table into another, but do so in a massively parallel way. As increasingly valuable analytic functions are pushed into the database, the value of constructing a data structure once, and reusing it across a large number of rows, increases substantially. Our API was designed with this in mind.

What’s the SQL/MR API look like? The SQL/MR function is given an iterator to a set of input rows, as well as an emitter for outputting rows. We decided on this interface for a number of reasons, with one of the most important being the ability to maintain state between rows. We’ve found that many useful analytic functions need to construct some state before processing a row of input, and this state construction should be amortized over as many rows as possible.

Here’s a wireframe of one type of SQL/MR function (a RowFunction):

class RealAsterFunction implements RowFunction
  void operateOnSomeRows(RowIterator iterator, RowEmitter outputEmitter)
   // Construct some data structure to enable fast processing.
   // Read a row from iterator, process it, and emit a result.

When this SQL/MR function is invoked in nCluster, the system starts several copies of this function on each node (think: one per CPU core). Each function is given an iterator to the rows that live in its local slice of the data. An alternative design, which is akin to the standard scalar UDF, would have been:

class NotRealAsterFunction implements PossibleRowFunction
 static void operateOnRow(Row currentRow, RowEmitter outputEmitter)
   // Process the given row and emit a result.

In this design, the static operateOnRow method would be called for each row in the function’s input. State can no longer be easily stored between rows. For simple functions, like computing the absolute value or a substring of a particular column, there’s no need for such inter-row state. But, as we’ve implemented more interesting analytic functions, we’ve found that enabling the storage of such state, or more specifically paying only once for the construction of something complex and then reusing it, has real value. Without the ability to save state between rows, the construction of this state would dominate the function’s execution.

Examples abound. Consider a SQL/MR function which applies a complex model to score the data in the database, whether it’s scoring a customer for insurance risk, scoring an internet user for an ad’s effectiveness, or scoring a snippet of text for its sentiment. These functions often construct a data structure in memory to accelerate scoring, which works very well with the SQL/MR API: build the data structure once and reuse it across a large number of rows.

A sentiment analysis SQL/MR function, designed to classify a set of notes written up by customer service reps or a set of comments post on a blog, would likely first build a hash table of words to sentiment scores, based on some dictionary file. This function would then iterate through each snippet of text, converting each word to its stem and then doing a fast lookup via the hash table. Such a persistent data structure accelerates the sentiment scoring of each text snippet.

Another example is Aster’s nPath SQL/MR function. At a high level, this function looks for patterns in ordered data, with the pattern specified with a regular expression. When nPath runs, it converts the pattern into a data structure optimized for fast, constant-memory pattern matching. If state couldn’t be maintained between rows, there’d be a large price to reconstructing this data structure on each new row.

Repeating the high bit: as increasingly valuable analytic functions are pushed into the database, the value of constructing a data structure once, and reusing it across a large number of rows, increases substantially. The SQL/MR API was designed with this in mind.

By Steve Wooledge in Blogroll on April 7, 2009

I’m delighted to welcome Specific Media to the quickly-growing family of Aster customers! I had the pleasure of briefly meeting the folks from Specific Media in our offices last week. Similar to Aster, Specific Media is incredibly focused on doing more with data to increase the value they provide to their customers: advertisers which represent 300 of the top Fortune 500 brands.

They’re also really smart and humble about what they do, which makes it a pleasure to work with them. And what you wouldn’t know from a brief introduction is how cutting-edge their analytic methodologies and capabilities are.  We’re just starting our partnership together and hope to have some success metrics to share later about how they are using the Aster nCluster database for their data warehouse. They have some interesting ideas for using the Aster In-Database MapReduce framework to perform rich analysis of data efficiently for improved ad targeting and relevancy.

By Tasso Argyros in Blogroll on April 6, 2009

When Mayank, George and I were at Stanford one of the things that brought us together was a shared vision of how the world could benefit with a more scalable database to address exploding volumes of data.  This led to the birth of Aster Data Systems and our flagship product, Aster nCluster, a highly scalable relational database system for what we call “frontline” data warehousing -an intersection of large data volumes, rich analytics, and mission-critical availability.

One way we found to solve the problem of managing and analyzing so much data was by implementing In-Database MapReduce. MapReduce is a programming model that was popularized at Google in 2003 to process large unstructured data sets distributed across thousands of nodes, and at Stanford we worked with some of the professors that had worked with the Google founders. In-Database MapReduce enables enterprises to harness the power of MapReduce while managing their data in Aster nCluster. Just like its massively parallel execution environment for standard SQL queries, Aster nCluster adds the ability to implement flexible MapReduce functions for parallel data analysis and transformation inside the database.

Much of the work of the Aster Data team is “fusing” best practices from the relational database world with innovations that Google pioneered for distributed computing; this takes strong engineering, so it’s no wonder that we are an engineering-driven company with some of the best minds available on our team.  Of our 26 engineers on staff, there are seven PhDs, and six PhDs on leave.  Over time in this blog I plan to highlight the members of the Aster team that help make nCluster a reality.

One key member is Dr. Mohit Aron. Mohit is an architect, and his focus is on the distributed aspects of the nCluster architecture. His achievements include the delivery of several key projects at Aster, notably in areas related to quality of service, SQL/MR, compression, performance, and fault-tolerance.

Before joining Aster Data Systems, Mohit was a Staff Engineer at Google Inc where he was one of the lead designers of the super-scalable award winning Google File System. Dr. Aron has held senior technical positions in industry where his work has focused on scalable cluster-based storage and database technologies.  He received his B.Tech degree from the Indian Institute of Technology, New Delhi and his M.S. and Phd from Rice University, Houston. His graduate research focused on high performance networking and cluster-based web server systems. He was one of the primary contributors to the ScalaServer project and won numerous best paper awards at prestigious conferences.

I am also very glad today to announce that another key member of our organization, Dheeraj Pandey, has been promoted to VP of Engineering. He has been with Aster ever since September ’07. Dheeraj has played an instrumental role in building this strong team together with me. He has been my alter ego all this while, as we shipped two major releases and four patchsets in the last 19 months. Beyond the tangibles, he has an acute focus on nurturing emotional intelligence within the engineering organization. Too many organizations, with strong technical mindsets, falter because people begin to underemphasize the value of honest communication, trust, and self-awareness. I am proud that we are building a culture, from very early on, which will endure the test of time as the company grows.

Dheeraj came to Aster from Oracle Corporation, where he managed the storage engine of the database. Under his leadership, Oracle built the unstructured data management stack, called Oracle SecureFiles, from the ground up. He also led the development of Oracle 11g Advanced Compression Option for both structured and unstructured data. Dheeraj has co-invented several patent-pending algorithms on database transaction management, Oracle Real Application Clusters, and Data Compression. Previously, he was building commodity-clustered fileservers at Zambeel. In the past 10 years of his industry career, he has developed diverse software for midtier Java/COM applications to fileservers, databases, and firmware in storage switches. Dheeraj received an M.S. in Computer Science from The University of Texas (Austin), where he was a doctoral fellow. He received a B.Tech. in Computer Science from the IIT Kanpur, where he was judged the “Best All-Rounder Student Among All Graduating Students in All Disciplines.”

I am confident that, as an innovation-driven company, we are entrusting one of  our most critical functions, Engineering, in very safe hands.

I hope you continue to watch this space for updates on Aster, our products, and our people.

By Shawn Kung in Blogroll, Cloud Computing on April 2, 2009

When Aster announced In-Database MapReduce last summer, we saw tremendous interest and intrigue. Today, Amazon announced that it is helping promote the use of parallel processing frameworks such as Hadoop (an open-source implementation of MapReduce) by making it available on EC2. (note: Aster announced production customers and availability of MapReduce on both Amazon’s EC2 and AppNexus in February)

Our vision was, and continues to be, to bring the power of MapReduce to a whole new class of developers and mission-critical enterprise systems. When would you use Aster’s In-Database MapReduce vs. a system like Hadoop? You need to ask a few questions as you think about this:

[1] Can I use my MapReduce system only for batch processing or can I do real-time reporting and analysis? Can I have a single system to do number-crunching AND needle-in-a-haystack summary or aggregation lookup? Can I get response to my short queries in seconds or do I need to wait for several minutes?

[2] How do I maximize developer productivity, using SQL for regular data processing and MapReduce for richer analysis?

[3] Do you only want to manage raw data files using file name conventions, or do you also want to use database primitives like partitions, tables, and views?

[4] How do I easily integrate the MapReduce system with my standard ETL and reporting tool, so I don’t have to reinvent the wheel on dashboards, scorecards, and reports?

[5] When I have such large data in an enterprise system, how do I control access to data and provide appropriate security privileges?

[6] Workload management: When I have invested in a system with hundreds or thousands of processors, how do I efficiently share it among multiple users and guarantee response-time SLAs?

[7] For mission-critical data-intensive applications, how do I do full and incremental backup and disaster recovery?

We conducted an educational webcast on MapReduce recently, together with a Stanford data mining professor, which details some of these differences further.

It’s great to see MapReduce going mainstream and companies such as Amazon supporting the proliferation of innovative approaches to the data explosion problem. Together, we hope to help build mind-share around MapReduce and help companies do more with their data. In fact, we welcome users to put Amazon Elastic MapReduce output into Aster nCluster Cloud Edition for persistence, sharing, reporting and easy fast concurrent access. Lots of Aster customers are using both and it’s easy to move data since Aster is on the same Amazon Web Services cloud.

Please contact us if you’d like help getting started with your MapReduce explorations. We conducted a web seminar to introduce you to the concept.