Archive for the ‘Blogroll’ Category

By Mayank Bawa in Blogroll, TCO on August 3, 2009

Netezza pre-announced last week that they will be moving to a new architecture - one based around IBM blades (Linux + Intel + RAM) with commodity SAS disks, RAID controllers, and NICs. The product will continue to rely on an FPGA, but that would sit much further from the disks & RAID controller, beyond the RAM but adjacent to the Intel CPU, in contrast to their previous product line.

In assembling a new hardware stack, Netezza calls this re-architecture as a change but not really a change - the FPGA will continue to offload data compression/decompression, selection and projection from the Intel CPU; the Intel CPU will be used to push-down joins and group bys; the RAM will be used to enable caching (thus helping improve mixed workload performance).

I think this is a pretty significant change for Netezza.

Clearly, Netezza would not have invested in this change - assemble & ship a new hardware stack to share revenue with IBM vs. a 3rd party hardware assembler - if Netezza’s old FPGA-dominant hardware was not being out-priced and out-performed by our Intel-based commodity hardware.

It was a matter of time before the market realized that FPGA’s had reached their end-of-life status in the data warehousing market. In realizing the writing on the wall, and responding to it early, Netezza has made a bold decision to change - and yet, clung to the warm familiarity of an FPGA as a “side car”.

Netezza, and the rest of the market, will soon become aware that a change in hardware stack is not a free lunch. The richness of CPU and RAM resources in an IBM commodity blade come at a cost that a resource-starved FPGA-based architecture never had to account for.

In 2009, after having engineered its software for an FPGA over the last 9 years, Netezza will need to come to terms with commodity hardware in production systems and demonstrate that they can:

- Manage processes and memory spawned by a single query across 100s of blade servers

- Maintain consistent caches across 100s of blade servers - after all, it is Oracle’s Cache Fusion technology that is the bane of scaling Oracle RAC beyond 8 blade servers

- Tolerate the higher frequency of failures that a commodity Linux + RAID Controller/driver + Network driver stack incur when put under rigorous data movement (e.g., allocation/de-allocation of memory contributing to memory leaks)

- Add a new IBM blade and ensure incremental scaling of their appliance

- Upgrade the software stack in place - unlike an FPGA-based hardware stack that customers are OK to floor-sweep in their upgrade

- Contain run-away queries from allocating the abundant CPU and RAM resources and starving other concurrent queries in the workload

- Reduce network traffic for a blade with 2 NICs that is managing 8 disks vs. a Power-PC/FPGA that had 1 NIC for 1 disk

- …

If you take a quick pulse of the market, apart from our known installations of 100+ servers, there is no other vendor - mature or new-age - who has demonstrated that 100′s of commodity servers can be made to work together to run a single database.

And I believe that there is a fundamental reason for this lack of proof-point even a decade after Linux has matured and commodity servers have been used for computing - software not built from the ground-up to leverage the richness and contain the limitations of commodity hardware is incapable of scaling. Aster nCluster has been built ground up to have these capabilities on a commodity stack. Netezza’s software written for proprietary hardware cannot be retrofitted to work on commodity hardware (else, Netezza would have completely taken the FPGAs out, now that they have powerful CPUs!). Netezza has its work cut-out - they have taken a dramatic shift that has the ability to bring the company and its production customers to its knees. And there-in lies Netezza’s challenge - they must succeed while supporting their current customers on an FPGA-based platform while moving resources to build out a commodity-based platform.

And we have not even touched upon the extension of SQL with MapReduce to power big data manipulation using arbitrary user-written procedures.

If a system is not fundamentally designed to leverage commodity servers, it’s only going to be a band-aid on seams that are bursting. Overall, we will curiously watch how long it takes Netezza to eliminate their FPGAs completely and move to a real commodity stack so that the customers can have the freedom to choose their own hardware and not be locked down to Netezza-supplied custom hardware.

By Shawn Kung in Blogroll, Frontline data warehouse on July 14, 2009

When you hear the word “warehouse,� you normally think of an oversized building with high ceilings and a ton of storage space. In the data warehousing world, it’s all too easy to fill that space faster than expected. Even companies with predictable data growth trajectories don’t want to pay for storage space they won’t need for months or even years out. For either type of company, the ability to scale on-demand, and to the appropriate degree, is critical.

That’s why I’m so excited about a webinar we are hosting next week with James Kobielus, Senior Analyst for Forrester Research. In case you haven’t read it, James recently released his report “Massive But Agile: Best Practices for Scaling the Next-Generation Data Warehouse.� In the report, James thoroughly address several issues around scalability for which Aster is well-suited (parallelism, optimized storage, in-database analytics, etc.).

We’ll get into much more detail on these and other issues over the course of the webinar. If you haven’t had a chance yet, please register for the webinar to hear what James, a leader and visionary in the industry, has to say. And make sure to leave a comment below if there are any facets of data warehouse scalability that you would like us to cover.

By Mayank Bawa in Blogroll on June 29, 2009

We are announcing the availability of an Enterprise-Ready MapReduce Data Warehouse Appliance.

The appliance is powered by Dell hardware and Aster’s nCluster SQL/ MR database, with optional software for BI platform from Microstrategy and data modeling software from Aqua Data Studio.

Our product portfolio now allows our customers to get the benefits of our flagship Aster nCluster SQL/MR database in the packaging that they are most comfortable with - on-premise software, in-cloud service, or pre-packaged appliance.

The appliance offering packs a lot of punch compared to other data warehousing appliances in the market - it has the highest ratio of compute & memory to data sizes, allowing you to run rich queries on the appliance without breaking a sweat.

We are especially proud of the open nature of our appliance - the hardware is from Dell built from industry-standard components, the BI server is from Microstrategy, and the data modeling tool is from AquaFold (Aqua Data Studio). The appliance brings together industry-leading components of a full data warehouse stack together - all pre-tested and configured for optimal performance.

Even the programming of our appliance is open - our SQL/MR framework allows applications to push computation into the appliance using industry standard SQL augmented with MapReduce in the language of your choice (Java, C#, Perl, Python, etc.).

We have been approached by a number of customers seeking a get-started-quickly system, especially those groups of users and departments seeking a Hadoop framework to build their solutions upon.

In response to the requests, we are proud to announce an Express Edition of the appliance that is designed to work for upto 1TB of user data. And it comes in an even more attractive price - that of $50K only - complete with hardware and software!

Give us a call - we’ll get your warehouse setup on our appliance to ensure that the time-to-first-query is measured in hours, not months!

By Peter Pawlowski in Blogroll on June 9, 2009

The Aster SQL/MapReduce framework allows developers to push analytics code for applications closer to the data in the database, without dealing with the headaches of extracting and analyzing data outside of the database. We’ve supported a variety of language from day one, including Java, Python, and Perl. Today we’re pleased to announce official support for the .NET family of languages via Mono, an excellent open source .NET implementation. This will allow developers who use .NET languages like C# and VB (and, of course, F#) to more easily leverage nCluster for massively parallel analytics.

Our .NET support is enabled through our Stream SQL/MR function, which allows users to process data via a simple streaming interface: provide a program that reads rows from the console (stdin) and writes rows back to the console (stdout). Let’s consider a simple C# program called Tokenize, which splits incoming rows into tokens, and then output each token (one per line):

To run this program over data stored in nCluster, a developer just needs to compile the above Tokenize.cs into Tokenize.exe with a C# compiler (in our case, the Mono C# compiler gmcs). With the compiled executable in hand, one command in our terminal client will install it in nCluster. The program can be then invoked from SQL. The below example will run the program over all the rows in the documents table, outputting a table with a single column (token). Each row in the result of the query will correspond to a single token in the input documents.

It’s as simple as that. We hope our new .NET support will enable an ever-broader group of developers take advantage of SQL/MR, our in-database analytics technology!If you’re interested in learning more, please check out a host of new resources around our implementation of MapReduce within Aster nCluster including example applications and code.

Goodbye, Rajeev
By Mayank Bawa in Blogroll on June 5, 2009

 Rajeev was a close friend and a cherished mentor. We were saddened to hear the news today and we will miss him dearly. Our thoughts are with his family.

By Peter Pawlowski in Blogroll, nPath on April 22, 2009

Aster’s SQL/MR framework (In-Database MapReduce) enables our users to write custom analytic functions (SQL/MR functions) in a programming language like Java or Python, install them in the cluster, and then invoke them from SQL to analyze data stored in nCluster database tables. These SQL/MR functions transform one table into another, but do so in a massively parallel way. As increasingly valuable analytic functions are pushed into the database, the value of constructing a data structure once, and reusing it across a large number of rows, increases substantially. Our API was designed with this in mind.

What’s the SQL/MR API look like? The SQL/MR function is given an iterator to a set of input rows, as well as an emitter for outputting rows. We decided on this interface for a number of reasons, with one of the most important being the ability to maintain state between rows. We’ve found that many useful analytic functions need to construct some state before processing a row of input, and this state construction should be amortized over as many rows as possible.

Here’s a wireframe of one type of SQL/MR function (a RowFunction):

class RealAsterFunction implements RowFunction
  void operateOnSomeRows(RowIterator iterator, RowEmitter outputEmitter)
   // Construct some data structure to enable fast processing.
   // Read a row from iterator, process it, and emit a result.

When this SQL/MR function is invoked in nCluster, the system starts several copies of this function on each node (think: one per CPU core). Each function is given an iterator to the rows that live in its local slice of the data. An alternative design, which is akin to the standard scalar UDF, would have been:

class NotRealAsterFunction implements PossibleRowFunction
 static void operateOnRow(Row currentRow, RowEmitter outputEmitter)
   // Process the given row and emit a result.

In this design, the static operateOnRow method would be called for each row in the function’s input. State can no longer be easily stored between rows. For simple functions, like computing the absolute value or a substring of a particular column, there’s no need for such inter-row state. But, as we’ve implemented more interesting analytic functions, we’ve found that enabling the storage of such state, or more specifically paying only once for the construction of something complex and then reusing it, has real value. Without the ability to save state between rows, the construction of this state would dominate the function’s execution.

Examples abound. Consider a SQL/MR function which applies a complex model to score the data in the database, whether it’s scoring a customer for insurance risk, scoring an internet user for an ad’s effectiveness, or scoring a snippet of text for its sentiment. These functions often construct a data structure in memory to accelerate scoring, which works very well with the SQL/MR API: build the data structure once and reuse it across a large number of rows.

A sentiment analysis SQL/MR function, designed to classify a set of notes written up by customer service reps or a set of comments post on a blog, would likely first build a hash table of words to sentiment scores, based on some dictionary file. This function would then iterate through each snippet of text, converting each word to its stem and then doing a fast lookup via the hash table. Such a persistent data structure accelerates the sentiment scoring of each text snippet.

Another example is Aster’s nPath SQL/MR function. At a high level, this function looks for patterns in ordered data, with the pattern specified with a regular expression. When nPath runs, it converts the pattern into a data structure optimized for fast, constant-memory pattern matching. If state couldn’t be maintained between rows, there’d be a large price to reconstructing this data structure on each new row.

Repeating the high bit: as increasingly valuable analytic functions are pushed into the database, the value of constructing a data structure once, and reusing it across a large number of rows, increases substantially. The SQL/MR API was designed with this in mind.

By Steve Wooledge in Blogroll on April 7, 2009

I’m delighted to welcome Specific Media to the quickly-growing family of Aster customers! I had the pleasure of briefly meeting the folks from Specific Media in our offices last week. Similar to Aster, Specific Media is incredibly focused on doing more with data to increase the value they provide to their customers: advertisers which represent 300 of the top Fortune 500 brands.

They’re also really smart and humble about what they do, which makes it a pleasure to work with them. And what you wouldn’t know from a brief introduction is how cutting-edge their analytic methodologies and capabilities are.  We’re just starting our partnership together and hope to have some success metrics to share later about how they are using the Aster nCluster database for their data warehouse. They have some interesting ideas for using the Aster In-Database MapReduce framework to perform rich analysis of data efficiently for improved ad targeting and relevancy.

By Tasso Argyros in Blogroll on April 6, 2009

When Mayank, George and I were at Stanford one of the things that brought us together was a shared vision of how the world could benefit with a more scalable database to address exploding volumes of data.  This led to the birth of Aster Data Systems and our flagship product, Aster nCluster, a highly scalable relational database system for what we call “frontline” data warehousing – an intersection of large data volumes, rich analytics, and mission-critical availability.

One way we found to solve the problem of managing and analyzing so much data was by implementing In-Database MapReduce. MapReduce is a programming model that was popularized at Google in 2003 to process large unstructured data sets distributed across thousands of nodes, and at Stanford we worked with some of the professors that had worked with the Google founders. In-Database MapReduce enables enterprises to harness the power of MapReduce while managing their data in Aster nCluster. Just like its massively parallel execution environment for standard SQL queries, Aster nCluster adds the ability to implement flexible MapReduce functions for parallel data analysis and transformation inside the database.

Much of the work of the Aster Data team is “fusing” best practices from the relational database world with innovations that Google pioneered for distributed computing; this takes strong engineering, so it’s no wonder that we are an engineering-driven company with some of the best minds available on our team.  Of our 26 engineers on staff, there are seven PhDs, and six PhDs on leave.  Over time in this blog I plan to highlight the members of the Aster team that help make nCluster a reality.

One key member is Dr. Mohit Aron. Mohit is an architect, and his focus is on the distributed aspects of the nCluster architecture. His achievements include the delivery of several key projects at Aster, notably in areas related to quality of service, SQL/MR, compression, performance, and fault-tolerance.

Before joining Aster Data Systems, Mohit was a Staff Engineer at Google Inc where he was one of the lead designers of the super-scalable award winning Google File System. Dr. Aron has held senior technical positions in industry where his work has focused on scalable cluster-based storage and database technologies.  He received his B.Tech degree from the Indian Institute of Technology, New Delhi and his M.S. and Phd from Rice University, Houston. His graduate research focused on high performance networking and cluster-based web server systems. He was one of the primary contributors to the ScalaServer project and won numerous best paper awards at prestigious conferences.

I am also very glad today to announce that another key member of our organization, Dheeraj Pandey, has been promoted to VP of Engineering. He has been with Aster ever since September ’07. Dheeraj has played an instrumental role in building this strong team together with me. He has been my alter ego all this while, as we shipped two major releases and four patchsets in the last 19 months. Beyond the tangibles, he has an acute focus on nurturing emotional intelligence within the engineering organization. Too many organizations, with strong technical mindsets, falter because people begin to underemphasize the value of honest communication, trust, and self-awareness. I am proud that we are building a culture, from very early on, which will endure the test of time as the company grows.

Dheeraj came to Aster from Oracle Corporation, where he managed the storage engine of the database. Under his leadership, Oracle built the unstructured data management stack, called Oracle SecureFiles, from the ground up. He also led the development of Oracle 11g Advanced Compression Option for both structured and unstructured data. Dheeraj has co-invented several patent-pending algorithms on database transaction management, Oracle Real Application Clusters, and Data Compression. Previously, he was building commodity-clustered fileservers at Zambeel. In the past 10 years of his industry career, he has developed diverse software for midtier Java/COM applications to fileservers, databases, and firmware in storage switches. Dheeraj received an M.S. in Computer Science from The University of Texas (Austin), where he was a doctoral fellow. He received a B.Tech. in Computer Science from the IIT Kanpur, where he was judged the “Best All-Rounder Student Among All Graduating Students in All Disciplines.”

I am confident that, as an innovation-driven company, we are entrusting one of  our most critical functions, Engineering, in very safe hands.

I hope you continue to watch this space for updates on Aster, our products, and our people.

Enterprise-Class MapReduce
By Shawn Kung in Blogroll, Cloud Computing on April 2, 2009

When Aster announced In-Database MapReduce last summer, we saw tremendous interest and intrigue. Today, Amazon announced that it is helping promote the use of parallel processing frameworks such as Hadoop (an open-source implementation of MapReduce) by making it available on EC2. (note: Aster announced production customers and availability of MapReduce on both Amazon’s EC2 and AppNexus in February)

Our vision was, and continues to be, to bring the power of MapReduce to a whole new class of developers and mission-critical enterprise systems. When would you use Aster’s In-Database MapReduce vs. a system like Hadoop? You need to ask a few questions as you think about this:

[1] Can I use my MapReduce system only for batch processing or can I do real-time reporting and analysis? Can I have a single system to do number-crunching AND needle-in-a-haystack summary or aggregation lookup? Can I get response to my short queries in seconds or do I need to wait for several minutes?

[2] How do I maximize developer productivity, using SQL for regular data processing and MapReduce for richer analysis?

[3] Do you only want to manage raw data files using file name conventions, or do you also want to use database primitives like partitions, tables, and views?

[4] How do I easily integrate the MapReduce system with my standard ETL and reporting tool, so I don’t have to reinvent the wheel on dashboards, scorecards, and reports?

[5] When I have such large data in an enterprise system, how do I control access to data and provide appropriate security privileges?

[6] Workload management: When I have invested in a system with hundreds or thousands of processors, how do I efficiently share it among multiple users and guarantee response-time SLAs?

[7] For mission-critical data-intensive applications, how do I do full and incremental backup and disaster recovery?

We conducted an educational webcast on MapReduce recently, together with a Stanford data mining professor, which details some of these differences further.

It’s great to see MapReduce going mainstream and companies such as Amazon supporting the proliferation of innovative approaches to the data explosion problem. Together, we hope to help build mind-share around MapReduce and help companies do more with their data. In fact, we welcome users to put Amazon Elastic MapReduce output into Aster nCluster Cloud Edition for persistence, sharing, reporting and easy fast concurrent access. Lots of Aster customers are using both and it’s easy to move data since Aster is on the same Amazon Web Services cloud.

Please contact us if you’d like help getting started with your MapReduce explorations. We conducted a web seminar to introduce you to the concept.

More In-Database MapReduce Applications
By Steve Wooledge in Analytics, Blogroll, nPath on March 27, 2009

When we announced Aster nCluster’s In-Database MapReduce feature last year, many people were intrigued by the new analytics they would be able to do in their database. However, In-Database MapReduce is new and often loaded with a lot of technical discussion on how it’s different from PL/SQL or UDF’s, whether it’s suitable for business aanalysts or developers, and more.What people really want to know is how businesses can take advantage of MapReduce.

I’ve referred to how our customers use In-Database MapReduce (and nPath) for click-stream analytics . In our “MapReduce for Data Warehousing and Analytics” webinar last week, Anand Rajaraman covered several other example applications. Rajaraman is CEO and Founder of Kosmix and Consulting Assistant Professor in the Computer Science Department at Stanford University (full disclosure: Anand is also on the Aster board of directors). After spending some time discussing graphing, i.e. finding the shortest path between items, Rajaraman discusses applications in finance, behavioral analytics, text, and statistical analysis that can be easily completed with In-Database MapReduce but are difficult or impossible with SQL alone.

As Rajaraman says, “We need to think beyond conventional relational databases. We need to move on to MapReduce. And the best way of doing that is to combine MapReduce with SQL.”