Archive for March, 2009

More In-Database MapReduce Applications
By Steve Wooledge in Analytics, Blogroll, nPath on March 27, 2009

When we announced Aster nCluster’s In-Database MapReduce feature last year, many people were intrigued by the new analytics they would be able to do in their database. However, In-Database MapReduce is new and often loaded with a lot of technical discussion on how it’s different from PL/SQL or UDF’s, whether it’s suitable for business aanalysts or developers, and more.What people really want to know is how businesses can take advantage of MapReduce.

I’ve referred to how our customers use In-Database MapReduce (and nPath) for click-stream analytics . In our “MapReduce for Data Warehousing and Analytics” webinar last week, Anand Rajaraman covered several other example applications. Rajaraman is CEO and Founder of Kosmix and Consulting Assistant Professor in the Computer Science Department at Stanford University (full disclosure: Anand is also on the Aster board of directors). After spending some time discussing graphing, i.e. finding the shortest path between items, Rajaraman discusses applications in finance, behavioral analytics, text, and statistical analysis that can be easily completed with In-Database MapReduce but are difficult or impossible with SQL alone.

As Rajaraman says, “We need to think beyond conventional relational databases. We need to move on to MapReduce. And the best way of doing that is to combine MapReduce with SQL.”

Aster Data Systems Named “Cool Vendor” by Leading Analyst Firm
By Steve Wooledge in Blogroll on March 24, 2009

A few months ago I wrote that we were referenced by Gartner in the Magic Quadrant for Data Warehouse Database Management Systems report after just 6 months of coming out of stealth mode, and I predicted then that 2009 would be a big year for Aster.

Today I am pleased to announce that Aster has been included in the “Cool Vendors” report from Gartner. The report observes that nCluster has “innovative design features.” Our MPP database design, which uses commodity servers in three tiers, and our In-Database MapReduce capabilities, are bringing true innovations to large-scale data warehousing and analytics.

Gartner is a leading analyst firm with its fingers on the pulse of the data warehousing industry. We are truly excited at this recognition.

There is more coming from Aster in these next few months. An early decision that we took in our company is to undertake a “we do it, then we say it” approach to marketing. We’ve been consistently following this principle:

[1] Our public launch in May 2008 was predicated on us reaching a 100TB production deployment at a customer

[2] Our Aster nCluster Cloud Edition launch was predicated on us having customers on two different cloud services to demonstrate our ability to work with multiple service providers as a true cloud offering - not just a porting of software to one service vendor

[3] Our In-Database MapReduce (SQL/MR) launch was predicated on us providing it with tight SQL integration for in-database analytics

… and many more instances.

The advantage of this principle is credibility. We are able to provide real-life customer advantages of our product and speak with the force of experience. We like to think that this credibility wins factual recognition from the leading analysts such as Gartner.

SQL/MapReduce: Faster Answers to Your Toughest Queries
By Peter Pawlowski in Analytics, Blogroll, nPath on March 13, 2009

Our goal at Aster is to build a product that will answer your analytical questions sooner. Sooner doesn’t just mean faster database  performance - it means faster answers from the moment you conceive of the question to the moment you get the answer. This means allowing analysts and end-users to easily ask the questions on their mind.

Aster nCluster, our massively-parallel database, has supported SQL from birth. SQL is great in many respects: it allows people of various levels of technical proficiency to ask lots of interesting questions in a relatively straightforward way. SQL’s easy to learn but powerful enough to ask the right questions.

But, we’ve realized that in many situations SQL just doesn’t cut it. If you want to sessionize your web clicks or find interesting user paths, run a custom fraud classifier, or tokenize and stem words across documents, you’re out of luck. Enter SQL/MR, one part of our vision of what a 21st-century database system should look like.

Let’s say your data is in nCluster. If your analytic question can be answered using SQL, you don’t have to worry about writing Java or Python. But, as soon as something more complicated comes up, you can write a SQL/MR function against our simple API, upload it into the cluster, and have it start operating on your data by invoking it from SQL. How is this related to MapReduce? It turns out that these functions are sufficient to express a full MapReduce dataflow. How are SQL/MR functions different than the UDFs of yore? It’s all about scale, usability, reusability; all three contributing to you getting your answer sooner.

SQL/MR functions play in a massively-parallel sandbox, one with terabytes and terabytes of data, so they’re designed to be readily parallelized. Yes, they just accept a table as input and produce a table as output, but they do so in a distributed way at huge scale. They can take as input either rows (think “map”) or well-defined partitions (think “reduce”), which allows nCluster to move data and/or computation around to make sure that the right data is on the right node at the right time. SQL/MR functions are table functions breaking out of the single node straight-jacket. This means you can analyze lots of data fast.

We want to make sure that developers using our SQL/MR framework spend their time thinking about the analytics, not dealing with infrastructure issues. We have a straight-foward API (think: you get a stream of rows and give us back a stream of rows) and a debugging interface that lets you monitor execution of your function across our cluster. Want to write and run a function? One command installs the function, and a single SQL statements invokes it. The data you provide the function is defined in SQL, and the output can be sliced and dices with more SQL - no digging into Java if you want to change a projection, provide the function a different slice of data, or add a sort onto the output. All this allows a developer to get a working function - sooner - and an analyst to tweak the question more readily.

We’ve gone to great lengths to make sure that a SQL/MR function, once written, can be leveraged far and wide. As mentioned before, SQL/MR functions are invoked from SQL, which means that they can be used by users who don’t know anything about Java. They also accept “argument clauses” - custom parameters which integrate nicely with SQL. Our functions are polymorphic, which means their output is dynamically determined by their input. This means that they can be used in a variety of contexts. And, it means that any number of people can write a function which you can easily reuse over your data. A function, once written, can be reused all over the place, allowing users to ask their questions faster (since someone’s probably asked a similar question in the past).

In fact, we’ve leveraged the SQL/MR framework to build a function that ships with nCluster: nPath. But this is just the first step, and
the sky’s the limit. SQL/MR could enable functions for market basket analysis, k-means clustering, support vector machines, natural language processing, among others.

How soon will your questions be answered? I’d love to hear of any ideas you have for analytic functions you’re struggling to write in SQL which you think could be a good fit for SQL/MapReduce

By Chris Neumann in Blogroll on March 4, 2009

MySpace decided to support one of its most important product launches of 2008 with an expansion of its Aster data warehouse. The data that would be collected would be used to provide information on trends in media and current interests on MySpace. The go-live date was October 2008.

MySpace Discusses Their Use of Aster nCluster

MySpace planned for the data warehouse right from the inception of the project to ensure that reporting was considered a first-class citizen in the overall launch process, rather than a post-launch activity. The result was that the data warehouse was up and running to receive the usage streams, even during a private beta release period, giving the warehousing team the necessary time to prepare for the onslaught of data that would result after the public release.

The launch was precisely on-time, and this video talks about the experience of MySpace in rolling out Aster nCluster on a broader scale after their initial deployment of Aster earlier in 2008. The combined Aster deployment now has 200+ commodity hardware servers working together to manage 200+ TB of data that is growing at 2-3TB per day by collecting 7-10B events that happen on one of the world’s largest social networks every day!

In fact, there is a very interesting incident that happened on the day of the new MySpace product launch. At about 7am, one of the servers in the Aster nCluster data warehouse failed. The failure was detected by our support team - and no scrambling ensued. Aster nCluster detected and isolated the failure, continuing to run the service with n-1 nodes without a blip and minimal performance change! Later, after the initial tense moments were behind us, the MySpace operations team walked over and replaced the failed hardware. The Aster database administrator then pressed a single button to re-include the node back to the nCluster data warehouse - the database continued to hum away with zero downtime.

The power of “Always-On”!

We will be co-hosting a case study by MySpace on their use of Aster at the Gartner BI Summit next week in National Harbor, MD on March 11.  If you’ll be at the event, please come by to hear what Hala has to say about their use of Aster to support their mission-critical operations at MySpace across multiple functions and departments.

Their Aster enterprise data warehouse supports frontline applications (e.g., MySpace TV, MySpace Video, etc.), as well as online marketing, sales, IT, finance, international, and legal.  MySpace is also planning  to incorporate data from Aster into a balanced scorecard for strategic alignment of the business around key performance indicators, as well as other future projects.

Some highlights from the video for folks who would rather read:

MySpace got up and running with Aster quickly

We were able to bring that up online and actually start processing the data into it within a matter of weeks, and I think very few technologies give you the ability to do something like that.”
- Hala Al-Adwin, VP of Data Services at MySpace

Aster is mission-critical to MySpace
With Aster, what we’ve been able to produce with commodity hardware has been a supercomputer-like infrastructure …the data that we collect and process is absolutely critical to the success of MySpace.
-Bita Mathews, Data Warehouse Manager, MySpace

Right now our key business performance metrics are all powered out of the Aster system.  If somebody went and shut it down, none of that would be available.  I think in a lot of ways, we were lacking that data before, and now that we’re used to having it, people are just hungry for more and more information.  So if all that went away, I think it’s kinda like going back to an age where there was no light.
-Hala Al-Adwan

MySpace’s data warehouse with Aster is extremely reliable
Aster is always on and available.  And this is very amazing thing about Aster, because it’s massive.  There’s a lot of hardware underneath the system.  When hardware fails, we can continue working, and although we know some engineers are fixing hardware, but that doesn’t stop us from continuing to run queries and producing our reports.
-Anna Dorofiyenko, Data Architect, MySpace

Aster is the blueprint for successful data warehouse deployments going forward
Integrating Aster and including them from the very beginning in the MySpace Music project from beginning to end is what allowed that to be the most successful data warehouse implementation we’ve had to date, and I think we should definitely use it as a blueprint for any future implementations we do.
-Christa Stelzmuller, Chief Data Architect, MySpace

The Value of Flexibility
By Chris Neumann in Blogroll, Cloud Computing on March 3, 2009

As the Director of Technology Delivery for Aster Data Systems, I oversee the teams responsible for delivering and deploying our nCluster analytic database to customers and enabling prospective customers to evaluate our solutions effectively and efficiently.  Recently, Shawn posted on the release of Aster nCluster Cloud Edition and discussed how cloud computing enables business to scale their infrastructure without huge hardware investments.   As a follow-on, I’d like to let you know about how the flexibility provided by nCluster’s support of multiple platforms can reduce the time and costs associated with evaluating nCluster.

Evaluating enterprise software can be a costly effort in both time and money.  The process typically requires weeks of prep work by the evaluation team, possibly including purchasing different hardware for each vendor being evaluated.  Spending significant amounts of money and losing weeks of resource productivity to an evaluation is something few companies can afford to do, particularly in these uncertain times.

With our recent public release of Aster nCluster Cloud Edition, we now provide the most platform options of any major data warehouse vendor.  While it’s natural to focus on the flexibility this affords for production systems, it also allows us to be very flexible for enabling customers to try our solution:

Commodity Hardware Evaluation
Several warehouse vendors claim to support commodity hardware, but most are very closely tied to one “preferred” vendor.  Aster nCluster supports any x86-based hardware, meaning that you can evaluate us on either new hardware (if performance is a key aspect of the evaluation) or older hardware that is being repurposed (if you want to test the functionality of nCluster without buying new hardware).

Aster-Hosted Evaluation
Our data center in San Carlos, CA has racks of servers dedicated to customer evaluations.  With an Aster-hosted system, functional evaluations of nCluster can be performed with minimum infrastructure requirements.

Cloud Evaluation
With Aster nCluster Cloud Edition, custom-configured nClusters can be brought up in minutes on either Amazon EC2 or AppNexus.  POCs can be performed on one or multiple systems in parallel, with zero infrastructure requirements.  Your teams can evaluate all of nCluster’s functionality in the cloud, with complete control over sizing and scaling. (While other vendors have announced cloud offerings, we’re the only data warehouse vendor to have production customers on two separate cloud services).

Whether you’re building a new frontline data warehouse or looking to replace an existing system that doesn’t scale or costs too much, you should check us out.  We have a great product that’s turning heads as an alternative to overpriced hardware appliances for multi-TB data warehouses.  With all the flexibility our offerings provide, you can evaluate all the power of Aster nCluster without the costs of traditional POCs.

Give us a try and see everything you can do with Aster nCluster!