Archive for the ‘Blogroll’ Category

24
Mar
By Steve Wooledge in Blogroll on March 24, 2009
   

A few months ago I wrote that we were referenced by Gartner in the Magic Quadrant for Data Warehouse Database Management Systems report after just 6 months of coming out of stealth mode, and I predicted then that 2009 would be a big year for Aster.

Today I am pleased to announce that Aster has been included in the “Cool Vendors� report from Gartner. The report observes that nCluster has “innovative design features.� Our MPP database design, which uses commodity servers in three tiers, and our In-Database MapReduce capabilities, are bringing true innovations to large-scale data warehousing and analytics.

Gartner is a leading analyst firm with its fingers on the pulse of the data warehousing industry. We are truly excited at this recognition.

There is more coming from Aster in these next few months. An early decision that we took in our company is to undertake a “we do it, then we say it” approach to marketing. We’ve been consistently following this principle:

[1] Our public launch in May 2008 was predicated on us reaching a 100TB production deployment at a customer

[2] Our Aster nCluster Cloud Edition launch was predicated on us having customers on two different cloud services to demonstrate our ability to work with multiple service providers as a true cloud offering - not just a porting of software to one service vendor

[3] Our In-Database MapReduce (SQL/MR) launch was predicated on us providing it with tight SQL integration for in-database analytics

… and many more instances.

The advantage of this principle is credibility. We are able to provide real-life customer advantages of our product and speak with the force of experience. We like to think that this credibility wins factual recognition from the leading analysts such as Gartner.



13
Mar
SQL/MapReduce: Faster Answers to Your Toughest Queries
By Peter Pawlowski in Analytics, Blogroll, nPath on March 13, 2009
   

Our goal at Aster is to build a product that will answer your analytical questions sooner. Sooner doesn’t just mean faster database  performance - it means faster answers from the moment you conceive of the question to the moment you get the answer. This means allowing analysts and end-users to easily ask the questions on their mind.

Aster nCluster, our massively-parallel database, has supported SQL from birth. SQL is great in many respects: it allows people of various levels of technical proficiency to ask lots of interesting questions in a relatively straightforward way. SQL’s easy to learn but powerful enough to ask the right questions.

But, we’ve realized that in many situations SQL just doesn’t cut it. If you want to sessionize your web clicks or find interesting user paths, run a custom fraud classifier, or tokenize and stem words across documents, you’re out of luck. Enter SQL/MR, one part of our vision of what a 21st-century database system should look like.

Let’s say your data is in nCluster. If your analytic question can be answered using SQL, you don’t have to worry about writing Java or Python. But, as soon as something more complicated comes up, you can write a SQL/MR function against our simple API, upload it into the cluster, and have it start operating on your data by invoking it from SQL. How is this related to MapReduce? It turns out that these functions are sufficient to express a full MapReduce dataflow. How are SQL/MR functions different than the UDFs of yore? It’s all about scale, usability, reusability; all three contributing to you getting your answer sooner.

Scalability
SQL/MR functions play in a massively-parallel sandbox, one with terabytes and terabytes of data, so they’re designed to be readily parallelized. Yes, they just accept a table as input and produce a table as output, but they do so in a distributed way at huge scale. They can take as input either rows (think “map”) or well-defined partitions (think “reduce”), which allows nCluster to move data and/or computation around to make sure that the right data is on the right node at the right time. SQL/MR functions are table functions breaking out of the single node straight-jacket. This means you can analyze lots of data fast.

Usability
We want to make sure that developers using our SQL/MR framework spend their time thinking about the analytics, not dealing with infrastructure issues. We have a straight-foward API (think: you get a stream of rows and give us back a stream of rows) and a debugging interface that lets you monitor execution of your function across our cluster. Want to write and run a function? One command installs the function, and a single SQL statements invokes it. The data you provide the function is defined in SQL, and the output can be sliced and dices with more SQL - no digging into Java if you want to change a projection, provide the function a different slice of data, or add a sort onto the output. All this allows a developer to get a working function - sooner - and an analyst to tweak the question more readily.

Reusability
We’ve gone to great lengths to make sure that a SQL/MR function, once written, can be leveraged far and wide. As mentioned before, SQL/MR functions are invoked from SQL, which means that they can be used by users who don’t know anything about Java. They also accept “argument clauses” - custom parameters which integrate nicely with SQL. Our functions are polymorphic, which means their output is dynamically determined by their input. This means that they can be used in a variety of contexts. And, it means that any number of people can write a function which you can easily reuse over your data. A function, once written, can be reused all over the place, allowing users to ask their questions faster (since someone’s probably asked a similar question in the past).

In fact, we’ve leveraged the SQL/MR framework to build a function that ships with nCluster: nPath. But this is just the first step, and
the sky’s the limit. SQL/MR could enable functions for market basket analysis, k-means clustering, support vector machines, natural language processing, among others.

How soon will your questions be answered? I’d love to hear of any ideas you have for analytic functions you’re struggling to write in SQL which you think could be a good fit for SQL/MapReduce



04
Mar
By Chris Neumann in Blogroll on March 4, 2009
   

MySpace decided to support one of its most important product launches of 2008 with an expansion of its Aster data warehouse. The data that would be collected would be used to provide information on trends in media and current interests on MySpace. The go-live date was October 2008.

MySpace Discusses Their Use of Aster nCluster

MySpace planned for the data warehouse right from the inception of the project to ensure that reporting was considered a first-class citizen in the overall launch process, rather than a post-launch activity. The result was that the data warehouse was up and running to receive the usage streams, even during a private beta release period, giving the warehousing team the necessary time to prepare for the onslaught of data that would result after the public release.

The launch was precisely on-time, and this video talks about the experience of MySpace in rolling out Aster nCluster on a broader scale after their initial deployment of Aster earlier in 2008. The combined Aster deployment now has 200+ commodity hardware servers working together to manage 200+ TB of data that is growing at 2-3TB per day by collecting 7-10B events that happen on one of the world’s largest social networks every day!

In fact, there is a very interesting incident that happened on the day of the new MySpace product launch. At about 7am, one of the servers in the Aster nCluster data warehouse failed. The failure was detected by our support team - and no scrambling ensued. Aster nCluster detected and isolated the failure, continuing to run the service with n-1 nodes without a blip and minimal performance change! Later, after the initial tense moments were behind us, the MySpace operations team walked over and replaced the failed hardware. The Aster database administrator then pressed a single button to re-include the node back to the nCluster data warehouse - the database continued to hum away with zero downtime.

The power of “Always-On”!

We will be co-hosting a case study by MySpace on their use of Aster at the Gartner BI Summit next week in National Harbor, MD on March 11.  If you’ll be at the event, please come by to hear what Hala has to say about their use of Aster to support their mission-critical operations at MySpace across multiple functions and departments.

Their Aster enterprise data warehouse supports frontline applications (e.g., MySpace TV, MySpace Video, etc.), as well as online marketing, sales, IT, finance, international, and legal.  MySpace is also planning  to incorporate data from Aster into a balanced scorecard for strategic alignment of the business around key performance indicators, as well as other future projects.

Some highlights from the video for folks who would rather read:

MySpace got up and running with Aster quickly

We were able to bring that up online and actually start processing the data into it within a matter of weeks, and I think very few technologies give you the ability to do something like that.”
- Hala Al-Adwin, VP of Data Services at MySpace

Aster is mission-critical to MySpace
With Aster, what we’ve been able to produce with commodity hardware has been a supercomputer-like infrastructure …the data that we collect and process is absolutely critical to the success of MySpace.
– Bita Mathews, Data Warehouse Manager, MySpace

Right now our key business performance metrics are all powered out of the Aster system.  If somebody went and shut it down, none of that would be available.  I think in a lot of ways, we were lacking that data before, and now that we’re used to having it, people are just hungry for more and more information.  So if all that went away, I think it’s kinda like going back to an age where there was no light.
– Hala Al-Adwan

MySpace’s data warehouse with Aster is extremely reliable
Aster is always on and available.  And this is very amazing thing about Aster, because it’s massive.  There’s a lot of hardware underneath the system.  When hardware fails, we can continue working, and although we know some engineers are fixing hardware, but that doesn’t stop us from continuing to run queries and producing our reports.
– Anna Dorofiyenko, Data Architect, MySpace

Aster is the blueprint for successful data warehouse deployments going forward
Integrating Aster and including them from the very beginning in the MySpace Music project … from beginning to end is what allowed that to be the most successful data warehouse implementation we’ve had to date, and I think we should definitely use it as a blueprint for any future implementations we do.
– Christa Stelzmuller, Chief Data Architect, MySpace



03
Mar
The Value of Flexibility
By Chris Neumann in Blogroll, Cloud Computing on March 3, 2009
   

As the Director of Technology Delivery for Aster Data Systems, I oversee the teams responsible for delivering and deploying our nCluster analytic database to customers and enabling prospective customers to evaluate our solutions effectively and efficiently.  Recently, Shawn posted on the release of Aster nCluster Cloud Edition and discussed how cloud computing enables business to scale their infrastructure without huge hardware investments.   As a follow-on, I’d like to let you know about how the flexibility provided by nCluster’s support of multiple platforms can reduce the time and costs associated with evaluating nCluster.

Evaluating enterprise software can be a costly effort in both time and money.  The process typically requires weeks of prep work by the evaluation team, possibly including purchasing different hardware for each vendor being evaluated.  Spending significant amounts of money and losing weeks of resource productivity to an evaluation is something few companies can afford to do, particularly in these uncertain times.

With our recent public release of Aster nCluster Cloud Edition, we now provide the most platform options of any major data warehouse vendor.  While it’s natural to focus on the flexibility this affords for production systems, it also allows us to be very flexible for enabling customers to try our solution:

Commodity Hardware Evaluation
Several warehouse vendors claim to support commodity hardware, but most are very closely tied to one “preferred” vendor.  Aster nCluster supports any x86-based hardware, meaning that you can evaluate us on either new hardware (if performance is a key aspect of the evaluation) or older hardware that is being repurposed (if you want to test the functionality of nCluster without buying new hardware).

Aster-Hosted Evaluation
Our data center in San Carlos, CA has racks of servers dedicated to customer evaluations.  With an Aster-hosted system, functional evaluations of nCluster can be performed with minimum infrastructure requirements.

Cloud Evaluation
With Aster nCluster Cloud Edition, custom-configured nClusters can be brought up in minutes on either Amazon EC2 or AppNexus.  POCs can be performed on one or multiple systems in parallel, with zero infrastructure requirements.  Your teams can evaluate all of nCluster’s functionality in the cloud, with complete control over sizing and scaling. (While other vendors have announced cloud offerings, we’re the only data warehouse vendor to have production customers on two separate cloud services).

Whether you’re building a new frontline data warehouse or looking to replace an existing system that doesn’t scale or costs too much, you should check us out.  We have a great product that’s turning heads as an alternative to overpriced hardware appliances for multi-TB data warehouses.  With all the flexibility our offerings provide, you can evaluate all the power of Aster nCluster without the costs of traditional POCs.

Give us a try and see everything you can do with Aster nCluster!



25
Feb
By Mayank Bawa in Blogroll on February 25, 2009
   

We announced today that we’ve raised another $5.1M as an addendum to our Series B.

ivp-logo-05×2.jpg

Institutional Venture Partners (IVP) led this addendum to bring our total Series B raise to $17.1M. IVP is very clearly the best later-stage VC firm in Silicon Valley, and it is great to have them on our side as we continue to grow our company. IVP has backed such enterprises as Netflix, WebEx, MySQL, Data Domain, Juniper Networks, and Akamai.

Steve Harrick, Partner at IVP, led this investment for IVP. Steve has a great understanding of the technology infrastructure companies poised to lead their markets. He has led IVP’s investments at WebEx and MySQL and was recently recognized by Forbes Magazine’s 2009 Midas List as one of the leading venture investors in the US.

We had met Steve during our rounds of Series B meetings in Q4 2008 and had really liked and respected each other. At that point, we didn’t have space for two venture firms - having decided to raise $12M; we decided to separate, promising to keep in touch for the future.

In Q1 2009, as the economy grew more uncertain, we re-visited our decision and realized that it would be more prudent to pro-actively build a bigger cash reserve to ensure that we did not stumble in our growth, even as the market deteriorates, and a recovery continues to inch away.

Steve was happy to step in and participate in Aster at the same terms as our Series B. We were delighted to have a person of Steve’s caliber participate whole-heartedly in our growth.

Finally, there has to be some weird co-incidence in the manner in which all of our Series A and Series B venture capital firm participants (Sequoia Capital, JAFCO Ventures, and now IVP) have been investors in database companies that have had successful returns (Netezza, Datallegro, MySQL).



11
Feb
By Steve Wooledge in Blogroll, nPath on February 11, 2009
   

As a follow-on to the introductory nPath post, I wanted to share a little more depth on the nPath SQL syntax and a more sophisticated example which can be applied in click-stream or Web analytics. I’ll try to keep it concise for my colleagues who don’t want the pretty marketing bow . ;-)

SEO and SEM are critical traffic drivers for just about any consumer-facing website. Third party analytics offerings such as Google Analytics or Omniture can provide great turn-key package of canned reports. However, certain deep analytics on sequential events are simply out of the reach of not only these outsourced analytics services, but also in-house Web analytics data warehouses implemented on traditional solutions such as Oracle or SQL Server.

For example, suppose we are interested in the optimization of our website flow in order to retain and engage visitors driven to us by SEO/SEM. We want to answer the question: for SEO/SEM-driven traffic that stay on our site only for 5 or less pageviews and then leave our site and never return in the same session, what are the top referring search queries and what are the top path of navigated pages on our site? In traditional data warehouse solutions, this problem would require a five-way self-join of granular weblog data, which is simply unfeasible for large sites such as Myspace.

With the Aster nPath SQL/MR function, this problem can be expressed in a straightforward query that is executed in a very efficient manner in just a single pass over the granular data. The query below returns the top combinations of referral query string (of the entry page of the visit to our site) and on-site navigation path of up to 5 pages before leaving the site:

SELECT entry_refquerystring, entry_page || “,” || onsite_pagepath as onsite_pagepath, count(*) as session_count FROM nPath(
ON ( select * from clicks where year = 2009 )
PARTITION BY customerid, sessionid
ORDER BY timestamp
PATTERN ( ‘Entry.Onsite+.OffSite+$’ )
SYMBOLS (
domain ilike “mysite.com” and refdomain ~* “yahoo.com|google.com|msn.com|live.com” as Entry,
domain ilike “mysite.com” as OnSite,
domain not ilike “mysite.com” as OffSite
)
MODE( NONOVERLAPPING )
RESULT(
first(page of Entry) as entry_page,
first(refquerystring of Entry) as entry_refquerystring,
accumulate(page of Onsite) as onsite_pagepath,
count(* of Onsite) as onsitecount_minus1
)
)
WHERE onsitecount_minus1 < 4
GROUP BY 1,2
ORDER BY 3 DESC
LIMIT 1000;



10
Feb
By Shawn Kung in Analytics, Blogroll, Cloud Computing on February 10, 2009
   

blue_sky11.jpg

Cloud computing is a fascinating concept.  It offers greenfield opportunities (or more appropriately, blue sky frontiers) for businesses to affordably scale their infrastructure needs without plunking down a huge hardware investment (and the space/power/cooling costs associated with managing your own hosted environment).  This removes the risks of mis-provisioning by enabling on-demand scaling according to your data growth needs.  Especially in these economic times, the benefits of Cloud computing are very attractive.

But let’s face it - there’s also a lot of hype, and it’s hard to separate truth from fiction.  For example, what qualities would you say are key to data warehousing in the cloud?

Here’s a checklist of things I think are important:

[1] Time-To-Scalability.  The whole point of clouds is to offer easy access to virtualized resources.  A cloud warehouse needs to quickly scale-out and scale-in to adapt to changing needs.  It can’t take days to scale…it has to happen on-demand in minutes (<1 hour).

[2] Manageability.  You go with clouds because you not only want to save on hardware, but also on the operational people costs of maintaining that infrastructure.  A cloud warehouse needs to offer one-click scaling, easy install/upgrade, and self-managed resiliency.

[3] Ecosystem.  While clouds offer *you* huge TCO savings, you can’t compromise service levels for your customers - especially if you run your business on the cloud.  BI/ETL/monitoring tools, Backup & Recovery, and ultra-fast data loading can’t be overlooked for “frontline” mission-critical warehousing on the cloud.

[4] Analytics.  Lots of valuable data is generated via the cloud and there are opportunities to subscribe to new data feed services.  It’s insufficient for a cloud warehouse to just do basic SQL reporting.  Rather, it must offer the ability to do deep analytics very quickly.

[5] Choice.  A truly best-in-class cloud warehouse won’t lock you in to a single cloud vendor.  Rather, it will offer portability by enabling you to choose the best cloud for you to run your business on.

Finally, here are a couple ideas on the future of cloud warehousing.  What if you could link multiple cloud warehouses together and do interesting queries across clouds?  And what about the opportunities for game-changing new analytics - with so many emerging data subscription services, wouldn’t this offer ripe opportunities for mash-up analytics (eg. using Aster SQL/MapReduce).

What do you think are the standards for “best-in-class” cloud warehousing?



10
Feb
By Steve Wooledge in Analytics, Blogroll, nPath on February 10, 2009
   

It may not sound sexy, but analyzing a sequence of events over time is non-trivial in standard SQL. When Aster started providing customers ways to express time-series analysis more elegantly and have it return result sets in 90 seconds as opposed to 90 minutes, we started getting comments like,  “That’s beautiful … I would make out with my screen if I could!”.

Crazy, I know.

Time-series, or sequential pattern analysis, is useful in a number of industries and applications such as:

- Price changes over time in financial market data

- Path analysis within click-stream data to define anything from top referring sites to the “golden” paths customers navigate before purchasing

- Patterns which detect deviant activity such as spamming or insurance claims fraud

- Sessionization (mapping each event in a clickstream to a human user session)

More specifically, one customer wanted to improve the effectiveness of their advertising and drive more visitors to their site. They asked us to help determine the top paths people take to get to their site and top paths people take after leaving the site. Knowing the “before” path gave them insight into what sites to place advertisements on to drive traffic to their site. Additionally, pathing helps the customer understand behavior/preferences of users who visit their site (e.g., if espn.com is a top site that is in the path of the 5 pages leading up to the customer site, they know that many visitors like sports).

However, discovering relationships between rows of data is difficult to express in SQL, which must invoke multiple self-joins of the data. These joins dramatically expand the amount of data involved in the query and slow down query performance - not to mention complexity in developing and parsing these expressions.

What’s the solution? There are a few, but Aster’s approach has been to develop extensions to SQL which is executed in-database in a single-pass over the data in a massively-parallel fashion utilizing nPath, which is a SQL-MapReduce (SQL/MR) function used to perform regular expression pattern matching over a sequence of rows. It allows users to:

- Specify any pattern in an ordered collection - a sequence - of rows with symbols;

- Specify additional conditions on the rows matching these symbols; and

- Extract useful information from these row sequences.

I’ll share the syntax of nPath here to give you more context of how the query operates:

SELECT…
FROM nPath (
ON {table|query}
…various parameters…
)
WHERE…
GROUP BY…
HAVING…
ORDER BY…
etc.

nPath performs pattern matching and computes aggregates. The results of this computation are the output rows of nPath. The rows from nPath can subsequently be used like any other table in a SQL query: rows from nPath may be filtered with the WHERE clause, joined to rows from other tables, grouped with the GROUP BY clause, and so on.

The result? Incredibly powerful insight into a series of events which indicates a pattern or segment can be expressed in SQL, run in parallel on massive clusters of compute-power in an extremely efficient manner via a single pass over the data, and made accessible to business analysts through traditional BI tools.

What do you think? What other methods are people using to tackle these types of problem?



27
Jan
By Shawn Kung in Blogroll, Frontline data warehouse, TCO on January 27, 2009
   

Back in March 2005, I attended the AFCOM Data Center World Conference while working at NetApp.  It was a great opportunity to learn about enterprise data center challenges and network with some very experienced folks.  One thing that caught my attention was a recurring theme on growing power & cooling challenges in the data center.

Vendors, consultants, and end user case study sessions trumpeted dire warnings that the proliferation of powerful 1U blade servers would result in power demands outstripping supply (for example, a typical 42U rack consumed 7-10kW, while new-generation blade servers were said to exhibit peak rack heat loads of 15-25kW).  In fact, estimates were that HVAC cooling (for heat emissions) were an equally significant power consumer (ie. for every watt you burn to power the hardware, you burn another watt to cool it down).

Not coincidentally, 2005 marked the year when many server, storage, and networking vendors came out with “green” messaging.  The idea was to convey technologies that reduce power consumption and heat emissions, saving both money and the environment.  While some had credible stories (eg. VMware), more often than not the result was me-too bland positioning or sheer hype (also known as “green washing”).

Luckily, Aster doesn’t suffer from this, as the architecture was designed for cost-efficiency (both people costs and facilities costs).  Among many examples:

[1] Heterogeneous scaling: we use commodity hardware but the real innovation is making new servers work with pre-existing older ones.  This saves power & cooling costs because rather than having to create a new cluster from scratch (which requires new Queen nodes, new Loader nodes, more networking equipment, etc), you can just plug in new-generation Worker nodes and scale-out on the existing infrastructure…

[2] Multi-layer scaling: A related concept is nCluster doesn’t require the same hardware for each “role” in the data warehousing lifecycle.  This division-of-labor approach ensures cost-effective scaling and power efficiency.  For example, Loader nodes are focused on ultra-fast partitioning and loading of data - since data doesn’t persist to disk, these servers contain minimal spinning disk drives to save power.  On the opposite end, Backup nodes are focused on storing full/incremental backups for data protection - typically these nodes are “bottom-heavy” and contain lots of high-capacity SATA disks for power efficiency benefits (fewer servers, fewer disk drives, slower spinning 7.2K RPM drives).

[3] Optimized partitioning: one of our secret sauce algorithms ensures maximizing locality of joins via intelligent data placement.  As a result, less data transfers over the network, which means IT orgs can stretch their existing network assets (without having to buy more networking gear and burn power).

[4] Compression: we love to compress things.  Tables, cross-node transfers, backup & recovery, etc all leverage compression algorithms to get 4x - 12x compression ratios - this means fewer spinning disk drives to store data and lower power consumption.

…and others (too many to list in a short blog like this)

I’d love to continue the conversation with IT folks passionate about power consumption…what are your top challenges today and what trends do you see in power consumption for different applications in the data center?



18
Jan
By Steve Wooledge in Blogroll on January 18, 2009
   

Interesting unrest brewing among Seagate customers today about Barracuda drives failing at an alarming rate. So how can you protect against this? Even though Seagate allows you to RMA the product, it’s really the data loss and downtime that are the headache, right?

You must figure out a strategy to deal with both planned and unplanned downtime before it happens. Ask your vendors about their fault-tolerance and self-healing capabilities to help you deal with hardware and software failures. Make sure the solutions address every component of failure, not just disks or processors. Ask whether replication can occur on commodity hardware - or will you have to wait 5 days for a new custom box to be ordered? And make sure online backup is an option, even if it’s a last resort.

If you’re looking for a low-cost way to protect your data warehouse against hardware failures, read more about Aster Data’s “always on” database.