blog   contact    
 
log: Winning with Data
1.888.Aster.Data Email

Archive for the ‘Availability’ Category



Posted on October 12th, 2008 by George Candea

At long last, I get to smash some hardware! Ever since we started the recovery-oriented computing (ROC) project at Stanford and Berkeley in 2001, I’ve been dreaming of demo-ing ROC by taking a sledgehammer to a running computer and have the software continue running despite the damage. I never quite had the funds for it! :-)

The Aster nCluster analytic database embodies so much of ROC (microrebooting, undo, fault-injection-based testing, and so on), that fulfilling this “childhood dream” within the context of nCluster is a perfect match.  It’s hilarious to watch, but for me it was a great experience; check out Recovery-Oriented Computing for Databases (the actual demonstration) and DBA’s Gone Wild (just for fun). The only thing I’d do differently is get a bigger sledgehammer, because some of that hardware was really built to last (hats off to HP )!

Beyond all the fun involved, there is also a broader message. A lot of data warehouses are way too fragile, and too many people believe that investing in more solid hardware is the way to go. Frontline business applications have to be able to withstand a wide range of failures, and what I do in these videos is really just scratching the surface.

At Aster, “always on” availability is much more than a key marketing message - it’s a core database innovation founded in recovery-oriented computing, which minimizes both planned and unplanned downtime for our customers. Whether it’s an analytic application, or an analyst requiring 24/7 queries for their modeling, data mining, or business intelligence (BI) report - having a database they can depend on is critical.

Posted on October 6th, 2008 by Steve Wooledge

Aster announced the general availability of our nCluster 3.0 database, complete with new feature sets. We’re thrilled with the adoption we saw before GA of the product, and it’s always a pleasure to speak directly with someone who is using nCluster to enable their frontline decision-making.

Lenin Gali, Director of BI, ShareThis

ShareThis logo

Lenin Gali, director of business intelligence for the online sharing platform ShareThis, is one such friend. He recently sat down with us to discuss how Internet and social networking companies can successfully grow their business by rapidly analyzing and acting on their massive data.

You can read the full details of our conversation on the Aster Website.

Posted on October 6th, 2008 by Mayank Bawa

It is really remarkable how many companies today view data analytics as the cornerstone of their businesses.

acerno logoaCerno is an advertising network that uses powerful analytics to predict which advertisements to deliver to which person at what time. Their analytics are performed on completely anonymous consumer shopping data of 140M users obtained from an association of 450+ product manufacturers and multi-channel retailers. There is a strong appetite at aCerno to perform analytics that they have not done before because each 1% uplift in the click-through rates is a significant revenue stream for them and their customers.

Aggregate KnowledgeAggregate Knowledge powers a discovery network (The Pique Discovery™ Network) that delivers recommendations of products and content based on what was previously purchased and viewed by an individual using the collective behavior of the crowds that had behaved similarly in the past. Again, each 1% increase of engagement is a significant revenue stream for them and their customers.

ShareThis logoShareThis provides a sharing network via a widget that makes it simple for people to share things they find online with their friends. In a short period of time since their launch, ShareThis has reached over 150M unique monthly users. The amazing insight is that ShareThis knows which content users actually engage with, and want to tell their friends about! And in its sheer genius, ShareThis gives away its service to publishers and consumers free; relying on delivering targeted advertising for its revenue: by delivering relevant ad messages while knowing the characteristics of that thing being shared. Again, the better their analytics, the better their revenue.

Which brings me to my point: data analytics is a direct contributor of revenue gains in these companies.

Traditionally, we think of data warehousing as a back-office task. The data warehouse can be loaded in separate load windows; loads can run late (the net effect is that business users will get their reports late); loads, backups, and scale-up can take data warehouses offline –which is OK since these tasks can be done on non-business hours (nights/weekends).

But these companies rely on data analytics for their revenue.

·    A separate exclusive load window implies that their service is not leveraging analytics during that window;
·    A late-running load implies that the service is getting stale data;
·    An offline warehouse implies that the service is missing fresh trends

Any such planned or unplanned outage results in lower revenues.

On the flip side, a faster load/query provides the service a competitive edge – a chance to do more with their data than anyone else in the market. A nimbler data model, a faster scale-out, or a more agile ETL process helps them implement their “Aha!” insights faster and gain revenue from a reduced time-to-market advantage.

These companies have moved data warehousing from the back-office to the frontlines of business: a competitive weapon to increase their revenues or to reduce their risks.

In response, the requirements of a data warehouse that supports these frontline applications go up a few notches: the warehouse has to be available for querying and loading 365×24x7; the warehouse has to be fast and nimble; the warehouse has to allow “Aha!” queries to be phrased.

We call these use cases “frontline data warehousing“. And today we released a new version of Aster nCluster that rises up those few notches to meet the demands of the frontline applications.

Posted on October 6th, 2008 by Tasso Argyros

Back in the days when Mayank, George and I were still students at Stanford, working hard to create Aster, we had a pretty clear vision of what we wanted to achieve: allow the world to do more analytics on more data. Aster has grown tremendously since these days, but that vision hasn’t changed. And one can see this very clearly in the new release of our software, Aster nCluster 3.0, which is all about doing more analytics with more data. Because 3.0 introduces so many and important features, we tried to categorize them in three big buckets: Always Parallel, Always On, and In-Database MapReduce.

Always Parallel has to do with the “Big Data” part of our vision. We want to build systems that can handle 10x – 100x more data than any other system today. But this is too much data for any single “commodity server” (that is, a server with reasonable cost) that one can buy. So we put a lot of R&D effort into parallelizing every single function of the system – not only querying, but also loading, data export, backup, and upgrades. Plus, we allow our users to choose how much they want to parallelize all these functions, without having to scale up the whole system.

Always On also stems from the need to handle “Big Data”, but in a different way. In order for someone to store and analyze anything from a terabyte to a petabyte, she needs to use a system with more than a single server. But then availability and management can become a huge problem. What if a server fails? How do I keep going, but also how do I recover from the failure (either by introducing the same server or a new, replacement, server) with no downtime? And how can I seamlessly expand the system, in order for me to realize the great promise of horizontal scaling, without taking the system down? And, finally, how do I backup all these oceans of data without disrupting my system’s operation? All these issues are handled in our new 3.0 release.

We introduced In-Database MapReduce in a previous post so I won’t spend too much time here. But I want to point out how this fits our overall vision. Having a database which is always parallel and always on allows you to handle Big Data with high performance, low cost, and high availability. But once you have all this data, you want to do more analytics - to extract more value and insights. In-Database MapReduce is meant to do exactly that – push the limits of what insights you can extract by providing the first-ever system that tightly integrates MapReduce (a powerful analytical paradigm) with a wide-spread standard like SQL.

These are the big features in nCluster 3.0, and in the majority of our marketing materials we stop here. But I also want to talk about the other great things we have in there; things more subtle or technical to mention in the headlines, but still very important. We’ve added table compression features that offer online, multi-level compression for cost-savings. With table compression, you can choose your compression ratio and algorithm and have different tables compressed differently. This paves the way for data life-cycle management that can compress data differently depending on its age.

We’ve also implemented richer workload management to offer quality of service for fine-grained mixed workload prioritization via priority and fair-share based resource queue.  You can even allocate resource weights based on transaction number or time (useful when both big and small jobs occur).

3.0 also has Network Aggregation (NIC “bonding”) for performance and fault tolerance. This is a one-click configuration that automates network setup – usually a tedious error-prone sys admin task. And that’s not the end of it – we also are introducing an Upgrade Manager that automates upgrades from one version of nCluster to another, including what most frequently breaks upgrades: the operating system components. This is another building block of the low cost of ongoing administration that we’re so proud of achieving with nCluster. I could go on and on (new SQL enhancements, new data validation tools, heterogeneous hardware support, LDAP authentication, …), but since blog space is supposed to be limited, I’ll stop here. (Check out our new resource library if you want to dig deeper.)

Overall, I am delighted to see how our product has evolved towards the vision we laid out years back. I’m also thrilled that we’re building a solid ecosystem around Aster nCluster – we now support all the major BI platforms – and are establishing quite a network of systems integrators to help customers with implementation of their frontline data warehouses. In a knowledge-based economy full of uncertainty, opportunities, and threats, doing more analytics on more data will drive the competitiveness of successful corporations – and Aster nCluster 3.0 will help you deliver just that for your own company.

Posted on August 12th, 2008 by Tasso Argyros

- John: “What was wrong with the server that crashed last week?”

- Chris: “I don’t know. I rebooted it and it’s just fine. Perhaps the software crashed!”

I’m sure anyone who has been in operations has had the above dialog, sometimes quite frequently! In computer science such a failure would be called “transient” because the failure affects a piece of the system only for a fixed amount of time. People who have been running large-scale systems for a long time will attest that transient failures are extremely common and can lead to system unavailability if not handled right.

In this post I want to explore why transient failures are an important threat to availability and how a distributed database can handle them.

To see why transient failures are frequent and unavoidable, let’s consider what can cause them. Here’s an easy (albeit non-intuitive) reason:  software bugs.  All production-quality software still has bugs; most of the bugs that escape testing are difficult to track down and resolve, and they take the form of Heisenbugs, race conditions, resource leaks, and environment-dependent bugs, both in the OS and the applications. Some of these bugs will cause a server to crash unexpectedly.  A simple reboot will fix the issue, but in the meantime the server will not be available.  Configuration errors are another common cause.  Somebody inserts the wrong parameters into a network switch console and as a result a few servers suddenly go offline. And, sometimes, the cause of the failure just remains unidentified because it can be hard to reproduce and thus examine more thoroughly.

I submit to you that it is much harder to prevent transient failures than permanent ones. Permanent failures are predictable, and are often caused by hardware failures. We can build software or hardware to work around permanent failures. For example, one can build a RAID scheme to prevent a server from going down if a disk fails, but no RAID level can prevent a memory leak in the OS kernel from causing a crash!

What does this mean? Since transient failures are unpredictable and harder to prevent, MTTF (mean time to failure) for transient failures is hard to increase.

Clearly, a smaller MTTF means more frequent outages and larger downtimes. But if MTTF is so hard to increase for transient failures, what can we do to always keep the system running?

The answer is that instead of increasing MTTF we can reduce MTTR (mean time to recover). Mathematically this concept is expressed by the formula:

Availability = MTTF/(MTTF+MTTR)

It is obvious that as MTTR approaches zero, Availability approaches 1, (i.e. 100%). In other words, if failure recovery is very fast, (instantaneous in an extreme example) then even if failures happen frequently, overall system availability will continue to be very high. This interesting approach to availability, called Recovery Oriented Computing was developed jointly by Berkeley and Stanford researchers, including my co-founder George Candea.

Applying this concept to a massively parallel distributed database yields interesting design implications. As an example, let’s consider the case where a server fails temporarily due to an OS crash in a 100-server distributed database. Such an event means that the system has fewer resources to work with: in our example after the failure we have a 1% reduction of available resources. A reliable system will need to:

(a) Be available while the failure lasts and

(b) Recover to the initial state as soon as possible after the failed server is restored.

Thus, recovering from this failure needs to be a two-step process:

(a) Keep the system available with a small performance/capacity hit while the failure is ongoing (availability recovery)

(b) Upgrade the system to its initial levels of performance and capacity as soon as the transient failure is resolved (resource recovery)

Minimizing MTTR means minimizing the sum of the time it takes to do (a) and (b), ta + tb. Keeping ta very low requires having replicas of data spread across the cluster; this, coupled with fast failure detection and fast activation of the appropriate replicas, will ensure that ta remains as low as possible.

Minimizing tb requires seamless re-incorporation of the transiently failed nodes into the system. Since in a distributed database each node has a lot of state, and the network is the biggest bottleneck, the system must be able to reuse as much of the state that pre-existed on the failed nodes as possible to reduce the recovery time. In other words, if most of the data that was on the node before the failure is still valid (a very likely case) then it needs to be identified, validated and reused during re-incorporation.

Any system that lacks the capacity to keep either ta or tb low does not provide good tolerance to transient failures.

And because there will always be more transient failures the bigger a system gets, any architecture that cannot handle failures correctly is - simply - not scalable. Any attempt to scale it up will likely result in outages and performance problems. Having a system designed with a Recovery-Oriented architecture, such as the Aster nCluster database, can ensure that transient failures are tolerated with minimal disruption, and thus true scalability is possible.

Category Archives

Relevant Blogs

  • Converting data exhaust into data valueOctober 20th, 2008
  • Why MapReduce matters to SQL data warehousingOctober 20th, 2008
  • The new paradigm of in-database cloud analytics, and Google’s role as catalystOctober 20th, 2008
  • Thoughts on category creation and information access platformsOctober 20th, 2008
Copyright © 2008 Aster Data Systems, Inc. All rights reserved.