Barton George is Cloud Computing and Scale-Out Evangelist for Dell.
Today at a press conference in San Francisco we announced the general availability of our Dell cloud solutions. One of the solutions we debuted was the Dell Cloud Solution for Data Analytics, a combination of our PowerEdge C servers with Aster Data’s nCluster, a massively parallel processing database with an integrated analytics engine.
Earlier this week I stopped by Aster Data‘s headquarters in San Carlos, CA and met up with their EVP of marketing, Sharmila Mulligan. I recorded this video where Sharmila discusses the Dell and Aster solution and the fantastic results a customer is seeing with it.
Some of the ground Sharmila covers:
What customer pain points and problems does this solution address (hint: organizations trying to manage huge amounts of both structured and unstructured data)
How Aster’s nCluster software is optimized for Dell PowerEdge C2100 and how it provides very high performance analytics as well as a cost effective way to store very large data.
(2:21) InsightExpress, a leading provider of digital marketing research solutions, has deployed the Dell and Aster analytics solution and has seen great results:
Up and running w/in 6 weeks
Queries that took 7-9 minutes now run in 3 seconds
Amazon announced today the availability of special EC2 cloud clusters that are optimized for low-latency network operations. This is useful for applications in the so-called High-Performance Computing area, where servers need to request and exchange data very fast. Examples of HPC applications range from nuclear simulations in government labs to playing chess.
I find this development interesting, not only because it makes scientific applications in the cloud a possibility, but also because it’s an indication of where cloud infrastructure is heading.
In the early days, Amazon EC2 was very simple: if you wanted 5 “instances” (that is, 5 virtual machines), that’s what you got. However, memory of the instances was low, as well as disk capacity. Over time, more and more configurations were added and now one can choose an instance type from a variety of disk & memory characteristics with up to 15GB of memory and 2TBs of disks per instance. However, network was always a problem independently of the size of the instance. (According to rumors, EC2 would make things worse by distributing instances as far away from each other as possible in the datacenter to increase reliability – as a result, network latency would suffer.) Now, the network problem is being solved by means of these special “Cluster Compute Instances” that provide guaranteed, non-blocking access to a 10GbE network infrastructure.
Overall this course represents a departure from the super-simple black-box model that EC2 started from. Amazon – wisely – realizes that accommodating more applications requires transparency – and providing guarantees – for the underlying infrastructure. Guaranteeing network latency is just the beginning: Amazon has the opportunity add much more options and guarantees around I/O performance, quality of service, SSDs versus hard drives, fail-over behavior etc. The more options & guarantees Amazon offers the closer we’ll get to the promise of the cloud – at least for resource-intensive IT applications.
If you read this blog, you’ve probably seen the news about the partnership between Aster Data and Dell on their new PowerEdge C-Series servers (link to their page). Together we have enabled some really successful customers such as MySpace and Mint.com and proven that Dell hardware with Aster Data software easily scales to support large-scale data warehousing and advanced analytics.
In the last few years there has been a significant amount of market pickup, from users and vendors, on data clouds and advanced analytics – specifically a new class of data-driven applications run in a data cloud or on-premise. What’s different about this from past approaches is the frequency and speed at which these applications are accessed, the depth of the analysis, the number of data sources involved and the volume of data mined by these applications – terabytes to petabytes. In the midst of this cacophony of dialogue, recent announcements from vendors in this space are helping to clarify different visions and approaches to the big data challenge.
Both Aster Data and Greenplum made announcements this week that illustrated different approaches. At the same time that Aster Data announced the Aster Analytics Center, Greenplum announced an upcoming product named Chorus. I wanted to take a moment to compare and contrast what these announcements say about the direction of the two companies.
Greenplum’s approach speaks to two traditional problem areas i) access to data, from provisioning of data marts to connectivity to data across marts, and ii) some level of collaboration among certain developers and analysts. Their approach is to create a tool for provisioning, unified data access, and sharing of annotations and data among different developers and analysts. Interestingly, this is not an entirely new concept; these are well-known problems for which a number of companies and tools have already developed best-of-breed solutions over the last 15 years. For example, the capabilities for data access are another version of Export/Copy primitives that already exist in all databases and that have been built upon by common ETL and EII tools for cases in which richer support than Export & Copy are needed – for instance, when data has to be transformed, correlated or cleaned while being moved from one context (mart) to another (mart).
This approach is indicative of a product direction in which the primary focus is on adding another option to the list of tools available to customers to address these problems. It’s really not a ground-breaking innovation that evolves the world of analytics. New types of analytics, or ‘data-driven applications,’ is where the enormous opportunity lies. The Greenplum approach of data collaboration is interesting in a test environment or sandbox. When it comes to real production value however, it effectively increases the functions available to the end user, but at a big cost due to significant increases in complexity, security issues and extra administrative overhead. What does this mean exactly?
The spin-up of marts and moving data around can result in “data sprawl” which ultimately increases administrative overhead and is dangerous in these days of compliance and sensitivity to privacy and data leaks.
Adding a new toolset into the data processing stack creates difficult and painful work to either manage and administer multiple tool sets for similar purposes or to eliminate and transition away from investments in existing toolsets.
To enable effective communication and sharing, users need strong processes and features for source identification of data, data collection, data transformation, rule administration, error detection & correction, data governance and security. The quality and security policies around meta-data are especially important as free-form annotations can lead to propagation of errors or leaks in the absence of strong oversight.
In contrast, Aster Data’s recent announcements support our long-standing investments in our unique advanced in-database architecture where applications run fully inside Aster Data’s platform with complete application services essential for complex analytic applications. The announcements highlight that our vision is not to create a new set of tools and layers in the data stack that recreate capabilities currently available from a number of leading vendors, but rather to deliver a new Analytics Platform, a Data-Application Server, to uniquely enable analytics professionals to create data-rich applications that were impossible or impractical before – namely, to create and use advanced analytics for rich, rapid, and scalable insights into their data. This focus is complemented by our partners, who offer proven best-of-breed solutions for collaboration and data transformation.
Our vision was, and continues to be, to bring the power of MapReduce to a whole new class of developers and mission-critical enterprise systems. When would you use Aster’s In-Database MapReduce vs. a system like Hadoop? You need to ask a few questions as you think about this:
 Can I use my MapReduce system only for batch processing or can I do real-time reporting and analysis? Can I have a single system to do number-crunching AND needle-in-a-haystack summary or aggregation lookup? Can I get response to my short queries in seconds or do I need to wait for several minutes?
 How do I maximize developer productivity, using SQL for regular data processing and MapReduce for richer analysis?
 Do you only want to manage raw data files using file name conventions, or do you also want to use database primitives like partitions, tables, and views?
 How do I easily integrate the MapReduce system with my standard ETL and reporting tool, so I don’t have to reinvent the wheel on dashboards, scorecards, and reports?
 When I have such large data in an enterprise system, how do I control access to data and provide appropriate security privileges?
 Workload management: When I have invested in a system with hundreds or thousands of processors, how do I efficiently share it among multiple users and guarantee response-time SLAs?
 For mission-critical data-intensive applications, how do I do full and incremental backup and disaster recovery?
It’s great to see MapReduce going mainstream and companies such as Amazon supporting the proliferation of innovative approaches to the data explosion problem. Together, we hope to help build mind-share around MapReduce and help companies do more with their data. In fact, we welcome users to put Amazon Elastic MapReduce output into Aster nCluster Cloud Edition for persistence, sharing, reporting and easy fast concurrent access. Lots of Aster customers are using both and it’s easy to move data since Aster is on the same Amazon Web Services cloud.
Please contact us if you’d like help getting started with your MapReduce explorations. We conducted a web seminar to introduce you to the concept.
As the Director of Technology Delivery for Aster Data Systems, I oversee the teams responsible for delivering and deploying our nCluster analytic database to customers and enabling prospective customers to evaluate our solutions effectively and efficiently. Recently, Shawn posted on the release of Aster nCluster Cloud Edition and discussed how cloud computing enables business to scale their infrastructure without huge hardware investments. As a follow-on, I’d like to let you know about how the flexibility provided by nCluster’s support of multiple platforms can reduce the time and costs associated with evaluating nCluster.
Evaluating enterprise software can be a costly effort in both time and money. The process typically requires weeks of prep work by the evaluation team, possibly including purchasing different hardware for each vendor being evaluated. Spending significant amounts of money and losing weeks of resource productivity to an evaluation is something few companies can afford to do, particularly in these uncertain times.
With our recent public release of Aster nCluster Cloud Edition, we now provide the most platform options of any major data warehouse vendor. While it’s natural to focus on the flexibility this affords for production systems, it also allows us to be very flexible for enabling customers to try our solution:
Commodity Hardware Evaluation
Several warehouse vendors claim to support commodity hardware, but most are very closely tied to one “preferred” vendor. Aster nCluster supports any x86-based hardware, meaning that you can evaluate us on either new hardware (if performance is a key aspect of the evaluation) or older hardware that is being repurposed (if you want to test the functionality of nCluster without buying new hardware).
Our data center in San Carlos, CA has racks of servers dedicated to customer evaluations. With an Aster-hosted system, functional evaluations of nCluster can be performed with minimum infrastructure requirements.
With Aster nCluster Cloud Edition, custom-configured nClusters can be brought up in minutes on either Amazon EC2 or AppNexus. POCs can be performed on one or multiple systems in parallel, with zero infrastructure requirements. Your teams can evaluate all of nCluster’s functionality in the cloud, with complete control over sizing and scaling. (While other vendors have announced cloud offerings, we’re the only data warehouse vendor to have production customers on two separate cloud services).
Whether you’re building a new frontline data warehouse or looking to replace an existing system that doesn’t scale or costs too much, you should check us out. We have a great product that’s turning heads as an alternative to overpriced hardware appliances for multi-TB data warehouses. With all the flexibility our offerings provide, you can evaluate all the power of Aster nCluster without the costs of traditional POCs.
Give us a try and see everything you can do with Aster nCluster!
Cloud computing is a fascinating concept. It offers greenfield opportunities (or more appropriately, blue sky frontiers) for businesses to affordably scale their infrastructure needs without plunking down a huge hardware investment (and the space/power/cooling costs associated with managing your own hosted environment). This removes the risks of mis-provisioning by enabling on-demand scaling according to your data growth needs. Especially in these economic times, the benefits of Cloud computing are very attractive.
But let’s face it – there’s also a lot of hype, and it’s hard to separate truth from fiction. For example, what qualities would you say are key to data warehousing in the cloud?
Here’s a checklist of things I think are important:
 Time-To-Scalability. The whole point of clouds is to offer easy access to virtualized resources. A cloud warehouse needs to quickly scale-out and scale-in to adapt to changing needs. It can’t take days to scale…it has to happen on-demand in minutes (<1 hour).
 Manageability. You go with clouds because you not only want to save on hardware, but also on the operational people costs of maintaining that infrastructure. A cloud warehouse needs to offer one-click scaling, easy install/upgrade, and self-managed resiliency.
 Ecosystem. While clouds offer *you* huge TCO savings, you can’t compromise service levels for your customers – especially if you run your business on the cloud. BI/ETL/monitoring tools, Backup & Recovery, and ultra-fast data loading can’t be overlooked for “frontline” mission-critical warehousing on the cloud.
 Analytics. Lots of valuable data is generated via the cloud and there are opportunities to subscribe to new data feed services. It’s insufficient for a cloud warehouse to just do basic SQL reporting. Rather, it must offer the ability to do deep analytics very quickly.
 Choice. A truly best-in-class cloud warehouse won’t lock you in to a single cloud vendor. Rather, it will offer portability by enabling you to choose the best cloud for you to run your business on.
Finally, here are a couple ideas on the future of cloud warehousing. What if you could link multiple cloud warehouses together and do interesting queries across clouds? And what about the opportunities for game-changing new analytics – with so many emerging data subscription services, wouldn’t this offer ripe opportunities for mash-up analytics (eg. using Aster SQL/MapReduce).
What do you think are the standards for “best-in-class” cloud warehousing?