I have always enjoyed the subtle irony of someone trying to be impressive by saying “my data warehouse is X Terabytes” [muted: “…and it’s bigger than yours”]! Why is this ironic? Because it describes a data warehouse, which is supposed to be all about data processing and analysis, using a storage metric. Having an obese 800 Terabytes system that may take hours or days to just do a single pass over the data is not impressive and definitely calls for some diet.
Surprisingly though, several vendors went down the path of making their data warehousing offerings fatter and fatter. Greenplum is a good example. Prior to Sun’s acquisition by Oracle, they were heavily pushing systems based on the Sun Thumper, a 48-disk-heavy 4U box that can store up to 100TBs/box. I was quite familiar with that box as it partly came out of a startup called Kealia that my Stanford advisor, David Cheriton, and Sun co-founder Andy Bechtolsheim had founded and then sold to Sun in 2004. I kept wondering, though, what a 50TB/CPU configuration has to do with data analytics.
After long deliberation I came to the conclusion that it has nothing to do with it. There were two reasons why people were interested in this configuration. First, there were some use cases that required “near-line storage”, a term that’s used to describe a data repository whose major purpose is to store data but also allows for basic & infrequent data access. In that respect, Greenplum’s software on top of the Sun Thumpers represented a cheap storage solution that offered basic data access and was very useful for applications where processing or analytics was not the main focus.
The second reason for the interest, though, is a tendency to drive DW projects towards an absolute low per-TB price to reduce costs. Experienced folks will recognize that such an approach leads to disaster, because – as mentioned above – analytics is more than just Terabytes. Perfectly low per-TB price using fat storage looks great on glossy paper but in reality it’s no good because nobody’s analytical problems are that simple.
The point here is that analytics have more to do with processing rather than storage. It requires a fair number of balanced servers (thus good scalability & fault tolerance), CPU cycles, networking bandwidth, smart & efficient algorithms, fair amounts of memory to avoid thrashing etc. It’s also about how much processing can it be done by SQL, and how much of your analytics need to use next-generation interfaces like MapReduce or pre-packaged in-database analytical engines. In the new decade in which we’re embarking, solving business problems like fraud, market segmentation & targeting, financial optimization, etc., require much more than just cheap, overweight storage.
So going to the EMC/Greenplum news, I think such an acquisition makes sense, but in a specific way. It will lead to systems that live between storage and data warehousing, systems able to store data and also give the ability to retrieve it on an occasional basis or if the analysis required is trivial. But the problems Aster is excited about are those of advanced in-database analytics for rich, ad hoc querying, delivered through a full application environment inside a MPP database. It’s these problems that we see as opportunities to not only cut IT costs but also provide tremendous competitive advantages to our customers. And on that front, we promise to continue innovating and pushing the limits of technology as much as possible.
Recently, a journalist called to ask about in-memory data processing, a very interesting subject. I always thought that in-memory processing will be more and more important as memory prices keep falling drastically. In fact, these days you can get 128GB of memory into a single system for less than $5K plus the server cost, not to mention that DDR3 and multiple memory controllers are giving a huge performance boost. And if you run software that can handle shared-nothing parallelism (MPP), your memory cost increases linearly, and systems with TBs of memory are possible.
So what do you do with all that memory? There are two classes of use cases that are emerging today. First is the case where you need to increase concurrent access to data with reduced latency. Tools like memcached offer in-memory caching that, used properly, can vastly improve latency and concurrency for large-scale OLTP applications like websites. Also the nice thing with object caching is that it scales well in a distributed way and people have build TB-level caches. Memory-only OLTP databases have started to emerge, such as VoltDB. And memory is used implicitly as a very important caching layer in open-source key-value products like Voldemort. We should only expect memory to play a more and more important role here.
The second way to use memory is to gain “processing flexibility” when doing analytics. The idea is to throw your data into memory (however much it fits, of course) without spending much time thinking how to do that or what queries you’ll need to run. Because memory is so fast, most simple queries will be executed at interactive times and also concurrency is handled well. European upstart QlikView exploits this fact to offer a memory-only BI solution which provides simple and fast BI reporting. The downside is its applicability to only 10s of GBs of data as Curt Monash notes.
By exploiting an MPP shared-nothing architecture, Aster Data has production clusters with TBs of total memory. Our software takes advantage of memory in two ways: first, it uses caching aggressively to ensure the most relevant data stays in memory; and when data is in memory, processing is much faster and more flexible. Secondly, MapReduce is a great way to utilize memory as it provides full flexibility to the programmer to use memory-focused data structures for data processing. In addition, Aster Data’s SQL-MapReduce provides tools to the user to encourage the development of memory-only MapReduce applications.
However, one shouldn’t fall into the trap of thinking that all analytics will be in-memory anytime soon. While memory is down to $30/GB, disk manufacturers have been busy increasing platter density and dropping their price to less than $0.06/GB. Given that the amount of data in the world grows faster than Moore’s law and memory, there will always be more data to be stored and analyzed than what fits into any amount of memory that an enterprise can use. In fact, most big data applications will have data sets that do not fit into memory because, while tools like memcached worry only about the present (e.g. current Facebook users), analytics need to worry about the past, as well – and that means much more data. So a multi-layer architecture will be the only cost-effective way of analyzing large amounts of data for some time.
One shouldn’t be discussing memory without mentioning solid-state disk products (like Aster Data partner company Fusion-io). SSDs are likely to make the surprise here given that their per-GB price is falling faster than disks (being a solid-state product that follows Moore’s law does help). In the next few years we’ll witness SSDs in read-intensive applications providing similar advantages to memory while accommodating much larger data sizes.
Netezza pre-announced last week that they will be moving to a new architecture - one based around IBM blades (Linux + Intel + RAM) with commodity SAS disks, RAID controllers, and NICs. The product will continue to rely on an FPGA, but that would sit much further from the disks & RAID controller, beyond the RAM but adjacent to the Intel CPU, in contrast to their previous product line.
In assembling a new hardware stack, Netezza calls this re-architecture as a change but not really a change - the FPGA will continue to offload data compression/decompression, selection and projection from the Intel CPU; the Intel CPU will be used to push-down joins and group bys; the RAM will be used to enable caching (thus helping improve mixed workload performance).
I think this is a pretty significant change for Netezza.
Clearly, Netezza would not have invested in this change - assemble & ship a new hardware stack to share revenue with IBM vs. a 3rd party hardware assembler - if Netezza’s old FPGA-dominant hardware was not being out-priced and out-performed by our Intel-based commodity hardware.
It was a matter of time before the market realized that FPGA’s had reached their end-of-life status in the data warehousing market. In realizing the writing on the wall, and responding to it early, Netezza has made a bold decision to change - and yet, clung to the warm familiarity of an FPGA as a “side car”.
Netezza, and the rest of the market, will soon become aware that a change in hardware stack is not a free lunch. The richness of CPU and RAM resources in an IBM commodity blade come at a cost that a resource-starved FPGA-based architecture never had to account for.
In 2009, after having engineered its software for an FPGA over the last 9 years, Netezza will need to come to terms with commodity hardware in production systems and demonstrate that they can:
- Manage processes and memory spawned by a single query across 100s of blade servers
- Maintain consistent caches across 100s of blade servers - after all, it is Oracle’s Cache Fusion technology that is the bane of scaling Oracle RAC beyond 8 blade servers
- Tolerate the higher frequency of failures that a commodity Linux + RAID Controller/driver + Network driver stack incur when put under rigorous data movement (e.g., allocation/de-allocation of memory contributing to memory leaks)
- Add a new IBM blade and ensure incremental scaling of their appliance
- Upgrade the software stack in place - unlike an FPGA-based hardware stack that customers are OK to floor-sweep in their upgrade
- Contain run-away queries from allocating the abundant CPU and RAM resources and starving other concurrent queries in the workload
- Reduce network traffic for a blade with 2 NICs that is managing 8 disks vs. a Power-PC/FPGA that had 1 NIC for 1 disk
- …
If you take a quick pulse of the market, apart from our known installations of 100+ servers, there is no other vendor - mature or new-age - who has demonstrated that 100’s of commodity servers can be made to work together to run a single database.
And I believe that there is a fundamental reason for this lack of proof-point even a decade after Linux has matured and commodity servers have been used for computing - software not built from the ground-up to leverage the richness and contain the limitations of commodity hardware is incapable of scaling. Aster nCluster has been built ground up to have these capabilities on a commodity stack. Netezza’s software written for proprietary hardware cannot be retrofitted to work on commodity hardware (else, Netezza would have completely taken the FPGAs out, now that they have powerful CPUs!). Netezza has its work cut-out - they have taken a dramatic shift that has the ability to bring the company and its production customers to its knees. And there-in lies Netezza’s challenge - they must succeed while supporting their current customers on an FPGA-based platform while moving resources to build out a commodity-based platform.
And we have not even touched upon the extension of SQL with MapReduce to power big data manipulation using arbitrary user-written procedures.
If a system is not fundamentally designed to leverage commodity servers, it’s only going to be a band-aid on seams that are bursting. Overall, we will curiously watch how long it takes Netezza to eliminate their FPGAs completely and move to a real commodity stack so that the customers can have the freedom to choose their own hardware and not be locked down to Netezza-supplied custom hardware.
The Aster SQL/MapReduce framework allows developers to push analytics code for applications closer to the data in the database, without dealing with the headaches of extracting and analyzing data outside of the database. We’ve supported a variety of language from day one, including Java, Python, and Perl. Today we’re pleased to announce official support for the .NET family of languages via Mono, an excellent open source .NET implementation. This will allow developers who use .NET languages like C# and VB (and, of course, F#) to more easily leverage nCluster for massively parallel analytics.
Our .NET support is enabled through our Stream SQL/MR function, which allows users to process data via a simple streaming interface: provide a program that reads rows from the console (stdin) and writes rows back to the console (stdout). Let’s consider a simple C# program called Tokenize, which splits incoming rows into tokens, and then output each token (one per line):
To run this program over data stored in nCluster, a developer just needs to compile the above Tokenize.cs into Tokenize.exe with a C# compiler (in our case, the Mono C# compiler gmcs). With the compiled executable in hand, one command in our terminal client will install it in nCluster. The program can be then invoked from SQL. The below example will run the program over all the rows in the documents table, outputting a table with a single column (token). Each row in the result of the query will correspond to a single token in the input documents.
It’s as simple as that. We hope our new .NET support will enable an ever-broader group of developers take advantage of SQL/MR, our in-database analytics technology!If you’re interested in learning more, please check out a host of new resources around our implementation of MapReduce within Aster nCluster including example applications and code.
When Mayank, George and I were at Stanford one of the things that brought us together was a shared vision of how the world could benefit with a more scalable database to address exploding volumes of data. This led to the birth of Aster Data Systems and our flagship product, Aster nCluster, a highly scalable relational database system for what we call “frontline” data warehousing – an intersection of large data volumes, rich analytics, and mission-critical availability.
One way we found to solve the problem of managing and analyzing so much data was by implementing In-Database MapReduce. MapReduce is a programming model that was popularized at Google in 2003 to process large unstructured data sets distributed across thousands of nodes, and at Stanford we worked with some of the professors that had worked with the Google founders. In-Database MapReduce enables enterprises to harness the power of MapReduce while managing their data in Aster nCluster. Just like its massively parallel execution environment for standard SQL queries, Aster nCluster adds the ability to implement flexible MapReduce functions for parallel data analysis and transformation inside the database.
Much of the work of the Aster Data team is “fusing” best practices from the relational database world with innovations that Google pioneered for distributed computing; this takes strong engineering, so it’s no wonder that we are an engineering-driven company with some of the best minds available on our team. Of our 26 engineers on staff, there are seven PhDs, and six PhDs on leave. Over time in this blog I plan to highlight the members of the Aster team that help make nCluster a reality.
One key member is Dr. Mohit Aron. Mohit is an architect, and his focus is on the distributed aspects of the nCluster architecture. His achievements include the delivery of several key projects at Aster, notably in areas related to quality of service, SQL/MR, compression, performance, and fault-tolerance.
Before joining Aster Data Systems, Mohit was a Staff Engineer at Google Inc where he was one of the lead designers of the super-scalable award winning Google File System. Dr. Aron has held senior technical positions in industry where his work has focused on scalable cluster-based storage and database technologies. He received his B.Tech degree from the Indian Institute of Technology, New Delhi and his M.S. and Phd from Rice University, Houston. His graduate research focused on high performance networking and cluster-based web server systems. He was one of the primary contributors to the ScalaServer project and won numerous best paper awards at prestigious conferences.
I am also very glad today to announce that another key member of our organization, Dheeraj Pandey, has been promoted to VP of Engineering. He has been with Aster ever since September ‘07. Dheeraj has played an instrumental role in building this strong team together with me. He has been my alter ego all this while, as we shipped two major releases and four patchsets in the last 19 months. Beyond the tangibles, he has an acute focus on nurturing emotional intelligence within the engineering organization. Too many organizations, with strong technical mindsets, falter because people begin to underemphasize the value of honest communication, trust, and self-awareness. I am proud that we are building a culture, from very early on, which will endure the test of time as the company grows.
Dheeraj came to Aster from Oracle Corporation, where he managed the storage engine of the database. Under his leadership, Oracle built the unstructured data management stack, called Oracle SecureFiles, from the ground up. He also led the development of Oracle 11g Advanced Compression Option for both structured and unstructured data. Dheeraj has co-invented several patent-pending algorithms on database transaction management, Oracle Real Application Clusters, and Data Compression. Previously, he was building commodity-clustered fileservers at Zambeel. In the past 10 years of his industry career, he has developed diverse software for midtier Java/COM applications to fileservers, databases, and firmware in storage switches. Dheeraj received an M.S. in Computer Science from The University of Texas (Austin), where he was a doctoral fellow. He received a B.Tech. in Computer Science from the IIT Kanpur, where he was judged the “Best All-Rounder Student Among All Graduating Students in All Disciplines.”
I am confident that, as an innovation-driven company, we are entrusting one of our most critical functions, Engineering, in very safe hands.
I hope you continue to watch this space for updates on Aster, our products, and our people.
MySpace decided to support one of its most important product launches of 2008 with an expansion of its Aster data warehouse. The data that would be collected would be used to provide information on trends in media and current interests on MySpace. The go-live date was October 2008.
MySpace planned for the data warehouse right from the inception of the project to ensure that reporting was considered a first-class citizen in the overall launch process, rather than a post-launch activity. The result was that the data warehouse was up and running to receive the usage streams, even during a private beta release period, giving the warehousing team the necessary time to prepare for the onslaught of data that would result after the public release.
In fact, there is a very interesting incident that happened on the day of the new MySpace product launch. At about 7am, one of the servers in the Aster nCluster data warehouse failed. The failure was detected by our support team - and no scrambling ensued. Aster nCluster detected and isolated the failure, continuing to run the service with n-1 nodes without a blip and minimal performance change! Later, after the initial tense moments were behind us, the MySpace operations team walked over and replaced the failed hardware. The Aster database administrator then pressed a single button to re-include the node back to the nCluster data warehouse - the database continued to hum away with zero downtime.
The power of “Always-On”!
We will be co-hosting a case study by MySpace on their use of Aster at the Gartner BI Summit next week in National Harbor, MD on March 11. If you’ll be at the event, please come by to hear what Hala has to say about their use of Aster to support their mission-critical operations at MySpace across multiple functions and departments.
Their Aster enterprise data warehouse supports frontline applications (e.g., MySpace TV, MySpace Video, etc.), as well as online marketing, sales, IT, finance, international, and legal. MySpace is also planning to incorporate data from Aster into a balanced scorecard for strategic alignment of the business around key performance indicators, as well as other future projects.
Some highlights from the video for folks who would rather read:
MySpace got up and running with Aster quickly
“We were able to bring that up online and actually start processing the data into it within a matter of weeks, and I think very few technologies give you the ability to do something like that.”
- Hala Al-Adwin, VP of Data Services at MySpace
Aster is mission-critical to MySpace
“With Aster, what we’ve been able to produce with commodity hardware has been a supercomputer-like infrastructure …the data that we collect and process is absolutely critical to the success of MySpace.”
– Bita Mathews, Data Warehouse Manager, MySpace
“Right now our key business performance metrics are all powered out of the Aster system. If somebody went and shut it down, none of that would be available. I think in a lot of ways, we were lacking that data before, and now that we’re used to having it, people are just hungry for more and more information. So if all that went away, I think it’s kinda like going back to an age where there was no light.”
– Hala Al-Adwan
MySpace’s data warehouse with Aster is extremely reliable
“Aster is always on and available. And this is very amazing thing about Aster, because it’s massive. There’s a lot of hardware underneath the system. When hardware fails, we can continue working, and although we know some engineers are fixing hardware, but that doesn’t stop us from continuing to run queries and producing our reports.”
– Anna Dorofiyenko, Data Architect, MySpace
Aster is the blueprint for successful data warehouse deployments going forward
“Integrating Aster and including them from the very beginning in the MySpace Music project … from beginning to end is what allowed that to be the most successful data warehouse implementation we’ve had to date, and I think we should definitely use it as a blueprint for any future implementations we do.”
– Christa Stelzmuller, Chief Data Architect, MySpace
As the Director of Technology Delivery for Aster Data Systems, I oversee the teams responsible for delivering and deploying our nCluster analytic database to customers and enabling prospective customers to evaluate our solutions effectively and efficiently. Recently, Shawn posted on the release of Aster nCluster Cloud Edition and discussed how cloud computing enables business to scale their infrastructure without huge hardware investments. As a follow-on, I’d like to let you know about how the flexibility provided by nCluster’s support of multiple platforms can reduce the time and costs associated with evaluating nCluster.
Evaluating enterprise software can be a costly effort in both time and money. The process typically requires weeks of prep work by the evaluation team, possibly including purchasing different hardware for each vendor being evaluated. Spending significant amounts of money and losing weeks of resource productivity to an evaluation is something few companies can afford to do, particularly in these uncertain times.
With our recent public release of Aster nCluster Cloud Edition, we now provide the most platform options of any major data warehouse vendor. While it’s natural to focus on the flexibility this affords for production systems, it also allows us to be very flexible for enabling customers to try our solution:
Commodity Hardware Evaluation
Several warehouse vendors claim to support commodity hardware, but most are very closely tied to one “preferred” vendor. Aster nCluster supports any x86-based hardware, meaning that you can evaluate us on either new hardware (if performance is a key aspect of the evaluation) or older hardware that is being repurposed (if you want to test the functionality of nCluster without buying new hardware).
Aster-Hosted Evaluation
Our data center in San Carlos, CA has racks of servers dedicated to customer evaluations. With an Aster-hosted system, functional evaluations of nCluster can be performed with minimum infrastructure requirements.
Cloud Evaluation
With Aster nCluster Cloud Edition, custom-configured nClusters can be brought up in minutes on either Amazon EC2 or AppNexus. POCs can be performed on one or multiple systems in parallel, with zero infrastructure requirements. Your teams can evaluate all of nCluster’s functionality in the cloud, with complete control over sizing and scaling. (While other vendors have announced cloud offerings, we’re the only data warehouse vendor to have production customers on two separate cloud services).
Whether you’re building a new frontline data warehouse or looking to replace an existing system that doesn’t scale or costs too much, you should check us out. We have a great product that’s turning heads as an alternative to overpriced hardware appliances for multi-TB data warehouses. With all the flexibility our offerings provide, you can evaluate all the power of Aster nCluster without the costs of traditional POCs.
Give us a try and see everything you can do with Aster nCluster!
If you would have told me that Aster Data Systems would be referenced by Gartner in the Magic Quadrant for Data Warehouse Database Management Systems report after just 6 months of coming out of stealth mode, I would have said you were pretty ambitious. There are some pretty high criteria for being placed, such as having a generally-available product for more than one year, as well as over 10 customers in production.
Well, we were included. Even if it was only a mention. And I’m proud to say that as of the publishing date, I believe the only criteria that held us back from being placed on it was having a product available for less than one year.
Although Aster’s high-performance analytic database, Aster nCluster, has been generally available since May - we’re actually on the third major release tested in customer environments. We were busy building out its functionality and testing Aster nCluster in some pretty rigorous frontline data warehouses such as Aggregate Knowledge and MySpace, where we have strict requirements for scale (100+ terabytes) and uptime (no unplanned downtime for more than one year).
2008 was a break-out year for Aster. 2009 will be even more exciting as we continue to set new limits for what customers expect of a relational database management system.
I suppose the next thing you’ll tell me is that we should be placed in the Leaders or Visionaries Quadrants on the chart? That’s ambitious. But it’s just the sort of thing this team and product are capable of.
The Magic Quadrant is copyrighted 2008 by Gartner, Inc. and is reused with permission. The Magic Quadrant is a graphical representation of a marketplace at and for a specific time period. It depicts Gartner’s analysis of how certain vendors measure against criteria for that marketplace, as defined by Gartner. Gartner does not endorse any vendor, product or service depicted in the Magic Quadrant, and does not advise technology users to select only those vendors placed in the “Leaders” quadrant. The Magic Quadrant is intended solely as a research tool, and is not meant to be a specific guide to action. Gartner disclaims all warranties, express or implied, with respect to this research, including any warranties of merchantability or fitness for a particular purpose.
I recently wrote an article with Enterprise Systems Journal on how retailers can benefit from innovations in “always on” databases. This story is based on the real-life experiences of one of our customers during the last holiday shopping season. They saw a spike in traffic and quickly scaled their frontline data warehouse built with Aster Data Systems. This allowed them to maintain the service level agreements with the business for product cross-promotion without needing to wait to plan an upgrade in the off-season.
After next week, let us know - how did your favorite e-tailers fare on ‘Black Friday’ and ‘Cyber Monday’? Did their recommendation engines deliver, or did they leave cash in your wallet that got spent elsewhere?
I’m pleased to have been given the opportunity to introduce this new approach to in-database analytics and parallel data processing, in general. The most consistent feedback I had was that there wasn’t enough time to cover this topic in-depth, and attendees were eager to learn more! In that case, my previous post on educational resources for MapReduce may be of interest.
Since Aster Data Systems introduced In-Database MapReduce for the Aster nCluster relational database, there has been tremendous interest in the data warehousing and technology community, with recent coverage in the NY Times and by influential blogs like DBMS2, Beyond Search, and Cloud N, just to name a few.
Hopefully TDWI will turn this into a 1/2-day course in the future. (If you agree, feel free to contact them at info@tdwi.org).
If anyone knows of other good resources on this emerging topic, please feel free to put links in the comments here.