Watching our customers use Aster Data to discover new insights and build new big data products is one of the most satisfying parts of my job. Having seen this process a few times, I found that it always has the same steps:
An Idea or Concept – Someone comes up with an idea of a hidden treasure that could be hidden in the data, e.g. a new customer segment that could be very profitable, a new pattern that reveals novel cases of fraud, or other event-triggered analysis.
Dataset – An idea based on data that doesn’t exist is like a great recipe without the ingredients. Hopefully the company has already deployed one or more big data repositories that have the necessary data in full detail (no summaries, sampling, etc). If that’s not the case, data has to be generated, captured and moved to a big data-analytics server, which is an MPP database with a fully integrated analytics engine, like Aster Data’s solution. It addresses both parts of the big data need – scalable data storage and data processing.
Iterative Experimentation – This is the fun part. In contrast to traditional reporting, where the idea translates almost automatically to a query or report (e.g.: I want to know average sales per store for the past 2 years), a big data product idea (e.g.: I want to know what is my most profitable customer segment) requires building an intuition about the data before coming up with the right answer. This can only be achieved by a large number of analytical queries using either SQL or MapReduce, and it’s the step where the analyst or data scientist builds their intuition and understanding of the dataset and of the hidden gems buried there.
Data Productization – Once iterative experimentation provides the data scientist with evidence of gold, the next step is to make the process repeatable so that its output can be systematically used by humans (e.g. marketing department) or systems (e.g. a credit card transaction clearing system that needs to identify fraudulent transactions). This requires not only a repeatable process but also data that’s certified to be of high quality and processing that can meet specific SLAs, always while using a hybrid of SQL and MapReduce for deep big data analysis
If you think about it, this process is similar to the process of coming up with a new product (software or otherwise). You start with an idea, you then get the first material and build a lot of prototypes. I’ve found that people who find an important and valuable data insight after a process of iterative experimentation feel the same satisfaction as an inventor who has just made a huge discovery. And the next natural step is to take that prototype, make it a repeatable manufacturing process and start using it in the real world.
In the “old”? world of simple reporting, the process of creating insights was straightforward. Respectively the value of the outcome (reports) was much lower and easily replicable by everyone. Big Data Analytics, on the other hand, require a touch of innovation and creativity, which is exactly why it is hard to replicate and why its results produce such important and sustainable advantages to businesses. I believe that Big Data Products are the next wave of corporate value creation and competitive differentiation.
I have always enjoyed the subtle irony of someone trying to be impressive by saying “my data warehouse is X Terabytes”? [muted: "and it's bigger than yours"?]! Why is this ironic? Because it describes a data warehouse, which is supposed to be all about data processing and analysis, using a storage metric. Having an obese 800 Terabytes system that may take hours or days to just do a single pass over the data is not impressive and definitely calls for some diet.
Surprisingly though, several vendors went down the path of making their data warehousing offerings fatter and fatter. Greenplum is a good example. Prior to Sun’s acquisition by Oracle, they were heavily pushing systems based on the Sun Thumper, a 48-disk-heavy 4U box that can store up to 100TBs/box. I was quite familiar with that box as it partly came out of a startup called Kealia that my Stanford advisor, David Cheriton, and Sun co-founder Andy Bechtolsheim had founded and then sold to Sun in 2004. I kept wondering, though, what a 50TB/CPU configuration has to do with data analytics.
After long deliberation I came to the conclusion that it has nothing to do with it. There were two reasons why people were interested in this configuration. First, there were some use cases that required “near-line storage”?, a term that’s used to describe a data repository whose major purpose is to store data but also allows for basic & infrequent data access. In that respect, Greenplum’s software on top of the Sun Thumpers represented a cheap storage solution that offered basic data access and was very useful for applications where processing or analytics was not the main focus.
The second reason for the interest, though, is a tendency to drive DW projects towards an absolute low per-TB price to reduce costs. Experienced folks will recognize that such an approach leads to disaster, because (as mentioned above) analytics is more than just Terabytes. Perfectly low per-TB price using fat storage looks great on glossy paper but in reality it’s no good because nobody’s analytical problems are that simple.
The point here is that analytics have more to do with processing rather than storage. It requires a fair number of balanced servers (thus good scalability & fault tolerance), CPU cycles, networking bandwidth, smart & efficient algorithms, fair amounts of memory to avoid thrashing etc. It’s also about how much processing can it be done by SQL, and how much of your analytics need to use next-generation interfaces like MapReduce or pre-packaged in-database analytical engines. In the new decade in which we’re embarking, solving business problems like fraud, market segmentation & targeting, financial optimization, etc., require much more than just cheap, overweight storage.
So going to the EMC/Greenplum news, I think such an acquisition makes sense, but in a specific way. It will lead to systems that live between storage and data warehousing, systems able to store data and also give the ability to retrieve it on an occasional basis or if the analysis required is trivial. But the problems Aster is excited about are those of advanced in-database analytics for rich, ad hoc querying, delivered through a full application environment inside a MPP database. It’s these problems that we see as opportunities to not only cut IT costs but also provide tremendous competitive advantages to our customers. And on that front, we promise to continue innovating and pushing the limits of technology as much as possible.
There is a lot of talk these days about relational vs. non-relational data. But what about analytics? Does it make sense to talk about relational and non-relational analytics?
I think it does. Historically, a lot of data analysis in the enterprise has been done with pure SQL. SQL-based analysis is a type of “relational analysis,”? which I define as analysis done via a set-based declarative language like SQL. Note how SQL treats every table as a set of values; SQL statements are relational set operations; and any intermediate SQL results, even within the same query, need to follow the relational model. All these are characteristics of a relational analysis language. Although recent SQLÂ standards define the language to be Turing Complete, meaning you can implement any algorithm in SQL, in practice implementing any computation that departs from the simple model of sets, joins, groupings, and orderings is severely sub-optimal, in terms of performance or complexity.
On the other hand, an interface like MapReduce is clearly non-relational in terms of its algorithmic and computational capabilities. You have the full flexibility of a procedural programming language, like C or Java; MapReduce intermediate results can follow any form; and the logic of a MapReduce analytical application can implement almost arbitrary formations of code flow and data structures. In addition, any MapReduce computation can be automatically extended to a shared-nothing parallel system which implies ability to crunch big amounts of data. So MapReduce is one version of “non-relational”? analysis.
So Aster Data’s SQL-MapReduce becomes really interesting if you see it as a way of doing non-relational analytics on top of relational data. In Aster Data’s platform, you can store your data in a purely relational form. By doing that, you can use popular RDBMS mechanisms to achieve things like adherence to a data model, security, compliance, integration with ETL or BI tools etc. The similarities, however, stop there. Because you can then use SQL-MapReduce to do analytics that were never possible before in a relational RDBMS, because they are MapReduce-based and non-relational and they extend to TBs or PBs. And that includes a large number of analytical applications like fraud detection, network analysis, graph algorithms, data mining, etc.
Recently, a journalist called to ask about in-memory data processing, a very interesting subject. I always thought that in-memory processing will be more and more important as memory prices keep falling drastically. In fact, these days you can get 128GB of memory into a single system for less than $5K plus the server cost, not to mention that DDR3 and multiple memory controllers are giving a huge performance boost. And if you run software that can handle shared-nothing parallelism (MPP), your memory cost increases linearly, and systems with TBs of memory are possible.
So what do you do with all that memory? There are two classes of use cases that are emerging today. First is the case where you need to increase concurrent access to data with reduced latency. Tools like memcached offer in-memory caching that, used properly, can vastly improve latency and concurrency for large-scale OLTP applications like websites. Also the nice thing with object caching is that it scales well in a distributed way and people have build TB-level caches. Memory-only OLTP databases have started to emerge, such as VoltDB. And memory is used implicitly as a very important caching layer in open-source key-value products like Voldemort. We should only expect memory to play a more and more important role here.
The second way to use memory is to gain “processing flexibility” when doing analytics. The idea is to throw your data into memory (however much it fits, of course) without spending much time thinking how to do that or what queries you’ll need to run. Because memory is so fast, most simple queries will be executed at interactive times and also concurrency is handled well. European upstart QlikView exploits this fact to offer a memory-only BI solution which provides simple and fast BI reporting. The downside is its applicability to only 10s of GBs of data as Curt Monash notes.
By exploiting an MPP shared-nothing architecture, Aster Data has production clusters with TBs of total memory. Our software takes advantage of memory in two ways: first, it uses caching aggressively to ensure the most relevant data stays in memory; and when data is in memory, processing is much faster and more flexible. Secondly, MapReduce is a great way to utilize memory as it provides full flexibility to the programmer to use memory-focused data structures for data processing. In addition, Aster Data’s SQL-MapReduce provides tools to the user to encourage the development of memory-only MapReduce applications.
However, one shouldn’t fall into the trap of thinking that all analytics will be in-memory anytime soon. While memory is down to $30/GB, disk manufacturers have been busy increasing platter density and dropping their price to less than $0.06/GB. Given that the amount of data in the world grows faster than Moore’s law and memory, there will always be more data to be stored and analyzed than what fits into any amount of memory that an enterprise can use. In fact, most big data applications will have data sets that do not fit into memory because, while tools like memcached worry only about the present (e.g. current Facebook users), analytics need to worry about the past, as well – and that means much more data. So a multi-layer architecture will be the only cost-effective way of analyzing large amounts of data for some time.
One shouldn’t be discussing memory without mentioning solid-state disk products (like Aster Data partner company Fusion-io). SSDs are likely to make the surprise here given that their per-GB price is falling faster than disks (being a solid-state product that follows Moore’s law does help). In the next few years we’ll witness SSDs in read-intensive applications providing similar advantages to memory while accommodating much larger data sizes.
It has been a few weeks since we announced the Aster Analytics Center, so I think this is a good time to shed a little more light on what we are doing. Our goal is to make analytical work easier and faster to do on many types of data sets. We have already worked closely with many customers to architect solutions that solve their analytics challenges: fraud detection; complex security analysis to detect communication anomalies; graph analysis for social networks.
As part of the center, we are building an analytics infrastructure to make advanced analytics readily accessible to anyone using Aster Data. This includes making use of our SQL-MapReduce interface to do analysis that can’t easily be expressed in SQL, and often leads to huge performance gains. In addition, we are releasing a suite of functions built on Aster’s API for MapReduce that allows for easy invocation from within SQL. The suite includes, for example, novel tools to do sequence analysis, which is very useful for anyone trying to do pattern analysis. It’s important to note that many of our customers are already writing their own applications using this API and it’s really straightforward to get started. Incidentally, development for our Java API has just become very easy with our new SDK that uses a plug-in for Eclipse. Also, we are actively developing partnerships with analytic functions and solution providers.
I’d like to briefly provide a brief background of why I’m so excited about what Aster is enabling and how this is indicative of a significant shift in how companies use and analyze their data. I first encountered Aster Data when I was at LinkedIn building analytically driven products with the large data sets that LinkedIn has amassed. Our team faced severe limitations with our standard warehouse, but with the introduction of the MPP Aster system we were suddenly able to analyze data much faster. Analyses that previously took 10 hours to run could suddenly run in 5 minutes. Our ability to think of an idea and get answers was no longer limited by the constraints of the equipment we owned but was instead bottlenecked by how quickly we could think. With a 10 hour wait-time you frequently forgot what you were working on or the stakeholder had moved on without doing a proper analysis. If you made a mistake or wanted to tweak your query you had to wait another 10 hours. With the Aster-enable approach to analytic development, however, a whole new way of thinking emerged and we started to perform analyses we didn’t even think was previously possible. Having the ability to quickly iterate on an idea is invaluable when solving problems – the answers we got back helped guide business decisions and enabled better products on LinkedIn.
As a customer I worked directly with the Aster team on a number of problems and was amazed by their depth of knowledge of the challenges analytics practitioners face and their ability to innovate. Since joining the team, I’ve been pleased by Aster’s strong commitment to make analytics accessible to all. A scalable system that can do more with data will unleash a whole new set of capabilities for enterprises. I’m very excited that the field team has grown and we have attracted top-talent like ex-particle physicist Puneet Batra and data mining experts like Qi Su. Ajay Mysore, another member of the team, conducted master’s research on clustering algorithms. Our team lives and breathes data and is always ready for new challenges. Right now the field of analytics is undergoing a renaissance and it’s exciting to be working with a leader in the field of big data and advanced analytics.
I’ve remarked in an earlier post that the usage of data is changing and new applications are on the horizon. In the last couple of years, we’ve observed interesting design patterns for business processes that use data.
In a previous post, I outlined a design pattern that we call “The Automated Feedback Loop.” In this post, I want to outline a design pattern that we call “The (Iterative) Analytics Data Warehouse”.
The traditional well-understood design pattern of a data warehouse is a central (for Enterprise Data Warehouse) or departmental (for Data Marts) repository of data. Data is fed into the warehouse from ETL processes that pull data from a variety of sources. The data is organized in a data model that caters to 3 use-cases of the warehouse:
Reports - A set of BI queries are run with regular frequency to monitor the state of the business. The target of the reports are business users who want to understand what happened. The goal is to keep them in touch with the pulse of the business.
Exports - A set of export jobs are run with adhoc frequency to provide data sets for further analysis. The target of the exports are business analysts who want to optimize business practices. The goal is to provide them with true, quality-stamped data so that they can make confident optimization recommendations.
Adhoc - A set of queries are run with adhoc frequency to detect or verify patterns that influence business events. The source of the queries are data scientists who want to understand and optimize business practices. The goal is to provide them with computation capabilities (good query interfaces, enough processing, memory and storage resources) to allow them to interact with the data.
The exports and adhoc tasks are transient tasks. Once the data analysts or data scientists find a pattern valuable to the business, that pattern is incorporated into a report so that business users can monitor that pattern on a frequent repeatable practice.
In a typical data warehouse, the bulk of tasks (~80%) are from [1] Reports. The remainder of 20% is from [2] Exports and [3] Adhoc.
Since Reports are frequent and generate known queries, the design of the data warehouse is done to cater to reporting. This includes data models, indexes, materialized views or derived tables – and other optimizations – to make the known Reporting queries go fast. Read the rest of this entry »
This Monday we announced a new web destination for MapReduce, MapReduce.org. At a high level, this site is the first consolidated source of information & education around MapReduce, the groundbreaking programming model which is rapidly revolutionizing the way people deal with big data. Our vision is to make this site the one-stop-shop for anyone looking to learn how MapReduce can help analyze large amounts of data.
There were a couple reasons why we thought the world of big data analytics needed a resource like this. First, MapReduce is a relatively new technology and we are constantly getting questions from people in the industry wanting to learn more about it, from basic facts to using MapReduce for complex data analytics at Petabyte scale. By placing our knowledge and references in one public destination, we hope to build a valuable self-serve resource to educate many more people than what we could ever reach directly. In addition, we were motivated by the fact that most MapReduce resources out there focus more on specific implementations of MapReduce, which fragments the available knowledge and reduces its value. In this new effort we hope to create a multi-vendor & multi-tool resource which will benefit anyone interested in MapReduce.
We’re already working with analysts such as Curt Monash, Merv Adrian, Colin White and James Kobielus to syndicate their MapReduce-related posts. Going forward, we expect even more analysts, bloggers, practitioners, vendors, and academics to contribute. If traffic grows like we expect, we may eventually add a community forum to aid in interaction and sharing of knowledge and best practices.
I hope you enjoy surfing this new site! Free to email me for any suggestions as we work to make MapReduce.org more useful for you.
In the last few years there has been a significant amount of market pickup, from users and vendors, on data clouds and advanced analytics – specifically a new class of data-driven applications run in a data cloud or on-premise. What’s different about this from past approaches is the frequency and speed at which these applications are accessed, the depth of the analysis, the number of data sources involved and the volume of data mined by these applications – terabytes to petabytes. In the midst of this cacophony of dialogue, recent announcements from vendors in this space are helping to clarify different visions and approaches to the big data challenge.
Both Aster Data and Greenplum made announcements this week that illustrated different approaches. At the same time that Aster Data announced the Aster Analytics Center, Greenplum announced an upcoming product named Chorus. I wanted to take a moment to compare and contrast what these announcements say about the direction of the two companies.
Greenplum’s approach speaks to two traditional problem areas i) access to data, from provisioning of data marts to connectivity to data across marts, and ii) some level of collaboration among certain developers and analysts. Their approach is to create a tool for provisioning, unified data access, and sharing of annotations and data among different developers and analysts. Interestingly, this is not an entirely new concept; these are well-known problems for which a number of companies and tools have already developed best-of-breed solutions over the last 15 years. For example, the capabilities for data access are another version of Export/Copy primitives that already exist in all databases and that have been built upon by common ETL and EII tools for cases in which richer support than Export & Copy are needed – for instance, when data has to be transformed, correlated or cleaned while being moved from one context (mart) to another (mart).
This approach is indicative of a product direction in which the primary focus is on adding another option to the list of tools available to customers to address these problems. It’s really not a ground-breaking innovation that evolves the world of analytics. New types of analytics, or ‘data-driven applications,’ is where the enormous opportunity lies. The Greenplum approach of data collaboration is interesting in a test environment or sandbox. When it comes to real production value however, it effectively increases the functions available to the end user, but at a big cost due to significant increases in complexity, security issues and extra administrative overhead. What does this mean exactly?
The spin-up of marts and moving data around can result in “data sprawl” which ultimately increases administrative overhead and is dangerous in these days of compliance and sensitivity to privacy and data leaks.
Adding a new toolset into the data processing stack creates difficult and painful work to either manage and administer multiple tool sets for similar purposes or to eliminate and transition away from investments in existing toolsets.
To enable effective communication and sharing, users need strong processes and features for source identification of data, data collection, data transformation, rule administration, error detection & correction, data governance and security. The quality and security policies around meta-data are especially important as free-form annotations can lead to propagation of errors or leaks in the absence of strong oversight.
In contrast, Aster Data’s recent announcements support our long-standing investments in our unique advanced in-database architecture where applications run fully inside Aster Data’s platform with complete application services essential for complex analytic applications. The announcements highlight that our vision is not to create a new set of tools and layers in the data stack that recreate capabilities currently available from a number of leading vendors, but rather to deliver a new Analytics Platform, a Data-Application Server, to uniquely enable analytics professionals to create data-rich applications that were impossible or impractical before – namely, to create and use advanced analytics for rich, rapid, and scalable insights into their data. This focus is complemented by our partners, who offer proven best-of-breed solutions for collaboration and data transformation.
Today Aster took a significant step and made it easier for developers building fraud detection, financial risk management, telco network optimization, customer targeting and personalization, and other advanced, interactive analytic applications.
Along with the release of Aster Data nCluster 4.5, we added a new Solution Partner level for systems integrators and developers.
Why is this relevant?
Recession or no-recession, IT executives are constantly challenged. They are asked to execute strategies based on better analytics and information to improve effectiveness of business processes (customer loyalty, inventory management, revenue optimization, ..), while staying on top of technology-based disruptions and managing (shrinking or flat) IT budgets.
IT organizations have taken on the challenge by building analytics-based offeringsleveraging existing data management skills and increasingly taking advantage of MapReduce, a disruptive technology introduced by Google and now being rapidly adopted by mainstream enterprise IT shops in Finance, Telco, LifeSciences, Govt. and other verticals.
As MapReduce and big data analytics goes mainstream, our customers and ecosystem partners have asked us to make it easier for their teams to leverage MapReduce across enterprise application lifecycles, while harvesting existing IT skills in SQL, Java and other programming languages. The Aster development team that brought us the SQL-MapReduce® innovation, has now delivered the market’s first integrated visual development environment for developing, deploying and managing MapReduce and SQL-based analytic applications.
Enterprise MapReduce developers and system integrators can now leverage the integrated Aster platform and deliver compelling business results in record time.
We are also teaming up with leaders in our ecosystem like MicroStrategy to deliver an end-to-end analytics solution to our customers that includes SQL/MapReduce enabled reporting and rich visualization. Aster is proud to be driving innovation in the Analytics and BI market and was recently honored at MicroStrategy’s annual customer conference.
I am delighted with the rapid adoption of Aster Data’s platform by our partners and the strong continued interest from enterprise developers and system integrators in building big data applications using Aster. New partners are endorsing our vision and technical innovation as the future of advanced analytics for large data volumes.
Sign up today to be an Aster solution partner and join the revolution to deliver compelling information and analytics-driven solutions.
When we announced Aster nCluster’s In-Database MapReduce feature last year, many people were intrigued by the new analytics they would be able to do in their database. However, In-Database MapReduce is new and often loaded with a lot of technical discussion on how it’s different from PL/SQL or UDF’s, whether it’s suitable for business aanalysts or developers, and more.What people really want to know is how businesses can take advantage of MapReduce.
I’ve referred to how our customers use In-Database MapReduce (and nPath) for click-stream analytics . In our “MapReduce for Data Warehousing and Analytics” webinar last week, Anand Rajaraman covered several other example applications. Rajaraman is CEO and Founder of Kosmix and Consulting Assistant Professor in the Computer Science Department at Stanford University (full disclosure: Anand is also on the Aster board of directors). After spending some time discussing graphing, i.e. finding the shortest path between items, Rajaraman discusses applications in finance, behavioral analytics, text, and statistical analysis that can be easily completed with In-Database MapReduce but are difficult or impossible with SQL alone.
As Rajaraman says, “We need to think beyond conventional relational databases. We need to move on to MapReduce. And the best way of doing that is to combine MapReduce with SQL.”