Speaking of ending things on a high note, New York City on December 6th will play host to the final event in the Big Analytics 2013 Roadshow series. Big Analytics 2013 New York is taking place at the Sheraton New York Hotel and Towers in the heart of Midtown on bustling 7th Avenue.
As we reflect on the illustrious journey of the Big Analytics 2013 Roadshow, kicking off in San Francisco, this year the Roadshow traveled through major international destinations including Atlanta, Dallas, Beijing, Tokyo, London and finally culminating at the Big Apple – it truly capsulated the appetite today for collecting, processing, understanding and analyzing data.
Big Analytics Roadshow 2013 stops in Atlanta
Drawing business & technical audiences across the globe, the roadshow afforded the attendees an opportunity to learn more about the convergence of technologies and methods like data science, digital marketing, data warehousing, Hadoop, and discovery platforms. Going beyond the “big data” hype, the event offered learning opportunities on how technologies and ideas combine to drive real business innovation. Our unyielding focus on results from data is truly what made the events so successful.
Continuing on with the rich lineage of delivering quality Big Data information, the New York event promises to pack tremendous amount of Big Data learning & education. The keynotes for the event include such industry luminaries as Dan Vesset, Program VP of Business Analytics at IDC, Tasso Argyros, Senior VP of Big Data at Teradata & Peter Lee, Senior VP of Tibco Software.
Teradata team at the Dallas Big Analytics Roadshow
The keynotes will be followed by three tracks around Big Data Architecture, Data Science & Discovery & Data Driven Marketing. Each of these tracks will feature industry luminaries like Richard Winter of WinterCorp, John O’Brien of Radiant Advisors & John Lovett of Web Analytics Demystified. They will be joined by vendor presentations from Shaun Connolly of Hortonworks, Todd Talkington of Tableau & Brian Dirking of Alteryx.
As with every Big Analytics event, it presents an exciting opportunity to hear first hand from leading organizations like Comcast, Gilt Groupe & Meredith Corporation on how they are using Big Data Analytics & Discovery to deliver tremendous business value.
In summary, the event promises to be nothing less than the Oscars of Big Data and will bring together the who’s who of the Big Data industry. So, mark your calendars, pack your bags and get ready to attend the biggest Big Data event of the year.
I’ve been working in the analytics and database market for 12 years. One of the most interesting pieces of that journey has been seeing how the market is ever-shifting. Both the technology and business trends during these short 12 years have massively changed not only the tech landscape today, but also the future of evolution of analytic technology. From a “buzz” perspective, I’ve seen “corporate initiatives” and “big ideas” come and go. Everything from “e-business intelligence,” which was a popular term when I first started working at Business Objects in 2001, to corporate performance management (CPM) and “the balanced scorecard.” From business process management (BPM) to “big data”, and now the architectures and tools that everyone is talking about.
The one golden thread that ties each of these terms, ideas and innovations together is that each is aiming to solve the questions related to what we are today calling “big data.” At the core of it all, we are searching for the right way to enable the explosion of data and analytics that today’s organizations are faced with, to simply be harnessed and understood. People call this the “logical data warehouse”, “big data architecture”, “next-generation data architecture”, “modern data architecture”, “unified data architecture”, or (I just saw last week) “unified data platform”. What is all the fuss about, and what is really new? My goal in this post and the next few will be to explain how the customers I work with are attacking the “big data” problem. We call it the Teradata Unified Data Architecture, but whatever you call it, the goals and concepts remain the same.
“The logical data warehouse is the next significant evolution of information integration because it includes ALL of its progenitors and demands that each piece of previously proven engineering in the architecture should be used in its best and most appropriate place. …
“… The logical data warehouse will finally provide the information services platform for the applications of the highly competitive companies and organizations in the early 21stCentury.”
The idea of this next-generation architecture is simple: When organizations put ALL of their data to work, they can make smarter decisions.
It sounds easy, but as data volumes and data types explode, so does the need for more tools in your toolbox to help make sense of it all. Within your toolbox, data is NOT all nails and you definitely need to be armed with more than a hammer.
In my view, enterprise data architectures are evolving to let organizations capture more data. The data was previously untapped because the hardware costs required to store and process the enormous amount of data was simply too big. However, the declining costs of hardware (thanks to Moore’s law) have opened the door for more data (types, volumes, etc.) and processing technologies to be successful. But no singular technology can be engineered and optimized for every dimension of analytic processing including scale, performance or concurrent workloads.
Thus, organizations are creating best-of-breed architectures by taking advantage of new technologies and workload-specific platforms such as MapReduce, Hadoop, MPP data warehouses, discovery platforms and event processing, and putting them together into, a seamless, transparent and powerful analytic environment. This modern enterprise architecture enables users to get deep business insights and allows ALL data to be available to an organization, creating competitive advantage while lowering the total system cost.
But why not just throw all your data into files and put a search engine like Google on top? Why not just build a data warehouse and extend it with support for “unstructured” data? Because, in the world of big data, the one-size-sits-all approach simply doesn’t work.
Different technologies are more efficient at solving different analytical or processing problems. To steal an analogy from Dave Schrader—a colleague of mine—it’s not unlike a hybrid car. The Toyota Prius can average 47 mpg with hybrid (gas and electric) vs. 24 mpg with a “typical” gas-only car – almost double! But you do not pay twice as much for the car.
How’d they do it? Toyota engineered a system that uses gas when I need to accelerate fast (and also to recharge the battery at the same time), electric mostly when driving around town, and braking to recharge the battery.
Three components integrated seamlessly – the driver doesn’t need to know how it works. It is the same idea with the Teradata UDA, which is a hybrid architecture for extracting the most insights per unit of time – at least doubling your insight capabilities at reasonable cost. And, business users don’t need to know all of the gory details. Teradata builds analytic engines—much like the hybrid drive train Toyota builds— that are optimized and used in combinations with different ecosystem tools depending on customer preferences and requirements, within their overall data architecture.
In the case of the hybrid car, battery power and braking systems, which recharge the battery, are the “new innovations” combined with gas-powered engines. Similarly, there are several innovations in data management and analytics that are shaping the unified data architecture, such as discovery platforms and Hadoop. Each customer’s architecture is different depending on requirements and preferences, but the Teradata Unified Data Architecture recommends three core components that are key components in a comprehensive architecture – a data platform (often called “Data Lake”), a discovery platform and an integrated data warehouse. There are other components such as event processing, search, and streaming which can be used in data architectures, but I’ll focus on the three core areas in this blog post.
In many ways, this is not unlike the operational data store we’ve seen between transactional systems and the data warehouse, but the data lake is bigger and less structured. Any file can be “dumped” in the lake with no attention to data integration or transformation. New technologies like Hadoop provide a file-based approach to capturing large amounts of data without requiring ETL in advance. This enables large-scale data processing for data refining, structuring, and exploring data prior to downstream analysis in workload-specific systems, which are used to discover new insights and then move those insights into business operations for use by hundreds of end-users and applications.
Discovery platforms are a new workload-specific system that is optimized to perform multiple analytic techniques in a single workflow to combine SQL with statistics, MapReduce, graph, or text analysis to look at data from multiple perspectives. The goal is to ultimately provide more granular and accurate insights to users about their business. Discovery Platforms enable a faster investigative analytical process to find new patterns in data, identify different types fraud or consumer behavior that traditional data mining approaches may have missed.
Integrated Data Warehouses
With all the excitement about what’s new, companies quickly forget the value of consistent, integrated data for reuse across the enterprise. The integrated data warehouse has become a mission-critical operational system which is the point of value realization or “operationalization” for information. The data within a massively parallel data warehouse has been cleansed, and provides a consistent source of data for enterprise analytics. By integrating relevant data from across the entire organization, a couple key goals are achieved. First, they can answer the kind of sophisticated, impactful questions that require cross-functional analyses. Second, they can answer questions more completely by making relevant data available across all levels of the organization. Data lakes (Hadoop) and discovery platforms complement the data warehouse by enriching it with new data and new insights that can now be delivered to 1000’s of users and applications with consistent performance (i.e., they get the information they need quickly).
A critical part of incorporating these novel approaches to data management and analytics is putting new insights and technologies into production in reliable, secure and manageable ways for organizations. Fundamentals of master data management, metadata, security, data lineage, integrated data and reuse all still apply!
The excitement of experimenting with new technologies is fading. More and more, our customers are asking us about ways to put the power of new systems (and the insights they provide) into large-scale operation and production. This requires unified system management and monitoring, intelligent query routing, metadata about incoming data and the transformations applied throughout the data processing and analytical process, and role-based security that respects and applies data privacy, encryption and other policies required. This is where I will spend a good bit of time on my next blog post.
Barton George is Cloud Computing and Scale-Out Evangelist for Dell.
Today at a press conference in San Francisco we announced the general availability of our Dell cloud solutions. One of the solutions we debuted was the Dell Cloud Solution for Data Analytics, a combination of our PowerEdge C servers with Aster Data’s nCluster, a massively parallel processing database with an integrated analytics engine.
Earlier this week I stopped by Aster Data‘s headquarters in San Carlos, CA and met up with their EVP of marketing, Sharmila Mulligan. I recorded this video where Sharmila discusses the Dell and Aster solution and the fantastic results a customer is seeing with it.
Some of the ground Sharmila covers:
What customer pain points and problems does this solution address (hint: organizations trying to manage huge amounts of both structured and unstructured data)
How Aster’s nCluster software is optimized for Dell PowerEdge C2100 and how it provides very high performance analytics as well as a cost effective way to store very large data.
(2:21) InsightExpress, a leading provider of digital marketing research solutions, has deployed the Dell and Aster analytics solution and has seen great results:
Up and running w/in 6 weeks
Queries that took 7-9 minutes now run in 3 seconds
Amazon announced today the availability of special EC2 cloud clusters that are optimized for low-latency network operations. This is useful for applications in the so-called High-Performance Computing area, where servers need to request and exchange data very fast. Examples of HPC applications range from nuclear simulations in government labs to playing chess.
I find this development interesting, not only because it makes scientific applications in the cloud a possibility, but also because it’s an indication of where cloud infrastructure is heading.
In the early days, Amazon EC2 was very simple: if you wanted 5 “instances” (that is, 5 virtual machines), that’s what you got. However, memory of the instances was low, as well as disk capacity. Over time, more and more configurations were added and now one can choose an instance type from a variety of disk & memory characteristics with up to 15GB of memory and 2TBs of disks per instance. However, network was always a problem independently of the size of the instance. (According to rumors, EC2 would make things worse by distributing instances as far away from each other as possible in the datacenter to increase reliability – as a result, network latency would suffer.) Now, the network problem is being solved by means of these special “Cluster Compute Instances” that provide guaranteed, non-blocking access to a 10GbE network infrastructure.
Overall this course represents a departure from the super-simple black-box model that EC2 started from. Amazon – wisely – realizes that accommodating more applications requires transparency – and providing guarantees – for the underlying infrastructure. Guaranteeing network latency is just the beginning: Amazon has the opportunity add much more options and guarantees around I/O performance, quality of service, SSDs versus hard drives, fail-over behavior etc. The more options & guarantees Amazon offers the closer we’ll get to the promise of the cloud – at least for resource-intensive IT applications.
If you read this blog, you’ve probably seen the news about the partnership between Aster Data and Dell on their new PowerEdge C-Series servers (link to their page). Together we have enabled some really successful customers such as MySpace and Mint.com and proven that Dell hardware with Aster Data software easily scales to support large-scale data warehousing and advanced analytics.
In the last few years there has been a significant amount of market pickup, from users and vendors, on data clouds and advanced analytics – specifically a new class of data-driven applications run in a data cloud or on-premise. What’s different about this from past approaches is the frequency and speed at which these applications are accessed, the depth of the analysis, the number of data sources involved and the volume of data mined by these applications – terabytes to petabytes. In the midst of this cacophony of dialogue, recent announcements from vendors in this space are helping to clarify different visions and approaches to the big data challenge.
Both Aster Data and Greenplum made announcements this week that illustrated different approaches. At the same time that Aster Data announced the Aster Analytics Center, Greenplum announced an upcoming product named Chorus. I wanted to take a moment to compare and contrast what these announcements say about the direction of the two companies.
Greenplum’s approach speaks to two traditional problem areas i) access to data, from provisioning of data marts to connectivity to data across marts, and ii) some level of collaboration among certain developers and analysts. Their approach is to create a tool for provisioning, unified data access, and sharing of annotations and data among different developers and analysts. Interestingly, this is not an entirely new concept; these are well-known problems for which a number of companies and tools have already developed best-of-breed solutions over the last 15 years. For example, the capabilities for data access are another version of Export/Copy primitives that already exist in all databases and that have been built upon by common ETL and EII tools for cases in which richer support than Export & Copy are needed – for instance, when data has to be transformed, correlated or cleaned while being moved from one context (mart) to another (mart).
This approach is indicative of a product direction in which the primary focus is on adding another option to the list of tools available to customers to address these problems. It’s really not a ground-breaking innovation that evolves the world of analytics. New types of analytics, or ‘data-driven applications,’ is where the enormous opportunity lies. The Greenplum approach of data collaboration is interesting in a test environment or sandbox. When it comes to real production value however, it effectively increases the functions available to the end user, but at a big cost due to significant increases in complexity, security issues and extra administrative overhead. What does this mean exactly?
The spin-up of marts and moving data around can result in “data sprawl” which ultimately increases administrative overhead and is dangerous in these days of compliance and sensitivity to privacy and data leaks.
Adding a new toolset into the data processing stack creates difficult and painful work to either manage and administer multiple tool sets for similar purposes or to eliminate and transition away from investments in existing toolsets.
To enable effective communication and sharing, users need strong processes and features for source identification of data, data collection, data transformation, rule administration, error detection & correction, data governance and security. The quality and security policies around meta-data are especially important as free-form annotations can lead to propagation of errors or leaks in the absence of strong oversight.
In contrast, Aster Data’s recent announcements support our long-standing investments in our unique advanced in-database architecture where applications run fully inside Aster Data’s platform with complete application services essential for complex analytic applications. The announcements highlight that our vision is not to create a new set of tools and layers in the data stack that recreate capabilities currently available from a number of leading vendors, but rather to deliver a new Analytics Platform, a Data-Application Server, to uniquely enable analytics professionals to create data-rich applications that were impossible or impractical before – namely, to create and use advanced analytics for rich, rapid, and scalable insights into their data. This focus is complemented by our partners, who offer proven best-of-breed solutions for collaboration and data transformation.
Our vision was, and continues to be, to bring the power of MapReduce to a whole new class of developers and mission-critical enterprise systems. When would you use Aster’s In-Database MapReduce vs. a system like Hadoop? You need to ask a few questions as you think about this:
 Can I use my MapReduce system only for batch processing or can I do real-time reporting and analysis? Can I have a single system to do number-crunching AND needle-in-a-haystack summary or aggregation lookup? Can I get response to my short queries in seconds or do I need to wait for several minutes?
 How do I maximize developer productivity, using SQL for regular data processing and MapReduce for richer analysis?
 Do you only want to manage raw data files using file name conventions, or do you also want to use database primitives like partitions, tables, and views?
 How do I easily integrate the MapReduce system with my standard ETL and reporting tool, so I don’t have to reinvent the wheel on dashboards, scorecards, and reports?
 When I have such large data in an enterprise system, how do I control access to data and provide appropriate security privileges?
 Workload management: When I have invested in a system with hundreds or thousands of processors, how do I efficiently share it among multiple users and guarantee response-time SLAs?
 For mission-critical data-intensive applications, how do I do full and incremental backup and disaster recovery?
It’s great to see MapReduce going mainstream and companies such as Amazon supporting the proliferation of innovative approaches to the data explosion problem. Together, we hope to help build mind-share around MapReduce and help companies do more with their data. In fact, we welcome users to put Amazon Elastic MapReduce output into Aster nCluster Cloud Edition for persistence, sharing, reporting and easy fast concurrent access. Lots of Aster customers are using both and it’s easy to move data since Aster is on the same Amazon Web Services cloud.
Please contact us if you’d like help getting started with your MapReduce explorations. We conducted a web seminar to introduce you to the concept.
As the Director of Technology Delivery for Aster Data Systems, I oversee the teams responsible for delivering and deploying our nCluster analytic database to customers and enabling prospective customers to evaluate our solutions effectively and efficiently. Recently, Shawn posted on the release of Aster nCluster Cloud Edition and discussed how cloud computing enables business to scale their infrastructure without huge hardware investments. As a follow-on, I’d like to let you know about how the flexibility provided by nCluster’s support of multiple platforms can reduce the time and costs associated with evaluating nCluster.
Evaluating enterprise software can be a costly effort in both time and money. The process typically requires weeks of prep work by the evaluation team, possibly including purchasing different hardware for each vendor being evaluated. Spending significant amounts of money and losing weeks of resource productivity to an evaluation is something few companies can afford to do, particularly in these uncertain times.
With our recent public release of Aster nCluster Cloud Edition, we now provide the most platform options of any major data warehouse vendor. While it’s natural to focus on the flexibility this affords for production systems, it also allows us to be very flexible for enabling customers to try our solution:
Commodity Hardware Evaluation
Several warehouse vendors claim to support commodity hardware, but most are very closely tied to one “preferred” vendor. Aster nCluster supports any x86-based hardware, meaning that you can evaluate us on either new hardware (if performance is a key aspect of the evaluation) or older hardware that is being repurposed (if you want to test the functionality of nCluster without buying new hardware).
Our data center in San Carlos, CA has racks of servers dedicated to customer evaluations. With an Aster-hosted system, functional evaluations of nCluster can be performed with minimum infrastructure requirements.
With Aster nCluster Cloud Edition, custom-configured nClusters can be brought up in minutes on either Amazon EC2 or AppNexus. POCs can be performed on one or multiple systems in parallel, with zero infrastructure requirements. Your teams can evaluate all of nCluster’s functionality in the cloud, with complete control over sizing and scaling. (While other vendors have announced cloud offerings, we’re the only data warehouse vendor to have production customers on two separate cloud services).
Whether you’re building a new frontline data warehouse or looking to replace an existing system that doesn’t scale or costs too much, you should check us out. We have a great product that’s turning heads as an alternative to overpriced hardware appliances for multi-TB data warehouses. With all the flexibility our offerings provide, you can evaluate all the power of Aster nCluster without the costs of traditional POCs.
Give us a try and see everything you can do with Aster nCluster!
Cloud computing is a fascinating concept. It offers greenfield opportunities (or more appropriately, blue sky frontiers) for businesses to affordably scale their infrastructure needs without plunking down a huge hardware investment (and the space/power/cooling costs associated with managing your own hosted environment). This removes the risks of mis-provisioning by enabling on-demand scaling according to your data growth needs. Especially in these economic times, the benefits of Cloud computing are very attractive.
But let’s face it – there’s also a lot of hype, and it’s hard to separate truth from fiction. For example, what qualities would you say are key to data warehousing in the cloud?
Here’s a checklist of things I think are important:
 Time-To-Scalability. The whole point of clouds is to offer easy access to virtualized resources. A cloud warehouse needs to quickly scale-out and scale-in to adapt to changing needs. It can’t take days to scale…it has to happen on-demand in minutes (<1 hour).
 Manageability. You go with clouds because you not only want to save on hardware, but also on the operational people costs of maintaining that infrastructure. A cloud warehouse needs to offer one-click scaling, easy install/upgrade, and self-managed resiliency.
 Ecosystem. While clouds offer *you* huge TCO savings, you can’t compromise service levels for your customers – especially if you run your business on the cloud. BI/ETL/monitoring tools, Backup & Recovery, and ultra-fast data loading can’t be overlooked for “frontline” mission-critical warehousing on the cloud.
 Analytics. Lots of valuable data is generated via the cloud and there are opportunities to subscribe to new data feed services. It’s insufficient for a cloud warehouse to just do basic SQL reporting. Rather, it must offer the ability to do deep analytics very quickly.
 Choice. A truly best-in-class cloud warehouse won’t lock you in to a single cloud vendor. Rather, it will offer portability by enabling you to choose the best cloud for you to run your business on.
Finally, here are a couple ideas on the future of cloud warehousing. What if you could link multiple cloud warehouses together and do interesting queries across clouds? And what about the opportunities for game-changing new analytics – with so many emerging data subscription services, wouldn’t this offer ripe opportunities for mash-up analytics (eg. using Aster SQL/MapReduce).
What do you think are the standards for “best-in-class” cloud warehousing?