I’ve been working in the analytics and database market for 12 years. One of the most interesting pieces of that journey has been seeing how the market is ever-shifting. Both the technology and business trends during these short 12 years have massively changed not only the tech landscape today, but also the future of evolution of analytic technology. From a “buzz” perspective, I’ve seen “corporate initiatives” and “big ideas” come and go. Everything from “e-business intelligence,” which was a popular term when I first started working at Business Objects in 2001, to corporate performance management (CPM) and “the balanced scorecard.” From business process management (BPM) to “big data”, and now the architectures and tools that everyone is talking about.
The one golden thread that ties each of these terms, ideas and innovations together is that each is aiming to solve the questions related to what we are today calling “big data.” At the core of it all, we are searching for the right way to enable the explosion of data and analytics that today’s organizations are faced with, to simply be harnessed and understood. People call this the “logical data warehouse”, “big data architecture”, “next-generation data architecture”, “modern data architecture”, “unified data architecture”, or (I just saw last week) “unified data platform”. What is all the fuss about, and what is really new? My goal in this post and the next few will be to explain how the customers I work with are attacking the “big data” problem. We call it the Teradata Unified Data Architecture, but whatever you call it, the goals and concepts remain the same.
Mark Beyer from Gartner is credited with coining the term “logical data warehouse” and there is an interesting story and explanation. A nice summary of the term is,
“The logical data warehouse is the next significant evolution of information integration because it includes ALL of its progenitors and demands that each piece of previously proven engineering in the architecture should be used in its best and most appropriate place. …
“… The logical data warehouse will finally provide the information services platform for the applications of the highly competitive companies and organizations in the early 21st Century.”
The idea of this next-generation architecture is simple: When organizations put ALL of their data to work, they can make smarter decisions.
It sounds easy, but as data volumes and data types explode, so does the need for more tools in your toolbox to help make sense of it all. Within your toolbox, data is NOT all nails and you definitely need to be armed with more than a hammer.
In my view, enterprise data architectures are evolving to let organizations capture more data. The data was previously untapped because the hardware costs required to store and process the enormous amount of data was simply too big. However, the declining costs of hardware (thanks to Moore’s law) have opened the door for more data (types, volumes, etc.) and processing technologies to be successful. But no singular technology can be engineered and optimized for every dimension of analytic processing including scale, performance or concurrent workloads.
Thus, organizations are creating best-of-breed architectures by taking advantage of new technologies and workload-specific platforms such as MapReduce, Hadoop, MPP data warehouses, discovery platforms and event processing, and putting them together into, a seamless, transparent and powerful analytic environment. This modern enterprise architecture enables users to get deep business insights and allows ALL data to be available to an organization, creating competitive advantage while lowering the total system cost.
But why not just throw all your data into files and put a search engine like Google on top? Why not just build a data warehouse and extend it with support for “unstructured” data? Because, in the world of big data, the one-size-sits-all approach simply doesn’t work.
Different technologies are more efficient at solving different analytical or processing problems. To steal an analogy from Dave Schrader—a colleague of mine—it’s not unlike a hybrid car. The Toyota Prius can average 47 mpg with hybrid (gas and electric) vs. 24 mpg with a “typical” gas-only car – almost double! But you do not pay twice as much for the car.
How’d they do it? Toyota engineered a system that uses gas when I need to accelerate fast (and also to recharge the battery at the same time), electric mostly when driving around town, and braking to recharge the battery.
Three components integrated seamlessly – the driver doesn’t need to know how it works. It is the same idea with the Teradata UDA, which is a hybrid architecture for extracting the most insights per unit of time – at least doubling your insight capabilities at reasonable cost. And, business users don’t need to know all of the gory details. Teradata builds analytic engines—much like the hybrid drive train Toyota builds— that are optimized and used in combinations with different ecosystem tools depending on customer preferences and requirements, within their overall data architecture.
In the case of the hybrid car, battery power and braking systems, which recharge the battery, are the “new innovations” combined with gas-powered engines. Similarly, there are several innovations in data management and analytics that are shaping the unified data architecture, such as discovery platforms and Hadoop. Each customer’s architecture is different depending on requirements and preferences, but the Teradata Unified Data Architecture recommends three core components that are key components in a comprehensive architecture – a data platform (often called “Data Lake”), a discovery platform and an integrated data warehouse. There are other components such as event processing, search, and streaming which can be used in data architectures, but I’ll focus on the three core areas in this blog post.
In many ways, this is not unlike the operational data store we’ve seen between transactional systems and the data warehouse, but the data lake is bigger and less structured. Any file can be “dumped” in the lake with no attention to data integration or transformation. New technologies like Hadoop provide a file-based approach to capturing large amounts of data without requiring ETL in advance. This enables large-scale data processing for data refining, structuring, and exploring data prior to downstream analysis in workload-specific systems, which are used to discover new insights and then move those insights into business operations for use by hundreds of end-users and applications.
Discovery platforms are a new workload-specific system that is optimized to perform multiple analytic techniques in a single workflow to combine SQL with statistics, MapReduce, graph, or text analysis to look at data from multiple perspectives. The goal is to ultimately provide more granular and accurate insights to users about their business. Discovery Platforms enable a faster investigative analytical process to find new patterns in data, identify different types fraud or consumer behavior that traditional data mining approaches may have missed.
Integrated Data Warehouses
With all the excitement about what’s new, companies quickly forget the value of consistent, integrated data for reuse across the enterprise. The integrated data warehouse has become a mission-critical operational system which is the point of value realization or “operationalization” for information. The data within a massively parallel data warehouse has been cleansed, and provides a consistent source of data for enterprise analytics. By integrating relevant data from across the entire organization, a couple key goals are achieved. First, they can answer the kind of sophisticated, impactful questions that require cross-functional analyses. Second, they can answer questions more completely by making relevant data available across all levels of the organization. Data lakes (Hadoop) and discovery platforms complement the data warehouse by enriching it with new data and new insights that can now be delivered to 1000’s of users and applications with consistent performance (i.e., they get the information they need quickly).
A critical part of incorporating these novel approaches to data management and analytics is putting new insights and technologies into production in reliable, secure and manageable ways for organizations. Fundamentals of master data management, metadata, security, data lineage, integrated data and reuse all still apply!
The excitement of experimenting with new technologies is fading. More and more, our customers are asking us about ways to put the power of new systems (and the insights they provide) into large-scale operation and production. This requires unified system management and monitoring, intelligent query routing, metadata about incoming data and the transformations applied throughout the data processing and analytical process, and role-based security that respects and applies data privacy, encryption and other policies required. This is where I will spend a good bit of time on my next blog post.