Archive for the ‘Database’ Category
|
|
|
|
|
About one year ago, Teradata Aster launched a powerful new way of integrating a database with Hadoop. With Aster SQL-H™, users of the Teradata Aster Discovery Platform got the ability to issue SQL and SQL-MapReduce® queries directly on Hadoop data as if that data had been in Aster all along. This level of simplicity and performance was unprecedented, and it enabled BI & SQL analysts that knew nothing about Hadoop to access Hadoop data and discover new information through Teradata Aster.
This innovation was not a one-off. Teradata has put forward the most complete vision for a data and analytics architecture in the 21st century. We call that the Unified Data Architecture™. The UDA combines Teradata, Teradata Aster & Hadoop into a best-of-breed, tightly integrated ecosystem of workload-specific platforms that provide customers the most powerful and cost-effective environment for their analytical needs. With Aster SQL-H™, Teradata provided a level of software integration between Aster & Hadoop that was, and still is, unchallenged in the industry.
 Teradata Unified Data Architecture™
Today, Teradata makes another leap in making its Unified Data Architecture™ vision a reality. We are announcing SQL-H™ for Teradata, bringing the best SQL engine for data warehousing and analytics to Hadoop. From now on, Enterprises that use Hadoop to store large amounts of data will be able to utilize Teradata’s analytics and data warehousing capabilities to directly query Hadoop data securely through ANSI standard SQL and BI tools by leveraging the open source Hortonworks HCatalog project. This is fundamentally the best and tightest integration between a data warehouse engine and Hadoop that exists in the market today. Let me explain why.
It is interesting to consider Teradata’s approach versus alternatives. If one wants to execute SQL on Hadoop, with the intent of building Data Warehouses out of Hadoop data, there are not many realistic options. Most databases have a very poor integration with Hadoop, and require Hadoop experts to manage the overall system – not a viable option for most Enterprises due to cost. SQL-H™ removes this requirement for Teradata/Hadoop deployments. Another “option” are the SQL-on-Hadoop tools that have started to emerge; but unfortunately, there are about a decade away from becoming sufficiently mature to handle true Data Warehousing workloads. Finally, the approach of taking a database and shoving it inside Hadoop has significant issues since it suffers from the worst of both worlds – Hadoop activity has to be limited so that it doesn’t disrupt the database, data is duplicated between HDFS and the database store, and performance of the database is less compared to a stand–alone version.
In contrast, a Teradata/Hadoop deployment with SQL-H™ offers the best of both worlds: unprecedented performance and reliability in the Teradata layer; seamless BI & SQL access to Hadoop data via SQL-H™; and it frees up Hadoop to perform data processing tasks at full efficiency.
Teradata is committed to being the strategic advisor of the Enterprise when it comes to Data Warehousing and Big Data. Through its Unified Data Architecture™ and today’s announcement on Teradata SQL-H™, it provides even more performance, flexibility and cost-effective options to Enterprises eager to use data as a competitive advantage.
|
|
|
|
|
|
|
|
|
|
|
|
|
Ever since Aster Data became part of Teradata a couple years ago, we have been fortunate to have the resources and focus to accelerate our rate of product innovation. In the past 8 months alone, we have led the market in deploying big analytics on Hadoop and introducing an ultra-fast appliance for discovering big data insights. Our focus is to provide the market with the best big data discovery platform; that is, the most efficient, cost-effective, and enterprise-friendly way to extract valuable business insights form massive piles of structured and unstructured data.
Today I am excited to announce another significant innovation that extends our lead in this direction. For the first time, we are introducing in-database, SQL-MapReduce-based visualization functions, as part of the Teradata Aster Discovery Platform 5.10 software release. These are functions that take the output of an analytical process (either SQL or MapReduce) and create an interactive data visualization that can be accessed directly from our platform through any web browser. There are several functions that we are introducing with today’s announcement, including functions that let you visualize flows of people or events, graphs, and arbitrary patterns. These functions complement your existing BI solution by extending the types of information you can visualize without adding the complexity of another BI deployment.
It did take some significant engineering effort and innovation from our field in working with customers to make a discovery platform produce in-database, in-process visualizations. So, why bother? Because these functions have three powerful characteristics: they are beautiful; powerful; and instant. Let me elaborate in reverse order.
Instant: the goal of a discovery platform like Aster’s is to accelerate the hypothesis –> analysis –> validation iteration process. One of the major big data challenges is that the data is so complex that you don’t even know what questions to ask. So you start with 10s or 100s of possible questions that you need to quickly implement and validate until you find the couple questions that extract the gold nuggets of information from the data. Besides analyzing the data, having access to instant visualizations can help data scientists and business analysts understand if they are down the right path of finding the insights they’re looking for. Being able to rapidly analyze and – now – visualize the insights in-process can rapidly accelerate the discovery cycle and save an analysts time and cost by more than 80% as has been recently validated.
Powerful: Aster comes with a broad library of pre-built SQL-MapReduce functions. Some of the most powerful, like nPath, crunch terabytes of customer or event data and produce patterns of activity that yield significant insights in a single pass of the data, regardless of the complexity of the pattern or history being analyzed. In the past, visualizing these insights required a lot of work – even after the insight was generated. This is because there were no specialized visualization tools that could consume the insight as-is to produce the visualizations. Abstracting the insights in order to visualize them is sub-optimal since it is killing the ‘a-ha!’ moment. With today’s announcement, we provide analysts with the ability to natively visualize concepts such as a graph of interactions or patterns of customer behavior with no compromises and no additional effort!
Beautiful: We all know that numbers and data are only as good as the story that goes with them. By having access to instant, powerful and also aesthetically beautiful in-database visualizations, you can do justice to your insights and communicate them effectively to the rest of the organization, whether that means business clients, executives, or peer analysts.
In addition, with this announcement we are introducing four buckets of pre-built SQL-MapReduce functions, I.e. Java functions that can be accessed through a familiar SQL or BI interface. These buckets are Data Acquisition (connecting to external sources and acquiring data); Data Preparation (manipulate structured and unstructured data to quickly prepare for analysis); Data Analytics (everything from path and pattern analysis to statistics and marketing analytics); and Data Visualization (introduced today). This is the most powerful collection of big data tools available in the industry today, and we’re proud to provide them to our customers.
 Teradata Aster Discovery Portfolio
Our belief is that our industry is still scratching the surface in terms of providing powerful analytical tools to enterprises that help them find more valuable insights, more quickly and more easily. With today’s launch, the Teradata Aster Discovery Platform reconfirms its lead as the most powerful and enterprise-friendly tool for big data analytics.
|
|
|
|
|
|
|
|
|
|
|
By MGoel in Analytic platform, Analytics, Availability, Blogroll, Business analytics, Database, Digital Marketing, MapReduce, Scalability, TCO, Teradata Aster on December 18, 2012 |
| |
|
|
|
|
It’s been about two months since Teradata launched the Aster Big Analytics Appliance and since then we have had the opportunity to showcase the appliance to various customers, prospects, partners, analysts, journalists etc. We are pleased to report that since the launch the appliance has already received the “Ventana Big Data Technology of the Year” award and has been well received by industry experts and customers alike.
Over the past two months, starting with the launch tweetchat, we have received numerous enqueries around the appliance and think now is a good time to answer the top 10 most frequently asked questions about the new Teradata Aster offering. Without further ado here are the top 10 questions and their answers:
WHAT IS THE TERADATA ASTER BIG ANALYTICS APPLIANCE?
The Aster Big Analytics Appliance is a powerful, ready to-run platform that is pre-configured and optimized specifically for big data storage and analysis. A purpose built, integrated hardware and software solution for analytics at big data scale, the appliance runs Teradata Aster patented SQL-MapReduce® and SQL-H™ technology on a time-tested, fully supported Teradata hardware platform. Depending on workload needs, it can be exclusively configured with Aster nodes, Hortonworks Data Platform (HDP) Hadoop nodes, or a mixture of Aster and Hadoop nodes. Additionally, integrated backup nodes are available for data protection and high availability
WHO WILL BENEFIT MOST BY DEPLOYING THE APPLIANCE?
The appliance is designed for organizations looking for a turnkey integrated hardware and software solution to store, manage and analyze structured and unstructured data (ie: multi-structured data formats). The appliance meets the needs of both departmental and enterprise-wide buyers and can scale linearly to support massive data volumes.
WHY DO I NEED THIS APPLIANCE?
This appliance can help you gain valuable insights from all of your multi-structured data. Using these insights, you can optimize business processes to reduce cost and better serve your customers. More importantly, these insights can help you innovate by identifying new markets, new products, new business models etc. For example, by using the appliance a telecommunications company can analyze multi-structured customer interaction data across multiple channels such as web, call center and retail stores to identify the path customers take to churn. This insight can be used proactively to increase customer retention and improve customer satisfaction.
WHAT’S UNIQUE ABOUT THE APPLIANCE?
The appliance is an industry first in tightly integrating SQL-MapReduce®, SQL-H™ and Apache Hadoop. The appliance delivers a tightly integrated hardware and software solution to store, manage and analyze big data. The appliance delivers integrated interfaces for analytics and administration, so all types of multi-structured data can be quickly and easily analyzed through SQL based interfaces. This means that you can continue to use your favorite BI tools and all existing skill sets while deploying new data management and analytics technologies like Hadoop and MapReduce. Furthermore, the appliance delivers enterprise class reliability to allow technologies like Hadoop to now be used for mission critical applications with stringent SLA requirements.
WHY DID TERADATA BRING ASTER & HADOOP TOGETHER?
With the Aster Big Analytics Appliance, we are not just putting Aster and Hadoop in the same box. The Aster Big Analytics Appliance is the industry’s first unified big analytics appliance, providing a powerful, ready to run big analytics and discovery platform that is pre-configured and optimized specifically for big data analysis. It provides intrinsic integration between the Aster Database and Apache Hadoop, and we believe that customers will benefit the most by having these two systems in the same appliance.
Teradata’s vision stems from the Unified Data Architecture. The Aster Big Analytics Appliance offers customers the flexibility to configure the appliance to meet their needs. Hadoop is best for capture, storing and refining multi-structured data in batch whereas Aster is a big analytics and discovery platform that helps derive new insights from all types of data. Hadoop is best for capture, storing and refining multi-structured data in batch. Depending on the customer’s needs, the appliance can be configured with all Aster nodes, all Hadoop nodes or a mix of the two.
WHAT SKILLS DO I NEED TO DEPLOY THE APPLIANCE?
The Aster Big Analytics appliance is an integrated hardware and software solution for big data analytics, storage, and management, which is also designed as a plug and play solution that does not require special skill sets.
DOES THE APPLIANCE MAKE DATA SCIENTISTS OR DATA ANALYSTS IRRELEVANT?
Absolutely not. By integrating the hardware and software in an easy to use solution and providing easy to use interfaces for administration and analytics, the appliance allows data scientists to spend more time analyzing data.
In fact, with this simplified solution, your data scientists and analysts are freed from the constraints of data storage and management and can now spend their time on value added insights generation that ultimately leads to a greater fulfillment of your organization’s end goals.
HOW IS THE APPLIANCE PRICED?
Teradata doesn’t disclose product pricing as part of its standard business operating procedures. However, independent research conducted by industry analyst Dr. Richard Hackathorn, president and founder, Bolder Technology Inc., confirms that on a TCO and Time-to-Value basis the appliance presents a more attractive option vs. commonly available do-it-yourself solutions. http://teradata.com/News-Releases/2012/Teradata-Big-Analytics-Appliance-Enables-New-Business-Insights-on–All-Enterprise-Data/
WHAT OTHER ASTER DEPLOYMENT OPTIONS ARE AVAILABLE?
Besides deploying via the appliance, customers can also acquire and deploy Aster as a software only solution on commodity hardware] or in a public cloud.
WHERE CAN I GET MORE INFORMATION?
You can learn more about the Big Analytics Appliance via http://asterdata.com/big-analytics-appliance/ – home to release information, news about the appliance, product info (data sheet, solution brief, demo) and Aster Express tutorials.
Join the conversation on Twitter for additional Q&A with our experts:
Manan Goel @manangoel | Teradata Aster @asterdata
For additional information please contact Teradata at http://www.teradata.com/contact-us/
|
|
|
|
|
|
|
|
|
|
|
By Tasso Argyros in Analytic platform, Analytics, Analytics tech, Business analytics, Business intelligence, Database, Digital Marketing, Interactive marketing, MapReduce, Teradata Aster on June 12, 2012 |
| |
|
|
|
|
Back in 2005, when we first founded Aster Data, our vision was to take some of the latest technology innovations – including MPP shared-nothing architectures; Linux-based commodity hardware; and novel analytical interfaces like Google’s MapReduce – and bring them to mainstream enterprises. This vision translated into a strategy focused not only on big data innovations, but also on delivering technologies that make big data viable for enterprise environments. SQL-MapReduce®, our industry-leading patented technology that combines standard SQL processing with a native MapReduce execution environment, is one example of how we make big data enterprise ready.
Today we have completed another major milestone on providing value to our customers by announcing a major innovation: Aster SQL-H™, a seamless way to execute SQL & SQL-MapReduce on Apache™ Hadoop™ data.
This is a significant step forward from what was state-of-the-art until yesterday. What was missing? A common DBMS-Hadoop connector operating at the physical layer. This means that getting data from Hadoop to a database required a Hadoop expert in the middle to do the data cleansing and the data type translation. If the data was not 100% clean (which is the case in most circumstances) a developer was needed to get it to a consistent, proper form. Besides wasting the valuable time of that expert, this process meant that business analysts couldn’t directly access and analyze data in Hadoop clusters. Other database connectors require duplicating the data into HDFS by using proprietary formats; a cumbersome and expensive approach by any measure.
SQL-H, an industry-first, solves all those problems.
First, we have integrated Aster’s metadata engine with Hadoop’s emerging metadata standard, HCatalog. This means that data stored in Hadoop using Pig, Hive & HBase can be “seen” in an Aster system as if they are just another Aster view. The business implication is that a business analyst using standard SQL or a BI tool can have full and seamless access to Hadoop data through the Aster’s standard ODBC/JDBC connector and Aster’s SQL engine. There is no need to have a human in the middle to translate the data or ensure its consistency; and no need to file tickets or call up experts to get the data the business needs. Everything happens transparently, seamlessly, and instantly. This is an industry first, since today all available Hadoop tools either do not provide standard SQL interfaces that are well optimized, do not provide native BI compatibility, or require manual data translation and movement from Hadoop to a third party system. None of these approaches are viable options for SQL & BI execution on Hadoop data, thus making it hard for enterprises to get value from Hadoop.
Secondly, SQL-H provides a high-performance, type-safe data connector, that can take a SQL or SQL-MapReduce query that involves Hadoop data, automatically select the minimum subset of data in Hadoop that is required for execution of the query, and run the query on the Aster system. The performance of running SQL and SQL-MapReduce analytics in Aster is significantly higher than Hadoop because (a) Aster can optimize data partitioning and distribution, thus reducing network transfers and overhead; (b) Aster’s engine can keep statistics about the data and use that to optimize execution of both SQL & MapReduce; (c) Aster’s SQL queries are cost-based-optimized which means that it can handle very complex SQL, including SQL produced by BI tools, very efficiently.
In addition, one can take advantage of SQL-H to apply the 50+ pre-build SQL-MapReduce apps that Teradata Aster provides on Hadoop data, thus doing big data analytics that are impossible to do in every other database without having to write a single line of Java MapReduce code! These apps include functions for path & pattern analysis, statistics, graph, text analysis, and more.
Teradata Aster is committed to groundbreaking product innovation as the key strategy in maintaining our #1 position in the big analytics market. SQL-H is another important step that we expect will make Hadoop and big data analytics much more palatable for enterprise environments, allowing business analysts, SQL power-users & BI tool users to analyze Hadoop data without having to learn about Hadoop interfaces and code.
If you want to find out more we’ll be talking about SQL-H at Hadoop Summit, on webcast taking place June 21st, at the upcoming Big Analytics 2012 events in Chicago & New York, and at the annual Teradata Partners event.
|
|
|
|
|
|
|
|
|
|
|
|
|
It has been about seven years since Aster Data was founded, four years since our industry-first Enteprise SQL-MapReduce implementation (first commercial MapReduce offering) and three years since our first Big Data Summit event (the first “Big Data” event in the industry as far as I know). During this whole time, we have witnessed our technology investments take off together with the Big Data market – just think how many people had never even heard the word MapReduce three years ago, and how many swear by it today!
As someone who was caught in the Big Data wave since 2005, I can tell you that the stage of the market has changed significantly during this time – and with it, the challenges that Enterprise customers face. A few years ago, customers were realizing the challenges that piles of new types of data were bringing – big volumes (terabytes to petabytes) and new, complex types (multi-structured data such as weblogs, text, customer interaction data); but at the same time, the opportunities that the new analytical interfaces, like MapReduce, were enabling. Fast forward to today and most enterprises are trying to put together their Big Data strategies and make sense of what the market has to offer – and as a result there is a lot of market noise and confusion: it is usually not clear what use cases apply to traditional technologies versus new; how to reconcile existing technologies with new investments; and what type of projects will they give them highest ROI versus a long and painful failure.
Teradata and Teradata Aster have a high interest in customers being successful with Big Data challenges and technologies, because we believe that the growth of the market will translate into growth for us. Given Teradata’s history in being the #1 strategic advisor to customers around data management and analytics, we only want to offer the best solutions to our customers. This includes our products –which are recognized by Gartner as leading technologies in Data Warehousing and Big Data analytics– but also our expertise helping customers how to use complementary solutions, like Hadoop, and making sure that the total solution works reliably and succeeds in tackling big business problems.
With this partnership, we are taking one more step towards this direction. So we are announcing three things:
1. Teradata and Hortonworks will work together to jointly solve big challenges for our customers. This is a win/win for customers and the industry.
2. Our intent to do joint R&D to make it easier for customers that use products from Teradata and Hadoop to utilize these products together. This is important because every enterprise will look to combine new technologies with existing investments, and there is plenty of opportunity to do better.
3. A set of reference architectures that combine Teradata and Hadoop products to accelerate the implementation of Big Data Big Data projects. We hope that this will be a starting point that will save enterprises time and money when they embark on Big Data projects.
We believe that all the above three points will translate into eliminating risks and unnecessary trial and error. We have enough collective experience to guide customers to avoid failed projects and traps. And by helping clear up some of the confusion in the big data market, we hope to accelerate its growth and the benefit to Enterprises that are looking to utilizing their data to become more competitive and efficient.
|
|
|
|
|
|
|
|
|
|
|
|
|
When we kicked off Aster Data back in 2005, we envisioned building a product that would advance the state of the art in data management in two areas; (1) size and diversity of data and (2) depth of insight/analytics. My co-founders and I quickly realized that building just another database wouldn’t cut it. With yet-another-database, even if we enabled companies to more cost-effectively manage large data sizes, it was not going to be enough given the explosion in diverse data types and the massive need to process all of it. So we set out to build a new platform that would solve these challenges – what’s now commonly known as the ‘Big Data’ challenge.
Fast forward to 2008 when Aster Data led the way in putting massive parallel processing inside a MPP database, using MapReduce, to advance how you process massive amounts of diverse data. While this was fully aligned with our vision for managing hoards of diverse data and allowing deep data processing in a single platform, most thought it was intriguing but couldn’t quite see the light in terms of where the future was going. At one point, we thought of naming our product XAP – “extreme analytic platform” or “extreme analytic processing” as that’s what it was designed to do from day one. However, we thought better of it since we thought we would have to educate people too much on what an “analytic platform” was and how it was different from a traditional DBMS for data warehousing. Since we also were serving the data architects in organizations as well as the front-line business that demands better, faster analytics, we needed to use terminology that resonated with both.
Then, in the fall of 2009, with our flagship product Aster Data nCluster 4.0, we made further strides in running advanced analytics inside the database by including all the built-in application services (e.g. like dynamic WLM, backup, monitoring, etc) to go with it. At that time, we referred to it as a Data-Application Server – which our customers quickly started calling a Data-Analytics Server. I remember when analyst Jim Kobielus at Forrester said,
“It’s really innovative and I don’t use those terms lightly. Moving application logic into the data warehousing environment is ‘a logical next step’.”
And others saying,
“The platform takes a different approach from traditional data warehouses, DBMS and data analytics solutions by housing data and applications together in one system, fully parallelizing both. This eradicates the need for movements of massive amounts of data and the problems with latency and restricted access that creates.”
What they started to fully appreciate and realize is that big data is not just about storing hoards of data, but rather, cracking the code on how to process all of it in deep ways, at blazing fast speeds. Read the rest of this entry »
|
|
|
|
|
|
|
|
|
|
|
|
|
Dave Kellog’s blog reminded me that the Claremont DB Research report was recently released. The Claremont report is the result of two days of discussion among some of the world’s greatest academics in databases and aims to identify and promote the most promising research directions in databases.
As I was reading the report, I realized that Aster Data is at the forefront of some of the most exciting database research topics. In particular, the report mentions four areas (out of a total of six) where Aster has been driving innovation very aggressively.
1. Revisiting database engines. MPP is the answer to Big Data, among other things.
2. Declarative programming for emerging platforms. MapReduce is explicitly mentioned here, noting its potential in data management. This is a very important development given that certain database academics (that participated in the report) have repeatedly shown their depreciation and ignorance on the topic.
3. Interplay of structured and unstructured data. This is an important area where MapReduce can play a huge role.
4. Cloud data services. Database researchers realize the potential of the cloud, both as a data management and a research tool. With our precision scaling feature, we are a strong fit for internal Enterprise clouds.
The world of databases is changing fast and this is an opportunity for us to provide the most cutting-edge database technology to our customers.
We’ve also found a lot of benefit from our strong ties with academia, by nature of our background and advisors, and we intend to strengthen these even more.
|
|
|
|
|
|
|
|
|
|
|
|
|
In response to Aster’s In-Database MapReduce initiative, I’ve been asked the following question:
“How does Aster Data Systems compete with open source MapReduce implementations, such as Hadoop?”
My answer –we simply do not.
Hadoop and Google’s implementation of MapReduce are targeted to the development (coding) community. The primary interface of these systems is the command line; and the primary means of accessing data is through Java or Python code. There have been efforts to build higher-level interfaces on top of these systems, but they are usually limited, do not follow any existing standard, and are incompatible with the existing filesystem.
Such tools are ideal for environments that are dominated by engineers, such as academic institutions, research labs or technology companies like Google/Yahoo that have a strong culture of in-house development (often hundreds of thousands of lines of code) to solve technical problems.
Most enterprises are unlike the culture of Google/Yahoo and each “build vs. buy” decision is carefully considered. Good engineering talent is a precious resource that is directed towards adding business value, not in building infrastructure from the ground up. The Data Services groups are universally under-staffed and consist of people that understand and leverage databases. As such, there are corporate governance expectations from any data management tool that they use:
- it has to comply with applicable standards like ANSI-SQL,
- it needs to provide a set of tools that IT can use & manage, and
- it needs to be ecosystem-friendly (BI and data integration tools compatibility).
In such an environment, using Java or developer-centric command line as the primary interface will increase the burden on the data services group and their IT counter-parts.
I strongly believe, that while existing MapReduce tools are good for development organizations, they are totally inappropriate for a large majority of enterprise IT departments.
Our goal is not to build yet another tool for development groups, but rather to create a product that unleashes the power of MapReduce for the enterprise IT organization.
How can we achieve that?
First, we’ve developed Aster to be a super-fast, always-parallel database for large-scale data warehousing using SQL. Then we allow our customers and partners to extend SQL through a tightly integrated MapReduce functionality.
The person that develops our MapReduce functions, naturally, needs to be a developer; but the person that is using this functionality can be an analyst using a standard BI tool (e.g., Microstrategy, Business Objects, Pentaho) over ODBC or JDBC connections!
Invoking MapReduce functions in Aster looks almost identical to writing standard SQL code. This way, the powerful MapReduce extensions that are developed by a small set of developers (either within an IT organization or by Aster itself) can be used by people with SQL skills using their existing sets of tools.
Integrating MapReduce and SQL is not an easy job; we had to innovate on multiple levels to achieve that, e.g. by creating a new type of UDFs that are both parallel and polymorphic, to make MapReduce extensions almost indistinguishable from standard SQL.
In summary, we have enabled:
- The flexible, parallel power of MapReduce to enable deep analytical insights that are impossible to express in standard SQL
- Seamless integration with ANSI standard SQL and all the rich commands, types, functions, etc. that are inherent in this well-known language
- Full JDBC/ODBC support ensures interoperability between Aster In-Database MapReduce and 3rd party database ecosystem tools like BI, reporting, advanced analytics (e.g., data mining), ETL, monitoring, scheduling, GUI administration, etc.
- SQL/MR functions –powerful plug-in operators that any non-engineer can easily plug into standard ANSI SQL to exploit the power of MapReduce analytic applications
- Polymorphism –unlike static, unreliable UDFs, SQL/MR functions unleash the power of polymorphism (run-time/dynamic) for cost-efficient reusability. Built-in sandboxing ensures fault tolerance to avoid system crashes commonly experienced with UDFs
To conclude, it is important to understand that Aster nCluster is not yet another MapReduce implementation nor does it compete with Hadoop for resources or audience.
Rather, Aster nCluster is the world’s most powerful database that breaks traditional SQL barriers allowing Data Services groups and IT organizations to extract more knowledge out of their data
|
|
|
|
|
|
|
|
|
|
|
|
|
Building on Mayank’s post, let me dig deeper into a few of the most important differences between Aster’s In-Database MapReduce and User Defined Functions (UDFs):
| Feature |
User Defined Functions |
Aster SQL/MR Functions |
What does it mean? |
| ———— |
———— |
———— |
———— |
| Dynamic Polymorphism |
No. Requires changing the code of the function and static declarations |
Yes |
SQL/MR functions work just like SQL extensions – no need to change function code |
| ———— |
———— |
———— |
———— |
| Parallelism |
Only in some cases and for few number of nodes |
Yes, across 100s of nodes |
Huge performance increases even for the most complex functions |
| ———— |
———— |
———— |
———— |
| Availability Ensured |
No. In most cases UDFs run inside the database |
Always. Functions run outside the database |
Even if functions have bugs, the system remains resilient to failures |
| ———— |
———— |
———— |
———— |
| Data Flow Control |
No. Requires changing the UDF code or writing complex SQL subselects |
Yes. “PARTITION BY” and “SEQUENCE BY” natively control the flow of data in and out of the SQL/MR functions |
Input/output of SQL/MR functions can be redistributed across the database cluster in different ways with no actual function code change |
In this blog post we’ll focus on Polymorphism – what it is and why it’s so critically important for building real SQL extensions using MapReduce.
Polymorphism allows Aster SQL/MR functions to be coded once (by a person that understands a programming or scripting language) and then used many times through standard SQL by analysts. In this context, comparing Aster SQL/MR functions and UDFs is like comparing SQL with the C language. The former is flexible, declarative and dynamic, the latter requires customization and recompilation even for the slightest change in usage.
For instance, take a SQL/MR function that performs sessionization. Let us assume that we have a webclicks(userId int, timestampValue timestamp, URL varchar, referrerURL varchar); table that contains a record of clicks for each user on our website. The same function, with no additional declarations, can be used in all the following ways:
SELECT sessionId, userId, timestampValue
FROM Sessionize( 'timestamp', 60 ) ON webclicks;
SELECT sessionId, userId, timestampValue
FROM Sessionize( 'timestamp', 60 ) ON
(SELECT userid, timestampValue FROM webclicks WHERE userid = 50 );
[Note how the number of input arguments changed (going from all columns of webclicks; to just two columns of webclicks) in the above clause but the same function can be used. This is not possible with a plain UDF without writing additional declarations and UDF code]
SELECT sessionId, UID, TS
FROM Sessionize( 'ts', 60 ) ON
(SELECT userid as UID, timestampValue as TS FROM webclicks WHERE userid = 50 );
[Note how the names of the arguments changed but the Sessionize() function does the right thing]
In other words, Aster SQL/MR functions are real SQL extensions – once they’ve been implemented there is zero need to change their code or write additional declarations – there is perfect separation between implementation and usage. This is an extremely powerful concept since in many cases the people that implement UDFs (engineers) and the people that use them (analysts) have different skills. Requiring a lot of back-and-forth can simply kill the usefulness of UDFs – but not so with SQL/MR functions.
How can we do that? There’s no magic, just technology. Our SQL/MR functions are dynamically polymorphic. To do this, our SQL/MR implementation (the sessionize.o file) includes not only code but also logic to determine its output schema based on its input, which is invoked at every query. This means that there is no need for a static signature as is the case with UDFs!

Polymorphism also makes it trivial to nest different functions arbitrarily. Consider a simple example with two functions, Sessionize() and FindBots(). FindBots() can filter the input from any users that seem to act as bots, e.g. users whom interactions are very frequent (who could click on 10 links per second? probably not a human). To use these two functions in combination, one would simply write:
SELECT sessionId, UserId, ts, URL
FROM Sessionize( 'ts', 60 ) ON FindBots( 'userid', 'ts' ) ON webclicks;
Using UDFs instead of SQL/MR functions would mean that this statement would require multiple subselects and special UDF declarations to accommodate the different inputs that come out of the different stages of the query.
So what is it that we have created? SQL? Or MapReduce? It really doesn’t matter. We just combined the best of both worlds. And it’s unlike anything else the world has seen!
|
|
|
|
|
|
|
|
|
|
|
|
|
Pardon the tongue-in-cheek analogy to Oldsmobile when describing user-defined functions (UDFs), but I want to draw out some distinctions between this new class of functions that In-Database MapReduce enables.

While similar on the surface, in practice there are stark differences between Aster In-Database MapReduce and traditional UDF’s.
MapReduce is a framework that parallelizes procedural programs to offload traditional cluster programming. UDF’s are simple database functions and while there are some syntactic similarities, that’s where the similarity ends. Several major differences between In-Database MapReduce and traditional UDF’s include:
Performance: UDF’s have limited or no parallelization capabilities in traditional databases (even MPP ones). Even where UDF’s are executed in parallel in an MPP database, they’re limited to accessing local node data, have byzantine memory management requirements, require multiple passes and costly materialization. In constrast, In-Database MapReduce automatically executes SQL/MR functions in parallel across potentially hundreds or even thousands of server nodes in a cluster, all in a single-pass (pipelined) fashion.
Flexibility: UDF’s are not polymorphic. Some variation in input/output schema may be allowed by capabilities like function overloading or permissive data-type handling, but that tends to greatly increase the burden on the programmer to write compliant code. In contrast, In-Database MapReduce MR/SQL functions are evaluated at run-time to offer dynamic type inference, an attribute of polymorphism that offers tremendous adaptive flexibility previously only found in mid-tier object oriented programming.
Manageability: UDF’s are generally not sandboxed in production deployments. Most UDF’s are executed in-process by the core database engine, which means bad UDF code can crash a database. SQL/MR functions execute in their own process for full fault isolation (bad SQL/MR code results in an aborted query, leaving other jobs uncompromised). A strong process management framework also ensures proper resource management for consistent performance and progress visibility.
|
|
|
|
|
|
|
|
|
|