Archive for the ‘MapReduce’ Category

15
Mar
By Steve Wooledge in Analytics, MapReduce, Teradata Aster on March 15, 2012
   

Yesterday I presented at the Los Angeles Teradata User Group on the topic of “Data Science: Finding Patterns in Your Data More Quickly & Easily with MapReduce”. One point discussed was the common misnomer that big data is about volume, which is certainly part of the issue organizations are facing. However, the big story in big data is the complexity and additional processing required to make “unstructured” data actionable through analytics. This is where procedural frameworks like MapReduce can help. Here is a great post by Teradata’s own Bill Franks about unstructured data which helps describe the requirements unstructured data demands in the context of analytics.

As Franks notes, “the thought of using unstructured data really shouldn’t intimidate people as much as it often does.” Read more to learn why.

 



21
Feb
By Tasso Argyros in Analytic platform, Analytics, Analytics tech, Database, MapReduce on February 21, 2012
   

It has been about seven years since Aster Data was founded, four years since our industry-first Enteprise SQL-MapReduce implementation (first commercial MapReduce offering) and three years since our first Big Data Summit event (the first “Big Data” event in the industry as far as I know). During this whole time, we have witnessed our technology investments take off together with the Big Data market – just think how many people had never even heard the word MapReduce three years ago, and how many swear by it today!

As someone who was caught in the Big Data wave since 2005, I can tell you that the stage of the market has changed significantly during this time – and with it, the challenges that Enterprise customers face. A few years ago, customers were realizing the challenges that piles of new types of data were bringing – big volumes (terabytes to petabytes) and new, complex types (multi-structured data such as weblogs, text, customer interaction data); but at the same time, the opportunities that the new analytical interfaces, like MapReduce, were enabling. Fast forward to today and most enterprises are trying to put together their Big Data strategies and make sense of what the market has to offer – and as a result there is a lot of market noise and confusion: it is usually not clear what use cases apply to traditional technologies versus new; how to reconcile existing technologies with new investments; and what type of projects will they give them highest ROI versus a long and painful failure.

Teradata and Teradata Aster have a high interest in customers being successful with Big Data challenges and technologies, because we believe that the growth of the market will translate into growth for us. Given Teradata’s history in being the #1 strategic advisor to customers around data management and analytics, we only want to offer the best solutions to our customers. This includes our products –which are recognized by Gartner as leading technologies in Data Warehousing and Big Data analytics– but also our expertise helping customers how to use complementary solutions, like Hadoop, and making sure that the total solution works reliably and succeeds in tackling big business problems.

With this partnership, we are taking one more step towards this direction. So we are announcing three things:

1. Teradata and Hortonworks will work together to jointly solve big challenges for our customers. This is a win/win for customers and the industry.

2. Our intent to do joint R&D to make it easier for customers that use products from Teradata and Hadoop to utilize these products together. This is important because every enterprise will look to combine new technologies with existing investments, and there is plenty of opportunity to do better.

3. A set of reference architectures that combine Teradata and Hadoop products to accelerate the implementation of Big Data Big Data projects. We hope that this will be a starting point that will save enterprises time and money when they embark on Big Data projects.

We believe that all the above three points will translate into eliminating risks and unnecessary trial and error. We have enough collective experience to guide customers to avoid failed projects and traps. And by helping clear up some of the confusion in the big data market, we hope to accelerate its growth and the benefit to Enterprises that are looking to utilizing their data to become more competitive and efficient.



29
Sep
By Tasso Argyros in Analytic platform, Analytics, MapReduce on September 29, 2011
   

One of the great things about starting your own company (if you’re lucky and your company does well) is that you take part in the evolution of a whole new market, from its nascent days to its heyday. This was the case with Aster and the “Big Data” market. Back when we started Aster, in 2005, MPP systems that could store and analyze data using off-the-self servers was still a pretty new concept. I also recall in 2008, when we first came out with our native in-database MapReduce support — and our SQL-MapReduce® technology — we had to explain to most people what MapReduce even was. In 2009, we came out with the first Big Data event series — “Big Data Summit” — because we knew we were doing something new and wanted a term to describe it. “Big Data” caught on more than we had imagined back then, and the rest is history. Product innovation was at the core of Aster’s existence, and we kept pushing ourselves and our product to become the best platform for enterprise-class data analytics using both SQL and MapReduce as first class citizens on one analytic platform.

Today there is a lot of innovation in the big data market. However, we see a “chasm” between the SQL technologies—which are very enterprise-friendly—and the new wave of open source big data or “NoSQL” software which is used extensively by engineering organizations. In the middle is a very large number of enterprises trying to understand how they can use these new technologies to push their analytical capabilities beyond purely SQL, while at the same time utilizing their existing investments in technologies and people. This is the problem that Aster solves.

With last week’s announcement, the launch of our Teradata Aster MapReduce solutions which include Aster Database 5.0 software (formerly Aster nCluster) and our new Aster MapReduce Appliance, we bring to market the best answer for the organizations that are “caught in the middle.”  Unlike SQL-only systems focused primarily on analyzing structured data, our database and appliance provide support for native MapReduce which enables a new generation of analytics, such as digital marketing optimization, social graph analysis, fraud detection based on customer behavior, etc. Our newly extended libraries of pre-built MapReduce analytical functions allows such applications to be developed with significantly less time and cost versus other MapReduce technologies. And, unlike other MapReduce-based systems, we offer full SQL support, integration with all major BI and ETL vendors and a data adapter to EDW systems that allows enterprises to utilize existing tools and skills to bring big data analytics to their businesses. Finally, with our new appliance, we leverage Teradata’s strength and engineering to provide a proven and performance-optimized system for businesses to start analyzing untapped diverse data while cutting down on time, cost and frustration!

As we move forward, Aster is committed to being the leader in SQL and MapReduce analytics for multi-structured data. Having spent 6 years in this market, we believe that it’s not just the coolest technologies that will win, but the ones that make it easier for business analysts and data scientists within organizations to solve their business problems and innovate with analytics. With the launch of our new Teradata Aster solutions — including the revamped SQL-MapReduce interfaces and the new Aster MapReduce appliance—we are pushing the state of the art towards this direction (or as my marketing team likes to say – “bringing the science of data to the art of business”). :)



25
May
By jonbock in Analytics, MapReduce on May 25, 2011
   

In case you missed the news, Aster Data just took another step to make SQL-MapReduce the best programming framework for big data analytics. The Aster Data SQL-MapReduce® Developer Portal is the first collaborative online developer community for SQL-MapReduce analytics, our framework for processing non-relational data and ultra-fast analytics. It builds on other efforts to enable MapReduce analytics including: Developer Center, a resource center for MapReduce and SQL-MapReduce developers; Aster Data Developer Express, the first integrated development environment for SQL-MapReduce; and Aster Data Analytic Foundation, a suite of ready-to-use SQL-MapReduce functions.

The Developer Portal gives our customers and partners a community for collaborating with peers to leverage the flexibility and power of SQL-MapReduce for analytics that were previously impossible or impractical. Data scientists, quantitative analysts, and developers from customers, partners, and Aster Data are using the portal to highlight insights and best practices, share analytic functions, and leverage the experience and knowledge of the community to easily harness the power of SQL-MapReduce for big data analytics.

The portal enables collaboration that is key in making it easy for our customers to become SQL-MapReduce experts so they can solve core business challenges. As Navdeep Alam, director of data architecture at Mzinga, said, the portal “will allow us the ability to share and leverage insights with others in using big data analytics to attain a deeper understanding of customers’ behavior and create competitive advantage for our business.”

We’re seeing strong interest in the Developer Portal from our current customers. Early activity and content on the portal includes discussions about using the GSL libraries, programming in .NET, and writing sessionization and sampling functions. We plan to expand on this with tutorials for additional functions over the next few months.

If you aren’t already a customer, we encourage you to get started at the Aster Data Developer Center, where you can get your hands on SQL-MapReduce by downloading Aster Data Developer Express for free and find links to other resources like www.mapreduce.org.  If you are an Aster Data customer, we encourage you to also register for access to the new SQL-MapReduce Developer Portal for additional content and learning.

We’re always interested in your feedback as to how we can better help developers learn about and use MapReduce and Aster Data’s SQL-MapReduce.  If you have any suggestions, please feel free to add them below in the comments.



26
Jan
By Tasso Argyros in Analytic platform, Analytics, Database, MapReduce on January 26, 2011
   

When we kicked off Aster Data back in 2005, we envisioned building a product that would advance the state of the art in data management in two areas; (1) size and diversity of data and (2) depth of insight/analytics. My co-founders and I quickly realized that building just another database wouldn’t cut it. With yet-another-database, even if we enabled companies to more cost-effectively manage large data sizes, it was not going to be enough given the explosion in diverse data types and the massive need to process all of it. So we set out to build a new platform that would solve these challenges – what’s now commonly known as the ‘Big Data’ challenge.

Fast forward to 2008 when Aster Data led the way in putting massive parallel processing inside a MPP database, using MapReduce, to advance how you process massive amounts of diverse data. While this was fully aligned with our vision for managing hoards of diverse data and allowing deep data processing in a single platform, most thought it was intriguing but couldn’t quite see the light in terms of where the future was going. At one point, we thought of naming our product XAP – “extreme analytic platform” or “extreme analytic processing” as that’s what it was designed to do from day one. However, we thought better of it since we thought we would have to educate people too much on what an “analytic platform” was and how it was different from a traditional DBMS for data warehousing. Since we also were serving the data architects in organizations as well as the front-line business that demands better, faster analytics, we needed to use terminology that resonated with both.

Then, in the fall of 2009, with our flagship product Aster Data nCluster 4.0, we made further strides in running advanced analytics inside the database by including all the built-in application services (e.g. like dynamic WLM, backup, monitoring, etc) to go with it. At that time, we referred to it as a Data-Application Server – which our customers quickly started calling a Data-Analytics Server.  I remember when analyst Jim Kobielus at Forrester said,

“It’s really innovative and I don’t use those terms lightly. Moving application logic into the data warehousing environment is ‘a logical next step’.”

And others saying,

“The platform takes a different approach from traditional data warehouses, DBMS and data analytics solutions by housing data and applications together in one system, fully parallelizing both. This eradicates the need for movements of massive amounts of data and the problems with latency and restricted access that creates.”

What they started to fully appreciate and realize is that big data is not just about storing hoards of data, but rather, cracking the code on how to process all of it in deep ways, at blazing fast speeds. Read the rest of this entry »



08
Dec
By Tasso Argyros in MapReduce on December 8, 2010
   

In the past couple of years, MapReduce – once an unknown, funky word – became a prominent, mainstream trend in data management and analytics. However even today I meet people that are not clear on what MapReduce exactly is and how it relates to some other terms and trends. In this post I attempt to clarify some of the MapReduce-related terminology. So here it goes.

MapReduce (the framework). MapReduce is a framework that allows programmers to develop analytical applications that run on (usually large) clusters of commodity hardware and process (usually large) amounts of data. It was first introduced by Google and it is language independent. It is abstract in the sense that an application that uses MapReduce doesn’t have to care about things like the number of servers/processes, fault tolerance, etc. MapReduce is supported by multiple implementations including the open source project Hadoop and Aster Data. Google also has its own proprietary implementation which, unfortunately, is also called MapReduce and sometimes creates confusion.

MapReduce (the Google implementation of MapReduce framework). As mentioned above, Google has its own implementation of MapReduce. This was described in the 2004 OSDI paper and it was the theoretical basis upon which Hadoop was developed. Google’s MapReduce was a processing framework and it was using Google’s GFS (Google File System) for data storage.

Aster Data’s SQL-MapReduce. Aster Data has a patent-pending implementation of MapReduce that (a) uses a database for data persistence, (b) is tightly integrated with SQL, i.e. an analyst or BI tool can invoke MapReduce via SQL queries, thus making MapReduce accessible to the enterprise. It supports multiple programming languages such as Java and C and it is accessible through standard interfaces such as ODBC and JDBC.

Hadoop. Hadoop is an Apache “umbrella” project that hosts many sub-projects, including Hadoop MapReduce and HDFS, Hadoop’s version of the Google File System which Hadoop MapReduce uses for data storage. Hadoop is the core open source project – however, there are many distributions for Hadoop, just as there are many distributions for Linux. These distributions contain Hadoop binaries together with other utilities and tools. The most popular distributions are the Cloudera distribution, the Yahoo distribution and the baseline Apache distribution.

HDFS. HDFS is Hadoop’s version of GFS and it is a distributed file system. HDFS can exist without Hadoop MapReduce, but usually Hadoop MapReduce requires HDFS. Aster Data’s MapReduce does not require HDFS as it uses an extensible MPP database for data storage and persistence.

Cloudera. Cloudera usually means either (a) the company, (b) Cloudera’s Distribution for Hadoop.

Sqoop. Sqoop which is short for “SQL to Hadoop” is an open source project that provides a framework for connecting to SQL data stores for data exchange.

NoSQL. NoSQL started as a term to describe a collection of products that did not support or rely on SQL. This included Hadoop and other products like Cassandra. However, as more people realized that SQL is a necessary interface  for many data management systems, the term evolved to mean (N)ot (o)nly SQL. These days there are attempts to port SQL on top of Hadoop and other NoSQL products.

Are there any MapReduce-related terms I omitted? Please add them in the comments below and include a definition and links to good resources if you’d like.



09
Nov
By Mayank Bawa in Analytics, MapReduce on November 9, 2010
   

It’s ironic how all of a sudden Vertica is changing its focus from being a column-only database to claiming to be an Analytic Platform.

If you’ve used an Analytic Platform you know it’s more than just bolting in a layer of analytic functions on top of a database. But that’s how Vertica claims it’s now a full-blown analytic platform when in fact their analytics capability is rather thin. For instance, their first layer is a pair of window functions (CTE and CCE). The CCE window function is used, for example, to do sessionization. Vertica has a blog post that posits sessionization as a major advanced analytic operation. In truth, Vertica’s sessionization is not analytics. It is a basic data preparation step that adds a session attribute to each clickstream event so that very simple session-level analytics can be performed.

What’s interesting is the CCE window function is simply a pre-built function – some might say just syntactic sugar – that combines the functionality of finite width SQL window functions (LEAD/LAG) with CASE statements (WHEN condition THEN predicate). Nothing ground breaking to say the least!

For example, the CTE query referred to in a Vertica blog post can be rewritten very simply using SQL-99:

SELECT
symbol, bid, timestamp,
SUM(CASE WHEN bid > 10.6 THEN 1 ELSE 0 END)
OVER (PARTITION BY symbol ORDER BY timestamp) window_id
FROM tickstore;

The layering of custom pre-built functions has for a long time been the traditional way of adding functions to a database. The SQL-99 and SQL-2003 analytic functions themselves follow this tradition.

The problem with this is not just with Vertica but also with the giants of the market, Oracle and Microsoft for instance. Their approach is that the customer is at the mercy of the database vendor – pre-built analytic functions are hard-coded to every major release of the DBMS. There is no independence between the analytics layer and the DBMS – which real, well-architected analytic platforms need to have. Simply put, if you want to do a different sessionization semantic, you’ll have to wait for Vertica to build a whole new function. Read the rest of this entry »



15
Apr
By Mayank Bawa in Analytics, Cloud Computing, MapReduce on April 15, 2010
   

In the last few years there has been a significant amount of market pickup, from users and vendors, on data clouds and advanced analytics – specifically a new class of data-driven applications run in a data cloud or on-premise. What’s different about this from past approaches is the frequency and speed at which these applications are accessed, the depth of the analysis, the number of data sources involved and the volume of data mined by these applications – terabytes to petabytes. In the midst of this cacophony of dialogue, recent announcements from vendors in this space are helping to clarify different visions and approaches to the big data challenge.

Both Aster Data and Greenplum made announcements this week that illustrated different approaches. At the same time that Aster Data announced the Aster Analytics Center, Greenplum announced an upcoming product named Chorus. I wanted to take a moment to compare and contrast what these announcements say about the direction of the two companies.

Greenplum’s approach speaks to two traditional problem areas i) access to data, from provisioning of data marts to connectivity to data across marts, and ii) some level of collaboration among certain developers and analysts. Their approach is to create a tool for provisioning, unified data access, and sharing of annotations and data among different developers and analysts. Interestingly, this is not an entirely new concept; these are well-known problems for which a number of companies and tools have already developed best-of-breed solutions over the last 15 years. For example, the capabilities for data access are another version of Export/Copy primitives that already exist in all databases and that have been built upon by common ETL and EII tools for cases in which richer support than Export & Copy are needed – for instance, when data has to be transformed, correlated or cleaned while being moved from one context (mart) to another (mart).

This approach is indicative of a product direction in which the primary focus is on adding another option to the list of tools available to customers to address these problems. It’s really not a ground-breaking innovation that evolves the world of analytics. New types of analytics, or ‘data-driven applications,’ is where the enormous opportunity lies. The Greenplum approach of data collaboration is interesting in a test environment or sandbox. When it comes to real production value however, it effectively increases the functions available to the end user, but at a big cost due to significant increases in complexity, security issues and extra administrative overhead. What does this mean exactly?

  • The spin-up of marts and moving data around can result in “data sprawl” which ultimately increases administrative overhead and is dangerous in these days of compliance and sensitivity to privacy and data leaks.
  • Adding a new toolset into the data processing stack creates difficult and painful work to either manage and administer multiple tool sets for similar purposes or to eliminate and transition away from investments in existing toolsets.
  • To enable effective communication and sharing, users need strong processes and features for source identification of data, data collection, data transformation, rule administration, error detection & correction, data governance and security. The quality and security policies around meta-data are especially important as free-form annotations can lead to propagation of errors or leaks in the absence of strong oversight.

In contrast, Aster Data’s recent announcements support our long-standing investments in our unique advanced in-database architecture where applications run fully inside Aster Data’s platform with complete application services essential for complex analytic applications. The announcements highlight that our vision is not to create a new set of tools and layers in the data stack that recreate capabilities currently available from a number of leading vendors, but rather to deliver a new Analytics Platform, a Data-Application Server, to uniquely enable analytics professionals to create data-rich applications that were impossible or impractical before – namely, to create and use advanced analytics for rich, rapid, and scalable insights into their data. This focus is complemented by our partners, who offer proven best-of-breed solutions for collaboration and data transformation.

Read the rest of this entry »



10
Sep
By Tasso Argyros in Blogroll, MapReduce on September 10, 2008
   

I am very excited about the power that In-Database MapReduce puts in the hands of the larger BI community. I’ll be leading a Night School session on In-Database MapReduce at the TDWI World Conference in November in New Orleans.

Please join me if you are interested in learning more about the MapReduce framework and its applications. I will introduce MapReduce from the basic principles, and then help build up your intuition. If we have time, I will even address why MapReduce is not UDF re-discovered. :-)

If you are unable to attend, or eager to understand, here are some MapReduce resources you may find informative: Aster’s whitepaper on In-Database MapReduce; Google Labs’ MapReduce research paper; Curt Monash’s post on Known Applications of MapReduce.

A great open-source project that I’d like to commend and draw your attention to illustrate the power of MapReduce is Apache’s Mahout Project, which is building machine learning algorithms on the MapReduce framework (Classification, Clustering, Regression, Dimension reduction and Evolutionary Algorithms).

I am sure this is just a snippet of the MapReduce resources available. If you have some that you have found helpful, please share them in your comments. I will be happy to review and cover them in our TDWI Night School!



06
Sep
By Tasso Argyros in Blogroll, Database, MapReduce on September 6, 2008
   

In response to Aster’s In-Database MapReduce initiative, I’ve been asked the following question:

“How does Aster Data Systems compete with open source MapReduce implementations, such as Hadoop?”

My answer –we simply do not.

Hadoop and Google’s implementation of MapReduce are targeted to the development (coding) community. The primary interface of these systems is the command line; and the primary means of accessing data is through Java or Python code. There have been efforts to build higher-level interfaces on top of these systems, but they are usually limited, do not follow any existing standard, and are incompatible with the existing filesystem.

Such tools are ideal for environments that are dominated by engineers, such as academic institutions, research labs or technology companies like Google/Yahoo that have a strong culture of in-house development (often hundreds of thousands of lines of code) to solve technical problems.

Most enterprises are unlike the culture of Google/Yahoo and each “build vs. buy” decision is carefully considered. Good engineering talent is a precious resource that is directed towards adding business value, not in building infrastructure from the ground up. The Data Services groups are universally under-staffed and consist of people that understand and leverage databases. As such, there are corporate governance expectations from any data management tool that they use:

- it has to comply with applicable standards like ANSI-SQL,

- it needs to provide a set of tools that IT can use & manage, and

- it needs to be ecosystem-friendly (BI and data integration tools compatibility).

In such an environment, using Java or developer-centric command line as the primary interface will increase the burden on the data services group and their IT counter-parts.

I strongly believe, that while existing MapReduce tools are good for development organizations, they are totally inappropriate for a large majority of enterprise IT departments.

Our goal is not to build yet another tool for development groups, but rather to create a product that unleashes the power of MapReduce for the enterprise IT organization.

How can we achieve that?

First, we’ve developed Aster to be a super-fast, always-parallel database for large-scale data warehousing using SQL. Then we allow our customers and partners to extend SQL through a tightly integrated MapReduce functionality.

The person that develops our MapReduce functions, naturally, needs to be a developer; but the person that is using this functionality can be an analyst using a standard BI tool (e.g., Microstrategy, Business Objects, Pentaho) over ODBC or JDBC connections!

Invoking MapReduce functions in Aster looks almost identical to writing standard SQL code. This way, the powerful MapReduce extensions that are developed by a small set of developers (either within an IT organization or by Aster itself) can be used by people with SQL skills using their existing sets of tools.

Integrating MapReduce and SQL is not an easy job; we had to innovate on multiple levels to achieve that, e.g. by creating a new type of UDFs that are both parallel and polymorphic, to make MapReduce extensions almost indistinguishable from standard SQL.

In summary, we have enabled:

- The flexible, parallel power of MapReduce to enable deep analytical insights that are impossible to express in standard SQL

- Seamless integration with ANSI standard SQL and all the rich commands, types, functions, etc. that are inherent in this well-known language

- Full JDBC/ODBC support ensures interoperability between Aster In-Database MapReduce and 3rd party database ecosystem tools like BI, reporting, advanced analytics (e.g., data mining), ETL, monitoring, scheduling, GUI administration, etc.

- SQL/MR functions –powerful plug-in operators that any non-engineer can easily plug into standard ANSI SQL to exploit the power of MapReduce analytic applications

- Polymorphism –unlike static, unreliable UDFs, SQL/MR functions unleash the power of polymorphism (run-time/dynamic) for cost-efficient reusability.  Built-in sandboxing ensures fault tolerance to avoid system crashes commonly experienced with UDFs

To conclude, it is important to understand that Aster nCluster is not yet another MapReduce implementation nor does it compete with Hadoop for resources or audience.

Rather, Aster nCluster is the world’s most powerful database that breaks traditional SQL barriers allowing Data Services groups and IT organizations to extract more knowledge out of their data