Archive for the ‘Analytic platform’ Category

28
Jul
By Mayank Bawa in Analytic platform, Analytics on July 28, 2011
   

I wrote earlier that data is structured in multiple forms. In fact, it is the structure of data that allows applications to handle it “automatically” - as an automaton, i.e., programmatically – rather than relying on humans to handle it “semantically”.

Thus a search engine can search for words, propose completion of partially typed words, do spell checking, and suggest grammar corrections “automatically”.

In the last 30 years, we’ve built specialized systems to handle each data structure differently at scale. We index a large corpus of documents in a dedicated search engine for searches, we arrange lots of words in a publishing framework to compose documents, we store relational data in a RDBMS to do reporting, we store emails in an e-discovery platform to identify emails that satisfy a certain pattern, we build and store cubes in a MOLAP engine to do interactive analysis, and so on.

Each such system is a silo – it imposes a particular structure on big data, and then it leverages that structure to do its tasks efficiently at scale.

The silo approach imposes fragmentation of data assets. It is expensive to maintain these silos. It is inefficient for humans and programs to master these silos – they have to learn the nuances of each silo to become an expert in exploiting it. As a result, we have all kinds of data administrators – a cube expert, a text expert, a spreadsheet expert, and so on.

The state of data fragmentation reminds me of the “dedicated function machines” that pre-dated the “Personal Computer”. We used to have electronic type-writers that would create documents, calculators that would calculate formulae, fax machines that would transmit documents, even tax machines that would calculate taxes. All of these machines were booted to relic-status at a museum by a general-purpose computer – the functions were ported on top of its computing framework and the data was stored in its file system. The unity of all of these functions and its data on the general-purpose computer gave rise to “integration” benefits. It made tasks easier: we can now fill our tax forms in (structured form-based) PDF documents, do tax calculations, and file taxes by transmitting the document - all on one platform. Our productivity has gone up. Indeed, the assimilation of data is leading to net new tasks that were not possible before. We can let programs search for previous year’s filings, read the entries, and populate this year’s forms from previous year’s filing to minimize data-entry errors.

We have the same opportunity in front of us now in the field of big data. For too long, have we relegated functions that work on big data to isolated “dedicated function machines.” These dedicated function machines are bad because they are not “open.” Data in a search engine can only be “searched” - it cannot be analyzed for sentiments or plagiarism or edited to insert or remove references. The data is the same, but each of these tasks requires a “dedicated function machine.”

We have the option to build a general purpose machine for big data – a multi-structured big data platform – that allows multiple structures of data to co-exist on a single platform that is flexible enough to perform multiple functions on data.

Such a platform, for example, would allow us to analyze structured payments data to identify our valuable customers, interpret sentiments of calls they made to us, analyze the most common problem across negative sentiment interactions, and predict the loss in revenue that can be prevented by solving that problem and the cost of acquiring net new customers to overcome the losses. Without a multi-structure big data platform, the above workflow is a 12-18 month cycle performed by a cross-functional team of “dedicated function experts” (CFO group, Customer Support group, Products group, Marketing group) – a bureaucratic mess of project management that produces results too expensively, too infrequently and too inaccurately, making simplifying assumptions at each step as they cannot agree on even basic metrics.

An open “Multi-Structured Big Data Platform” would be hugely enabling and open up vast efficiency and functionality that we can’t imagine today.



13
Jun
By Mayank Bawa in Analytic platform on June 13, 2011
   

The “big data” world is a product of exploding applications. The number of applications that are generating data has just gone through the roof. The number of applications that are being written to consume the generated data is also growing rapidly. Each application wants to produce and consume data in a structure that is most efficient for its own use.  As Gartner points out in a recent report on big data[1], “Too much information is a storage issue, certainly, but too much information is also a massive analysis issue.”

In our data-driven economy, business models are being created (and destroyed) and shifting based on the ability to compete on data and analytics. The winners realize the advantage of having platforms,  that allow data to be stored in multiple structures and (more importantly) allow data to be processed in multiple structures. This allows companies to more easily 1) harness and 2) quickly process ALL of the data about their business to better understand customers, behaviors, and opportunities/threats in the market. We call this “multi-structured” data, which has been a topic of discussion lately with IDC Research (where we first saw the term referenced) and other industry analysts. It is also the upcoming topic of a webcast we’re doing with the IDC on June 15th.

To us, multi-structured data means “a variety of data formats and types.” This could include any data “structured” or “unstructured”  - “relational” or “non-relational”. Curt Monash has blogged about naming such data Poly-structured or Multi-structured. At the core is the ability for an analytic platform to both 1) store and 2) process a diversity of formats in the most efficient means possible.

Handling Multi-structured Data

We in the industry use the term “structured” data to mean “relational” data. And data that is not “relational” is called “unstructured” or “semi-structured.”

Unfortunately, this definition lumps text, csv, pdf, doc, mpeg, jpeg, html, log files as unstructured data. Clearly, all of these forms of data have an implicit “structure” to them!

My first observation is that Relational is one way of manifesting the data. Text is another way of expressing the data - Jpeg, gif, bmp and other formats are structured forms of expressing images. For example, (Mayank, Aster Data, San Carlos, 6/1/2011) is a relational row stored in a table (Name, Company Visited, City Visited, Date Visited) – the same data can be expressed in text as “Mayank visited Aster Data, based in San Carlos, on June 1, 2011.” A geo-tagged photograph of Mayank entering the Aster Data office in San Carlos on June 1, 2011 will also capture the same information.

My second observation is that “structure” of data is what makes applications understand the data and know what to do with it. For example, a SQL-based application can issue the right SQL queries to process its logic; an image viewer can interpret JPG/GIF/BMP files to interpret the data; a text-engine can parse subject-object-verbs to interpret the data; etc.

Each application leverages the structure of data to do its processing in the most efficient manner. Thus, search engines recognize the white-space structure in English and can build inverted indexes on words to do fast searches. Relational engines recognize row headers and tuple boundaries to build indexes that can be used to retrieve selected rows very quickly. And so on.

My third observation is that each application produces data in a structure that is most efficient for its use. Thus, applications produce logs; cameras produce images; business applications produce relational rows; Web content engines produce HTML pages; etc. It is very hard to “Transform” data from one structure to the other. ETL tools have their hands full in just doing transformations from a relational schema to another relational schema. And semantic engines have a hard time “transforming” text to relational forms. All such “across structure” transforms are lost in the information.

Relational databases handle relational structure and relational processing very efficiently, but they are severely limiting in their capabilities to store and process other structures (e.g., text, xml, jpg, pdf, doc). In these engines, relations are a first-class citizen; every other structure is a distant second-class citizen.

Hadoop is exciting in the “Big Data” world because it doesn’t pre-suppose any structure. Data in any structure can be stored in plain files. Applications can read the files and build their own structures on the fly. It is liberating. However, it is not efficient – precisely because it reduces all data to its base form of files and robs the data of its structure - the structure that would allow for efficient processing or storage by applications! Each application has to redo its work from scratch.

What would it take for a platform to treat multiple structures of data as first class citizens? How could it natively support each format, yet provide a unified way to express queries or analytic logic at the end-user level to as to abstract away the complexity/diversity of the data and provide insights more quickly?  It’d be liberating as well as efficient!


[1] “’Big Data’ Is Only the Beginning of Extreme Information Management”. Gartner Research, April 7, 2011



02
Feb
2011: The Year of the Analytics Platform – Part II
By Tasso Argyros in Analytic platform, Analytics on February 2, 2011
   

In my previous post, I spoke about how strongly I feel that this is the year that the analytic platform will become its own distinct and unique category.  As the market as a whole realizes the value of integrated data and process management, in-database applications and in-database analytics, the “analytic platform”, or “analytic computing system”, or “data analytics server” (pick your name) will gain even more momentum, reaching critical mass this year.

In this process, you will see significant movement from vendors, first in their marketing collateral (as it is always the case for followers in a technology space) and then scrambling to cover their product gaps in the 5 categories that define a true analytic platform that I mentioned in Part I of 2011: - The Year of the Analytics Platform.

What took Aster Data 6+ years to build is impossible to be done overnight, or over a few releases (side note: if you are interested in software product development and haven’t read the Mythical Man-Month, now is a good time – it’s an all-time classic and explains this point very clearly), and especially if the fundamental architecture is not there from day one.

But the momentum for the analytic platform category is there and, at this point, is irreversible. Part of this powerful trend is derived from the central place that analytics is taking in the enterprise and government. Analytics today is not a luxury, but a necessity for competitiveness. Every industry today is thinking how to employ analytics to better understand its customers, cut costs, and increase revenues. For example, companies in the financial services sector, a fiercely competitive space, want to use the wealth of data they have to become more relevant to their customers, increase customer satisfaction and retention rates. Governments’ use of data and analytics is one of few last resorts against terrorism and cyber threats. In retail, the advent of Internet, social networks, and globalization has increased competition and reduced margins. Using analytics to understand cross-channel behavior and preferences of consumers improves the returns of marketing campaigns and optimizes product pricing and placement, and can make the difference between red and black ink at the bottom of the balance sheet. Read the rest of this entry »



26
Jan
By Tasso Argyros in Analytic platform, Analytics, Database, MapReduce on January 26, 2011
   

When we kicked off Aster Data back in 2005, we envisioned building a product that would advance the state of the art in data management in two areas; (1) size and diversity of data and (2) depth of insight/analytics. My co-founders and I quickly realized that building just another database wouldn’t cut it. With yet-another-database, even if we enabled companies to more cost-effectively manage large data sizes, it was not going to be enough given the explosion in diverse data types and the massive need to process all of it. So we set out to build a new platform that would solve these challenges - what’s now commonly known as the ‘Big Data’ challenge.

Fast forward to 2008 when Aster Data led the way in putting massive parallel processing inside a MPP database, using MapReduce, to advance how you process massive amounts of diverse data. While this was fully aligned with our vision for managing hoards of diverse data and allowing deep data processing in a single platform, most thought it was intriguing but couldn’t quite see the light in terms of where the future was going. At one point, we thought of naming our product XAP – “extreme analytic platform” or “extreme analytic processing” as that’s what it was designed to do from day one. However, we thought better of it since we thought we would have to educate people too much on what an “analytic platform” was and how it was different from a traditional DBMS for data warehousing. Since we also were serving the data architects in organizations as well as the front-line business that demands better, faster analytics, we needed to use terminology that resonated with both.

Then, in the fall of 2009, with our flagship product Aster Data nCluster 4.0, we made further strides in running advanced analytics inside the database by including all the built-in application services (e.g. like dynamic WLM, backup, monitoring, etc) to go with it. At that time, we referred to it as a Data-Application Server - which our customers quickly started calling a Data-Analytics Server.  I remember when analyst Jim Kobielus at Forrester said,

“It’s really innovative and I don’t use those terms lightly. Moving application logic into the data warehousing environment is ‘a logical next step’.”

And others saying,

“The platform takes a different approach from traditional data warehouses, DBMS and data analytics solutions by housing data and applications together in one system, fully parallelizing both. This eradicates the need for movements of massive amounts of data and the problems with latency and restricted access that creates.”

What they started to fully appreciate and realize is that big data is not just about storing hoards of data, but rather, cracking the code on how to process all of it in deep ways, at blazing fast speeds. Read the rest of this entry »