By Mayank Bawa in Analytic platform on June 13, 2011

The “big data” world is a product of exploding applications. The number of applications that are generating data has just gone through the roof. The number of applications that are being written to consume the generated data is also growing rapidly. Each application wants to produce and consume data in a structure that is most efficient for its own use.  As Gartner points out in a recent report on big data[1], “Too much information is a storage issue, certainly, but too much information is also a massive analysis issue.”

In our data-driven economy, business models are being created (and destroyed) and shifting based on the ability to compete on data and analytics. The winners realize the advantage of having platforms,  that allow data to be stored in multiple structures and (more importantly) allow data to be processed in multiple structures. This allows companies to more easily 1) harness and 2) quickly process ALL of the data about their business to better understand customers, behaviors, and opportunities/threats in the market. We call this “multi-structured” data, which has been a topic of discussion lately with IDC Research (where we first saw the term referenced) and other industry analysts. It is also the upcoming topic of a webcast we’re doing with the IDC on June 15th.

To us, multi-structured data means “a variety of data formats and types.” This could include any data “structured” or “unstructured”  - “relational” or “non-relational”. Curt Monash has blogged about naming such data Poly-structured or Multi-structured. At the core is the ability for an analytic platform to both 1) store and 2) process a diversity of formats in the most efficient means possible.

Handling Multi-structured Data

We in the industry use the term “structured” data to mean “relational” data. And data that is not “relational” is called “unstructured” or “semi-structured.”

Unfortunately, this definition lumps text, csv, pdf, doc, mpeg, jpeg, html, log files as unstructured data. Clearly, all of these forms of data have an implicit “structure” to them!

My first observation is that Relational is one way of manifesting the data. Text is another way of expressing the data - Jpeg, gif, bmp and other formats are structured forms of expressing images. For example, (Mayank, Aster Data, San Carlos, 6/1/2011) is a relational row stored in a table (Name, Company Visited, City Visited, Date Visited) – the same data can be expressed in text as “Mayank visited Aster Data, based in San Carlos, on June 1, 2011.” A geo-tagged photograph of Mayank entering the Aster Data office in San Carlos on June 1, 2011 will also capture the same information.

My second observation is that “structure” of data is what makes applications understand the data and know what to do with it. For example, a SQL-based application can issue the right SQL queries to process its logic; an image viewer can interpret JPG/GIF/BMP files to interpret the data; a text-engine can parse subject-object-verbs to interpret the data; etc.

Each application leverages the structure of data to do its processing in the most efficient manner. Thus, search engines recognize the white-space structure in English and can build inverted indexes on words to do fast searches. Relational engines recognize row headers and tuple boundaries to build indexes that can be used to retrieve selected rows very quickly. And so on.

My third observation is that each application produces data in a structure that is most efficient for its use. Thus, applications produce logs; cameras produce images; business applications produce relational rows; Web content engines produce HTML pages; etc. It is very hard to “Transform” data from one structure to the other. ETL tools have their hands full in just doing transformations from a relational schema to another relational schema. And semantic engines have a hard time “transforming” text to relational forms. All such “across structure” transforms are lost in the information.

Relational databases handle relational structure and relational processing very efficiently, but they are severely limiting in their capabilities to store and process other structures (e.g., text, xml, jpg, pdf, doc). In these engines, relations are a first-class citizen; every other structure is a distant second-class citizen.

Hadoop is exciting in the “Big Data” world because it doesn’t pre-suppose any structure. Data in any structure can be stored in plain files. Applications can read the files and build their own structures on the fly. It is liberating. However, it is not efficient – precisely because it reduces all data to its base form of files and robs the data of its structure - the structure that would allow for efficient processing or storage by applications! Each application has to redo its work from scratch.

What would it take for a platform to treat multiple structures of data as first class citizens? How could it natively support each format, yet provide a unified way to express queries or analytic logic at the end-user level to as to abstract away the complexity/diversity of the data and provide insights more quickly?  It’d be liberating as well as efficient!

[1] “’Big Data’ Is Only the Beginning of Extreme Information Management”. Gartner Research, April 7, 2011

Seth Grimes on June 13th, 2011 at 12:27 pm #

Mayank, this is a helpful analysis, thanks.

One correction: XML and other object-structured information, stored natively by a system such as MarkLogic, MonetDB, IBM DB2, or Oracle is neither relational nor un/semi-structured. Similarly, RDF/triple stores and graph-structured database systems (there’s some category overlap) are also neither relational nor un/semi-structured.

In any case, all this takes my thoughts back to the emergence of object-relational DBMSes, some of them such as Oracles with multi-engine architectures, in the mid/late ’90s.

Seth, http://twitter.com/sethgrimes

Notes and links, June 15, 2011 | DBMS 2 : DataBase Management System Services on June 15th, 2011 at 3:07 am #

[...] Data (now a Teradata company) is positioning itself as analyzing multi-structured data — which is my second-choice term, behind the more precise but odder-sounding [...]

Mayank Bawa on June 21st, 2011 at 11:25 am #

Seth, thanks.

I agree with you that XML, RDF and Graphs are not relational data. They have their own structure, and I like the category name “multi-structure” to include all of these forms of structure, rather than lumping them together under the “semi-structured” name.

I do think the innovation and growth here is very different from the Object Relational models of the past. But that’s a different post!

[...] wrote earlier that data is structured in multiple forms. In fact, it is the structure of data that allows [...]

[...] wrote earlier that data is structured in multiple forms. In fact, it is the structure of data that allows [...]

Teradata Columnar sounds good « Data Visualization on September 25th, 2011 at 8:00 am #

[...] MapReduce™ software with the enterprise friendliness of SQL. (Also see article about “multi-structured data sources” from Aster  [...]

Start Making Sense of All That New Data: Teradata Tells How - BusinessIntelligence.com on December 31st, 2011 at 11:09 pm #

[...] than ever to see, explore, and understand future possibilities across an infinite spectrum of new, multi-structured data sources which often go untapped within organizations,” Bawa [...]

What Big Data Can Learn From the PC Era: The Need for a Multi-Structured Big Data Platform on April 14th, 2012 at 7:02 pm #

[...] wrote earlier that data is structured in multiple forms. In fact, it is the structure of data that allows [...]

Bhanu Prakash on May 19th, 2012 at 8:07 pm #

Hi Mayank,

I was looking for this kind of article. A very well written, nicely explained article. Thank you for sharing this info.


Post a comment