Every year or so Google comes out with an interesting piece of infrastructure, always backed by claims that it’s being used by thousands of people on thousands of servers and processes petabytes or exabytes of web data. That alone makes Google papers interesting reading.
This latest piece of research just came out on Google’s Research Buzz page. It’s about a system called Dremel (note: Dremel is a company building hardware tools which I happened to use a lot when I was building model R/C airplanes as a kid). Dremel is an interesting move by Google which provides a system for interactive analysis of data. It was created because it was thought that native MapReduce has too much latency for for fast interactive querying/analysis. It uses data that sits on different storage systems like GFS or BigTable. Data is modeled in a columnar, semi-structured format and the query language is SQL-like with extensions to handle the non-relational data model. I find this interesting - below is my analysis of what Dremel is and the big conclusion.
Main characteristics of the system:
Data & Storage Model
• Data is stored in a semi-structured format. This is not XML, rather it uses Google’s Protocol Buffers. Protocol Buffers (PB) allow developers to define schemas that are nested.
• Every field is stored in its own file, i.e. every element of the Protocol Buffers schema is columnar-ized. Columnar modeling is especially important for Dremel for two specific reasons:
- Protocol Buffer data structures can be huge (> 1000 fields).
- Dremel does not offer any data modeling tools to help break these data structures down. E.g. there’s nothing in the paper that explains how you can take a Protocol Buffers data structure and break it down to 5 different tables.
• Data is stored in a way that makes it possible to recreate the orignial “flat” schema from the columnar representation. This however requires a full pass over the data - the paper doesn’t explain how point or indexed queries would be executed.
• There’s almost no information about how data gets in the right format, how is it stored, deleted, replicated, etc. My best guess is that when someone defines a Dremel table, data is copied from the underlying storage to the local storage of Dremel nodes (“leaf nodes”) and at the same time is replicated across the leaf nodes. Since data in Dremel cannot be updated (it seems to be a write-once-read-many model), design & implementation of the replication subsystem should be significantly simplified.
Interface
• Query interface is SQL-like but with extensions to handle the semi-structured, nested nature of data. Input of queries is semi-structured, and output is semi-structured as well. One needs to get used to this since it’s significantly different from the relational model.
• Tables can be defined from files, e.g. stored in GFS by means of a “DEFINE TABLE” command.
• The data model and query language makes Dremel appropriate for developers; for Dremel to be used by analysts or database folks, a different/simpler data model and a good number of tools (for loading, changing the data model etc) would be needed.
Query Execution
• Queries do NOT use MapReduce, unlike Hadoop query tools like Pig & Hive.
• Dremel provides optimizations for sequential data access, such as async I/O & prefetching.
• Dremel supports approximate results (e.g. return partial results after reading X% of data - this speeds up processing in systems with 100s of servers or more since you don’t have to wait for laggards).
• Dremel can use replicas to speed up execution if a server becomes too slow. This is similar to the “backup copies” idea from the original Google MapReduce paper.
• There seems to be a tree-like model of executing queries, meaning that there are intermediate layers of servers between the leaf nodes and the top node (which receives the user query). This is useful for very large deployments (e.g. thousands of servers) since it provides some intermediate aggregation points that reduce the amount of data that needs to flow to any single node.
Performance & Scale
• Compared to Google’s native MapReduce implementation, Dremel is two orders of magnitude faster in terms of query latency. As mentioned above, part of the reason is that the Protocol Buffers are usually very large and Dremel doesn’t have a way to break those down except for its columnar modeling. Another reason is the high startup cost of Google’s MapReduce implementation.
• Following Google’s tradition, Dremel was shown to scale reasonably well to thousands of servers although this was demonstrated only over a single query that parallelizes nicely and from what I understand doesn’t reshuffle much data. To really understand scalability, it’d be interesting to see benchmarks with a more complex workload collection.
• The paper mentions little to nothing about how data is partitioned across the cluster. Scalability of the system will probably be sensitive to partitioning strategies, so that seems like a significant omission IMO.
So the big question – can MapReduce itself handle fast, interactive querying?
• There’s a difference between the MapReduce paradigm, as an interface for writing parallel applications, and a MapReduce implementation (two examples are Google’s own MapReduce implementation, which is mentioned in the Dremel paper, and open-source Hadoop). MapReduce implementations have unique performance characteristics.
• It is well known that Google’s MapReduce implementation & Hadoop’s MapReduce implementation are optimized for batch processing and not fast, interactive analysis. Besides the Dremel paper, look at this Berkeley paper for some Hadoop numbers and an effort to improve the situation.
• Native MapReduce execution is not fundamentally slow; however Google’s MapReduce and Hadoop happen to be oriented more towards batch processing. Dremel tries to overcome that by building a completely different system that speeds interactive querying. Interestingly, Aster Data’s SQL-MapReduce came about to address this in the first place and offers very fast interactive queries even though it uses MapReduce. So the idea that one needs to get rid of MapReduce to achieve fast interactivity is something I disagree with - we’ve shown this is not the case with SQL-MapReduce.
We just closed out the fourth Big Data Summit in 8 months this week. This time we brought big data and advanced analytics to downtown DC and it proved to have some fantastic sessions. Here’s a quick recap. I’ll link presenter’s slides to this post as they come available:
- Curt Monash was our keynote speaker, kicking off the event and providing some great context. Titled “Implications of New Analytic Technology”, Curt was able to raise a number of issues to consider as technology advances to enable big data analytics, not the least of which is legislative implications which need to be considered. (Check out Curt’s wrap on his talk at his DBMS2 blog.)
- Will Duckworth from comScore detailed the technical requirements around their highly successful MediaMetrix 360 product which has resulted in a flood of 10 billion rows per day of new data entering the Aster Data system. (In addition to the slides below, Will discusses more in this video.)
- Matt Ipri from MicroStrategy discussed how customers benefit from using MicroStrategy with systems like Aster Data because of their “database aware” BI platform. Their integration with Aster Data around SQL-MapReduce is also likely one of the reasons we won their Technology Innovation award at MicroStrategy World earlier this year.
- Michelle Wilkie from SAS described the advanced in-database analytics initiative they have to push more of the data mining process into DBMSs like Aster Data. SAS is using Aster Data’s SQL-MapReduce to accomplish this with Aster Data nCluster, providing statistical integrity of results.
– Tasso Argyros, CTO and CO-Founder of Aster Data, described the requirements for managing and analyzing big data, advanced analytic use-cases, and how Aster Data nCluster uniquely providers customers with a next-generation data analytics platform to do more.
- Jim Kobielus from Forrester Research joined the other speakers during a lunch panel, which proved to be exciting given the amount of innovation coming from distributed computing methods like MapReduce which are finding their way into commercial applications. Of note was a question from the audience around the right type of education background to look for when hiring analytics professionals. The answered ranged from “philosophy” to “engineering” and everything in between! Apparently, you need passion for analytics more than anything else. None of that was lacking in our panel.
Stay tuned for more summits on big data and advanced analytics from Aster Data. Chicago is up next and we’ll be firming up dates shortly. And if you can’t make it to the next event, follow us on Twitter at www.twitter.com/AsterData. There was some great conversation around the event there.
Today Aster took a significant step and made it easier for developers building fraud detection, financial risk management, telco network optimization, customer targeting and personalization, and other advanced, interactive analytic applications.
Along with the release of Aster Data nCluster 4.5, we added a new Solution Partner level for systems integrators and developers.
Why is this relevant?
Recession or no-recession, IT executives are constantly challenged. They are asked to execute strategies based on better analytics and information to improve effectiveness of business processes (customer loyalty, inventory management, revenue optimization, ..), while staying on top of technology-based disruptions and managing (shrinking or flat) IT budgets.
IT organizations have taken on the challenge by building analytics-based offeringsleveraging existing data management skills and increasingly taking advantage of MapReduce, a disruptive technology introduced by Google and now being rapidly adopted by mainstream enterprise IT shops in Finance, Telco, LifeSciences, Govt. and other verticals.
As MapReduce and big data analytics goes mainstream, our customers and ecosystem partners have asked us to make it easier for their teams to leverage MapReduce across enterprise application lifecycles, while harvesting existing IT skills in SQL, Java and other programming languages. The Aster development team that brought us the SQL/MapReduce innovation, has now delivered the market’s first integrated visual development environment for developing, deploying and managing MapReduce and SQL-based analytic applications.
Enterprise MapReduce developers and system integrators can now leverage the integrated Aster platform and deliver compelling business results in record time (read how ComScore delivers 360 degree view of digital world to enterprise customers, Full Tilt Poker gains the upper hand tackling online fraud using Aster).
We are also teaming up with leaders in our ecosystem like MicroStrategy to deliver an end-to-end analytics solution to our customers that includes SQL/MapReduce enabled reporting and rich visualization. Aster is proud to be driving innovation in the Analytics and BI market and was recently honored at MicroStrategy’s annual customer conference.
I am delighted with the rapid adoption of Aster Data’s platform by our partners and the strong continued interest from enterprise developers and system integrators in building big data applications using Aster. New partners are endorsing our vision and technical innovation as the future of advanced analytics for large data volumes.
Sign up today to be an Aster solution partner and join the revolution to deliver compelling information and analytics-driven solutions.
Last week I attended Bank of America’s Technology Innovation Summit in Silicon Valley. In attendance were leading technology executives from Bank of America who outlined needs and challenges for the global banking giant. BofA’s annual IT spend is greater than $5 Billion, serving almost 59 million, or one out of every two U.S. households and distribution strength of about six thousand branches, 18,000 ATMs and 24 million online banking customers, and more than 3,000 customer touches every second. Key themes discussed involved Cloud computing, Information Management, Security, Mobility and Green IT. And as I sat through the panel discussions and spoke to some of the IT leaders, it became evident that underpinning all the major business and IT initiatives for the global bank was a central theme – Lots of data, need for better and faster insights.
A senior BofA IT executive stated “Broad BI and data mining remain objectives, not realized goals”. There was a high level of interest in analytics anda big drive to be information-driven across business units.
Clearly, for a large bank like BofA, the business drivers exist. For example, the consumer channels executive was interested in understanding consumer behavior across different channels. In a saturated marketplace for retail customers and facing stiff competition from Chase (now owns WaMu), Wells Fargo (now owns Wachovia), BofA is keenly interested in strengthening its bond with its existing customer base. With thousands of interactions per second, every interaction with the customer is an opportunity to learn more about customer behavior and customer preferences.
In the credit card division, early detection of fraud patterns can translate into big savings for a market that is undergoing dramatic transformation due to reforms mandated by Congress.
On the IT front, BofA has lots of existing investments in BI tools and data management software.
So where is the gap? Why are BI/data mining unrealized goals?
The answer lies in re-thinking and challenging the status quo in data management and analytic application development in today’s big data IT environments. Google, Amazon, and other innovators are leading this and it is only a matter of time before leaders in the financial services industry follow suit. A new mandate and architecture for big data applications is emerging.
This new class of analytic applications will require a strategic investment in infrastructure that embraces assimilating advanced analytics processing right next to the terabytes to petabytes of enterprise data for key business initiatives including
Customer service effectiveness to predict customer requirements as well as fully understand customer relationships across branch office, ATM, online, and mobile channels
Ability to respond faster to regulators or to management and driving decisions based on insights driven from accurate, timely data
Broader, more pervasive BI and richer Analytics is on the threshold of becoming a reality!
MySpace decided to support one of its most important product launches of 2008 with an expansion of its Aster data warehouse. The data that would be collected would be used to provide information on trends in media and current interests on MySpace. The go-live date was October 2008.
MySpace planned for the data warehouse right from the inception of the project to ensure that reporting was considered a first-class citizen in the overall launch process, rather than a post-launch activity. The result was that the data warehouse was up and running to receive the usage streams, even during a private beta release period, giving the warehousing team the necessary time to prepare for the onslaught of data that would result after the public release.
In fact, there is a very interesting incident that happened on the day of the new MySpace product launch. At about 7am, one of the servers in the Aster nCluster data warehouse failed. The failure was detected by our support team - and no scrambling ensued. Aster nCluster detected and isolated the failure, continuing to run the service with n-1 nodes without a blip and minimal performance change! Later, after the initial tense moments were behind us, the MySpace operations team walked over and replaced the failed hardware. The Aster database administrator then pressed a single button to re-include the node back to the nCluster data warehouse - the database continued to hum away with zero downtime.
The power of “Always-On”!
We will be co-hosting a case study by MySpace on their use of Aster at the Gartner BI Summit next week in National Harbor, MD on March 11. If you’ll be at the event, please come by to hear what Hala has to say about their use of Aster to support their mission-critical operations at MySpace across multiple functions and departments.
Their Aster enterprise data warehouse supports frontline applications (e.g., MySpace TV, MySpace Video, etc.), as well as online marketing, sales, IT, finance, international, and legal. MySpace is also planning to incorporate data from Aster into a balanced scorecard for strategic alignment of the business around key performance indicators, as well as other future projects.
Some highlights from the video for folks who would rather read:
MySpace got up and running with Aster quickly
“We were able to bring that up online and actually start processing the data into it within a matter of weeks, and I think very few technologies give you the ability to do something like that.”
- Hala Al-Adwin, VP of Data Services at MySpace
Aster is mission-critical to MySpace
“With Aster, what we’ve been able to produce with commodity hardware has been a supercomputer-like infrastructure …the data that we collect and process is absolutely critical to the success of MySpace.”
– Bita Mathews, Data Warehouse Manager, MySpace
“Right now our key business performance metrics are all powered out of the Aster system. If somebody went and shut it down, none of that would be available. I think in a lot of ways, we were lacking that data before, and now that we’re used to having it, people are just hungry for more and more information. So if all that went away, I think it’s kinda like going back to an age where there was no light.”
– Hala Al-Adwan
MySpace’s data warehouse with Aster is extremely reliable
“Aster is always on and available. And this is very amazing thing about Aster, because it’s massive. There’s a lot of hardware underneath the system. When hardware fails, we can continue working, and although we know some engineers are fixing hardware, but that doesn’t stop us from continuing to run queries and producing our reports.”
– Anna Dorofiyenko, Data Architect, MySpace
Aster is the blueprint for successful data warehouse deployments going forward
“Integrating Aster and including them from the very beginning in the MySpace Music project … from beginning to end is what allowed that to be the most successful data warehouse implementation we’ve had to date, and I think we should definitely use it as a blueprint for any future implementations we do.”
– Christa Stelzmuller, Chief Data Architect, MySpace
I recently attended a panel discussion in New York on media fragmentation consisting of media agency execs including:
- Bant Breen (Interpublic - Initiative – President, Worldwide Digital Communications),
- John Donahue (Omnicom Media Group - Director of BI Analytics and Integration),
- Ed Montes (Havas Digital - Executive Vice President),
- Tim Hanlon (Publicis - Executive Vice President/Ventures for Denuo)
The discussion was kicked off of by Brian Pitz, Principle of Equity Research for Bank of America. Brian set the stage for a spirited discussion regarding the continuing fragmentation of online media along with research on the issues posed by this. The panel discussion touched upon many issues including fear placement around unknown user-generated content, agency lack of skill set to address this medium and lack of standards. However, what surprised me most was the unanimous consensus in opinion that there is more value further out on “The Tail” of the online publisher spectrum due to the targeted nature of the content. Yet the online media buying statistics conflict with this opinion (over 77% of online ad spending is still flowing to the top 10 sites).
When asked “why the contrast?” between their sentiment and the stats, the discussion revealed the level of uncertainty due to a lack of transparency into “The Tail”. Despite the 300+ ad networks that have emerged to address this very challenge, the value chain lacks the data to confidently invest the dollars. In addition, there was a rather cathartic moment when John Donahue professed that agencies should “Take Back Your Data From Those that Hold It Hostage”.
It is our belief that the opinions expressed by the panel serve as evidence of a shift towards a new era in media where evidential data will drive valuation across media rather than sampling-based ratings acting as the currency. No one will be immune from this:
- Agencies need it to confidentially invest their clients dollars and show demonstrable ROI of their services
- Ad networks need it to earn their constituencies’ share of marketing budgets
- Ad networks need it to defend the targeted value and the appropriateness of their collective content
- 3rd Party measurement firms (comScore, Nielsen Online, ValueClick) need it to maintain the value of their objective value
- Advertisers need it to support the logic budget allocation decisions
- BIG MEDIA needs it to defend their 77% stake
You might be thinking, “The need for data is no great epiphany”. However, I submit that the amount of data and the mere fact that all participants should have their own copy is a shift in thinking. Gone are the days where:
- The value chain is driven solely by 3rd Party’s and their audience samples
- Ad Servers/Ad Networks are the only keepers of the data
- Service Providers can offer data for a fee
In a down economy, marketing and advertising are some of the first budgets to get cut. However, recessions are also a great time to gain market share from your competitors. If you take a pragmatic, data-driven approach to your marketing, you can be sure you’re getting the most ROI from every penny spent. It is not a coincidence that in the last recession, Google and Advertising.com came out stronger since they provided channels that were driven by performance metrics.
That’s why I’m excited that Aster Data Systems will be at Net.Finance East in New York City next week. Given the backdrop of the global credit crisis, we will learn first-hand the implications of the events in the financial landscape. I am sure the marketing executives are thinking of ways to take advantage of a change in the financial landscape, whether it’s multi-variate testing, more granular customer segmentation, or simply lowering the data infrastructure costs associated with your data warehouse or Web analytics.
Look us up if you’re at Net.Finance East - we’d love to learn from you and vice-versa.
Aster announced the general availability of our nCluster 3.0 database, complete with new feature sets. We’re thrilled with the adoption we saw before GA of the product, and it’s always a pleasure to speak directly with someone who is using nCluster to enable their frontline decision-making.
Lenin Gali, director of business intelligence for the online sharing platform ShareThis, is one such friend. He recently sat down with us to discuss how Internet and social networking companies can successfully grow their business by rapidly analyzing and acting on their massive data.
It is really remarkable how many companies today view data analytics as the cornerstone of their businesses.
aCerno is an advertising network that uses powerful analytics to predict which advertisements to deliver to which person at what time. Their analytics are performed on completely anonymous consumer shopping data of 140M users obtained from an association of 450+ product manufacturers and multi-channel retailers. There is a strong appetite at aCerno to perform analytics that they have not done before because each 1% uplift in the click-through rates is a significant revenue stream for them and their customers.
Aggregate Knowledge powers a discovery network (The Pique Discovery™ Network) that delivers recommendations of products and content based on what was previously purchased and viewed by an individual using the collective behavior of the crowds that had behaved similarly in the past. Again, each 1% increase of engagement is a significant revenue stream for them and their customers.
ShareThis provides a sharing network via a widget that makes it simple for people to share things they find online with their friends. In a short period of time since their launch, ShareThis has reached over 150M unique monthly users. The amazing insight is that ShareThis knows which content users actually engage with, and want to tell their friends about! And in its sheer genius, ShareThis gives away its service to publishers and consumers free; relying on delivering targeted advertising for its revenue: by delivering relevant ad messages while knowing the characteristics of that thing being shared. Again, the better their analytics, the better their revenue.
Which brings me to my point: data analytics is a direct contributor of revenue gains in these companies.
Traditionally, we think of data warehousing as a back-office task. The data warehouse can be loaded in separate load windows; loads can run late (the net effect is that business users will get their reports late); loads, backups, and scale-up can take data warehouses offline –which is OK since these tasks can be done on non-business hours (nights/weekends).
But these companies rely on data analytics for their revenue.
· A separate exclusive load window implies that their service is not leveraging analytics during that window;
· A late-running load implies that the service is getting stale data;
· An offline warehouse implies that the service is missing fresh trends
Any such planned or unplanned outage results in lower revenues.
On the flip side, a faster load/query provides the service a competitive edge – a chance to do more with their data than anyone else in the market. A nimbler data model, a faster scale-out, or a more agile ETL process helps them implement their “Aha!” insights faster and gain revenue from a reduced time-to-market advantage.
These companies have moved data warehousing from the back-office to the frontlines of business: a competitive weapon to increase their revenues or to reduce their risks.
In response, the requirements of a data warehouse that supports these frontline applications go up a few notches: the warehouse has to be available for querying and loading 365×24x7; the warehouse has to be fast and nimble; the warehouse has to allow “Aha!” queries to be phrased.
We call these use cases “frontline data warehousing“. And today we released a new version of Aster nCluster that rises up those few notches to meet the demands of the frontline applications.
I am curious if anyone out there is attending the TDWI World Conference in San Diego this week? If so and you would like to meet up with me, please do drop me a line or comment below as I will be in attendance. I’m of course very excited to be making the trip to sunny San Diego and hope to catch a glimpse of Ron Burgundy and the channel 4 news team!
But of course it’s not all fun and games, as I’ll participate in one of TDWI’s famous Tool Talk evening sessions discussing data warehouse appliances. This should make for some great dialogue between me and other database appliance players, especially given the recent attention our industry has seen. I think Aster has a really different approach to analyzing big data and look forward to discussing exactly why.
For those interested in the talk, here are the details..come on by and let’s chat! What:TDWI Tool Talk Session on data warehouse appliances When: Wednesday, August 20, 2008 @ 6:00p.m. Where: Manchester Grand Hyatt, San Diego, CA