Archive for November, 2008

Media Fragmentation: How Do We Tame the “Tail”?
By John Guevara in Analytics, Blogroll on November 19, 2008

I recently attended a panel discussion in New York on media fragmentation consisting of media agency execs including:

- Bant Breen (Interpublic - Initiative -President, Worldwide Digital Communications),
- John Donahue (Omnicom Media Group - Director of BI Analytics and Integration),
- Ed Montes (Havas Digital - Executive Vice President),
- Tim Hanlon (Publicis - Executive Vice President/Ventures for Denuo)

The discussion was kicked off of by Brian Pitz, Principle of Equity Research for Bank of America.  Brian set the stage for a spirited discussion regarding the continuing fragmentation of online media along with research on the issues posed by this.  The panel discussion touched upon many issues including fear placement around unknown user-generated content, agency lack of skill set to address this medium and lack of standards.  However, what surprised me most was the unanimous consensus in opinion that there is more value further out on “The Tail” of the online publisher spectrum due to the targeted nature of the content.  Yet the online media buying statistics conflict with this opinion (over 77% of online ad spending is still flowing to the top 10 sites).

When asked “why the contrast?” between their sentiment and the stats, the discussion revealed the level of uncertainty due to a lack of transparency into “The Tail”.  Despite the 300+ ad networks that have emerged to address this very challenge, the value chain lacks the data to confidently invest the dollars.  In addition, there was a rather cathartic moment when John Donahue professed that agencies should “Take Back Your Data From Those that Hold It Hostage”.

It is our belief that the opinions expressed by the panel serve as evidence of a shift towards a new era in media where evidential data will drive valuation across media rather than sampling-based ratings acting as the currency.  No one will be immune from this:

- Agencies need it to confidentially invest their clients dollars and show demonstrable ROI of their services
- Ad networks need it to earn their constituencies’ share of marketing budgets
- Ad networks need it to defend the targeted value and the appropriateness of their collective content
- 3rd Party measurement firms (comScore, Nielsen Online, ValueClick) need it to maintain the value of their objective value
- Advertisers need it to support the logic budget allocation decisions
- BIG MEDIA needs it to defend their 77% stake

You might be thinking, “The need for data is no great epiphany”.  However, I submit that the amount of data and the mere fact that all participants should have their own copy is a shift in thinking.  Gone are the days where:

- The value chain is driven solely by 3rd Party’s and their audience samples
- Ad Servers/Ad Networks are the only keepers of the data
- Service Providers can offer data for a fee

By Tasso Argyros in Blogroll, Frontline data warehouse on November 18, 2008

I recently wrote an article with Enterprise Systems Journal on how retailers can benefit from innovations in “always on” databases. This story is based on the real-life experiences of one of our customers during the last holiday shopping season. They saw a spike in traffic and quickly scaled their frontline data warehouse built with Aster Data Systems. This allowed them to maintain the service level agreements with the business for product cross-promotion without needing to wait to plan an upgrade in the off-season.

Thinking about more than query performance and initial system cost is something we believe firmly in. People often overlook the cost of maintaining and scaling systems in a time of need.

After next week, let us know - how did your favorite e-tailers fare on ‘Black Friday’ and ‘Cyber Monday’? Did their recommendation engines deliver, or did they leave cash in your wallet that got spent elsewhere?

TDWI MapReduce Nightschool Recap
By Tasso Argyros in Blogroll on November 14, 2008

Last week I conducted a course at the TDWI World Conference in New Orleans, LA called, “Introduction to Map/Reduce Data Transformations“.  If you weren’t able to make the session, my slides are embedded here.

I’m pleased to have been given the opportunity to introduce this new approach to in-database analytics and parallel data processing, in general. The most consistent feedback I had was that there wasn’t enough time to cover this topic in-depth, and attendees were eager to learn more! In that case, my previous post on educational resources for MapReduce may be of interest.

Since Aster Data Systems  introduced In-Database MapReduce for the Aster nCluster relational database, there has been tremendous interest in the data warehousing and technology community, with recent coverage in the NY Times and by influential blogs like DBMS2, Beyond Search, and Cloud N, just to name a few.

Hopefully TDWI will turn this into a 1/2-day course in the future. (If you agree, feel free to contact them at

If anyone knows of other good resources on this emerging topic, please feel free to put links in the comments here.

Introduction to MapReduce Data Transformations
View SlideShare presentation or Upload your own. (tags: mapreduce tdwi)

By Mayank Bawa in Analytics, Blogroll, Frontline data warehouse on November 6, 2008

I was at Defrag 2008 yesterday and it was a wonderful, refreshing experience. A diverse group of Web 2.0 veterans and newcomers came together to accelerate the “Aha!” moment in today’s online world. The conference was very well organized and there were interesting conversations on and off the stage.

The key observation was that individuals, groups and organizations are struggling to discover, assemble, organize, act on, and gather feedback from data. Data itself is growing and fragmenting at an exponential pace. We as individuals feel overwhelmed by the slew of data (messages, emails, news, posts) in the microcosm, and we as organizations feel overwhelmed in the macrocosm.

The very real danger is that an individual or organization’s feeling of being constantly overwhelmed could result in the reduction of their “Aha!” moments - our resources will be so focused on merely keeping pace with new information that we won’t have the time or energy to connect the dots.

The goal then is to find tools and best practices to enable the “Aha!” moments - to connect the dots even as information piles up on our fingertips.

My thought going into the conference was that we need to understand what causes these “Aha!” moments. If we understand the cause, we can accelerate the “Aha!” even at scale.

Earlier this year, Janet Rae-Dupree published an insightful piece in the International Herald Tribune on Reassessing the Aha! Moment. Her thesis is that creativity and innovation - “Aha! Moments” - do not come in flashes of pure brilliance. Rather, innovation is a slow process of accretion, building small insight upon interesting fact upon tried-and-true process.

Building on this thesis, I focused my talk on using frontline data warehousing as an infrastructure piece that allows organizations to collect, store, analyze and act on market events. The incremental fresh data loads in a frontline data warehouse add up over time to build a stable historical context. At the same time, applications can contrast fresh data with historical data to build the small contrasts gradually until the contrasts become meaningful to act upon.

I’d love to hear back from you on how massive data can accelerate, rather than impede, the “Aha!” moment.

Aster Defrag 2008 97
View SlideShare presentation or Upload your own. (tags: systems data)

By Tasso Argyros in Blogroll on November 6, 2008

Forget about total cost of ownership (TCO). In the Internet age, scaling up can be the biggest cost factor in your infrastructure. And overlooking the scaling challenge can be disastrous.

Take Bob, for instance. Bob is the fictional database manager at FastGrowth, an Internet company with a fast-growing user base.  Bob is 36 and has done more than a dozen data warehouse deployments in his career. He’s confident in his position. Granted, this is his first Internet gig But this shouldn’t matter a lot, right? Bob’s been there, done that, got the tee-shirt.

Bob’s newest gig is to implement a new data warehouse system to accommodate FastGrowth’s explosive data growth. He estimates that there will be 10TB of data in the next 6 months, 20TB in the next 12, and 40TB 18 months from now.

Bob needs to be very careful about cost (TCO); getting way overboard on his budget could cost him his reputation or even (gasp) his job. He thus asks vendors and friends how much hardware and software he needs to buy at each capacity level. He also makes conservative estimates about the number of people required to manage the system and its data at 10, 20, and 40 terabytes.

Fast-forward 18 months. Bob’s DW is in complete chaos; it can hardly manage half of the 40 TB target and it required twice the number of people and dollars so far. Luckily for Bob, his boss (Suzy), has been doing Internet infrastructure projects for her whole career and knew exactly what mistake Bob made (and why he deserves a second chance.)

What went wrong? Bob did everything almost perfectly. His TCO estimates at each scale level were, in fact, correct. But what he did not account for was the effort of going from one scale level to the other in such a short time! Doubling the size of data every 6 months is 3x faster than Moore’s law. That’s like trying to build a new car that is 3x faster than a Ferrari. As a result, growing from 10TB to 20TB in six months may cost many times more than (in terms of people, time and dollars) running a 20TB system for 6 months.

In some way, this is no news. The Internet space is full of stories where scaling was either too expensive or too disruptive to be carried out properly. Twitter, with its massive success, has had to put huge effort to scale up its systems. And Friendster lost the opportunity to be a top social network partly because it was taking too long to scale up its infrastructure. Moreover, as new data sources become available, companies outside Internet are facing similar kind of challenges -scaling needs that are too hard to manage!

So how can we reason about this new dimension of infrastructure cost? What happens when data is growing constantly, and scaling up ends up being the most expensive part of most projects?

The answer, we believe, is that the well-known concept of TCO not good enough to capture scaling costs in this new era of fast data growth. Instead, we need to also start thinking about the Total Cost of Scaling -or TCS.

Why is TCS useful? TCS captures all costs -in terms of hardware, software and people -that are required to increase the capacity of the infrastructure. Depending on the application, capacity can mean anything such as amount of data (e.g. for data warehousing projects) or queries per second (for OLTP  systems.)  TCO together with TCS gives a true estimate of project costs for environments that have been blessed with a growing business.

Let’s see how TCS works in an example. Say that you need 100 servers to run your Web business at a particular point in time, and you have calculated the TCO for that. You can also calculate the TCO of having 250 servers running 12 months down the road, when your business has grown. But going from 100 severs to 250 -that’s where TCS comes in. The careful planner (e.g. Bob in his next project) will need to add all three numbers together -TCO at 100 servers, TCO at 250 servers and TCS for scaling from 100 to 250 -to get an accurate picture of the full cost.

At Aster, we have been thinking about TCS from day one exactly because we design our systems for environments of fast data growth. We have seen TCS dominating the cost of data projects. As a result, we have built a product that is designed from the ground-up to make scalability seamless and reduce the TCS of our deployments to a minimum. For example, one of our customers scaled up their Aster deployment from 45 to 90 servers with a click of a button. In contrast, traditional scaling approaches -manual, tedious and risky -bloat TCS and can jeopardize whole projects.

As fast data growth becomes the rule rather than the exception, we expect more people to start measuring TCS and seek ways to reduce it. As Francis Ford Coppola put it, “anything you build on a large scale or with intense passion invites chaos.” And while passion is hard to manage, there is something we can do about scale.