By Mayank Bawa in Analytics, Blogroll, Business analytics, Interactive marketing on June 11, 2008

I had the opportunity to work closely with Anand Rajaraman while at Stanford University and now at our company. Anand teaches the Data Mining class at Stanford as well, and recently he did a very instructive post on the observation that efficient algorithms on more data usually beat complex algorithms on small data. He followed it up with an elaboration post. Google also seems to believe in a similar philosophy.

I want to build upon that observation here. If you haven’t read the posts, do read them first. It is well-worth the time!

I propose that there are 2 forces in action that help simple algorithms on big data beat complex algorithms on small data:

  1. The freedom of big data allows us to bring in related datasets that provide contextual richness.
  2. Simple algorithms allow us to identify small nuances by leveraging contextual richness in the data.

Let me expand my proposal using Internet Advertising Networks as an example.

Advertising networks essentially make a guess about a user’s intent and present an advertisement (creative) to the consumer. If the user is indeed interested, the user clicks through the creative to learn more.

Advertising networks are used today on a CPC model (Cost-Per-Click). There are stronger variants of CPL (Cost-Per-Lead) or CPA (Cost-Per-Acquisition) but these variants are as applicable to the discussion as the simpler CPC model. There is a simpler variant of CPM (Cost-Per-Impression) but an advertiser ends up effectively computing CPC by keeping track of click-through rates for money spent via the CPM model. The CPC model dictates that Advertising Networks do not make money unless the user clicks on a creative.

Today, the best advertising networks have a click through rate of less than 1%. In other words, advertising networks correctly interpret a user’s intentions 1% of the time, 99% of the time they are ineffective!I find this statistic immensely liberating. Here is a statistic that shows that even if we are correct 1% of the time, the rewards are significant. Why is the click-through rate so low? I think it is because human behavior is difficult to predict. Even sophisticated algorithms (that are computationally practical only on small datasets) do a bad job of predicting human behavior.It is much more powerful to think of efficient algorithms that execute across larger, diverse datasets to exploit the richness inherent in the context to enable a higher click-through rate.I’ve observed people in the field sample behavioral data to reduce their operating dataset. I submit that a sample of 1% will lose the nuances and the context that can cause an uplift and growth in revenue.For example, a Content Media site may have 2% of their users who come in to read Sports stay on to read Finance articles. A sampling of 1% is certain to reduce this 2% population trait to a statistically insignificant portion in the sample. Should we or should we not derive this insight to identify and engage the 2% by serving them better content?Similarly, an Internet Retailer may have 2% of their users who come in to buy flat-panel TV have also bought video games recently. Should we or should we not act on this insight to identify and engage the 2% by offering them better deals on games? Given that games are a high-margin product, the net effect on revenue via cross-sell could be higher than 2% in dollars.We often want to develop an algorithm that is provably correct under all circumstances. In a bid to satisfy this urge, we restrict our datasets to find a statistically significant model that is a good predictor. I associate that with a purist way of algorithm development that was drilled into us at school.Anand’s observation is a call for practitioners to think simple, use context and come up with rules that segment and win locally. It will be faster to develop, test and win on simple heuristics than waiting for a perfect “Aha!” that explains all things human.

Eric on June 12th, 2008 at 1:13 pm #

I’m no statistics expert, but if the overall % of visitors who visit a Content Media site to read Sports and stay on to read Finance is 2%, than wouldn’t a sample of just 1% of the visitors lead to the conclusion that 2% of the visitors do just that? If you take a 1% sample, and find that 50% of the sample only reads sports, your going to assume that 50% of your total readership just reads Sports, not 0.5%.

Mayank on June 13th, 2008 at 7:35 am #

Eric - Yes, sampling is likely to preserve patterns in the data. The stronger pattern will have a stronger presence, the weaker pattern will have a weaker presence.

The problem arises because the weaker pattern has such a weak support in the sample that we are unable to say which of the following two is correct:

[1] the weak pattern actually is a pattern that occurs in the complete dataset, or

[2] the weak pattern is a noise that was introduced because sampling happened to pick something that is very rare.

As a result, all algorithms that work on sampled dataset must apply metrics (typically support) to ensure that the patterns they detect are substantial and not co-incidental.

Know your algorithm, learn the dataset! on June 13th, 2008 at 8:06 am #

[...] For the post that started this discussion please check out the aster data blog [...]

Hints, tips and cheats to better datamining « Lightspeed Venture Partners Blog on June 16th, 2008 at 9:55 pm #

[...] Bawa of Aster Data chimes in to say that running simple analysis over complete datasets is better than running more complex data over sampled… for two reasons: 1. The freedom of big data allows us to bring in related datasets that provide [...]

Ted Dunning on June 23rd, 2008 at 11:47 am #

This is a very simplistic view of both sampling approaches and of click-through rates.

Taking the latter first, click-through rates on internet ads are not low because users are unpredictable. They are low precisely because users are trivially predictable (i.e. if you simply predict no-click, you will almost always be correct). The key need here is to build models that correctly handle the alternative values, costs and benefits. Think multi-armed bandit.

Regarding sampling, what you say is correct if you do simple-minded sampling. Alternately, you can do slightly more interesting sampling and get very good results. Taking your ad case, you can down-sample non-clicks by 50 to one, but keep all clicks. This gives you a data set 30 times smaller which retains all click behavior. Depending on your learning algorithm this may require some (slight) cleverness, but you will generally get very good results.

I agree that it is better to use all of the data and include a viable cost model when doing learning, but sampling and using less scalable techniques is also a very viable option if you have truly enormous data.

The most important thing to remember is that you have to think about your data and what you are trying to do with it.

Mayank on June 24th, 2008 at 1:04 am #

Ted - Very good counter-point to my post. Thank you for voicing it.

My point is that sampling loses nuances and context of weak patterns.

Your counter-point is that we can do weighted sampling to boost the weak pattern.

I agree we can, but the resulting algorithms that work on the weighted sampling have to be smart about recognizing and then discounting the boost. The boost itself will kill other patterns.

For example,

(a) your suggestion of 50x boost for clicks as opposed to impressions can be used to generate average statistics, but will fail for

(b) time series analysis to understand how many impressions to a user before a click, if there is a click.

The latter will require you to sample on user-ids to retain clickers in entirety. The Sampling in (a) loses the nuance and context required for (b).

Each use-case will introduce its own requirements and we will end up with different samples to test different hypothesis.

Mining for patterns then becomes an iterative process of defining the right sample, extracting it from the base data (if all such data exists), writing careful algorithms to work with the skewed sample, and adjusting parameters to empirically remove approximations in the resulting pattern.

It can be done: indeed, we have been using sampling to great effect in various industries for the last 100 years.

My post is not a call to disband sampling but an “Aha!” that came from Anand’s insights on the power of big data and simple algorithms.

Post a comment