Archive for the ‘Analytics’ Category

11
Jun
By Mayank Bawa in Analytics, Blogroll, Business analytics, Interactive marketing on June 11, 2008
   

I had the opportunity to work closely with Anand Rajaraman while at Stanford University and now at our company. Anand teaches the Data Mining class at Stanford as well, and recently he did a very instructive post on the observation that efficient algorithms on more data usually beat complex algorithms on small data. He followed it up with an elaboration post. Google also seems to believe in a similar philosophy.

I want to build upon that observation here. If you haven’t read the posts, do read them first. It is well-worth the time!

I propose that there are 2 forces in action that help simple algorithms on big data beat complex algorithms on small data:

  1. The freedom of big data allows us to bring in related datasets that provide contextual richness.
  2. Simple algorithms allow us to identify small nuances by leveraging contextual richness in the data.

Let me expand my proposal using Internet Advertising Networks as an example.

Advertising networks essentially make a guess about a user’s intent and present an advertisement (creative) to the consumer. If the user is indeed interested, the user clicks through the creative to learn more.

Advertising networks are used today on a CPC model (Cost-Per-Click). There are stronger variants of CPL (Cost-Per-Lead) or CPA (Cost-Per-Acquisition) but these variants are as applicable to the discussion as the simpler CPC model. There is a simpler variant of CPM (Cost-Per-Impression) but an advertiser ends up effectively computing CPC by keeping track of click-through rates for money spent via the CPM model. The CPC model dictates that Advertising Networks do not make money unless the user clicks on a creative.

Today, the best advertising networks have a click through rate of less than 1%. In other words, advertising networks correctly interpret a user’s intentions 1% of the time, 99% of the time they are ineffective!I find this statistic immensely liberating. Here is a statistic that shows that even if we are correct 1% of the time, the rewards are significant. ☺Why is the click-through rate so low? I think it is because human behavior is difficult to predict. Even sophisticated algorithms (that are computationally practical only on small datasets) do a bad job of predicting human behavior.It is much more powerful to think of efficient algorithms that execute across larger, diverse datasets to exploit the richness inherent in the context to enable a higher click-through rate.I’ve observed people in the field sample behavioral data to reduce their operating dataset. I submit that a sample of 1% will lose the nuances and the context that can cause an uplift and growth in revenue.For example, a Content Media site may have 2% of their users who come in to read Sports stay on to read Finance articles. A sampling of 1% is certain to reduce this 2% population trait to a statistically insignificant portion in the sample. Should we or should we not derive this insight to identify and engage the 2% by serving them better content?Similarly, an Internet Retailer may have 2% of their users who come in to buy flat-panel TV have also bought video games recently. Should we or should we not act on this insight to identify and engage the 2% by offering them better deals on games? Given that games are a high-margin product, the net effect on revenue via cross-sell could be higher than 2% in dollars.We often want to develop an algorithm that is provably correct under all circumstances. In a bid to satisfy this urge, we restrict our datasets to find a statistically significant model that is a good predictor. I associate that with a purist way of algorithm development that was drilled into us at school.Anand’s observation is a call for practitioners to think simple, use context and come up with rules that segment and win locally. It will be faster to develop, test and win on simple heuristics than waiting for a perfect “Aha!” that explains all things human.



20
May
By Mayank Bawa in Analytics, Blogroll, Business analytics, Business intelligence on May 20, 2008
   

I’ve remarked in an earlier post that the usage of data is changing and new applications are on the horizon. Over the past few years, we’ve observed or invented quite a few interesting design patterns for business processes that use data.

There are no books or tutorials for these new applications, and they are certainly not being taught in the classrooms of today. So I figured I’d share some of these design patterns on our blog.

Let me start with a design pattern that we internally call “The Automated Feedback Loop”. I didn’t invent it but I’ve seen it being applied successfully at search engines during my research days at Stanford University. I certainly think there is a lot of power that remains to be leveraged from this design principle in other verticals and applications.

Consider a search engine. Users ask keyword queries. The search engine ranks documents that match the queries and provides 10 results to the user. The user clicks one of these results, perhaps comes back and clicks another result, and then does not come back.

How do search engines improve themselves? One key way is by recording the number of times users clicked or ignored a result page. They also record the speed with which a user returned from that page to continue his exploration. The quicker the user returned, the less relevant the page was for user’s query. The relevancy of a page now becomes a factor in the ranking function itself for future queries.

The Automated Feedback LoopSo here is an interesting feedback loop. We offered options (search results) to the user, and the user provided us feedback (came back or not) on how good one option was compared to the others. We then used this knowledge to adapt and improve future options. The more the user engages, the more everyone wins!

This same pattern could hold true in a lot of consumer-facing applications that provide consumers with options.

Advertising networks, direct marketing companies, and social networking sites are taking consumer feedback into account. However, this feedback loop in most companies today is manual and not automated. Usually the optimization (adapting to user response) is done by domain experts who read historical reports from their warehouses, build an intuition of user needs and then apply their intuition to build a model that runs everything from marketing campaigns to supply chain processes.

Such a manual feedback loop has two significant drawbacks:

1. The process is expensive: it takes a lot of time, trial and error for humans to become experts, and as a result the experts are hard to find and worth their weight in gold.

2. The process is ineffective: humans can only think about handful of parameters and they optimize for the most popular products or processes (e.g., “Top 5 products or Top 10 destinations”). Everything outside this area of comfort is left under-optimized.

Such a narrow focus on optimization is severely limiting. The incorporation of Top 10 trends into future behavior is akin to a search engine saying that it will optimize for only the top 10 searches of the quarter. I am sure Google would definitely be a less valuable company then, and the world a less engaging place.

I strongly believe that there are rich dividends to be reaped if we can automate the feedback process in more consumer-facing areas. What about hotel selection, airline travel, and e-mail marketing campaigns? E-tailers, news (content providers), insurance, banks and media sites are all offering the consumer a choice for his time and money. Why not instill an automated feedback loop in all consumer-facing processes to improve consumer experience? The world will be a better place for both the consumer and the provider!



15
May
A taste of something new
By Mayank Bawa in Analytics, Analytics tech, Blogroll, Statements on May 15, 2008
   

Have you ever discovered a wonderful little restaurant off the beaten path? You know the kind of place. It’s not part of some corporate conglomerate. They don’t advertise. The food is fresh and the service is perfect – it feels like your own private oasis. Keeping it to yourself would just be wrong (even if you selfishly don’t want the place to get too crowded).

We’re happy to see a similar anticipation and word-of-mouth about some new ideas Aster is bringing to the data analytics market. Seems that good news is just too hard to keep to yourself.

We’re serving up something unique that we’ve been preparing for several years now. We’re just as excited to be bringing you this fresh approach.



08
May
How can I analyze all of this data?
By Tasso Argyros in Analytics, Analytics tech on May 8, 2008
   

Over the last couple of years I’ve talked to scores of companies that face data analytics problems and ask this question. From these discussions it was pretty clear that no existing infrastructure can really solve the problem of driving deep insights from massive amounts of data for most enterprises. But why? And how do companies today try to cope with this issue?

I’ve seen three classes of “solutions” that companies attempt to implement in a desperate attempt to overcome their data analytics challenges. Let me try to describe what I’ve seen here.

“Solutionâ€Â? One. Vertical scale-up. If you are like most companies, database performance problems make your favorite hardware vendor sales rep lots of money every year! There is nothing new here. Ever since the 1960s, when the first data management systems came around, performance issues were solved by buying much more expensive hardware. So here’s the obvious problem with this approach: cost. And here’s the non-obvious one: there’s a limit in how much you can scale this way, which is actually pretty low. (Question: what is the maximum number of CPUs that you can buy in a high-end server? How does it compare to the average Google cluster?)

“Solutionâ€Â? Two. “Massivelyâ€Â? parallel database clusters. Sometimes I’ve heard an argument that goes like this: “Why shouldn’t it be simple to build a farm of databases just like we have farms of app servers or web servers?” Driven by this seemingly innocent question, you may try (or have tried) to put together clusters of databases to do analytics, either on your own or using one of the MPP products that are in the marketplace. This will work fine for small datasets *or* very simple queries (e.g. computing a sum of values). But, as any student of distributed systems knows, there is a reason why web servers scale so nicely: they are stateless! That’s why they’re so easy to deploy and scale. On the other hand, databases do have state. In fact, they have lots of it, perhaps several Gigabytes per box. And, guess what, in analytics each query potentially needs access to all of it at once! So what works fine for very small numbers of nodes or small amount of data, doesn’t do anything for slightly more complex queries and larger systems - which is probably the issue you were trying to solve in the first place.

By the way, all the solutions that are in the marketplace today solve the wrong problems. For instance, some optimize disk I/O of the individual nodes and not overall system performance for complex queries, which is the real issue (e.g., “columnar” systems). Others allow for fast execution of really simple queries but do nothing to allow more complex ones to go really quickly (e.g., “MPP” databases). None of these products can provide a solution that is even relevant to the hardest problems these systems face.

“Solutionâ€Â? Three. Write custom code. Why not? Google and Yahoo have done it pretty successfully! The only problem is, this approach is even more expensive than approach #1! Google has built a great infrastructure, but what is the cost to retain and compensate the best minds in the world who can develop and maintain your analytics? (Hint: It’s more than free snacks and soda). I’ve frequently seen what starts as a simple, cheap solution for a single point problem evolve to a productivity nightmare, where each new data insight requires development time and specialized (thus expensive) skills. If you can afford that, that’s fine. But I’ll bet you do not want to spend your most precious resources reinventing the wheel every time you need to run a new query instead of doing what makes your company most successful.

The end result is that all of these approaches are pretty far from solving the real problem. Rather, the cost of becoming more competitive through data is currently huge - and it shouldn’t be! I believe that as soon as the right tools are built and made available, companies will immediately take advantage of them to be more competitive and successful. This is the upcoming data revolution that I see, and, frankly, it has been long overdue.