Archive for the ‘Blogroll’ Category

11
Jun
By Mayank Bawa in Analytics, Blogroll, Business analytics, Interactive marketing on June 11, 2008
   

I had the opportunity to work closely with Anand Rajaraman while at Stanford University and now at our company. Anand teaches the Data Mining class at Stanford as well, and recently he did a very instructive post on the observation that efficient algorithms on more data usually beat complex algorithms on small data. He followed it up with an elaboration post. Google also seems to believe in a similar philosophy.

I want to build upon that observation here. If you haven’t read the posts, do read them first. It is well-worth the time!

I propose that there are 2 forces in action that help simple algorithms on big data beat complex algorithms on small data:

  1. The freedom of big data allows us to bring in related datasets that provide contextual richness.
  2. Simple algorithms allow us to identify small nuances by leveraging contextual richness in the data.

Let me expand my proposal using Internet Advertising Networks as an example.

Advertising networks essentially make a guess about a user’s intent and present an advertisement (creative) to the consumer. If the user is indeed interested, the user clicks through the creative to learn more.

Advertising networks are used today on a CPC model (Cost-Per-Click). There are stronger variants of CPL (Cost-Per-Lead) or CPA (Cost-Per-Acquisition) but these variants are as applicable to the discussion as the simpler CPC model. There is a simpler variant of CPM (Cost-Per-Impression) but an advertiser ends up effectively computing CPC by keeping track of click-through rates for money spent via the CPM model. The CPC model dictates that Advertising Networks do not make money unless the user clicks on a creative.

Today, the best advertising networks have a click through rate of less than 1%. In other words, advertising networks correctly interpret a user’s intentions 1% of the time, 99% of the time they are ineffective!I find this statistic immensely liberating. Here is a statistic that shows that even if we are correct 1% of the time, the rewards are significant. ☺Why is the click-through rate so low? I think it is because human behavior is difficult to predict. Even sophisticated algorithms (that are computationally practical only on small datasets) do a bad job of predicting human behavior.It is much more powerful to think of efficient algorithms that execute across larger, diverse datasets to exploit the richness inherent in the context to enable a higher click-through rate.I’ve observed people in the field sample behavioral data to reduce their operating dataset. I submit that a sample of 1% will lose the nuances and the context that can cause an uplift and growth in revenue.For example, a Content Media site may have 2% of their users who come in to read Sports stay on to read Finance articles. A sampling of 1% is certain to reduce this 2% population trait to a statistically insignificant portion in the sample. Should we or should we not derive this insight to identify and engage the 2% by serving them better content?Similarly, an Internet Retailer may have 2% of their users who come in to buy flat-panel TV have also bought video games recently. Should we or should we not act on this insight to identify and engage the 2% by offering them better deals on games? Given that games are a high-margin product, the net effect on revenue via cross-sell could be higher than 2% in dollars.We often want to develop an algorithm that is provably correct under all circumstances. In a bid to satisfy this urge, we restrict our datasets to find a statistically significant model that is a good predictor. I associate that with a purist way of algorithm development that was drilled into us at school.Anand’s observation is a call for practitioners to think simple, use context and come up with rules that segment and win locally. It will be faster to develop, test and win on simple heuristics than waiting for a perfect “Aha!” that explains all things human.



27
May
Visibility vs. Control
By George Candea in Administration, Blogroll, Manageability on May 27, 2008
   

When developing a system that is expected to take care of itself (self-managing, autonomic, etc.) the discussion of how much control to give users over the details of the system inevitably comes up. There is, however, a clear line between visibility and control.

Users want control primarily because they don’t have visibility into the reasons for a system’s behavior. Take for instance a database whose performance has suddenly dropped 3x… This can be due to someone running a crazy query, or some other process on the same machine updating a filesystem index, or the battery of a RAID controller’s cache having run out and forcing all updates to be write-through, etc. In order to figure out what is going on, the DBA would normally start poking around with ps, vmstat, mdadm, etc. and for this (s)he needs control. However, what the DBA really wants is visibility into the cause of the slowdown… the control needed to remedy the situation is minimal: kill a query, reboot, replace a battery, etc.)

To provide good visibility, one ought to expose why the system is doing something, not how it is doing it. Any system that self-manages must be able to explain itself when requested to do so. If a DB is slow, it should be able to provide a profile of the in-flight queries. If a cluster system reboots nodes frequently, it should be able to tell whether it’s rebooting due to the same cause or a different one every time. If a node is taken offline, the system should be able to tell it’s because of suspected failure of disk device /dev/sdc1 on that node. And so on… this is visibility.

We do see, however, very many systems and products that substitute control for visibility, such as providing root access on the machines running the system. I believe this is mainly because the engineers themselves do not understand very well in which way the how turns into the why, i.e., they do not understand all the different paths that lead to poor system behavior.

Choosing to expose the why instead of the how influences the control knobs provided to users and administrators. Retrofitting complex systems to provide visibility instead of control is hard, so this really needs to be done from day one. What’s more, when customers get used to control, it becomes difficult to give it up in exchange for visibility, so the product must maintain the user-accessible controls for backward compatibility. This allows administrators to introduce unpredictable causes of system behavior (e.g., by allowing RAID recovery to be triggered at arbitrary times), which makes self-management that much harder and inaccurate. Hence the need to build visibility in from day one and to minimize unnecessary control.



20
May
By Mayank Bawa in Analytics, Blogroll, Business analytics, Business intelligence on May 20, 2008
   

I’ve remarked in an earlier post that the usage of data is changing and new applications are on the horizon. Over the past few years, we’ve observed or invented quite a few interesting design patterns for business processes that use data.

There are no books or tutorials for these new applications, and they are certainly not being taught in the classrooms of today. So I figured I’d share some of these design patterns on our blog.

Let me start with a design pattern that we internally call “The Automated Feedback Loop”. I didn’t invent it but I’ve seen it being applied successfully at search engines during my research days at Stanford University. I certainly think there is a lot of power that remains to be leveraged from this design principle in other verticals and applications.

Consider a search engine. Users ask keyword queries. The search engine ranks documents that match the queries and provides 10 results to the user. The user clicks one of these results, perhaps comes back and clicks another result, and then does not come back.

How do search engines improve themselves? One key way is by recording the number of times users clicked or ignored a result page. They also record the speed with which a user returned from that page to continue his exploration. The quicker the user returned, the less relevant the page was for user’s query. The relevancy of a page now becomes a factor in the ranking function itself for future queries.

The Automated Feedback LoopSo here is an interesting feedback loop. We offered options (search results) to the user, and the user provided us feedback (came back or not) on how good one option was compared to the others. We then used this knowledge to adapt and improve future options. The more the user engages, the more everyone wins!

This same pattern could hold true in a lot of consumer-facing applications that provide consumers with options.

Advertising networks, direct marketing companies, and social networking sites are taking consumer feedback into account. However, this feedback loop in most companies today is manual and not automated. Usually the optimization (adapting to user response) is done by domain experts who read historical reports from their warehouses, build an intuition of user needs and then apply their intuition to build a model that runs everything from marketing campaigns to supply chain processes.

Such a manual feedback loop has two significant drawbacks:

1. The process is expensive: it takes a lot of time, trial and error for humans to become experts, and as a result the experts are hard to find and worth their weight in gold.

2. The process is ineffective: humans can only think about handful of parameters and they optimize for the most popular products or processes (e.g., “Top 5 products or Top 10 destinations”). Everything outside this area of comfort is left under-optimized.

Such a narrow focus on optimization is severely limiting. The incorporation of Top 10 trends into future behavior is akin to a search engine saying that it will optimize for only the top 10 searches of the quarter. I am sure Google would definitely be a less valuable company then, and the world a less engaging place.

I strongly believe that there are rich dividends to be reaped if we can automate the feedback process in more consumer-facing areas. What about hotel selection, airline travel, and e-mail marketing campaigns? E-tailers, news (content providers), insurance, banks and media sites are all offering the consumer a choice for his time and money. Why not instill an automated feedback loop in all consumer-facing processes to improve consumer experience? The world will be a better place for both the consumer and the provider!



20
May
By Mayank Bawa in Blogroll, Statements on May 20, 2008
   

I am glad to share the news that one of our first customers, MySpace, has scaled their Aster nCluster enterprise data warehouse to more than 100 Terabytes of actual data.

MySpace.com LogoIt is not easy to cross the 100TB barrier, especially when loads happen continuously and queries are relentless, as they are at MySpace.com.

Hala, Richard, Dan, Jim, Allen, and Aber, you have been awesome partners for us! It has been a great experience for Aster to work with you and we can see the reasons behind MySpace’s continued success. Your team is amazingly strong and capable and there is a clear sense of purpose. Tasso and I often remark that we need to replicate that culture in our company as we grow. At the end of the day, it is the culture and the strength of a team that makes a company successful.

And to everyone at Aster, you have been great from Day 1. It is impressive how a fresh perspective and a clean architecture can solve a tough technical challenge!

Thank you. And I wish everyone as much fun in the coming days!



19
May
By Tasso Argyros in Blogroll, Database, Manageability, Scalability on May 19, 2008
   

One of the most interesting, complex and perhaps overused terms in data analytics today is scalability. People constantly talk about “scaling problems� and “scalable solutions.� But what really makes a data analytics system “scalable�? Unfortunately, despite its importance, this question is rarely discussed so I wanted to post my thoughts here.

Any good definition of scalability needs to be a multi-dimensional concept. In other words, there is no single system property that is enough to make a data analytics system scalable. But what are the dimensions that separate scalable from non-scalable systems? In my opinion the three most important are (a) data volume; (b) analytical power; and (c) manageability. Let me provide a couple of thoughts on each.

(a) Data Volume. This is definitely an important scale dimension because enterprises today generate huge amounts of data. For a shared-nothing MPP system this means accommodating a sufficient number of nodes to accommodate the available data. Evolution in disk and server technology have made it possible to store 10s of TBs of data per node, so this scale dimension alone can be achieved even with a relatively small number of nodes.

(b) Analytical Power. This is an equally important scale dimension to Data Volume because storing large amounts of data alone has little benefit; one needs to be able to extract deep insights out of it to provide real business value. And for non-trivial queries in a shared-nothing environment this presents two requirements. First, the system needs to be able to accommodate a large number of nodes to have adequate processing power to execute complex analytics. And secondly, the system needs to scale its performance linearly as more nodes are added. The latter is particularly hard for queries that involve processing of distributed state such as distributed joins: really intelligent algorithms have to be in place or else interconnect bottlenecks just kill performance and the system is not truly scalable.

(c) Manageability. Scalability across the manageability dimension means that a system can scale up and keep operating at a large scale without armies of administrators or downtime. For an MPP architecture this translates to seamless incremental scalability, scalable replication and failover, and little if any requirement for human intervention during management operations. Despite popular belief, we believe manageability can be measured and we need to take such metrics into account when characterizing a system as scalable or non-scalable.

At Aster, we focus on building systems that scale across all dimensions. We believe that even if one dimension is missing our products do not deserve to be called scalable. And since this is such an important issue, I’ll be looking forward to more discussion around it!



17
May
By George Candea in Administration, Blogroll, Database, Manageability on May 17, 2008
   

I want databases that are as easy to manage as Web servers.

IT operations account for 50%-80% of today’s IT budgets and amount to 10s of billions of dollars yearly(1). Poor manageability impacts the bottomline and reduces reliability, availability, and security.

Stateless applications, like Web servers, require little configuration, can be scaled through mere replication, and are reboot-friendly. I want to do that with databases too. But the way they’re built today, the number of knobs is overwhelming: the most popular DB has 220 initialization parameters and 1,477 tables of system parameters, while its “Administrator’s Guide� is 875 pages long(2).

What worries me is an impending manageability crisis, as large data repositories are proliferating at an astonishing pace… in 2003, large Internet services were collecting >1 TB of clickstream data per day(3). 5 years later we’re encountering businesses that want SQL databases to store >1 PB of data. PB-scale databases are by necessity distributed, since no DB can scale vertically to 1 PB; now imagine taking notoriously hard-to-manage single-node databases and distributing them…

How does one build a DB as easy to manage as a Web server? All real engineering disciplines use metrics to quantitatively measure progress toward a design goal, to evaluate how different design decisions impact the desired system property.

We ought to have a manageability benchmark, and the place to start is a concrete metric for manageability, one that is simple, intuitive, and applies to a wide range of systems. We don’t just use the metric to measure, but also to guide developers in making day-to-day choices. It should tell engineers how close their system is to the manageability target. It should enable IT managers to evaluate and compare systems to each other. It should lay down a new criterion for competing in the market.

Here’s a first thought…

I think of system management as a collection of tasks the administrators have to perform to keep a system running in good condition (e.g., deployment, configuration, upgrades, tuning, backup, failure recovery). The complexity of a task is roughly proportional to the number of atomic steps Stepsi required to complete task i; the larger Stepsi, the more inter-step intervals, so the greater the opportunity for the admin to mess up. Installing an operating system, for example, has Stepsinstall in the 10s or 100s.

Efficiency of management operations can be approximated by the time Ti in seconds it takes the system to complete task i ; the larger Ti , the greater the opportunity for unrelated failures to impact atomicity of the management operation. For a trouble-free OS install, Tinstall is probably around 1-3 hours.

If Ni represents the number of times task i is performed during a time interval �evaluation (e.g., 1 year) and Ntotal=N1+… +Nn, then task i ’s relative frequency of occurrence is Frequencyi = Ni / Ntotal . Typical values for Frequencyi can be derived empirically or extracted from surveys(4),(5),(6). The less frequently one needs to manage a system, the better.

Manageability can now be expressed with a formula, with larger values of manageability being better:

manageability formula

This says that, the more frequently a system needs to be “managed,� the poorer its manageability. The longer each step takes, the poorer the manageability. The more steps involved in each management action, the poorer the manageability. The longer the evaluation interval, the better the manageability, because observing a system longer increases the confidence in the “measurement.�

While complexity and efficiency are system-specific, their relative importance is actually specific to a customer: an improvement in complexity may be preferred over an improvement in efficiency or vice-versa; this differentiated weighting is captured by α. I would expect α>2 in general, because having fewer, atomic steps is valued more from a manageability perspective than reducing task duration, since the former reduces the risk of expensive human mistakes and training costs, while the latter relates almost exclusively to service-level agreements.

So would this metric work? Is there a simpler one that’s usable?



15
May
A taste of something new
By Mayank Bawa in Analytics, Analytics tech, Blogroll, Statements on May 15, 2008
   

Have you ever discovered a wonderful little restaurant off the beaten path? You know the kind of place. It’s not part of some corporate conglomerate. They don’t advertise. The food is fresh and the service is perfect – it feels like your own private oasis. Keeping it to yourself would just be wrong (even if you selfishly don’t want the place to get too crowded).

We’re happy to see a similar anticipation and word-of-mouth about some new ideas Aster is bringing to the data analytics market. Seems that good news is just too hard to keep to yourself.

We’re serving up something unique that we’ve been preparing for several years now. We’re just as excited to be bringing you this fresh approach.



24
Apr
By Mayank Bawa in Blogroll, Statements on April 24, 2008
   

My name is Mayank, and I co-founded Aster Data Systems with George and Tasso in 2005.

Shortly after incorporation, the three of us were eating lunch at a Chinese restaurant and out popped a fortune slip from a cookie reading:

You will always live in interesting times.

Indeed. The Internet is changing the speed at which we communicate, processes are being automated to react and execute in the blink of an eye, and data is playing a key role in guiding execution. Analysis of data is moving front-and-center, breaking out of the passive world of warehousing and reporting, as applications create intelligent processes, and companies live-and-die by their ability to monetize their data.

A new set of applications are being written - or waiting in-the-wings to be written - that will leverage data to act smarter. Consider the rapid evolution of online advertising networks: in the past 5 years, we have seen a spate of successful companies carving out a niche for themselves in the market. Their differentiation? The unique ability to match advertising inventory with consumer segments. Their basis of differentiation? Data!

And yet, a majority of these advertising networks do not use databases for their optimizations! Google and Yahoo! have famously built their own platforms; so did the amazingly talented teams at Right Media, Kosmix and Revenue Science. Of course, these companies use databases: but only for reporting and billing purposes.

How did we get to this cross-road where data is being analyzed outside the database?

For too long, databases have been clunky, monolithic systems that are rigid and inflexible, locking up the data in architectures that are

1. Hard to query
2. Hard to scale
3. Hard to manage

Meanwhile, the landscape of applications around a database is changing, shifting away slowly but surely.

We will use this blog to outline our thoughts on this changing landscape, along with our experiences in building an analytics database and a company that participates in this change.