Archive for June, 2008

17
Jun
By Tasso Argyros in Analytics, Analytics tech, Blogroll, Database, Scalability on June 17, 2008
   

 

I’m delighted to be able to bring to a guest post to our blog this week. David Cheriton, one of Aster Data Systems’ angel investors, leads the Distributed Systems Group at Stanford University and has been known for making some smart investments. Below is what David has to say about the need to address the network interconnect in MPP systems - we hope this spurs some interesting conversation!

“A cluster of commodity computer nodes clearly offers a very cost-effective means of tackling demanding large-scale applications such as data mining over large data sets. However, most applications require substantial communication. For example, consider a query that requires a join between three tables that share no common key to partition on (non-parallelizable query), a frequent case in analytics. In conventional architectures, such operations need to move huge amounts of data among different nodes and depend on the interconnect to deliver adequate performance.

The cost and performance impact of the interconnect for the cluster to support this communication is often an unpleasant surprise, particularly without careful design of the cluster software. Yes, we are seeing the cost of 10G Ethernet coming down in cost, both in switches and NICs, and the IEEE is starting work on 100G Ethernet. However, the interconnect is, and will remain, an issue for several reasons.

First, in a parallelizable query, you need to get data from one node to several others. The bandwidth out of this one node is limited by its NIC bandwidth, Bn. In a uniformly configured cluster, each of the receiving nodes has the same NIC bandwidth Bn, so with K receivers, each is receiving at 1/K. However, the actual performance of the cluster can be limited by data hotspots, where the requirement for data from a given node far exceeds its NIC and/or memory bandwidth.

The inverse problem, often called the incast problem, arises when K nodes need to send data to a single node. Each can send at bandwidth Bn for a total bandwidth demand of K*Bn, but the target node can only receive at Bn or 1/K of the offered load. The result can be congestion, packet drop from overflowing packet queues, TCP timeouts and backoff, resulting in dramatically lower goodput than even Bn. Here, I say “dramatically” because the performance can collapse to 1/10 of expected or worse because of packet drop, timeout and retries that can occur at the TCP level. In systems with as little as 10 nodes, connected via a Gigabit Ethernet interconnect, performance can deteriorate to under 10 MB per second per node! For higher number of nodes, the problem becomes even worse.

Phanishayee et al have studied the incast problem. They show that TCP tuning does not help significantly. They observe that significantly larger switch buffering helps up to some scale, but that drives up the cost of the switches substantially. Besides some form of link-level flow control (which suffers from head-of-line blocking, is not generally available and usually does not work between switches), the other solution is just adding more NICs or faster NICs per node, to increase the send and receive bandwidth.

Moreover, with k NICs per node, an N node network now requires k*N ports, requiring a larger network to interconnect all the nodes in the cluster. Large fast networks are an engineering and operation challenge. The simplest switch is a single-chip shared memory switch. This type of switch is limited by the memory and memory bandwidth available for buffering. For instance, a 24-port 10 Gbps switch requires roughly 30 Gbytes/sec of memory bandwidth, forcing the use of on-chip memory or off-chip SRAM, in either case rather limited in size, aggravating TCP performance problems. This memory bandwidth demand tends to limit the size of shared memory switches.

The next step up is a crossbar switch. In effect, each line card is a shared memory switch, possibly splitting the send and receive sides, connected by a special interconnect, the crossbar. The cost per port increases because of the interconnect and the overall complexity of the system and the lower volume for large-scale switches. In particular each line card needs to solve the same congestion problems as above in sending through the interconnect to other line cards.

Scaling larger means building a multi-switch network. The conventional hierarchical multi-switch network introduces bottlenecks within the network, such as from the top-of-rack switch to the inter-rack switch, leading to packet loss inside the network. Various groups have proposed building Clos networks out of commodity GbE switches, but these require specialized routing support and complex configuration and a larger number of components, leading to more failures and complex failure behavior and extra cost.

Overall, you can regard the problem as being k nodes of a cluster needing to read from and write to the memory of the other nodes. The network is just an intermediary trying to handle this aggregate of read and write traffic across all the nodes in the cluster, thus requiring expensive high-speed buffering because these actions are asynchronous/streamed. Given this aggregate demand, faster processors and faster NICs just make the challenge greater.

In summary, MPP databases are more MPP than databases, in the sense that for complex distributed queries the network performance (major bottleneck in MPP systems) is much more challenging than disk I/O performance (major bottleneck in conventional database systems). Smart software that is able to minimize demands on the network and avoid hotspots and incast can significantly reduce the demand on the network and achieve far more cost-efficient scaling of the cluster, plus avoid dependence on complex (CLOS) or non-sweet spot networking technologies (i.e. non-Ethernet). It’s a great investment in software and processor cycles when the network is intrinsically a critical resource. In some sense, smart software in the nodes is the ultimate end-to-end solution, achieving good application performance by minimizing its dependence on the intermediary, the interconnect.”

- Prof. David Cheriton, Computer Science Dept., Stanford University

 



11
Jun
The Power of Efficient Algorithms on Big Data
By Mayank Bawa in Analytics, Blogroll, Business analytics, Interactive marketing on June 11, 2008
   

I had the opportunity to work closely with Anand Rajaraman while at Stanford University and now at our company. Anand teaches the Data Mining class at Stanford as well, and recently he did a very instructive post on the observation that efficient algorithms on more data usually beat complex algorithms on small data. He followed it up with an elaboration post. Google also seems to believe in a similar philosophy.

I want to build upon that observation here. If you haven’t read the posts, do read them first. It is well-worth the time!

I propose that there are 2 forces in action that help simple algorithms on big data beat complex algorithms on small data:

  1. The freedom of big data allows us to bring in related datasets that provide contextual richness.
  2. Simple algorithms allow us to identify small nuances by leveraging contextual richness in the data.

Let me expand my proposal using Internet Advertising Networks as an example.

Advertising networks essentially make a guess about a user’s intent and present an advertisement (creative) to the consumer. If the user is indeed interested, the user clicks through the creative to learn more.

Advertising networks are used today on a CPC model (Cost-Per-Click). There are stronger variants of CPL (Cost-Per-Lead) or CPA (Cost-Per-Acquisition) but these variants are as applicable to the discussion as the simpler CPC model. There is a simpler variant of CPM (Cost-Per-Impression) but an advertiser ends up effectively computing CPC by keeping track of click-through rates for money spent via the CPM model. The CPC model dictates that Advertising Networks do not make money unless the user clicks on a creative.

Today, the best advertising networks have a click through rate of less than 1%. In other words, advertising networks correctly interpret a user’s intentions 1% of the time, 99% of the time they are ineffective!I find this statistic immensely liberating. Here is a statistic that shows that even if we are correct 1% of the time, the rewards are significant. Why is the click-through rate so low? I think it is because human behavior is difficult to predict. Even sophisticated algorithms (that are computationally practical only on small datasets) do a bad job of predicting human behavior.It is much more powerful to think of efficient algorithms that execute across larger, diverse datasets to exploit the richness inherent in the context to enable a higher click-through rate.I’ve observed people in the field sample behavioral data to reduce their operating dataset. I submit that a sample of 1% will lose the nuances and the context that can cause an uplift and growth in revenue.For example, a Content Media site may have 2% of their users who come in to read Sports stay on to read Finance articles. A sampling of 1% is certain to reduce this 2% population trait to a statistically insignificant portion in the sample. Should we or should we not derive this insight to identify and engage the 2% by serving them better content?Similarly, an Internet Retailer may have 2% of their users who come in to buy flat-panel TV have also bought video games recently. Should we or should we not act on this insight to identify and engage the 2% by offering them better deals on games? Given that games are a high-margin product, the net effect on revenue via cross-sell could be higher than 2% in dollars.We often want to develop an algorithm that is provably correct under all circumstances. In a bid to satisfy this urge, we restrict our datasets to find a statistically significant model that is a good predictor. I associate that with a purist way of algorithm development that was drilled into us at school.Anand’s observation is a call for practitioners to think simple, use context and come up with rules that segment and win locally. It will be faster to develop, test and win on simple heuristics than waiting for a perfect “Aha!” that explains all things human.