I’m delighted to be able to bring to a guest post to our blog this week. David Cheriton, one of Aster Data Systems’ angel investors, leads the Distributed Systems Group at Stanford University and has been known for making some smart investments. Below is what David has to say about the need to address the network interconnect in MPP systems – we hope this spurs some interesting conversation!
“A cluster of commodity computer nodes clearly offers a very cost-effective means of tackling demanding large-scale applications such as data mining over large data sets. However, most applications require substantial communication. For example, consider a query that requires a join between three tables that share no common key to partition on (non-parallelizable query), a frequent case in analytics. In conventional architectures, such operations need to move huge amounts of data among different nodes and depend on the interconnect to deliver adequate performance.
The cost and performance impact of the interconnect for the cluster to support this communication is often an unpleasant surprise, particularly without careful design of the cluster software. Yes, we are seeing the cost of 10G Ethernet coming down in cost, both in switches and NICs, and the IEEE is starting work on 100G Ethernet. However, the interconnect is, and will remain, an issue for several reasons.
First, in a parallelizable query, you need to get data from one node to several others. The bandwidth out of this one node is limited by its NIC bandwidth, Bn. In a uniformly configured cluster, each of the receiving nodes has the same NIC bandwidth Bn, so with K receivers, each is receiving at 1/K. However, the actual performance of the cluster can be limited by data hotspots, where the requirement for data from a given node far exceeds its NIC and/or memory bandwidth.
The inverse problem, often called the incast problem, arises when K nodes need to send data to a single node. Each can send at bandwidth Bn for a total bandwidth demand of K*Bn, but the target node can only receive at Bn or 1/K of the offered load. The result can be congestion, packet drop from overflowing packet queues, TCP timeouts and backoff, resulting in dramatically lower goodput than even Bn. Here, I say “dramatically” because the performance can collapse to 1/10 of expected or worse because of packet drop, timeout and retries that can occur at the TCP level. In systems with as little as 10 nodes, connected via a Gigabit Ethernet interconnect, performance can deteriorate to under 10 MB per second per node! For higher number of nodes, the problem becomes even worse.
Phanishayee et al have studied the incast problem. They show that TCP tuning does not help significantly. They observe that significantly larger switch buffering helps up to some scale, but that drives up the cost of the switches substantially. Besides some form of link-level flow control (which suffers from head-of-line blocking, is not generally available and usually does not work between switches), the other solution is just adding more NICs or faster NICs per node, to increase the send and receive bandwidth.
Moreover, with k NICs per node, an N node network now requires k*N ports, requiring a larger network to interconnect all the nodes in the cluster. Large fast networks are an engineering and operation challenge. The simplest switch is a single-chip shared memory switch. This type of switch is limited by the memory and memory bandwidth available for buffering. For instance, a 24-port 10 Gbps switch requires roughly 30 Gbytes/sec of memory bandwidth, forcing the use of on-chip memory or off-chip SRAM, in either case rather limited in size, aggravating TCP performance problems. This memory bandwidth demand tends to limit the size of shared memory switches.
The next step up is a crossbar switch. In effect, each line card is a shared memory switch, possibly splitting the send and receive sides, connected by a special interconnect, the crossbar. The cost per port increases because of the interconnect and the overall complexity of the system and the lower volume for large-scale switches. In particular each line card needs to solve the same congestion problems as above in sending through the interconnect to other line cards.
Scaling larger means building a multi-switch network. The conventional hierarchical multi-switch network introduces bottlenecks within the network, such as from the top-of-rack switch to the inter-rack switch, leading to packet loss inside the network. Various groups have proposed building Clos networks out of commodity GbE switches, but these require specialized routing support and complex configuration and a larger number of components, leading to more failures and complex failure behavior and extra cost.
Overall, you can regard the problem as being k nodes of a cluster needing to read from and write to the memory of the other nodes. The network is just an intermediary trying to handle this aggregate of read and write traffic across all the nodes in the cluster, thus requiring expensive high-speed buffering because these actions are asynchronous/streamed. Given this aggregate demand, faster processors and faster NICs just make the challenge greater.
In summary, MPP databases are more MPP than databases, in the sense that for complex distributed queries the network performance (major bottleneck in MPP systems) is much more challenging than disk I/O performance (major bottleneck in conventional database systems). Smart software that is able to minimize demands on the network and avoid hotspots and incast can significantly reduce the demand on the network and achieve far more cost-efficient scaling of the cluster, plus avoid dependence on complex (CLOS) or non-sweet spot networking technologies (i.e. non-Ethernet). It’s a great investment in software and processor cycles when the network is intrinsically a critical resource. In some sense, smart software in the nodes is the ultimate end-to-end solution, achieving good application performance by minimizing its dependence on the intermediary, the interconnect.”
- Prof. David Cheriton, Computer Science Dept., Stanford University