Archive for the ‘Database’ Category

17
Jun
By Tasso Argyros in Analytics, Analytics tech, Blogroll, Database, Scalability on June 17, 2008
   

 

I’m delighted to be able to bring to a guest post to our blog this week. David Cheriton, one of Aster Data Systems’ angel investors, leads the Distributed Systems Group at Stanford University and has been known for making some smart investments. Below is what David has to say about the need to address the network interconnect in MPP systems – we hope this spurs some interesting conversation!

“A cluster of commodity computer nodes clearly offers a very cost-effective means of tackling demanding large-scale applications such as data mining over large data sets. However, most applications require substantial communication. For example, consider a query that requires a join between three tables that share no common key to partition on (non-parallelizable query), a frequent case in analytics. In conventional architectures, such operations need to move huge amounts of data among different nodes and depend on the interconnect to deliver adequate performance.

The cost and performance impact of the interconnect for the cluster to support this communication is often an unpleasant surprise, particularly without careful design of the cluster software. Yes, we are seeing the cost of 10G Ethernet coming down in cost, both in switches and NICs, and the IEEE is starting work on 100G Ethernet. However, the interconnect is, and will remain, an issue for several reasons.

First, in a parallelizable query, you need to get data from one node to several others. The bandwidth out of this one node is limited by its NIC bandwidth, Bn. In a uniformly configured cluster, each of the receiving nodes has the same NIC bandwidth Bn, so with K receivers, each is receiving at 1/K. However, the actual performance of the cluster can be limited by data hotspots, where the requirement for data from a given node far exceeds its NIC and/or memory bandwidth.

The inverse problem, often called the incast problem, arises when K nodes need to send data to a single node. Each can send at bandwidth Bn for a total bandwidth demand of K*Bn, but the target node can only receive at Bn or 1/K of the offered load. The result can be congestion, packet drop from overflowing packet queues, TCP timeouts and backoff, resulting in dramatically lower goodput than even Bn. Here, I say “dramatically” because the performance can collapse to 1/10 of expected or worse because of packet drop, timeout and retries that can occur at the TCP level. In systems with as little as 10 nodes, connected via a Gigabit Ethernet interconnect, performance can deteriorate to under 10 MB per second per node! For higher number of nodes, the problem becomes even worse.

Phanishayee et al have studied the incast problem. They show that TCP tuning does not help significantly. They observe that significantly larger switch buffering helps up to some scale, but that drives up the cost of the switches substantially. Besides some form of link-level flow control (which suffers from head-of-line blocking, is not generally available and usually does not work between switches), the other solution is just adding more NICs or faster NICs per node, to increase the send and receive bandwidth.

Moreover, with k NICs per node, an N node network now requires k*N ports, requiring a larger network to interconnect all the nodes in the cluster. Large fast networks are an engineering and operation challenge. The simplest switch is a single-chip shared memory switch. This type of switch is limited by the memory and memory bandwidth available for buffering. For instance, a 24-port 10 Gbps switch requires roughly 30 Gbytes/sec of memory bandwidth, forcing the use of on-chip memory or off-chip SRAM, in either case rather limited in size, aggravating TCP performance problems. This memory bandwidth demand tends to limit the size of shared memory switches.

The next step up is a crossbar switch. In effect, each line card is a shared memory switch, possibly splitting the send and receive sides, connected by a special interconnect, the crossbar. The cost per port increases because of the interconnect and the overall complexity of the system and the lower volume for large-scale switches. In particular each line card needs to solve the same congestion problems as above in sending through the interconnect to other line cards.

Scaling larger means building a multi-switch network. The conventional hierarchical multi-switch network introduces bottlenecks within the network, such as from the top-of-rack switch to the inter-rack switch, leading to packet loss inside the network. Various groups have proposed building Clos networks out of commodity GbE switches, but these require specialized routing support and complex configuration and a larger number of components, leading to more failures and complex failure behavior and extra cost.

Overall, you can regard the problem as being k nodes of a cluster needing to read from and write to the memory of the other nodes. The network is just an intermediary trying to handle this aggregate of read and write traffic across all the nodes in the cluster, thus requiring expensive high-speed buffering because these actions are asynchronous/streamed. Given this aggregate demand, faster processors and faster NICs just make the challenge greater.

In summary, MPP databases are more MPP than databases, in the sense that for complex distributed queries the network performance (major bottleneck in MPP systems) is much more challenging than disk I/O performance (major bottleneck in conventional database systems). Smart software that is able to minimize demands on the network and avoid hotspots and incast can significantly reduce the demand on the network and achieve far more cost-efficient scaling of the cluster, plus avoid dependence on complex (CLOS) or non-sweet spot networking technologies (i.e. non-Ethernet). It’s a great investment in software and processor cycles when the network is intrinsically a critical resource. In some sense, smart software in the nodes is the ultimate end-to-end solution, achieving good application performance by minimizing its dependence on the intermediary, the interconnect.”

- Prof. David Cheriton, Computer Science Dept., Stanford University

 



19
May
By Tasso Argyros in Blogroll, Database, Manageability, Scalability on May 19, 2008
   

One of the most interesting, complex and perhaps overused terms in data analytics today is scalability. People constantly talk about “scaling problems” and “scalable solutions.” But what really makes a data analytics system “scalable”? Unfortunately, despite its importance, this question is rarely discussed so I wanted to post my thoughts here.

Any good definition of scalability needs to be a multi-dimensional concept. In other words, there is no single system property that is enough to make a data analytics system scalable. But what are the dimensions that separate scalable from non-scalable systems? In my opinion the three most important are (a) data volume; (b) analytical power; and (c) manageability. Let me provide a couple of thoughts on each.

(a) Data Volume. This is definitely an important scale dimension because enterprises today generate huge amounts of data. For a shared-nothing MPP system this means accommodating a sufficient number of nodes to accommodate the available data. Evolution in disk and server technology have made it possible to store 10s of TBs of data per node, so this scale dimension alone can be achieved even with a relatively small number of nodes.

(b) Analytical Power. This is an equally important scale dimension to Data Volume because storing large amounts of data alone has little benefit; one needs to be able to extract deep insights out of it to provide real business value. And for non-trivial queries in a shared-nothing environment this presents two requirements. First, the system needs to be able to accommodate a large number of nodes to have adequate processing power to execute complex analytics. And secondly, the system needs to scale its performance linearly as more nodes are added. The latter is particularly hard for queries that involve processing of distributed state such as distributed joins: really intelligent algorithms have to be in place or else interconnect bottlenecks just kill performance and the system is not truly scalable.

(c) Manageability. Scalability across the manageability dimension means that a system can scale up and keep operating at a large scale without armies of administrators or downtime. For an MPP architecture this translates to seamless incremental scalability, scalable replication and failover, and little if any requirement for human intervention during management operations. Despite popular belief, we believe manageability can be measured and we need to take such metrics into account when characterizing a system as scalable or non-scalable.

At Aster, we focus on building systems that scale across all dimensions. We believe that even if one dimension is missing our products do not deserve to be called scalable. And since this is such an important issue, I’ll be looking forward to more discussion around it!



17
May
By George Candea in Administration, Blogroll, Database, Manageability on May 17, 2008
   

I want databases that are as easy to manage as Web servers.

IT operations account for 50%-80% of today’s IT budgets and amount to 10s of billions of dollars yearly(1). Poor manageability impacts the bottomline and reduces reliability, availability, and security.

Stateless applications, like Web servers, require little configuration, can be scaled through mere replication, and are reboot-friendly. I want to do that with databases too. But the way they’re built today, the number of knobs is overwhelming: the most popular DB has 220 initialization parameters and 1,477 tables of system parameters, while its “Administrator’s Guide” is 875 pages long(2).

What worries me is an impending manageability crisis, as large data repositories are proliferating at an astonishing pace; in 2003, large Internet services were collecting >1 TB of clickstream data per day(3). 5 years later we’re encountering businesses that want SQL databases to store >1 PB of data. PB-scale databases are by necessity distributed, since no DB can scale vertically to 1 PB; now imagine taking notoriously hard-to-manage single-node databases and distributing them;

How does one build a DB as easy to manage as a Web server? All real engineering disciplines use metrics to quantitatively measure progress toward a design goal, to evaluate how different design decisions impact the desired system property.

We ought to have a manageability benchmark, and the place to start is a concrete metric for manageability, one that is simple, intuitive, and applies to a wide range of systems. We don’t just use the metric to measure, but also to guide developers in making day-to-day choices. It should tell engineers how close their system is to the manageability target. It should enable IT managers to evaluate and compare systems to each other. It should lay down a new criterion for competing in the market.

Here’s a first thought;

I think of system management as a collection of tasks the administrators have to perform to keep a system running in good condition (e.g., deployment, configuration, upgrades, tuning, backup, failure recovery). The complexity of a task is roughly proportional to the number of atomic steps Stepsi required to complete task i; the larger Stepsi, the more inter-step intervals, so the greater the opportunity for the admin to mess up. Installing an operating system, for example, has Stepsinstall in the 10s or 100s.

Efficiency of management operations can be approximated by the time Ti in seconds it takes the system to complete task i ; the larger Ti , the greater the opportunity for unrelated failures to impact atomicity of the management operation. For a trouble-free OS install, Tinstall is probably around 1-3 hours.

If Ni represents the number of times task i is performed during a time interval Tevaluation (e.g., 1 year) and Ntotal=N1+; +Nn, then task i ‘s relative frequency of occurrence is Frequencyi = Ni / Ntotal . Typical values for Frequencyi can be derived empirically or extracted from surveys(4),(5),(6). The less frequently one needs to manage a system, the better.

Manageability can now be expressed with a formula, with larger values of manageability being better:

manageability formula

This says that, the more frequently a system needs to be “managed,” the poorer its manageability. The longer each step takes, the poorer the manageability. The more steps involved in each management action, the poorer the manageability. The longer the evaluation interval, the better the manageability, because observing a system longer increases the confidence in the “measurement.”

While complexity and efficiency are system-specific, their relative importance is actually specific to a customer: an improvement in complexity may be preferred over an improvement in efficiency or vice-versa; this differentiated weighting is captured by T. I would expect T>2 in general, because having fewer, atomic steps is valued more from a manageability perspective than reducing task duration, since the former reduces the risk of expensive human mistakes and training costs, while the latter relates almost exclusively to service-level agreements.

So would this metric work? Is there a simpler one that’s usable?