Configuring Replication Factor = 2 in Aster Data
By Mark Ott, Teradata Aster
Since taking the Aster Data class, I wanted to learn more about this exciting new technology. After downloading the 2 VMware images (Queen and Worker), I needed another Worker node to be able to increase the Replication Factor (RF) from 1 to 2.
The Replication Factor is the number of copies of data that are stored in Aster to provide tolerance against hardware failure. For example, with a RF = 2, when you load data, the system will copy the same data on 2 different physical Worker nodes. So if 1 node goes down, the system can still function. So it’s similar to what Teradata does when you configure Fallback on the AMPs. An added bonus with RP = 2 is that Aster will also replica the Queen’s data, which ensures you can perform faster Queen replacement.
I wanted to test this concept by intentionally bringing an Active Worker node down while executing a query to see if the system could recover and bring back the answer set in real time. With the nCluster User’s Guide at my side, I set aside a full day to change the RF. What possibly could go wrong?
Before we begin, I do want to point out the resources needed to run 3 VMware images on one physical PC. You will probably want a newer PC that has scads of RAM. I'm running this demo on a Quad-CPU with 8 GB of RAM and performance is reasonable. According to Task Manager, I'm consuming almost 5.5 GB of RAM so I would not attempt this on system will less than 6 GB RAM. Having said that, let's roll.
Day 1 – Adding a 2nd Worker node (Worker-2)
This seemed easy enough. Just make a copy of the existing Vmware image for the Worker node and then fire it up. I figured I would have to change the host name and IP address on this new image. That was quite simple using the GUI tool provided in the image. But when I attempted to add a Worker node from the Aster Management Console (AMC), it would not do so. I then noticed what might be the problem. When adding a Worker, you can specify IP or MAC address.
I was guessing since I had 2 identical MAC addresses for the 2 Worker nodes, this is why the new node was failing. After a Google search on how to change MAC address’s on Linux, I edited the boot.local file and inserted a fictitious MAC address as shown below.
Sure enough, this time I was able to register the Worker node using the AMC. However, the Worker node would never enter Active Status, but rather stayed Passive. With Status = Passive, the Worker node is in standby mode which clearly is not what I wanted. I needed it to be Active so it could be available to process queries. Hmmm, there must some other configurations I need to do. But since it was already dark outside, I figured I had enough for one day.
Day 2 – Partition Splitting
Now I was at a standstill since the Worker-2 node would not enter Status = Active. At that point, I knew I needed an intervention. With an e-mail to Aster Support Team, I laid out my dilemma. Sure enough, a reply mentioned it might have something to do with Partition Splitting. Another trip back to User’s Guide shed light on this topic.
Partition Splitting is an unfortunate name as far as I’m concerned. ‘Partition’ has a lot of different definitions depending on whom you ask in the Aster world. Basically, Partition Splitting is a feature that adds v-Workers. Let’s back up a minute to get perspective on this.
To scale out your cluster, you add Worker nodes. Of course as you add Workers, you get more CPUs on those nodes. To execute queries, Worker nodes need v-Workers. V-Workers do all the heavy lifting for a Worker node. Aster Data recommends you have about 2 CPU cores per v-Worker. If you have less than this, the CPUs may become underutilized. Since I was unable to change the RF to 2, maybe it had something to do with a lack of v-Workers.
In the User’s Guide, it mentioned a file that held the current partition count. So from the command shell of the Queen, I ran:
$ cat /home/beehive/config/totalPartitionCount
That resulted in a v-Worker count = 1. So at this point, I had potentially solved the mystery. Worker-2 node was Passive since I only had 1 v-Worker available and it was already handed out to Worker-1 node. To confirm this, I went to the AMC under Partition Map and sure enough, it stated I had 1 Primary v-Worker and it was on Worker-1 node.
Time to do some Partition Splitting. From the Queen, I ran the following to increase the v-Workers to 2:
After a Soft Reset (which restarts the software on the Nodes), I did an Activate Cluster ( which brings Nodes online). A Balance Data (which moves v-Workers among Worker nodes) was done automatically as was a Balance Process (which sets v-Workers as Primary or Secondary).