High Availability Series: Server Clustering

We’ve had clustering for years. A cluster is the standard multiple servers sharing a common set of storage to serve the load in the event one fails or storage requirements call for scalability. This means that instead of just one server supporting the entire workload, a few servers are united to appear and behave as one but can then bear the weight of more or compensate if one stops working. Technologies like Microsoft Failover clustering have been popular for years in order to tolerate a hardware or software feature, and in recent years have done a fairly good job of solving the uptime, high availability conundrum. While the virtualization technologies focus on keeping the virtual hardware itself up and running, clustering focuses on keeping/getting the actual services provided up and running. 

Active/Passive Clusters

This is the setup to compensate for server outage. In this system, the main server will bear the entire weight of the workload until it either fails or is undergoing maintenance. In the event that that server is unavailable, the secondary server will take on the workload. In these situations, it’s common to have a server that is either just as capable or even less capable than the primary server.

Active/Active Clusters

This organization of servers is designed so that each server performs useful work, so each server would be the primary one for a specified set of applications. If one server fails, another would continue to perform its basic functions as well as take on the workload of that failed server. Generally symmetric clusters are considered the more cost-effective option because the resources are more evenly distributed.

The downside to clustering is that it has always had some rather specific requirements that can sometimes be hard to meet, and the normally shared storage that can be a single point of failure.  In most clustering environments like Microsoft Failover Clustering there also tends to be downtime between workloads failing over to the other node, which can also cause interruptions to service.

Check out our previous post on Storage Flexibility.