The main point of disaster recovery (DR) is to decouple it from your production environment. This is the reason that offsite (and offline) backups are so important. Having your backups onsite and another copy online in another site could still be wiped out by ransomware. Disaster recovery is your fallback when everything has gone wrong and you need to start bringing things online, whereas high availability (HA) is supposed to have things back online in seconds to minutes. DR is often a multi-day solution and should be able to survive anything coming at it.
However, HA and DR are two topics that are often heavily confused in organizations. It’s not your fault if you lump these together; modern availability and DR solutions provide products that blur the lines between the two solutions. To clear this up, let’s run through what each of these are and where it can get fuzzy.
High Availability is the ability for an application to stay online as things fail.
For example, take a SQL cluster with two nodes. If one of the Windows boxes dies, the application fails over to the other node and everything keeps running as expected as an application-level failover. Exchange also has application-level HA, with DAGs allowing for the environment to come back online after a node fails. Properly configured Active Directory also has inherent application high availability, with the database being replicated to multiple other servers, with all of them capable of answering the requests of the clients.
A more modern example is a virtualization HA cluster. If a host dies, all the VMs fail over to another host. This is OS-level failover when the entire OS moves over to another host and can start up there.
One of the key things with HA is that it has something shared, as opposed to DR which is meant to eliminate any single points of failure. Traditional SQL failover clustering shares the disks and by proxy the databases. Exchange DAGs share the same configuration and quorum but have different disks and databases. Active Directory shares config and duplicates the database across multiple locations. Virtualization shares the configs and (often) storage between the multiple nodes and relies on a single shared VM that moves between the nodes.
If you’re thinking of VMware fault tolerance right now, remember, that is still only with a single VM, just running in two places. If the VM crashes in one side, it crashes in the other. If you’re thinking about Software Defined Storage (S2D or VSAN), remember that’s the same config, and the same data being synchronized between multiple locations. If you delete a VM in one place, that deletion is synchronously completed everywhere. Synchronously mirroring SANs are the same way — if you screw the config up at one place, you lose both copies. Local SAN-based snapshots are another thing that many people try to sell as DR. Again, it inherently relies on the primary SAN being functional. If that SAN dies, so too do the snapshots.
This has all been wonderful for our production availability and ability to keep the lights on throughout the day as issues come up, but we haven’t touched disaster recovery.
Disaster Recovery’s job is to get the environment back online after those shared things above have failed. This is your fourth-and-30 punt option when everything has gone wrong and you just need to get stuff back online.
The oldest DR option is simple backups stored offsite. Create backups using a separate server, with separate software, and store it on separate storage. If the production SAN dies and kills the HA cluster, use the backups to restore it on another server and bring the environment back up. Shipping those backups to another location for future recovery was one of the first true DR options. If those backups were to stay in the production server room, then that room becomes a single point of failure. If a fire breaks out, the primary copy of the data and the DR copy are both gone.
Shipping backups around was slow and labor intensive, though, so people tried to figure out other methods of getting copies of the data elsewhere without the effort. That is where SAN-based asynchronous replication joined the fray, and that’s where the line started getting very blurry.
Asynchronous Mirroring Is Not True DR
Some of the problem with SAN-based replication, even asynchronous, is the fact that if the data gets deleted at the main site, it will be deleted at the other location. From there, people started doing snapshots on the DR SAN to keep multiple copies of the data. However, that’s brought us to having something that’s directly tethered to our production environment as our disaster recovery solution. If you want an example of how this can go badly, think about what would happen if a really bad firmware bug corrupted data on your SAN. Most SANs require both sides to be on the same firmware version to replicate, so that bug would hit both sides.
Software-Based Replication to the Rescue!
This is all not to mention that SANs are expensive! This is where software-based replication (Veeam, Carbonite Migrate, etc.) started to join the fight. This software allowed for a shared-nothing replication of the environment. A VM running on a Hyper-V cluster off a SAN in one place can be replicated to a stand-alone server with local storage at another location.
Eliminating Single Points of Failure
There’s only one planet Earth, so at some point, you have a single point of failure. If your business exclusively exists inside the confines of the Louisville Metro area, and a disaster wipes out the entire city … being able to recover elsewhere may not be important to you if everyone is dead. If you have 50 sites across the whole US, a single tornado hitting Louisville is a disaster, but the business needs to keep working, so plan your disaster recovery around that. Your tolerance for downtime coupled with your budget will give you a pretty good direction for both your HA and DR initiatives.
If you have any further questions about what’s right for your organization, we can help! Send us an email or give us a call at 502-240-0404!