No discussion of high availability is complete without a quick dive into disaster recovery. Technically, disaster recovery (DR) is not at all the same as high availability (HA). If you are going for HA, you are going to build things wall to wall, location to location, country to country that can handle some failure and still remain up. You will build a resilient environment from start to finish – soup to nuts, as they say.
But, bad things happen. This is a rule. It will happen when you least expect it. It could be a hurricane or a rogue collective of hackers that take down entire countries. It could be something much more catastrophic that I don’t even want to imagine it. While you’re building your utopian HA environment, you must also be having a disaster recovery conversation.
DR is NOT an IT decision, but a management decision. IT can start the conversation – and you should because you are a stakeholder – but management needs to drive and sign off on any DR discussion or plan. There are two main things to discuss:
Recovery Point Objective (RPO)
This is how much data you can lose. What’s the “point in time” you have to recover back to if something horrible happens. Can you lose a day’s worth of data/information? A week? A month? A minute? A second? The least amount of data you can lose, the lower the RPO, the more expensive things will be. You’ll need better software, hardware, processes, Internet bandwidth, and partnerships with offsite locations/gear.
Recovery Time Objective (RTO)
This is how quickly you get back up and running. In the case of something horrible happening, one of the network things is “when do I get my email back?” Or “when does our e-commerce site come back online?” Can you be down for a week? A day? An hour? The faster you need to recover and be fully operational, the more expensive things will be. Again, you’ll need better software, hardware, processes, and partnership. If your building is destroyed, chances are you don’t have an extra building across town. You’ll have to build some partnership to perhaps share a rack, or a building, or co-location facility, or sister company location.
You want to make sure you have backup and replication (or both) technologies and processes in place to assist with your RPO/RTO. Get that data offsite and easily accessed.
You probably want “extra gear” and that’s hard. In small/isolated failures, like a network closet, you probably have warranties in place to get gear in a few hours (or days). But what about the meantime? Can your company afford to have a whole section of users down and not working? Maybe you need some gear on a shelf you can swap out, and then put the newly-replaced gear on the shelf for next time.
We’ve discussed gateway networking, and BGP, and how to make that resilient. But what if you’re in a DR situation and you have to move services from A to B and your IP addresses change? That means you probably have DNS entries that need to change…quickly. There are technologies and services that can provide automated DNS heartbeat/failover too. If your RPO/RTO is low and you need to swing production in a matter of seconds or minutes, you can do that with enough planning.
At the end of the day, with DR (and most any HA discussion), management needs to understand how decisions affect the use of technology. Remember, IT is typically a cost center. When things run well, IT doesn’t exist. When bad things happen, IT is the scapegoat. A clear partnership between IT and management is required to understand how IT is used in the organization and how it needs to remain active (as best as possible) when bad things happen. Which they will. We promise.