One of the biggest mistakes that most people make when planning for a Disaster Recovery (DR) plan is that they assume they’re always planning for the complete destruction of the main site.
While this is a part of a DR plan, it’s not the only thing you have to plan for. Proper DR starts with High Availability (HA), and the first tier of any good DR plan involves simply keeping services available at the main production location.
If an ESX host melts in the middle of the business day, in most environments zero administrative actions are required to get services back up. The fire suppression kicks in and puts out the fire. (Let’s assume the ESX host is LITERALLY melting…). VMware HA kicks in once the other hosts see that the host is gone, VMs boot up on other servers, and within just a couple minutes we’ve recovered from a serious disaster.
Why does that need to be documented? End users might even miss the entire event (excluding the frantic screaming and running of the IT staffers). The reason it needs to be documented is because everyone is off work at some point. While the primary staff responsible for VMware know that this is going to happen, if that staff is away — at VMworld for instance — the rest of IT may not know the level of automation of the system and may unintentionally lower availability by trying to recover services manually. They may not know how to replace the puddle that formerly was ESX05, or if it needs to be replaced. The DR plan in this case isn’t for getting services back on, it’s for recovering HA after a disaster has already been mitigated.
When planning for DR you essentially have to plan for dozens of different types of disasters. An example of just a few:
- Single file disaster: One super critical file is deleted or corrupt (think .edb or .mdf)
- Single server disaster: One server just dies; corrupt OS, hardware failure, etc.
- Single rack disaster: One whole rack of equipment is dead; water leak, power failure, etc.
- Single room disaster: One whole room is no longer usable; fire, water, power failure, etc.
- Single building disaster: An entire building is gone, fire, meteorite, jet engine, etc.
- Single campus disaster: Tornado, massive power outage, meteorite shower, etc.
- Single city disaster: Generic acts of God.
- Single IT person disaster: This could mean that John is on vacation, or that John just got hit by a bus.
- Multiple IT people disaster: This may mean that the flight back from VMworld with all the VMware admins just crashed.
- Employee Workspace uninhabitable: This could be a fire, flood, or a blizzard; for some reason employees can’t work in the office.
Most people only plan for acts of God, rather than all the other disasters which are far more likely. If you read the previous article about HA, you know that most of these issues can be mitigated through HA, but that still needs to be part of the DR plan.
If you want to skip around, here are all the articles in this disaster recovery series:
- How to Construct a Proper Disaster Recovery Plan
- Technical Documentation
- End Users
- Management Buy-In
- Care and Feeding