The phone rang at 7am one morning: “Hey, we’ve got a big problem.”
“I just got a call that the air conditioner’s condenser pump went out, the room is almost a hundred degrees, but more importantly water is pouring out of it onto the server rack. Everything is down, we need help, and users will be here within the hour. I’m on my way in, we’ve already shut everything off.”
This is a conversation that really gets you awake in the morning.
When I showed up at the customer’s site, it was as bad as imagined. There was a large puddle on the top of the UPS, water was still dripping out of the servers, and the room was sweltering. We quickly organized. We pulled servers out of the racks and water just poured out. The physical domain controller, VOIP server, backup server, and one half of their DataCore SAN were all dead. The ESX hosts had water in them, but it didn’t make it into the motherboard.
We annexed a nearby office, ran some extension cords to two old 110v 20A UPS that were pulled out of storage, box fans were set up, and we grabbed a handful of 20’ CAT6. Half of the SAN was completely toast, but the other half worked fine, so we were able to bring up half of DataCore and get it online within a few minutes in the office. The two ESX hosts were alive, except for their front-panel iDRAC display, so we were able to bring them up into production. My lab environment happened to have matching hardware for the dead DataCore node and the backup server, which I had delivered and we were able to get those servers up as well.
Within two hours we had the entire environment up and running out of an office servicing the entire business while we waited for replacement hardware to show up from vendors.
The next day the air conditioner was fixed and we used the “shared nothing” approach of DataCore to migrate the storage and the VMs back into their rack in the server room without downtime. The air conditioner was moved out of the server room, and an IDF 5 stories above was annexed to be a separate datacenter. One half of DataCore was moved up there along with an ESX host to provide full server room redundancy to prevent such disasters in the future.
This incident further highlights how most disasters are local. This particular disaster only affected a single rack of equipment but that rack just happened to be critical to running their entire environment. Had there not been redundant sets of DataCore storage, both the SAN and the backups would have been completely inoperable. Had these servers simply been distributed between multiple racks in that room, the impact would have been greatly reduced. This client’s new multi-floor stretch provides for not only a single rack, but an entire floor disaster accommodation.