Hard drives fail, this is a fact. Anything that moves will, given time, stop moving. When hard drives fail, it is one of the most potentially disruptive incidents because they actually hold the essential data. The question is: why do hard drives fail so often?
If we look at manufacturer’s drive specifications, they say the Mean Time Between Failures (MTBF) of drives is often 1,000,000 hours (or more), which would be over 100 years. That should mean that our drive will last as long as we need it and never die, right? Well, not quite, because it may be 1,000,000 hours, but you have to factor in how many drives you have.
So to exemplify this, let’s do some simple scale up math.
First, how many hours are in a year? 365 days x 24 hours = 8760 hours (ignoring leap years).
That means that each drive runs for 8760 hours a year. So, if I have 10 drives that run, in a given year for 87600 hours: 1,000,000 divided by 87600 equals 11.4 years. That means that, according to MTBF alone, they will run for 11 years before one fails.
What if I have 100 drives? Now we’re up to 876,000 hours in a year, which means we would have a failure every 1.1 years, which means in a five-year life cycle, we would have to replace five and a half (six) drives. What if I have 1,000 drives? That adds up to 8,760,000 hours, which means I have a failure every .11 years, or 40 days, and over a five-year life cycle, I could have 44 drives fail.
Okay, so now it makes sense why we still have failures. Even given that though, it seems that we have a LOT more failures in a given time period than MTBF calculates. Why is that? Well, the reason is that drive “failures” aren’t just from a drive itself failing. We’ll talk about that more in an upcoming post.