In previous blogs we’ve discussed the fact that hard drives “fail” far more frequently than they should based on manufacturer’s specifications, and that they are often substantially slower in a production environment than they should be. Seagate noticed this trend in the early 2000s when they found that 60-90 percent of all drives sent back to them actually had no issues when they were put through a bench test.
This obviously cost them a lot of money, so they set about solving the problem through innovation and eventually produced the technology that became the X-IO ISE. A good description of the ISE can be found here so these articles are going to focus instead on the implications of this technology as related to previous blogs.
The X-IO chassis is built in a way to resolve the primary causes of hard drive failures through a series of ingenious design decisions.
All of the drives in an X-IO drive enclosure are mounted on special rubber grommets that isolate vibration and prevent individual drives from influencing the others through vibration. This keeps the average X-IO enclosed at only 2 rad/s (as opposed to 12-25 in a normal enclosure). This greatly reduces vibration within the chassis and prevents many mechanical errors that lead to drive failures.
If a drive in an ISE enclosure exhibits symptoms that seem to indicate the drive may be failing, the ISE chassis will do an automatic Bus Reset for that individual drive, and if necessary do a power cycle, which simulates resetting a drive. As opposed to a traditional array, however, this does not force an immediate full rebuild of the drive.
It’s all fully controlled by the ISE controllers and they know that the drive is going to be going away and coming back, so it will only have to rebuild the small amount of data after the drive is power cycled. If none of this fixes the drive, the system will do in-situ remanufacturing by reformatting the drives to zeros and finding the particular parts of the drive that are defective.
One of the biggest changes from normal drive enclosures with the ISE is that the drives are in a sealed Datapac. This is held in the closure with a small servo to fully prevent the drives from being removed when in use unless the controllers have released the servo.
Individual drives no longer have to be replaced. Within each Datapac is 20 percent spare capacity spread across all drives, which is available to satisfy any failures that occur within the ISE’s lifespan.
One of the reasons that drives fail when they are finally powered off after having run for years is that they lose their calibration after they are finally stopped and moved around.
X-IO has built in technology to address this; if a drive is exhibiting odd behavior, the ISE chassis has the ability to load Seagate’s diagnostic firmware on a drive, and fully recalibrate everything about that drive on the fly, such as head flight height, and then place it back into production when it is done.
The “Bathtub Curve”
While no one can prevent the bathtub curve, the ISE technology addresses it by failing components in a very granular level rather than an entire drive. In a normal drive enclosure, if one small part of a drive starts to exhibit problems, the entire drive has to be decommissioned.
Within an individual drive, an ISE can retire certain sectors, heads or surfaces of a platter, without having to retire the entire drive. This means that when something does go wrong on a drive, only a small portion of the drive has to be retired rather than the entire drive itself which means a single drive can live much longer within the Datapac.
In addition to minimizing vibration, the drives are mounted for optimal cooling. Rather than a traditional drive enclosure where all of the drives go across the front of the enclosure into a backplane that is parallel with the front of the enclosure and restricts airflow, the backplane for the drives in the ISE is perpendicular to the front, which allows air to flow straight through the enclosure over the top and bottom of the drives to maintain optimal temperatures and prevent hotspots or dead air.
The fans in the ISE are also of aeronautic quality and pull air through the enclosure rather than pushing it, allowing them to move more air more efficiently without the need for dozens of fans. Additionally, within each ISE enclosure are redundant active/active controllers with dual power supplies and super capacitor backed intelligent cache. This means that within any enclosure, a large number of failures would have to occur in order to actually have a failure of the system.
No system is perfect, and the ISE is no exception. Each ISE reports daily back to X-IO with telemetry data for the day as well as an overall health of the system. This allows for X-IO to proactively notify customers if components are degraded or in a failing condition. In the rare situation that a Datapac approaches the point of needing a replacement, a replacement can be proactively shipped and data mirrored off the deteriorating Datapac before there are any production impacts.