RAID: What You Should Know, and Probably Don't.

RAID stands for Redundant Array of Independent Disks. It uses multiple hard disk drives to increase storage capacity (RAID 0), redundancy (RAID 1), or both (RAID 5, RAID 6, etc.) over that of a single disk. RAID is frequently used in network servers because of its increased storage and reliability.

For instance, RAID 0, commonly referred to as "Striping", is an array of two disks, each of which stores half the data. The array has twice the capacity of a single disk, but it contains no redundancy, so if one disk fails, the remaining disk doesn't contain any useful data. On the other hand, RAID 1, commonly referred to as "Mirroring", stores the same data on both disks, so that if one disk fails, the data is still safe as long as the other disk continues to operate properly.

It is this redundancy that makes RAID so attractive for the storage of shared data. But it's important to understand that, while adding more disks increases the capacity, it decreases the reliability.

In the first chart you can see a "Bell Curve" showing a typical probability distribution, in this case representing the number of disk failures that can be expected to occur during certain fixed periods of time. It's true that the failure of one disk can't infer the reliability of another, just like getting "Heads" while flipping a coin doesn't increase the probability of getting "Tails" on the following flip. But if you accept that the disks in your array constitute a random subset of all disks of that manufacturer and model, then it's reasonable to presume that the probable reliability of those disks will more closely match the reliability of all disks of that manufacturer and model, especially as the number of disks in the array gets large.

In a two disk RAID 0 (Striping) array, all the data is lost when the first disk fails, but in a RAID 1 (Mirroring) array, the data is still safe until the second disk fails.

Predicted reliability of two disks

Adding a disk to an array, as with RAID 5, increases the capacity of the array, and it maintains redundancy. And again, the data is safe until a second disk fails. However, since the array has more disks which are all likely to experience individual failures evenly distributed in a typical probability distribution, the time between the first and second failure is slightly less for the 3 disk array than for the 2 disk array above.

Predicted reliability of three disks.

Similarly, a fourth disk decreases the reliability of a RAID 5 array because of the increased probability that a second disk will fail in an even shorter span of time.

Predicted reliability of four disks.

A fifth disk also increases the probability that a second disk will fail sooner, thereby decreasing the reliability of the array. As the number of disks in an array goes up, the incidence of a second disk failure within the array moves farther and farther to the left on the probability 'Bell Curve', showing a consistant decrease in reliability of the array.

Predicted reliability of five disks.

Comparing the following chart to the previous one, shows that arrays with several disks are quite a bit less reliable than arrays with fewer disks.

Predicted reliability of nine disks.

Clearly, the more disks you put in an array, the less reliable it is.

This presumes that data is lost when a second disk fails. There are ways around that, of course. One is RAID 6 which uses an additional disk for double redundancy, so that data isn't lost until a third disk fails. But again, the more disks in the array, the sooner a third disk will fail.

The primary implication of this is that, when a disk fails, the system administrator needs to be "Johnny-on-the-spot" to replace the failed disk so the RAID Controller can rebuild the data from the failed disk. A better option is to install an extra disk as a 'Hot Spare' that the RAID Controller can use to automatically replace the failed disk as soon as it happens. This is critical because the replacement disk must be a fully functional part of the array before a second disk fails, or all data will be lost.

When planning a RAID implementation, you need to compromise between having a few large disks where a failed disk will take longer to replace, and more smaller disks where a failed one can be replaced more quickly but because there are more of them, a second one might fail sooner.

© 2013 T C Solutions, Inc.
Facebook Google+ Yelp. Real People. Real Reviews.