Click here to monitor SSC
SQLServerCentral is supported by Red Gate Software Ltd.
 
Log in  ::  Register  ::  Not logged in
 
 
 

SQL Man of Mystery

Wes Brown is a PASS chapter leader and SQL Server MVP. He writes for SQL Server Central and maintains his blog at http://www.sqlserverio.com. Wes is Currently serving as a Senior Lead Consultant at Catapult Systems. Previous experiences include Product Manager for SQL Litespeed by Quest software and consultant to fortune 500 companies. He specializes in high availability, disaster recovery and very large database performance tuning. He is a frequent speaker at local user groups and SQLSaturdays.

Fundamentals of Storage Systems – RAID, An Introduction

In previous articles, we have covered the system bus, host bus adapters, and disk drives. Now we will move up the food chain at take a look at getting several disks to operate as one.

In 1988 David A. Patterson, Garth Gibson, and Randy H. Katz authored a seminal paper, A Case for Redundant Arrays of Inexpensive Disks (RAID). The main concept was to use off the shelf commodity hardware to provide better performance and reliability and a much lower price point than the current generation of storage. Even in 1988, we already knew that CPUs and memory were outpacing disk drives. To try to solve these issues Dr. Patterson and his team laid out the fundamentals of our modern RAID structures almost completely RAID levels 1 through 5 all directly come from this paper. There have been improvements in the error checking but the principals are the same. In 1993, Dr. Patterson along with his team released a paper covering RAID 6.

 

RAID Level

Description

Disk Requirements

Usable Disks

Diagram

RAID 0

RAID 0 is striping without parity. Technically, not a Redundant array of disks just an array of disks but lumped in since it uses some of the same technical aspects. Other hybrid raid solutions utilize RAID 0 to join other RAID arrays together. Each disk in the array holds data and no parity information. Without having to calculate parity, there are no penalties on reads or writes. This is the fastest of all the RAID configurations. It is also the most dangerous. One drive failure means you lose all your data. I don’t recommend using RAID 0 unless you are 100% sure losing all your data is completely OK.

2

N

 325px-RAID_0.svg

RAID 1

RAID 1 is mirroring two disks. RAID 1 writes and reads to both disks simultaneously. You can lose one disk and still operate. Some controllers allow you to read data from both disks; others return only data from the disk that delivers it first. Since there are no parity calculations, it is generally the easiest RAID level to implement. Duplexing is another form of RAID 1 where each disk has its own controller.

2

N/2

 325px-RAID_1.svg

RAID 5

RAID 5 is a striped array with distributed parity. This is similar to RAID 0 in that all data is striped across all available disks. Where it differs is one stripe holds parity information. If a drive fails, the data contained on that drive is recreated on the fly using the parity data from the other drives. More than one disk failure equals total data loss. The more drives you have in a RAID 5 array the greater the risk of having a second disk failure during the rebuild process from the first disk failure. The general recommendation at this time is 8 drives or less. In general, the larger the drive the fewer of them you should have in a RAID 5 configuration due to the rebuild time and the likely hood of a second drive failure.

3

N-1

 675px-RAID_5.svg

RAID 6

RAID 6 is a striped array with dual distributed parity. Like RAID 5 it is a distributed block system with two parity stripes instead of one. This allows you to sustain a loss of two drives dramatically reducing the risk of a total stripe failure during a rebuild operation. Also known as, P+Q redundancy using Reed-Solomon isn’t practical to implement in software due to the math intensive calculations that have to take place to write parity data to two different stripes. The current recommendation is to use 8 drives or more.

4

N-2

 800px-RAID_6.svg

RAID 10

RAID 10 is a hybrid or nested striping scheme combining RAID 1 mirrors with a RAID 0 stripe. This is for high performing and fault tolerant systems. Like RAID 1, you lose half your available space. You could lose N/2 drives and still have a functioning array. Duplexing each mirror between two drive chassis is common. You could lose a drive chassis and still function. The absence of parity means write speeds are high. Along with excellent redundancy, this is probably the best option for speed and redundancy.

4

N/2

 180px-RAID_10

RAID 0+1

RAID 0 + 1 is not interchangeable with RAID 10. There is one huge difference and that is reliability. You can lose only one drive and have a functioning array. With the more drives in a single RAID 0 stripe the greater the chance you take. Speed characteristics are identical to RAID 10. I have never implemented RAID 0 + 1 when RAID 10 was available.

4

N/2

 180px-RAID_0 1

RAID 50

Since RAID 5 becomes more susceptible to failure with more drives in the array keeping the RAID 5 stripe small, usually under 8 drives and then striping them with RAID 0 increases the reliability while allowing you to expand capacity. You will lose a drive per RAID 5 stripe but that is a lot less than loosing half of them in a RAID 10. Before RAID 6, this was used to get higher reliability in very large arrays of disks.

6

(N-1)*R

 320px-RAID_50

RAID 60

RAID 60 is the exact same concept as RAID 50. Generally, a RAID 6 array is much less susceptible to an array failure during a rebuild of a failed drive due to the nature of the dual striping that it uses. It still is not bullet proof though the RAID 6 array sizes can be much larger before hitting the probability of a dual drive failure and then a failure during rebuild than RAID 5. I do not see many RAID 60 configurations outside of SAN internal striping schemes. You do lose twice as many drives worth of capacity as you do in a RAID 50 array.

8

(N-2)*R

 400px-RAID_60

RAID 100

RAID 100 is RAID 10 with and additional RAID 0 stripe. Bridging multiple drive enclosures is the most common use of RAID 10. It also reduces the number of logical drives you have to maintain at the OS level.

8

N/2

 320px-RAID_100

Speed, Fault Tolerance, or Capacity?

You can’t have your cake and eat it too. In the past, it was hard to justify the cost of RAID 10 unless you really needed speed and fault tolerance. RAID 5 was the default because in most situations it was good enough. Offering near raid 0 read speeds. If you had a heavy write workload, you took a penalty due to the parity stripe. RAID 6 suffers from this even more so with two parity stripes to deal with. Today, with the cost of drives coming down and the capacity going up RAID 10 should be the default configuration for everything.

Here is a breakdown of how each RAID level handles reads and writes in order of performance.

 

RAID Level

Write Operations

Notes

Read Operations

Notes

RAID 0

1 operation

High throughput, low CPU utilization.
No data protection

1 operation

High throughput, low CPU utilization.

RAID 1

2 IOP’s

Only as fast as a single drive.

1 IOP

Two read schemes available. Read data from both drives, or data from the drive that returns it first. One is higher throughput the other is faster seek times.

RAID 5

4 IOP’s

Read-Modify-Write requires two reads and two writes per write request. Lower throughput higher CPU if the HBA doesn’t have a dedicated IO processor.

1 IOP

High throughput low CPU utilization normally, in a failed state performance falls dramatically due to parity calculation and any rebuild operations that are going on.

RAID 6

6 IOP’s

Read-Modify-Write requires three reads and three writes per write request. Do not use a software implementation if it is available.

1 IOP

High throughput low CPU utilization normally, in a failed state performance falls dramatically due to parity calculation and any rebuild operations that are going on.

 

Choosing your RAID level

This is not as easy as it should be. Between budgets, different storage types, and your requirements, any of the RAID levels could meet your needs. Let us work of off some base assumptions. Reliability is necessary, that rules out RAID 0 and probably RAID 0+1. Is the workload read or write intensive? A good rule of thumb is more than 10% reads go RAID 10. In addition, if write latency is a factor RAID 10 is the best choice. For read workloads, RAID 5 or RAID 6 will probably meet your needs just fine. One of the other things to take into consideration if you need lots of space RAID 5 or RAID 6 may meet your IO needs just through sheer number of disks. Take the number of disks divide by 4 for RAID 5 or 6 for RAID 6 then do your per disk IO calculations you may find that they do meet your IO requirements.

Separate IO types!

The type of IO, random or sequential, greatly affects your throughput. SQL Server has some fairly well documented IO information. One of the big ones folks overlook is keeping their log separate from their data files. I am not talking about all logs on one drive and all data on another, which buys you nothing. If you are going to do that you might as well put them all on one large volume and use every disk available. You are guaranteeing that all IO’s will be random. If you want to avoid this, you must separate your log files from data files AND each other! If the log file of a busy database is sharing with other log files, you reduce its IO throughput 3 fold and its data through put 10 to 20 fold.

 

RAID Reliability and Failures


Correlated Disk Failures


Disks from the same batch can suffer similar fate. Correlated disk failures can be due to a manufacturing defect that can affect a large number of drives. It can be very difficult to get a vendor to give you disks from different batches. Your best bet is to hedge against that and plan to structure your RAID arrays accordingly.

Error rates and Mean Time Between Failures

As hard disks get larger the chance for an uncorrectable and undetected read or write failure. On a desktop drive, that rate is 10^14 bits read there will be an unrecoverable error. A good example is an array with the latest two-terabyte SATA drives would hit this error on just one full pass of a 6 drive RAID 5 array. When this happens, it will trigger a rebuild event. The probability of hitting another failure during the rebuild is extremely high. Bianca Schroeder and Garth A. Gibson of Carnegie Mellon University have written an excellent paper on the subject. Read it, it will keep you up at night worrying about your current arrays. Enterprise class drives are supposed to protect against this. No study so far proves that out. That does not mean I am swapping out my SAS for SATA. Performance is still king. They do boast a much better error rate 10^16 or 100 times better. Is this number accurate or not is another question all together. Google also did a study on disk failure rates, Failure Trends in a Large Disk Drive Population. Google also found correlated disk failures among other things. This is necessary read as well. Eventually, RAID 5 just will not be an option, and RAID 6 will be where RAID 5 is today.

What RAID Does Not Do

RAID Doesn’t back your data up. You heard me. It is not a replacement for a real backup system. Write errors do occur.As database people we are aware of atomic operations, the concept of an all or nothing operation, and recovering from a failed transaction. People assume the file system and disk is also atomic, it isn’t. NTFS does have a transaction system now TxF I doubt SQL Server is using it. Disk drives limit data transfer guarantees to the sector size of the disk, 512 bytes. If you have the write cache enabled and suffer a power failure, it is possible to write part of the 8k block. If this happens, SQL Server will read new and old data from that page, which is now in an inconsistent state. This is not a disk failure. It wrote every 512-byte block it could successfully. When the disk drive comes back on line, the data on the disk is not corrupted at the sector level at all. If you have turned off torn page detection or page checksum because you believe it is a huge performance hit, turn it back on. Add more disks if you need the extra performance don’t put your data at risk.

Final Thoughts

  1. Data files tend to be random reads and writes.
  2. Log files have zero random reads and writes normally.
  3. More than one active log on a drive equals random reads and writes.
  4. Use Raid 1 for logs or RAID 10 if you need the space.
  5. Use RAID 5 or RAID 6 for data files if capacity and read performance are more important than write speed.
  6. The more disks you add to an array the greater chance you have for data loss.
  7. Raid 5 offers very good reliability at small scale. Rule of thumb, more than 8 drives in a RAID 5 could be disastrous.
  8. Raid 6 offers very good reliability at large scales. Rule of thumb, less than 9 drives you should consider RAID 5 instead.
  9. Raid 10 offers excellent reliability at any scale but is susceptible to correlated disk failures.
  10. The larger the disk drive capacity should adjust your number of disks down per array.
  11. Turn on torn page for 2000 and checksum for 2005/08.
  12. Restore Backups regularly,
  13. RAID isn’t a backup solution.

Up next, A deeper look into IO types, stripe size, disk alignment, and file system cluster sizes.

Comments

Posted by Anonymous on 7 December 2009

Pingback from  SQL Server Central Debt on Me

Posted by Dugi on 8 December 2009

Nice and simple introduction, thanks for share!

Posted by ta.bu.shi.da.yu on 15 December 2009

Awesome! Learned about 2 new RAID types I didn't know about before.

Posted by Matt Whitfield on 15 December 2009

I don't believe your descriptions of RAID 10 and RAID 0+1 are entirely accurate. With both array types, you can lose one disk for certain. If you lose two disks in either configuration, then it could be a total loss, depending on which was the second disk that failed. In your diagrams, losing both the left disks on RAID 10 or the first and third disk in RAID 0+1 would both lead to a total loss...

Posted by Wesley Brown on 15 December 2009

Matt,

You are correct there is a scenario where RAID 10 is just as likely to loose data as a RAID 0+1. It comes down to correlated disk failure and how you have the disks laid out. If you have two drive shelves you can interleave the mirrored pairs between each shelf reducing the likely hood of two failures happening in such a way to cause stripe failure. It is still considered much more redundant than any other RAID scheme on the market. With RAID 5 and RAID 0+1 two failures always results in a failed stripe.

Posted by bwillsie-842793 on 15 December 2009

Your discussion of correlated disk failures is important.  I saw roughly 20 drives from the same manufacturing batch all fail within a two week period although they were operating in different machines.

I wonder if it is even possible to get drives from more than just a couple of batches at a time?

Posted by Wesley Brown on 15 December 2009

I wrote a more in depth post on failures.

www.sqlservercentral.com/.../fundamentals-of-storage-systems-raid-and-hard-disk-reliability-under-the-covers.aspx

As important as RAID is, knowing what to expect on failure is critical.

I've tried to get mixed batches and for small companies it is impossible. When I was at a much larger company and the order time was far enough out we would get multiple batches and have them mixed into the SAN's. Then you get Google large where you are the batch and are back in the same boat again. But, they have their own ways of dealing with that issue :)

Posted by dld on 15 December 2009

I believe the paper "A Case for Redundant Arrays of Inexpensive Disks" was written in 1988, not 1998.

Posted by Wesley Brown on 15 December 2009

Fixed, right in one place wrong in another :)

Posted by TimothyAWiseman on 15 December 2009

One of the most concise and useful summaries of raid usage I hav eseen yet.  Thank you.

Posted by Brian Schaefer on 15 December 2009

I'm confused about your recommendation under Seperate IO types.  With enterprise level database servers, the number of dbs numbers in the hundreds, at least.  So you're saying that unless we have a seperate drive for each db log file, we are actually impeding disk/log I/O?  So the recommended setup would be a seperate disk for all data files, and then a seperate for each db log file?  Sounds good on paper and in theory.

Posted by moorer85 on 15 December 2009

Great job. Very nice document.

Posted by Wesley Brown on 15 December 2009

Brian,

Close.

Logs separate from other logs and data. Since your data files are generally accessed randomly grouping them all together doesn't change the IO pattern random is random.

I will separate out critical logs on their own drives if they need as much throughput as possible. The only reason for separating logs and data otherwise is protection from drive failures not putting all your eggs in one basket so to speak.

The IO requirements don't magically change just because you have put all the DB's one one server. If you have planned to have all your logs on the same set of drives then you have factored in the extra amount of IO and throughput needed to keep your databases running smoothly.

Sometimes, it won't make a difference if say you are on a virtual server and you are really sharing drives under the covers any way. Or, if you are on a SAN that implements a wide and thin striping method and shares all LUNS with all drives in a drive pool with no way to separate them out.

Posted by skutnar on 15 December 2009

Wesley, it looks like your avatar picture and RAID 100 picture got swapped.

Posted by skutnar on 15 December 2009

Never mind, it may have been my browser.  RAID 100 looks okay now.

Posted by Tom.Thomson on 6 February 2010

I feel the final thoughts have a tendency to push people towards RAID 5 more than is desirable.  For example final thought number 5 lumps RAID 5 and RAID 6 together, and says to use them if capacity and read performance are more important than write performance; the trouble with that is that in the case of RAID 5 it should probably say "more important that write performance and availability", because you will lose a disc now and again and recovery will take a long time and read performance will be very poor while recovery takes place, perhaps low enough to render your system effectively unavailable. I haven't done any detailed study or RAID 6 so I don't know whether the same applies there or not, but it seems to me that RAID 5 will almost always be the wrong decision unless your throughput requirement is sufficiently low that your system doesn't lost useful capacity during RAID recovery.

Posted by Wesley Brown on 6 February 2010

Tom,

I can see that. You can balance you rebuild rate with performance but it will take longer for the rebuild. Also, limiting the number of drives will decrease the time to rebuild the array as well.

I also agree that RAID 5 is almost always the wrong choice as well. I have been in the situation where you really didn't have a choice in the matter ether :).

Posted by Anonymous on 22 April 2010

Pingback from  cmdln.org (a sysadmin blog)  » Blog Archive   » Analyzing I/O performance in Linux

Posted by wodom on 26 May 2010

Hi, Wesley, excellent article! Best on RAID I've seen, anywhere. I have two questions:

1. In the section "Choosing Your Raid Level" it says:

Is the workload read or write intensive? A good rule of thumb is more than 10% reads go RAID 10. In addition, if write latency is a factor RAID 10 is the best choice. For read workloads, RAID 5 or RAID 6 will probably meet your needs just fine.

My guess is there's a typo in there, and you really meant to say "A good rule of thumb is more than 10% WRITES go RAID 10." Is that right?

2. You say "I also agree that RAID 5 is almost always the wrong choice." But you also say "Rule of thumb, less than 9 drives you should consider RAID 5 instead [of RAID 6]." So which one of these rules should "rule" if you have an array of 4 to 6 drives?

We're configuring a SAN with 15 drives for a handful of to-be-virtualized servers and we can mix & match various RAID groupings, so we're trying to decide how many groupings and what kind. Right now leaning toward a mix of a RAID 10 (for SQL DBs), maybe a RAID 1 for SQL logs, and maybe a couple of RAID 5s for other stuff.

Posted by Wesley Brown on 26 May 2010

1. Yep type o I'll fix it up.

2. If you have less drives but want to use striping RAID 5 is the first choice. If it is more than 8 or 9 drives RAID 6 should be your choice. The larger the drive the smaller that number should get so say if it is 1TB drives 7 or 8 should be the max for RAID 5 if they are 3TB drives then 4 or 5 should be the top end. The problem is if you have a failure the likelihood of a rebuild to a new drive before a second failure is almost 100%. There is an upper bound for RAID 6 but it is much higher. With drive sizes growing at the rate they are RAID 5 won't be an option too much longer.

If you are planning on virtualization loss of IO is a real issue I'd stay with raid 10 if possible. I generally figure a 20% to 30% loss in random IO's for virtualized databases.

Posted by wodom on 28 May 2010

OK, thanks. The wording of "almost always wrong" threw me off -- I've never worked with any RAID 5 array more than about 7 or 8 drives. Guess I'm showing my smaller-system experience! And yes we do intend on using RAID 10 for the main SQL database. With 15 drives we've had the luxury to configure one RAID 10, three RAID 1's, and one RAID 5 for various different purposes.

Leave a Comment

Please register or log in to leave a comment.