RE: RAID 5 and SQL Server – SQLServerCentral

Right there with Babe

Points: 783

January 10, 2011 at 7:59 am

In debating RAID 5 against RAID 10 performance for SQL we need, in my view, to distinguish between read and write performance

For read performance:

SQL read page requests (except for SQL read-ahead) are generally un-overlapped; your request waits until the read is satisfied and your request can continue, you need to retrieve the top index level before you retrieve the next index level, before you retrieve the leaf index level, before you retrieve the data row. In effect, poor read performance mushrooms. In partial mitigation of this, metadata and non-leaf index pages have a relatively higher hit rate than index leaf and data pages, they tend to stay buffer or cache resident longer

Raid 10 may satisfy read requests faster than RAID 5; a competent RAID 10 implementation will allow access to the first available mirror copy. In effect, path contention can be halved by RAID 10, or it can be seen as doubling the effective device bandwidth. In a multi-user SQL environment this can be very significant. In some SAN environments path contention (rather than device contention) may be the major bottleneck, and the benefit of RAID 10 above RAID 5 may not be seen because of this constraint

A large device cache may, or may not, be an advantage to read performance. In most of the cached devices I

have worked with recently, including SANs, the SQL buffer pool has been significantly larger than the device cache. In effect, any SQL buffer miss is almost by definition also a device cache miss (both use similar ageing algorithms).

In my (biased) view additional RAM allocated to SQL buffer pools, rather than a larger device cache, is far more effective until the device cache is multiple times larger than the SQL buffer pool. A SQL buffer hit may be more like 10000 times faster than a physical read, a device cache hit may be less than a 100 times faster than a physical read under load (depending on implementation and path contention)

For write performance:

Data page write requests are effectively overlapped. I can find no specific reason why writing of dirty data pages during checkpoint should delay transaction processing, except if the flushing of pages is sufficiently slow as to

affect the free page replenishment algorithm. In effect, if data page write is sufficiently slow, it may cause free list stalls, with all the negative performance implications thereof. To the SQL heavies: if I have missed some aspects of data page write, please let me know

Log page write requests are not overlapped; an implicit or explicit transaction is not complete until the log pages reflecting that request are confirmed as written. The effect of poor write performance should not be cumulative, as it is in read performance; log write may require multiple physical writes, but they can be viewed essentially as either a single request, or multiple independent requests

RAID 5 page writes require cache hits (not buffer hits) on the page to be written, and the parity page, and possibly all the pages in the parity set (there are some implementations that verify the parity before write). In mitigation, the page(s) can be retrieved in parallel, but the physical write is delayed by the retrieval process. It is likely that SQL log page writes have a very low cache hit rate (log writes into empty pages rather than previously used pages), the retrieval of the parity page, and possibly parity set, is what delays the physical write. The same delay of physical write is not true for RAID 10 implementations

RAID 5 page writes require the page and the parity page to be physically written. In some implementations the writes are not strictly in parallel; recovery from failed writes requires a particular sequence of writes. Essentially the same is true for RAID 10 implementations

In all SQL-supported implementations the modified page and parity page (where applicable) must be stored in non-volatile cache before the write can be posted as complete; regardless of where they were retrieved from. In almost all implementations I have come across the non-volatile cache is significantly smaller than the volatile cache. In most implementations the size of the non-volatile cache is not readily increased. I have yet to meet the non-volatile cache that is not frequently (constantly ?) overrun under load, thus delaying write completion until space is available in the non-volatile cache. In effect, for SQL, write buffering reduces the write time, but in practice it is still significantly dependent on the physical write time of the underlying pages

What distinguishes RAID 5 write from RAID 10 write is that generally two entries are required in the non-volatile cache for RAID 5 as against a single entry for RAID 10, thus effectively doubling the size of the non-volatile cache, and thus reducing contention for space in that non-volatile cache. The non-volatile cache is in almost all RAID implementations a severely constrained resource. In addition, RAID 5 SQL log write time is significantly delayed by the retrieval of the parity page, which for SQL log is seldom a device cache hit almost regardless of the device cache size. This may significantly compound the contention for non-volatile cache space

Thus, in my view, a large SQL buffer pool, a RAID 10 implementation, and as large a non-volatile cache as I can get are very desirable

In my view SSDs do not change this equation, except that physical write rates to SSDs are significantly greater than to traditional disks. However PCIe-attached SSDs can be seen as a very large but relatively slow non-volatile device cache, which can benefit both read and write