RE: Storage - A meeting of minds

SSC-Insane

Points: 20039

July 16, 2012 at 10:15 am

A good introduction to some aspects of storage, marred by some Fusion-IO/PCIe SSD specific perspective, generalizations without supporting evidence, and the very serious flaw of lacking a discussion of RAID levels in modern systems, and the equally serious flaw of failing to discuss OS presented mount points vs. logical drive vs. LUN vs. raidset/virtual drive vs. spindle, or even the very critical dedicated vs. shared spindle approach, and also ignored hot spares. Additionally, shared SAN backbone limitations didn't appear to make an appearance.

Note that on the storage front, there are modern 2U, 4U, and tower servers that support in excess of 20 to 30 local spindles each (2.5", of course), with a mix of 15k RPM, 10k RPM, 7.2k RPM, and SSD disks. These provide us with new options for high IOPS/throughput capable SQL Servers, in addition to the PCIe SSD front.

Note that with SAS and SATA SSD's, either local or on the SAN, you have the option of all the normal RAID levels - 1, 10, 5, 50, 6, 60, etc. With PCIe SSD's, the last I heard for both OCZ and Fusion-IO SSD's was that you were limited to software RAID at this time. It's generally held that software RAID is inferior to hardware RAID; that may or may not be true with the most modern server operating systems. I haven't bothered to try software RAID; I stick with hardware RAID on caching controller cards, as do the storage professionals I work with.

Unsupported generalization: "... not a commodity piece of hardware... However 128 GB RAM for the SAN would cost a £six figure sum!"

Reference for EMC Clariion systems: http://www.pinncomp.com/pdf/technical/compellent/emc_product_analysis_cx4.pdf, which lists DDR2 DIMMs as RAM, which is commodity hardware, even in ECC variants (it's what we use in servers as well), and I've bought hundreds of gigabytes at a time for far, far less than six figures USD (and used it in SQL servers). Unless references for a third party replacement for SAN memory (i.e. without as much price gouging as the vendors may put in their replacement part MSRP) are provided, I don't believe this is true in 2012.

RAID levels: Conventional wisdom is that RAID 1 and 10 is better for writes (i.e. one log file per RAID set), and RAID 5 is good for reads (less wasted storage). On modern caching controller and/or SAN hardware from the last couple years, my benchmarking has shown this to no longer quite be the case; see my results in my post at http://www.sqlservercentral.com/Forums/FindPost1293225.aspx. On my particular setup, RAID10 appears to have an advantage over RAID5 and RAID50 only on 8KB and 64KB random (not sequential) writes, and was equivalent or worse on other operations. Test your own setup carefully, whether SAN or local - many setups have quirks with one or another specific aspect that you should take into consideration when planning what goes where and how it's configured (for instance, a sequential write throughput cap, or severe performance problem with, say, 64KB random reads). Note that on some modern SAN's, RAID 50 is extremely performant.

Perhaps the most critical oversight in the article or my reading of it was to not discuss the path from SQL Server data files down to storage spindles or parts thereof, and the dedicated vs. shared argument.

I.e. (I'm going to skip subdirectory level mount points, but be aware they exist), on your SQL server you see:

Production O:\userdb.mdf

Production V:\userdb.ldf

Development E:\tempdb.mdf and E:\tempdb.ldf

The SAN admin tells you:

O: maps to LUN 5

V: maps to LUN 71

E: maps to LUN 6

Unless you ask further, you may not hear that:

LUN 5 maps to RAIDset 12

LUN 71 maps to RAIDset 13

LUN 6 maps to RAIDset 13

Corporate file share \\server\MainShare maps to RAIDset 12

Then, you may still have to ask to find out:

RAIDset 12 is a 14 disk RAID5 (1x13+1)

RAIDset 13 is an 8 disk RAID50 (2x3+1)

Oddly enough, when they're backing up the corporate file share, all sequential activity slows down, and reads on the production userdb are particularly slow. Why? Because A) the SAN has limited total fiber channel bandwidth, and the backups are using up a lot of the total throughput available, and B) because the corporate file share is hitting the same spindles that userdb.mdf is on with a mix of random and sequential access.

Even worse, when the development machine is doing a lot of hard tempdb activity, writes on the production userdb are slow. Why? Because the development tempdb LUN is on the same spindles as production userdb.ldf.

There's also a large difference between dedicated spindles (i.e. 8 disk RAID5 for userdb.mdf, 2 disk RAID1 for userdb.ldf, 2 disk RAID1 for tempdb.mdf, 2 disk RAID1 for tempdb.ldf, 2 disk RAID1 for the OS and programs, 3 disk RAID5 for the file share, 2 disk RAID1 for system DB's, 2 disk RAID1 for system log files, and 1 global hot spare) vs. shared spindles (i.e. 23 disk RAID5 for everything, and 1 global hot spare). With dedicated spindles, you can have high tempdb, OS, file share, and user log file activity all at once, and each will proceed almost as quickly (random or sequential) as it would if it was the only activity. With shared spindles, the maximum speed for any one activity will be much higher, and the "average" will seem much better on paper... but on the first day of the new fiscal year, when all kinds of activity happens at once, don't be surprised if everything slows down quite a lot.

Shared spindles are basically large sets of spindles set up in a "storage pool", and everyone shares it. It's very simple, allows you to use less overall spindles (you're not really counting IOPS anymore), is easy to manage, and when only one or two things happen at a time, it performs very well indeed. When many, many things happen at once, it thrashes itself to death (moreso if too many spindles were traded away in the search for cost savings) trying to deliver too many random IOPS. Some SAN admins really, really push it, because it does most efficiently utilitize the storage. However, it means Johnny playing with his MP3 library on the file share (Bad Johnny!) can causes the production SQL Server to slow down. Shared spindles are all about averages, and not about concurrent peaks (contrary to storage admin whitepapers, peak usage is not random, nor is it based on a normal curve; it's based on business requirements, like reporting and commission periods).

Dedicated spindles are about being able to predict performance and guaranteeing minimum performance levels (call them... SLA's).

Here's a Brent Ozar article on dedicated vs. shared: http://www.brentozar.com/archive/2008/08/sql-server-on-a-san-dedicated-or-shared-drives/[/url].

Shared SAN backbone limitations are also important. If you have, say, an 8Gbps Active/Passive FC setup to your SAN, you aren't going to get more than 8Gbps of throughput. This may sound great - it's higher than 6Gbps for modern SAS and SATA drives, so it must be better, right? Well, remember, if the SAN itself is also 8Gbps Active/Passive, then _it_ can only provide 8Gbps total... to your production box, plus your development box, plus the data warehouse, plus the tape backup, plus the corporate file share, plus... and so on. If you have several 6Gbps drives locally, _each_ gets 6Gbps; I've seen a local 6 disk SATA SSD setup in RAID5 deliver 1.4GB/s (i.e. ~14Gbps, or an 8Gbps Active/Active bandwidth aggregating FC's maximum)... on 64KB random reads, and 64KB and larger sequential reads (apparently that was a bandwidth limitation on the controller). Further, each box is using its own throughpout, not sharing it.

Note that SAN's can be very effectively supplemented by putting, say, tempdb data and log files on local SSD's, either SATA/SAS or PCIe; this not only allows tempdb to respond faster than the SAN could at peak, but it keeps tempdb transfers off the SAN, allowing everything else on the SAN to use the throughput and IOPS that are now going to local storage... and since you don't back up tempdb, there's no need to change backup strategies. Most warm and cold DR capabilities are also unaffected by this.