Storage - A meeting of minds

  • Comments posted to this topic are about the item Storage - A meeting of minds

  • What an excellent article - thanks very much for sharing your thoughts. Some really well-researched material here!


    Note to developers:
    So why complicate your code AND MAKE MY JOB HARDER??!:crazy:

    Want to get the best help? Click here (Jeff Moden)
    My blog:
    Visit to find out more about me.

  • Great article, David.


    Learn to play, play to learn !

    Dont drive faster than your guardian angel can fly ...
    but keeping both feet on the ground wont get you anywhere :w00t:

    - How to post Performance Problems
    - How to post data/code to get the best help[/url]

    - How to prevent a sore throat after hours of presenting ppt

    press F1 for solution, press shift+F1 for urgent solution 😀

    Need a bit of Powershell? How about this

    Who am I ? Sometimes this is me but most of the time this is me

  • Great article, will reread it twice

  • Great article. Thanks for sharing your research. 🙂

  • Hi David,

    Wonderful article.


  • Really good article Dave and something we've spoken about in depth a lot over the last few years.

    Something worthwhile pointing out on the cost front, PCIe storage is the cheaper option compared to a SAN if doing a new build (where you have to factor in the SAN cost). I've just recently looked at the costs for a new SAN setup with 1.2TB as the basic requirement of storage capacity. On the SAN side I've gone for 4x600GB disks (RAID 10) and a dozen servers. I've looked at dedicated spindles per DB server as always (I would have preferred 300GB disks but it's the better balance for the example I was working on). The SAN is a Clarion VNX5.

    For PCIe I went with the FusionIO ioDrive2 Mono MLC card as an OEM bit of kit shipped from Dell. The rest of the server spec is identical to the above. It's more than fast enough but there are much faster.

    The cost of PCIe is well under half the SAN based costs and will deliver a lot more performance (I upped the RAM spec on both sides as I found the requirements as you mentioned from FusionIO and it's not that more expensive to allow at this stage).

    On the lifespan front you can use the program / erase cycles to calculate the theoretical lifespan.

    The lowest number you normally see quoted is 10,000 p/e cycles. Using that value we can calculate (simplified theoretical version) :

    A 1.2TB drive = 1,318,554,959,872 bytes

    1,318,554,959,872 bytes * 10,000 p/e cycles = 13,185,549,598,720,000 bytes that can be written

    500GB written per day = 536,870,912,000 bytes (for me this is pretty close as TempDB takes a hammering in our estate)

    1,318,554,959,872 bytes / 536,870,912,000 bytes = 24,560 days of writing at 500GB per day

    24,560 = 67 years or 589,440 hours (admittedly lower than half of SATA or SAS, but when you up the capacity to 2.4TB with the same write rate it almost matches the usual MTBF rates on mechanical storage [my preferred way of describing SAN storage without being offensive]).

    It's a bit of an unfair calculation if I'm honest as we are comparing the amount of times we can theoretically write to something vs a potential hardware failure rate with mechanical parts. However, since the end result is something being kaput it's probably not too wide of the mark. Adding more component parts increases the probability of failure so that is something else to consider with mechanical storage. If we add spindles for speed we increase the likelihood of something breaking.

    Oh and once you see a 1TB database restore go from 4 hours to 5 minutes simply with none-mechanical storage it's very hard to get it out of your head.

  • Missing, not for the first time in such essays, is discussion of normal forms, particularly for the operational data. If one moves to SSD, response time factor changes significantly, even compared to short-stroking. But doing so with the typical flat-file datastore is cost prohibitive. In order to get maximum user data back and forth with available IOPS, one needs a high NF datastore, which also happens to have the minimum footprint on storage.

    Coders just love to refactor code, but they (all too often in control of database schemas) refuse to refactor data. Since their schemas start life as byte dumps manipulated by their wonderous code (just like their granddaddies' COBOL/VSAM apps), refactoring data means re-writing code; well, mostly discarding lots of code. The lifetime employment assurance disappears.

    IOW, the problem isn't technical, but spiritual. Much the same thing happened when the 360 appeared with DASD. Rather than code to Direct Access, coders continued to do what was comfortable, code to Sequential Batch. Who said there's something new under the sun?

  • You state:

    In addition there are more sectors in the outside tracks than there are in the innter [sic] tracks.

    My understanding of disk sectors has always been that the number of sectors per track is constant for a given disk, and that each sector stores the same amount of data as any other sector.

    Because the sectors on the outer tracks cover a bigger surface area on the physical platter, the storage density for those outer tracks is correspondingly lower. The included angle subtended by any sector is the same, which allows the head to read the same amount of data per partial rotation, no matter where on the disk its reading from.

    Anyone confirm this?


  • Correct me if I'm wrong, but wouldn't 40 disks in a RAID 1 give you 1/2 the capacity you stated. It's a mirror, so your array size is still only 6 TB, not 12. This works out to about 5.5 TB of useable space. More to the point, aren't we really talking about RAID 0+1 here?

  • rmechaber (7/16/2012)

    You state:

    In addition there are more sectors in the outside tracks than there are in the innter [sic] tracks.

    My understanding of disk sectors has always been that the number of sectors per track is constant for a given disk, and that each sector stores the same amount of data as any other sector.

    Anyone confirm this?


    That was true about a decade ago, or perhaps longer. For very many years, HDD have had variable geometry, with more sectors on outer tracks, since there's more there, there.

  • RobertYoung (7/16/2012)

    rmechaber (7/16/2012)

    You state:

    In addition there are more sectors in the outside tracks than there are in the innter [sic] tracks.

    My understanding of disk sectors has always been that the number of sectors per track is constant for a given disk, and that each sector stores the same amount of data as any other sector.

    Anyone confirm this?


    That was true about a decade ago, or perhaps longer. For very many years, HDD have had variable geometry, with more sectors on outer tracks, since there's more there, there.

    Ah, thank you -- it's been that long or more since I've looked into disk storage geometry. The 'net has a memory: I found several authoritative-"looking" pages via Google supporting my (older) knowledge in a way that sounded current. Hence my request for some confirmation/elaboration.

    Without add'l sectors on outer tracks, the concept of short-stroking makes no sense, so I knew something was off.

    Thanks again,


  • Thanks for this useful contribution.


    Basit A. Farooq (MSC Computing, MCITP SQL Server 2005 & 2008, MCDBA SQL Server 2000)
  • A good introduction to some aspects of storage, marred by some Fusion-IO/PCIe SSD specific perspective, generalizations without supporting evidence, and the very serious flaw of lacking a discussion of RAID levels in modern systems, and the equally serious flaw of failing to discuss OS presented mount points vs. logical drive vs. LUN vs. raidset/virtual drive vs. spindle, or even the very critical dedicated vs. shared spindle approach, and also ignored hot spares. Additionally, shared SAN backbone limitations didn't appear to make an appearance.

    Note that on the storage front, there are modern 2U, 4U, and tower servers that support in excess of 20 to 30 local spindles each (2.5", of course), with a mix of 15k RPM, 10k RPM, 7.2k RPM, and SSD disks. These provide us with new options for high IOPS/throughput capable SQL Servers, in addition to the PCIe SSD front.

    Note that with SAS and SATA SSD's, either local or on the SAN, you have the option of all the normal RAID levels - 1, 10, 5, 50, 6, 60, etc. With PCIe SSD's, the last I heard for both OCZ and Fusion-IO SSD's was that you were limited to software RAID at this time. It's generally held that software RAID is inferior to hardware RAID; that may or may not be true with the most modern server operating systems. I haven't bothered to try software RAID; I stick with hardware RAID on caching controller cards, as do the storage professionals I work with.

    Unsupported generalization: "... not a commodity piece of hardware... However 128 GB RAM for the SAN would cost a £six figure sum!"

    Reference for EMC Clariion systems:, which lists DDR2 DIMMs as RAM, which is commodity hardware, even in ECC variants (it's what we use in servers as well), and I've bought hundreds of gigabytes at a time for far, far less than six figures USD (and used it in SQL servers). Unless references for a third party replacement for SAN memory (i.e. without as much price gouging as the vendors may put in their replacement part MSRP) are provided, I don't believe this is true in 2012.

    RAID levels: Conventional wisdom is that RAID 1 and 10 is better for writes (i.e. one log file per RAID set), and RAID 5 is good for reads (less wasted storage). On modern caching controller and/or SAN hardware from the last couple years, my benchmarking has shown this to no longer quite be the case; see my results in my post at On my particular setup, RAID10 appears to have an advantage over RAID5 and RAID50 only on 8KB and 64KB random (not sequential) writes, and was equivalent or worse on other operations. Test your own setup carefully, whether SAN or local - many setups have quirks with one or another specific aspect that you should take into consideration when planning what goes where and how it's configured (for instance, a sequential write throughput cap, or severe performance problem with, say, 64KB random reads). Note that on some modern SAN's, RAID 50 is extremely performant.

    Perhaps the most critical oversight in the article or my reading of it was to not discuss the path from SQL Server data files down to storage spindles or parts thereof, and the dedicated vs. shared argument.

    I.e. (I'm going to skip subdirectory level mount points, but be aware they exist), on your SQL server you see:

    Production O:\userdb.mdf

    Production V:\userdb.ldf

    Development E:\tempdb.mdf and E:\tempdb.ldf

    The SAN admin tells you:

    O: maps to LUN 5

    V: maps to LUN 71

    E: maps to LUN 6

    Unless you ask further, you may not hear that:

    LUN 5 maps to RAIDset 12

    LUN 71 maps to RAIDset 13

    LUN 6 maps to RAIDset 13

    Corporate file share \\server\MainShare maps to RAIDset 12

    Then, you may still have to ask to find out:

    RAIDset 12 is a 14 disk RAID5 (1x13+1)

    RAIDset 13 is an 8 disk RAID50 (2x3+1)

    Oddly enough, when they're backing up the corporate file share, all sequential activity slows down, and reads on the production userdb are particularly slow. Why? Because A) the SAN has limited total fiber channel bandwidth, and the backups are using up a lot of the total throughput available, and B) because the corporate file share is hitting the same spindles that userdb.mdf is on with a mix of random and sequential access.

    Even worse, when the development machine is doing a lot of hard tempdb activity, writes on the production userdb are slow. Why? Because the development tempdb LUN is on the same spindles as production userdb.ldf.

    There's also a large difference between dedicated spindles (i.e. 8 disk RAID5 for userdb.mdf, 2 disk RAID1 for userdb.ldf, 2 disk RAID1 for tempdb.mdf, 2 disk RAID1 for tempdb.ldf, 2 disk RAID1 for the OS and programs, 3 disk RAID5 for the file share, 2 disk RAID1 for system DB's, 2 disk RAID1 for system log files, and 1 global hot spare) vs. shared spindles (i.e. 23 disk RAID5 for everything, and 1 global hot spare). With dedicated spindles, you can have high tempdb, OS, file share, and user log file activity all at once, and each will proceed almost as quickly (random or sequential) as it would if it was the only activity. With shared spindles, the maximum speed for any one activity will be much higher, and the "average" will seem much better on paper... but on the first day of the new fiscal year, when all kinds of activity happens at once, don't be surprised if everything slows down quite a lot.

    Shared spindles are basically large sets of spindles set up in a "storage pool", and everyone shares it. It's very simple, allows you to use less overall spindles (you're not really counting IOPS anymore), is easy to manage, and when only one or two things happen at a time, it performs very well indeed. When many, many things happen at once, it thrashes itself to death (moreso if too many spindles were traded away in the search for cost savings) trying to deliver too many random IOPS. Some SAN admins really, really push it, because it does most efficiently utilitize the storage. However, it means Johnny playing with his MP3 library on the file share (Bad Johnny!) can causes the production SQL Server to slow down. Shared spindles are all about averages, and not about concurrent peaks (contrary to storage admin whitepapers, peak usage is not random, nor is it based on a normal curve; it's based on business requirements, like reporting and commission periods).

    Dedicated spindles are about being able to predict performance and guaranteeing minimum performance levels (call them... SLA's).

    Here's a Brent Ozar article on dedicated vs. shared:[/url].

    Shared SAN backbone limitations are also important. If you have, say, an 8Gbps Active/Passive FC setup to your SAN, you aren't going to get more than 8Gbps of throughput. This may sound great - it's higher than 6Gbps for modern SAS and SATA drives, so it must be better, right? Well, remember, if the SAN itself is also 8Gbps Active/Passive, then _it_ can only provide 8Gbps total... to your production box, plus your development box, plus the data warehouse, plus the tape backup, plus the corporate file share, plus... and so on. If you have several 6Gbps drives locally, _each_ gets 6Gbps; I've seen a local 6 disk SATA SSD setup in RAID5 deliver 1.4GB/s (i.e. ~14Gbps, or an 8Gbps Active/Active bandwidth aggregating FC's maximum)... on 64KB random reads, and 64KB and larger sequential reads (apparently that was a bandwidth limitation on the controller). Further, each box is using its own throughpout, not sharing it.

    Note that SAN's can be very effectively supplemented by putting, say, tempdb data and log files on local SSD's, either SATA/SAS or PCIe; this not only allows tempdb to respond faster than the SAN could at peak, but it keeps tempdb transfers off the SAN, allowing everything else on the SAN to use the throughput and IOPS that are now going to local storage... and since you don't back up tempdb, there's no need to change backup strategies. Most warm and cold DR capabilities are also unaffected by this.

  • Excellent article, and I cannot overemphasize along with you the importance of the DBA working as part of a team both with the application developers and the administrators that provide the lower level infrastructure.

    Timothy A Wiseman
    SQL Blog:

Viewing 15 posts - 1 through 15 (of 22 total)

You must be logged in to reply to this topic. Login to reply