Implications of slower writes with 64K pages using SSDs vs SAN

  • As part of our migration to SQL Server 2017, we are considering a move to local SSDs.  Some testing with DiskSpd.exe whowed that all IO types are faster with local disk for 8K IOs, but writes are slower for 64K IOs.  Here's how it tested out:
    Workload LOCAL mb/s SAN mb/s
    8KB Random Writes 434.65 190.38
    8KB Random Reads 1070.99 390.84
    8KB 60%read 40% writes 664.2 390.88
    64KB Random Writes 757.33 1584.57
    64KB 60%read 40% writes 1135.78 1607.31

    My first thought is that since generally writes are asynchronous for SQL Server workloads, the benefits of faster reads, lower total cost are worth the probably insignficant cost of slower writes for larger IOs.  Is this conventional wisdom, or are there other thoughts on this?

    Thanks in advance for your thoughtful replies.

  • NJ-DBA - Monday, November 13, 2017 9:27 AM

    As part of our migration to SQL Server 2017, we are considering a move to local SSDs.  Some testing with DiskSpd.exe whowed that all IO types are faster with local disk for 8K IOs, but writes are slower for 64K IOs.  Here's how it tested out:
    Workload LOCAL mb/s SAN mb/s
    8KB Random Writes 434.65 190.38
    8KB Random Reads 1070.99 390.84
    8KB 60%read 40% writes 664.2 390.88
    64KB Random Writes 757.33 1584.57
    64KB 60%read 40% writes 1135.78 1607.31

    My first thought is that since generally writes are asynchronous for SQL Server workloads, the benefits of faster reads, lower total cost are worth the probably insignficant cost of slower writes for larger IOs.  Is this conventional wisdom, or are there other thoughts on this?

    Thanks in advance for your thoughtful replies.

    Just to give you something to think about, I find it difficult to believe that a SAN can sustain writes at roughly 1.6 Gb/sec.   That seems to me to be likely to be an effect of write cacheing on the SAN, and thus no longer keeping the I/O comparison in the category of "apples to apples".   SAN devices have to have huge cacheing because the typical SAN fabric is usually not faster than 1 Gb/sec fiber.   Not that there aren't faster SAN fabrics...  10 Gb/sec is readily available, but pricey.   Also, multi-pathing a SAN fabric can significantly impact your I/O speed.   However, perhaps the most valuable thing to consider in going to a local SSD over a SAN is whether or not you can take a backup quite the same way.   Some SANs have built-in write mirroring and a separate backup mechanism directly attached.   You might be hard-pressed to do better than that if you have any 24/7 web requirements.

    Steve (aka sgmunson) 🙂 🙂 🙂
    Rent Servers for Income (picks and shovels strategy)

  • What I've found is that, although there is a benefit to having things on SSD, it's trivial compared to the larger problem.  Main Memory (RAM) (comparatively) blows the doors off of even SSDs but, even if they were identical in performance, people forget what the true and much larger problem is.  NOTHING on the SSDs can be used directly by SQL Server.  Whatever the data is, it MUST be moved into Main Memory.  If the data is accessed many times, the data needs to be allowed to STAY in Main Memory.

    What that means is that Main Memory IS one of the 2 most precious resources (CPU is the other one) and people waste it (actually, people waste both resources).  How do they waste it?  They write inefficient code that bumps other stuff out of memory.  And, don't forget that TempDB also starts out in memory and only spills to disk if the temporary data to support the task at hand becomes to large according to SQL Server.  Also remember that spools, hash joins, and a whole lot of other things use TempDB, which is using memory.  And, no... I AM NOT suggesting that you avoid Temp Tables because of this.  In fact, using tried and true "Divide'n'Conquer" methods that use Temp Tables can actually save a huge amount of memory.

    A year ago, we migrated from some nearly 8 year old servers that would cost a fortune to upgrade memory on.  To upgrade the servers from 128GB to just 256GB cost as much as building new physical servers and so we upgraded the hardware.  In the process, 2 to 3TB of SSD cache was added to each server.  Everyone thought that there would be huge improvements in performance... everyone except me... and, unfortunately, I was right.  After upgrading from 16 CPUs and 128GB of RAM to 32 CPUs and 256GB of RAM, there was no improvement in the performance of front end code and only 2X improvement on only some of the overnight batch jobs.  While 2X seems like a lot, remember that it was sporadic and did NOTHING to improve performance for front end code.

    The reason why the front-end code was no faster is because technology has reached both a clock-rate limit on CPUs (generally somewhere between 2.8 and 3.2GHz) and a similar limit has been reached with memory.  Long gone are the days of a 2X improvement every 18 months.  It's physically impossible.  The only way you can get a real performance improvement nowadays is to go parallel but most front end code is singleton in nature and can't go parallel.

    Doubling the Main Memory was what help some of the batch procs run faster because the data stayed where it needed to be (in Main Memory) instead of having to load it by pieces.  Even then, and as I previously stated, it was only a 2X improvement sometimes.  It wasn't the panacea that everyone but me thought it was going to be.

    In comparison, we recently found a way to make some of our imports (batch jobs) run 70X faster and some of our front end code (even though single threaded) run 40 to 440 times faster (not a misprint).  You can't buy hardware to even do the low end of that range of improvements.

    How did we do it?

    I finally managed to convince management that we need to address the real problem.  We made some relatively minor (2 FTE weeks for the imports, (and affects ALL of the imports) 1 FTE day for the front end) to the CODE.

    The CODE is where performance lives.  Invest some time in doing that right the first time and fixing it if it wasn't.  It's actually not that difficult to do, either.  You just need someone that "gets it" and then give them the time to "do it". 😉

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • These are interesting points... It is definitely possible that we are seeing SAN caching ruining our apples to apples comparison.  I'm using diskspd with the -h switch to disable caching, but of course that's not going to work at the SAN layer.

    To Jeff's point, we definitely do have code that could be improved.  Our servers have huge amounts of memory, but we also have some huge tables and cache churn does happen.  That said, there are a few reasons we are moving away from SAN... first, it's a shared resource and a single point of failure for mutliple database servers in our highly scaled out application.  (think 20 clusters sharing the same SAN- the SAN has a problem and people ask why our "redundant" systems all stop working.)  Also, we are looking to implement alwayson replicas and so if we can save money on storage cost by going to local disk, we want to do that.

  • NJ-DBA - Tuesday, November 14, 2017 10:27 AM

    These are interesting points... It is definitely possible that we are seeing SAN caching ruining our apples to apples comparison.  I'm using diskspd with the -h switch to disable caching, but of course that's not going to work at the SAN layer.

    To Jeff's point, we definitely do have code that could be improved.  Our servers have huge amounts of memory, but we also have some huge tables and cache churn does happen.  That said, there are a few reasons we are moving away from SAN... first, it's a shared resource and a single point of failure for mutliple database servers in our highly scaled out application.  (think 20 clusters sharing the same SAN- the SAN has a problem and people ask why our "redundant" systems all stop working.)  Also, we are looking to implement alwayson replicas and so if we can save money on storage cost by going to local disk, we want to do that.

    Just be sure to include any additional costs associated with doing backups differently, as many SAN systems automatically back up the data, and such a feature is difficult to replace, as backups then require additional data reads from the server that you don't normally see if your SAN is using write mirroring for that purpose.  Those additional reads might well be at night, but if you operate 24/7 and have web-based apps that are public facing, those backups could interfere with normal business operation in a way you've not previously experienced if the SAN you use does that kind of write mirroring to effect backups.

    Steve (aka sgmunson) 🙂 🙂 🙂
    Rent Servers for Income (picks and shovels strategy)

  • We are not taking backups with the SAN.

  • NJ-DBA - Tuesday, November 14, 2017 10:59 AM

    We are not taking backups with the SAN.

    Then I'd guess that local SSD's would certainly help with your I/O speed.   Given the typical SAN fabric speed of just 1Gb/sec, having local drives that can operate at 6 Gb/sec (effective rate closer to 4 to 5 Gb/sec), that's a considerable speed boost.   However, given the considerable write-cacheing that takes place in the SAN, I'd be sure to review I/O wait stats and disk queue length stats (perfmon) under load to see if having that extra speed just puts the I/O behind in a queue.   I'd want as many physical SSD's as the machine can handle to provide relief on I/O controller wait time or queueing.

    Steve (aka sgmunson) 🙂 🙂 🙂
    Rent Servers for Income (picks and shovels strategy)

  • NJ-DBA - Tuesday, November 14, 2017 10:27 AM

    These are interesting points... It is definitely possible that we are seeing SAN caching ruining our apples to apples comparison.  I'm using diskspd with the -h switch to disable caching, but of course that's not going to work at the SAN layer.

    To Jeff's point, we definitely do have code that could be improved.  Our servers have huge amounts of memory, but we also have some huge tables and cache churn does happen.  That said, there are a few reasons we are moving away from SAN... first, it's a shared resource and a single point of failure for mutliple database servers in our highly scaled out application.  (think 20 clusters sharing the same SAN- the SAN has a problem and people ask why our "redundant" systems all stop working.)  Also, we are looking to implement alwayson replicas and so if we can save money on storage cost by going to local disk, we want to do that.

    Just curious.  What is your definition of a "huge" table in rowc ount, column count, and average number of bytes per row?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • We have some real problem tables - in every variety... lots of records of large size/ lots of records small size/ "small" record count large size,etc.
    For example, we have one table which has "only" about 50 million rows, but the table consumes about 600GB on disk- most in a varbinary(max) column being used to store images, with a handful of other columns.  We have another table with 900 million rows and consumes about 200GB on disk  (there is a varchar(150) column) and a handful of smaller columns.  We have tables with 100-300 million rows and a few small columns, which consume roughly 10-20GB on disk per table.

  • NJ-DBA - Thursday, November 16, 2017 8:36 AM

    We have some real problem tables - in every variety... lots of records of large size/ lots of records small size/ "small" record count large size,etc.
    For example, we have one table which has "only" about 50 million rows, but the table consumes about 600GB on disk- most in a varbinary(max) column being used to store images, with a handful of other columns.  We have another table with 900 million rows and consumes about 200GB on disk  (there is a varchar(150) column) and a handful of smaller columns.  We have tables with 100-300 million rows and a few small columns, which consume roughly 10-20GB on disk per table.

    Thanks for the feedback.  Heh... I appreciate your misery... we have a table that stores call recordings (not the right thing to do until you realize that calls were "getting lost" on disk because of thoughtlessness), which only contains 2.8 million rows but occupies 900GB on disk.  During my "fall cleanup", I also discovered that someone that wrote a batch job decided to keep a record of things by creating a single 50 byte row for each run... one per date named table.  I have more than a thousand of those damned things to cleanup and I'm going to have to have a bit of a "calibration session" with a "Developer".

    As you do, I also have tables that I'm setting up to do a cleanup on.  With have a "network" of tables that keep track of a certain form of batch runs.  The data is all but useless after a week or so but the tables have never been purged (they contain 10 years of data now) because they couldn't figure out how to step through the 23 tables that are all linked with FKs.  The database that holds these tables is continuing to grow from its present 1.6TB.  Once I do the cleanup of those tables and some "other" tables, the database will be about 30GB in size.  It's just stupid how far they've let things go.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Just in case anyone ends up here in a future search, I wanted to follow up with the end result.  It turned out that I was not compariing apples to apples.  I was comparing a 4TB local disk lun (created with 4 underlying disk) to SAN.  With SAN, any given 4TB LUN could reach 1.6GB/sec tested in isolation, but since the bottleneck is at the HBA, the total throughput for the server would be limited to 1.6GB/sec. So if I allocate 4 LUNs on SAN, I'm only getting 400MBPS per LUN. With the local disk, I was getting 800MBPS per LUN, becaue the limitation was at the disk level.  By adding more disks, more total usable space- I am still getting the 800MBPS per LUN, so with 4 luns, I may get 3.2GBPS total throughput... provided I don't hit the limit of the local array controller first.

Viewing 11 posts - 1 through 10 (of 10 total)

You must be logged in to reply to this topic. Login to reply