If you have been following this series we have covered system buses, hard disks, host bus adapters and RAID. Along the way we also covered how to capture your IO patterns and the SQLIO tool. Now we will pull it all together.We move up the stack even further to the actual layout of the RAID stripe and the file system. How the stripe and file system are laid out on your disks has a huge impact on performance. One of the things that has really gotten some traction over the last few years is sector alignment. This one thing, if not done, could cost you 30% to 40% of your IO potential. Jimmy May has covered sector alignment in depth So I won’t hash it here again. Kendal Van Dyke also has a good series that covers offset, stripe size, and allocation units with different raid levels.
Something I’ve seen, and been guilty of, is taking a drives base specifications and just multiplying out. Say the manufacturer says the drive will to 79MB/Sec minimum throughput, we have 10 drives so that is 790MB/Sec of throughput! We all know from experience that this isn’t so. What eats us up is how much slower it really can be. As we have seen throughout this series there is overhead associated to everything. Before we just throw a bunch of disks in an enclosure and press it into service it would be nice to have an idea of what the performance should be. It’s also recommended to do some of this work before you actually buy anything so you don’t have to go back to your boss and beg for more money and explain to him that your wild guess was wrong.
Always add a pinch of salt to whatever the disk manufacturer puts in the specifications. Most of the time they will be close enough. The problem lies in the fact they don’t always disclose the methods for archiving those numbers. For instance, when they report minimum and maximum throughput they are usually talking about a scan of the entire disk including all meta data stored between tracks, the best possible throughput possible. You won’t see those results in every day life. They also give you numbers that can be completely irrelevant like single sector read rates. very rarely do you read a single sector at a time. Personally, I would love if the drive makers gave the engineering specifications. I know that won’t happen, it would make my life easier though. The disk characteristics that are important are, sector size,spindle speed, seek times read and write, sequential times read and write. To a lesser extent sequential throughput in megabytes per second. With the single disk numbers we can move on to the RAID configuration.
There are several factors that impact the RAID arrays ability to perform. The RAID level, size of the IO request, and stripe size. RAID level is the easy one, what kind of hits do you take on writes vs. capacity of the array. On the stripe size there is a direct corollary with the size of the IO request. If the IO request is bigger than the stripe size it will have to seek across another disk to satisfy the data request. If the IO request size is very small and random you may loose some IO performance if the requests pile up on one disk causing a hot spot. There are established calculations that you can perform to get an idea of how to configure you array. I’ve built a web page that you can use to do all the basic calculations, Disk Drive RAID Configuration Tool. These equations are base line estimates so you aren’t working completely in the dark. You can enter your own drive statistics or pick from one of 1100 hard drives in the database. This web calculator is based off of Peter Chen’s equations for estimating RAID performance and best stripe size. I’ll add more to it as I get time.
SQL Server works with two specific IO request size 8K and 64K in general. If you did your due diligence earlier you could also add any other request size that you saw come through. Focusing on the page size and extent size is a good place to start. Using the raid calculator tool I chose a Seagate Savvio 15K.2 drive as my base. One of the things my calculator can’t take into consideration is your system and RAID HBA. This is where testing is essential. You will find there are anomalies in every card, physical limits on throughput and IO’s. Since my RAID card won’t do a stripe bigger than 256k that is my cap for size. Reading through several IO white papers on SQL Server the general recommendation is for 2000/2005 a 64k or 128k stripe size and for SQL Server 2008 a 256k stripe size. I’ve found as general guidance, this is a good place to start as well. The calculator tells me for a RAID 10 array with 24 drives at a 256k stripe size and 8k IO request I should get 9825 IOs/Sec and 76.75 MB/Sec on average, across reads, writes, sequential and random IO requests. That’s right, 76 MB/Sec throughput for 24 drives rated at 122 MB/sec minimum. That is 2.5 MB/Sec per drive. The same array at a 64k IO request size yields 8102 IOs/Sec and 506 MB/Sec. A huge difference in throughput just based on the IO request size. Still, not anywhere near 122 MB/Sec. As an estimate, I find that these numbers are “good enough” to start sizing my arrays. If I needed to figure out how big the array needs to be to support say 150 MB/sec throughput or 10000 IOs/Sec you can do that with the calculator as well. Armed with our estimates it’s time to actually test our new RAID arrays. I use SQLIO to do synthetic benchmarking before running any actual data loads.
After doing a round of testing I found that in some cases the numbers were a little high or a little low. Other factors that are hard to calculate are cache hit ratios. Enterprise RAID HBA’s usually disable the write cache on the local disk controller and just use their own batter backed cache for all write operations. This is safer but with more and more disks on a single controller the amount of cache per disk can get pretty low. The HBA will also want you to split that between read and write operations. On my HP RAID HBA’s the default is 25% read and 75% write. In an older study I found on disk caches and cache size saw diminishing returns above 2 MB gaining between 1 and 2 percent additional cache hits per megabyte of cache. I expect that to flatten out even more as the caches get larger, you simply can’t get 100% cache ratios that would mean the whole drive fit in the ram cache or your IO request are the same over and over. Generally if that is the case you will find SQL Server won’t have to go to disk it will have what it needs in the buffer pool for reads. I find that if you have less than 20 percent write activity leaving the defaults is fine. If I do have a write heavy load I will set the cache to 100% writes.
Having completed my benchmarking I found that 128k or 256k stripe size was fine on average. Just realize that if you optimize for one IO pattern the others will suffer. Latency is also important and I have included it here as well. You find that the larger the IO request and the smaller the stripe size latency gets worse. Here are the results from my tests on a DL380 G5 with a P411 and 24 drives in a MSA 70 enclosure. I’ve included tests for an 8k to 256k stripe sizes.
As a footnote I’d like to thank Joe Handley, Ben Poliakoff, David Gosslin and Dale Davis for helping me get the Disk Drive RAID Configuration Tool together. I’m not a web guy!
WARNING! Lots of charts below!
In previous articles, we have covered the system bus, host bus adapters, and disk drives. Now we will move up the food chain at take a look at getting several disks to operate as one.
In 1988 David A. Patterson, Garth Gibson, and Randy H. Katz authored a seminal paper, A Case for Redundant Arrays of Inexpensive Disks (RAID). The main concept was to use off the shelf commodity hardware to provide better performance and reliability and a much lower price point than the current generation of storage. Even in 1988, we already knew that CPUs and memory were outpacing disk drives. To try to solve these issues Dr. Patterson and his team laid out the fundamentals of our modern RAID structures almost completely RAID levels 1 through 5 all directly come from this paper. There have been improvements in the error checking but the principals are the same. In 1993, Dr. Patterson along with his team released a paper covering RAID 6.
RAID Level
Description
Disk Requirements
Usable Disks
Diagram
RAID 0
RAID 0 is striping without parity. Technically, not a Redundant array of disks just an array of disks but lumped in since it uses some of the same technical aspects. Other hybrid raid solutions utilize RAID 0 to join other RAID arrays together. Each disk in the array holds data and no parity information. Without having to calculate parity, there are no penalties on reads or writes. This is the fastest of all the RAID configurations. It is also the most dangerous. One drive failure means you lose all your data. I don’t recommend using RAID 0 unless you are 100% sure losing all your data is completely OK.
2
N
RAID 1
RAID 1 is mirroring two disks. RAID 1 writes and reads to both disks simultaneously. You can lose one disk and still operate. Some controllers allow you to read data from both disks; others return only data from the disk that delivers it first. Since there are no parity calculations, it is generally the easiest RAID level to implement. Duplexing is another form of RAID 1 where each disk has its own controller.
N/2
RAID 5
RAID 5 is a striped array with distributed parity. This is similar to RAID 0 in that all data is striped across all available disks. Where it differs is one stripe holds parity information. If a drive fails, the data contained on that drive is recreated on the fly using the parity data from the other drives. More than one disk failure equals total data loss. The more drives you have in a RAID 5 array the greater the risk of having a second disk failure during the rebuild process from the first disk failure. The general recommendation at this time is 8 drives or less. In general, the larger the drive the fewer of them you should have in a RAID 5 configuration due to the rebuild time and the likely hood of a second drive failure.
3
N-1
RAID 6
RAID 6 is a striped array with dual distributed parity. Like RAID 5 it is a distributed block system with two parity stripes instead of one. This allows you to sustain a loss of two drives dramatically reducing the risk of a total stripe failure during a rebuild operation. Also known as, P+Q redundancy using Reed-Solomon isn’t practical to implement in software due to the math intensive calculations that have to take place to write parity data to two different stripes. The current recommendation is to use 8 drives or more.
4
N-2
RAID 10
RAID 10 is a hybrid or nested striping scheme combining RAID 1 mirrors with a RAID 0 stripe. This is for high performing and fault tolerant systems. Like RAID 1, you lose half your available space. You could lose N/2 drives and still have a functioning array. Duplexing each mirror between two drive chassis is common. You could lose a drive chassis and still function. The absence of parity means write speeds are high. Along with excellent redundancy, this is probably the best option for speed and redundancy.
RAID 0+1
RAID 0 + 1 is not interchangeable with RAID 10. There is one huge difference and that is reliability. You can lose only one drive and have a functioning array. With the more drives in a single RAID 0 stripe the greater the chance you take. Speed characteristics are identical to RAID 10. I have never implemented RAID 0 + 1 when RAID 10 was available.
RAID 50
Since RAID 5 becomes more susceptible to failure with more drives in the array keeping the RAID 5 stripe small, usually under 8 drives and then striping them with RAID 0 increases the reliability while allowing you to expand capacity. You will lose a drive per RAID 5 stripe but that is a lot less than loosing half of them in a RAID 10. Before RAID 6, this was used to get higher reliability in very large arrays of disks.
6
(N-1)*R
RAID 60
RAID 60 is the exact same concept as RAID 50. Generally, a RAID 6 array is much less susceptible to an array failure during a rebuild of a failed drive due to the nature of the dual striping that it uses. It still is not bullet proof though the RAID 6 array sizes can be much larger before hitting the probability of a dual drive failure and then a failure during rebuild than RAID 5. I do not see many RAID 60 configurations outside of SAN internal striping schemes. You do lose twice as many drives worth of capacity as you do in a RAID 50 array.
8
(N-2)*R
RAID 100
RAID 100 is RAID 10 with and additional RAID 0 stripe. Bridging multiple drive enclosures is the most common use of RAID 10. It also reduces the number of logical drives you have to maintain at the OS level.
You can’t have your cake and eat it too. In the past, it was hard to justify the cost of RAID 10 unless you really needed speed and fault tolerance. RAID 5 was the default because in most situations it was good enough. Offering near raid 0 read speeds. If you had a heavy write workload, you took a penalty due to the parity stripe. RAID 6 suffers from this even more so with two parity stripes to deal with. Today, with the cost of drives coming down and the capacity going up RAID 10 should be the default configuration for everything.
Here is a breakdown of how each RAID level handles reads and writes in order of performance.
Write Operations
Notes
Read Operations
1 operation
High throughput, low CPU utilization. No data protection
High throughput, low CPU utilization.
2 IOP’s
Only as fast as a single drive.
1 IOP
Two read schemes available. Read data from both drives, or data from the drive that returns it first. One is higher throughput the other is faster seek times.
4 IOP’s
Read-Modify-Write requires two reads and two writes per write request. Lower throughput higher CPU if the HBA doesn’t have a dedicated IO processor.
High throughput low CPU utilization normally, in a failed state performance falls dramatically due to parity calculation and any rebuild operations that are going on.
6 IOP’s
Read-Modify-Write requires three reads and three writes per write request. Do not use a software implementation if it is available.
This is not as easy as it should be. Between budgets, different storage types, and your requirements, any of the RAID levels could meet your needs. Let us work of off some base assumptions. Reliability is necessary, that rules out RAID 0 and probably RAID 0+1. Is the workload read or write intensive? A good rule of thumb is more than 10% reads go RAID 10. In addition, if write latency is a factor RAID 10 is the best choice. For read workloads, RAID 5 or RAID 6 will probably meet your needs just fine. One of the other things to take into consideration if you need lots of space RAID 5 or RAID 6 may meet your IO needs just through sheer number of disks. Take the number of disks divide by 4 for RAID 5 or 6 for RAID 6 then do your per disk IO calculations you may find that they do meet your IO requirements.
The type of IO, random or sequential, greatly affects your throughput. SQL Server has some fairly well documented IO information. One of the big ones folks overlook is keeping their log separate from their data files. I am not talking about all logs on one drive and all data on another, which buys you nothing. If you are going to do that you might as well put them all on one large volume and use every disk available. You are guaranteeing that all IO’s will be random. If you want to avoid this, you must separate your log files from data files AND each other! If the log file of a busy database is sharing with other log files, you reduce its IO throughput 3 fold and its data through put 10 to 20 fold.
Disks from the same batch can suffer similar fate. Correlated disk failures can be due to a manufacturing defect that can affect a large number of drives. It can be very difficult to get a vendor to give you disks from different batches. Your best bet is to hedge against that and plan to structure your RAID arrays accordingly.
As hard disks get larger the chance for an uncorrectable and undetected read or write failure. On a desktop drive, that rate is 10^14 bits read there will be an unrecoverable error. A good example is an array with the latest two-terabyte SATA drives would hit this error on just one full pass of a 6 drive RAID 5 array. When this happens, it will trigger a rebuild event. The probability of hitting another failure during the rebuild is extremely high. Bianca Schroeder and Garth A. Gibson of Carnegie Mellon University have written an excellent paper on the subject. Read it, it will keep you up at night worrying about your current arrays. Enterprise class drives are supposed to protect against this. No study so far proves that out. That does not mean I am swapping out my SAS for SATA. Performance is still king. They do boast a much better error rate 10^16 or 100 times better. Is this number accurate or not is another question all together. Google also did a study on disk failure rates, Failure Trends in a Large Disk Drive Population. Google also found correlated disk failures among other things. This is necessary read as well. Eventually, RAID 5 just will not be an option, and RAID 6 will be where RAID 5 is today.
RAID Doesn’t back your data up. You heard me. It is not a replacement for a real backup system. Write errors do occur.As database people we are aware of atomic operations, the concept of an all or nothing operation, and recovering from a failed transaction. People assume the file system and disk is also atomic, it isn’t. NTFS does have a transaction system now TxF I doubt SQL Server is using it. Disk drives limit data transfer guarantees to the sector size of the disk, 512 bytes. If you have the write cache enabled and suffer a power failure, it is possible to write part of the 8k block. If this happens, SQL Server will read new and old data from that page, which is now in an inconsistent state. This is not a disk failure. It wrote every 512-byte block it could successfully. When the disk drive comes back on line, the data on the disk is not corrupted at the sector level at all. If you have turned off torn page detection or page checksum because you believe it is a huge performance hit, turn it back on. Add more disks if you need the extra performance don’t put your data at risk.
Up next, A deeper look into IO types, stripe size, disk alignment, and file system cluster sizes.
12/03/2009 - UPDATE! There were a couple of bugs in the SQLIOCommandGenerator new SQLIOTools.zip has been updated.
------------------------------------
I often tell people one of the greatest things about SQL Server is that anyone can install it. I also tell people what the worst things about SQL Server is that anyone can install it. Microsoft fostered a "black-box" approach to SQL Server in 7.0 and 2000. Thankfully, they are reversing this course. As a follow-on to my last article, capturing I/O patterns, we will take a quick look at building some synthetic tests based on those results. There are several tools on the market test I/O systems, some of them free some of the not. SQLIO has been around for several years. There are lots of good articles already on the web describing various uses for this tool.SQLIO was specifically designed to test the limits of your I/O system at different workloads. The problem is people tend to run this tool, will look at the best results, and assume that they will see the same results when the server goes live. But, without understanding your current workloads that is an unreasonable expectation at best. What ends up happening, is a misconfigured I/O system, lots of headaches, with no idea why the system performs so poorly.
I always advocate testing new systems before they go into production. I also understand that it always isn't an option. Having found myself in that exact situation recently, I've decided to take my own advice and pull the new storage off-line to do the proper testing. I'm also taking this opportunity to refine my testing methodology and gather as many data points before the system goes live.
With my IO patterns in hand I set out to build a couple of little tools to help me generate all the test scripts and manage the data. As usual, I built these as command line tools since I have no skill at all with GUI’s. It is all in C# and I will be posting them up to Codeplex. You can download the tools here SQLIOTools.zip, this zip has the two tools, they are beta and don’t have a ton of error checking built into them yet. The first tool, SQLIOCommandGenerator does just that, generates the batch file that has all the commands. I does depend on the SQLIO.exe being in the same directory as well as having already defined a parameter file for it to use.
params.txt
X:\SQLIO_testfile0.dat 8 0x0 150240
The first parameter is the test file name that SQLIO will create on start up or use if it already exists. Second is the number of threads that will access that file. Third is the affinity mask. Fourth is the file size in megabytes. Make sure and size the file large enough to be representative of a real database you would be housing on the system. If it is too small it will simply fit in the RAID controllers cache and give you inflated results. I also tend to use one thread per physical CPU core. Be careful though, if you are using a lot of files, having too many threads can cause SQLIO to run out of memory.
Calling SQLIOCommandGenerator:
SQLIOCommandGenerator 0.10 We assume -F<paramfile> -LS -d,-R,-f,-p,-a,-i,-m,-u,-S,-v, -t not implemented Usage: SQLIOCommandGenerator [OPTIONS] Generates the command line syntax for the SQLIO.exe program output into a batch file. Options: -f, --iopattern[=VALUE] Random, Sequential or Both -k, --iotype[=VALUE] Read,Write or Both -s, --seconds[=VALUE] Number of seconds to run each test 1(60) to 10(600) minutes is normal -c, --cooldown[=VALUE] Number of seconds pause between tests suggested minimum is 5 seconds. --os, --outstandingiostart[=VALUE] Starting number of outstanding IOs 1 --oi, --outstandingioincrament[=VALUE] Multiply Outstanding IO start by X i.e 2 --oe, --outstandingioend[=VALUE] Ending Number of outstanding IOs i.e. 64 --ol, --outstandingiolist[=VALUE] Specific Outstanding IO List i.e. 1,2,4,8,16,32,64,128,256,512,1024 --oss, --iosizestart[=VALUE] Starting Size of the IO request in kilobytes i.- e. 1 --osi, --iosizeincrament[=VALUE] Multiply IO size by X in kilobytes i.e. 2 --ose, --iosizeend[=VALUE] Ending number of outstanding IOs in kilobytes - i.e. 1024 --osl, --iosizeList[=VALUE] Specific IO Sizes in kilobytes i.e. 1,2,4,8,16,32,64,128,256,512,1024 -b, --buffering[=VALUE] Set the type of buffering None, All, Hardware, Software. None is the default for SQL Server --bat, --sqliobatchfilename[=VALUE] The name of the output batch file that will be created -?, -h, --help show this message and exit
So I passed it this command:
SQLIOCommandGenerator.exe -k=Both -s=600 -c=5 --os=1 --oi=2 --oe=256 --oss=1 --osi=2 --ose=1024 -b=all --bat=c:\wes_sqlio_bat.txt -f=both
That generates this sample:
:: Generated by SQLIOCommandGenerator :: This relies on SQLIO.exe being in the same directory. :: c:\wes_sqlio_bat.txt c:\paramfile.txt c:\outputfile.csv "description of the tests" :: param1 sqlio parameter file, param2 output of each test to single csv file, param3 test description SET paramfile=%1 SET outfile=%2 SET runtime=600 SET cooloff=5 SET desc=%3 @ECHO OFF ECHO ComputerName: %COMPUTERNAME% > %OUTFILE% ECHO Date: %DATE% %TIME% >> %OUTFILE% ECHO Runtime: %RUNTIME% >> %OUTFILE% ECHO Cool Off: %COOLOFF% >> %OUTFILE% ECHO Parameters File: %PARAMFILE% >> %OUTFILE% ECHO Description: %DESC% >> %OUTFILE% ECHO Test Start >> %OUTFILE% ECHO Command Line: sqlio -kW -s%RUNTIME% -frandom -b1 -o1 -LS -BY -F%PARAMFILE% >> %OUTFILE% sqlio -kW -s%RUNTIME% -frandom -b1 -o1 -LS -BY -F%PARAMFILE% >> %OUTFILE% timeout /T %COOLOFF%ECHO End Date: %DATE% %TIME% >> %OUTFILE% :: This batch will take approximately 264.0014 Hours to Execute.
The batch file has the instructions for calling it and what parameters you can pass into it. You can omit seconds and cooldown if you want to generate a more generic batch file. This tool is flexible enough for my needs. I can generate specific targeted tests when I have data back that up, or I can generate more general tests to feel out the performance edges.
You may have noticed the estimate run time, that is pretty accurate. This is a worst case scenario where you have chosen pretty much every possible test to run. I wouldn’t recommend this. With the data we have already we can narrow down our testing to just a few IO sizes and queue depths to keep the test well within reason.
SQLIOCommandGenerator.exe -k=Both -s=600 -c=5 --ol=2 --osl=8,64 -b=None --bat=c:\wes_sqlio_bat.txt -f=bothThis batch will take approximately 80.08334 Minutes to Execute.
Much better! by focusing on our IO targets we now have a test that is meaningful and repeatable. Why would you want to repeat this test over and over? Simple, not all RAID controllers are created equal. You may need to adjust several options before you hit the optimal configuration.
Now that I have my tests defined I need to start running them and gathering information. There are some constants I always stay with. One, use diskpart.exe to sector align your disks. Two, format NTFS with a 64k block size. Since I”m doing these tests over and over I wrote a little batch file for that too. Diskpart can take a command file to do its work. Once the RAID controller is in I create an array and look what disk number is assigned to it. As long as you don’t make multiple arrays you will always get the same disk number. After that I format the volume accordingly. WARNING, I do use the /Y so the format happens without prompting for permission!
diskpart.txt
select disk 2 create partition primary align = 64 assign letter = X
testvol.bat
diskpart /S z:\diskpart.txt format x: /q /FS:NTFS /V:TEMP /A:64K /Y
I I also use the RAID controllers command line interface if it has one to make it easier to construct the tests and just let them run using a batch file as a control file. If that isn’t possible don’t worry, the bulk of your time will be waiting for the test to complete anyway.
As you have guessed, I have a tool to parse the output of the tests and import them into SQL Server or export it as a CSV file for easy access in Excel. SQLIOParser is also pretty simple to use.
SQLIOParser 0.20 Usage: SQLIOParser [OPTIONS] Process output of the SQLIO.exe program piped to a text file. Options: -c, --computername[=VALUE] The comptuer name that the test was executed on. -s, --sqlserver[=VALUE] The SQL Server you want to import the data into. -u, --sqluser[=VALUE] If using SQL Server authentication specify a user -p, --sqlpass[=VALUE] If using SQL Server authentication specify a password -t, --tablename[=VALUE] The table you want to import the data into. -d, --databasename[=VALUE] The database you want to import the data into. -f, --sqliofilename[=VALUE] The file name you want to import the data from. -a, --sqliofiledirectory[=VALUE] The directory containing the files you want to import the data from. -o, --csvoutputfilename[=VALUE] The file name you want to export the data to. -?, -h, --help show this message and exit
It will work with a single file or import a set of files in a single directory. If you are importing to SQL Server you need to have the table already created.
CREATE TABLE [dbo].[SQLIOResults]( [ComputerName] [varchar](255) NULL, [TestDescription] [varchar](255) NULL, [SQLIOCommandLine] [varchar](255) NULL, [SQLIOFileName] [varchar](255) NULL, [ParameterFile] [varchar](255) NULL, [TestDate] [datetime] NULL, [RunTime] [int] NULL, [CoolOff] [int] NULL, [NumberOfFiles] [int] NULL, [FileSize] [int] NULL, [NumberOfThreads] [int] NULL, [IOOperation] [varchar](255) NULL, [IOSize] [varchar](255) NULL, [IOOutstanding] [int] NULL, [IOType] [varchar](255) NULL, [IOSec] [decimal](18, 2) NULL, [MBSec] [decimal](18, 2) NULL, [MinLatency] [int] NULL, [AvgLatency] [int] NULL, [MaxLatency] [int] NULL) ON [PRIMARY]
This is the same structure the CSV is in as well.
I will warn you that the results you get will not match your performance 100% once the server is in production. This shows you the potential of the system. If you have horrible queries hitting your SQL Server those queries are still just as bad as before. Generally, I ignore max latency and min latency focusing on the average. That is what I am most worried about as the IO load changes or queue depth increases how will the system respond. Remember raw megabytes a second isn’t always king. Number of IO’s at a given IO block size is also very important. I will go into great detail in the next article as I walk you through analyzing the results from my own system so stay tuned for that.
These tests aren’t the end of your road. I still advocate playing back traces and seeing how the system responds with your exact workload whenever possible. If you can’t do that then using tools like SQLIO is better than nothing at all. We are also working under the assumption that we are upgrading or replacing an existing production server. If that isn’t the case and this is a brand new deployment using SQLIO will help you know what your I/O system is capable of before you have a problem with bad queries or other issues that always crop up on new systems.
You can always to more testing. It is almost a never ending process, my goal isn’t to give you the end solution just to give you another tool to pull out when you need it. As always, I look forward to your feedback!
We often take the advice given to us on forums or in articles at face value. Even though the authors almost always say things like “your mileage may vary” or “may not apply to your situation” people still assume it is the gospel. Sometimes it is lack of experience. Other times it is just lack of knowledge on how to verify these things on your own. In this article I’m going to give you a tool to look at what SQL Server is doing at the disk level and allow you to make better decisions on how to configure your underlying disks.
There are several things you need to know about how SQL Server accesses the database files and the implications of that before you can construct a proper testing methodology.
http://technet.microsoft.com/en-us/library/cc966500.aspx covers the basics. There are a few things I will highlight here.
ACID (Atomicity, Consistency, Isolation, and Durability) is what makes our database reliable. The ability to recover from a catastrophic failure is key to protecting your data.
WAL (Write-Ahead Logging) is how ACID is achieved. Basically, the log record must be flushed to disk before the data file is modified.
Stable media isn’t just the disk drive. A controller with a battery backed cache is also considered stable. Since SQL Server can request blocks as big as 64KB make sure your controller can handle that block size request in cache. Some older controllers only do a 16KB block or smaller.
With the requirement of stable media SQL Server creates and opens all files with a specific set of flags. FILE_FLAG_WRITETHROUGH tells the underlying OS not to use write caching that isn’t considered stable media. So, the local disk cache is normally bypassed. Not all hard drives honor the flag though, Some SATA/IDE drives ignore it. Usually, the drive manufacturer provides a tool to turn off write caching. If you are using desktop drives in a mission critical situation be aware of the potential for data loss. FILE_FLAG_NO_BUFFERING tells the OS not to buffer the file ether. At this point the only cache available will be the battery backed or other durable cached on the controller.
SQL Server uses asynchronous access for data and log files. This allows IO request to queue up and use the IO system as efficiently as possible. The main difference between the two are SQL Server will try and gather writes to the data file into bigger blocks but the log is always written to sequentially.
All of these rules apply to everything but tempdb. Since tempdb is recreated at restart every time recoverability isn’t an issue.
Searching around you will find these generalities about SQL Server’s IO patterns
Log Writes
Sequential 512 bytes to 64KB
Data File Read/Writes
8KB
Read ahead – more important to Enterprise Edition
8KB to 125KB
Bulk Insert
8KB to 128KB
Create Database
512 byte – full initialize on log file only.
Backup Sequential Read/Write
1 MB
Restore Sequential Read/Write
64K
DBCC – CHECKDB
Sequential Read 8K – 64K
DBCC – DBREINDEX
(Read Phase) Sequential Read (see Read Ahead)
(Write Phase) Sequential Write
Any multiple of 8K up to 128K
DBCC – SHOWCONTIG
Now that we have an idea of what SQL Server is suppose to be doing its time to verify our assumptions.
There are a few tools that will allow you to capture the file activity at the system level. Process Monitor is a free tool from Microsoft that I will use to collect some base line information. In it’s standard configuration Process Monitor captures a ton of stuff and uses the page file to spool the info to. So, before we begin we need to change the default configuration.
Capturing IO data using process monitor.
Filter to apply
process is sqlservr.exe Operation is Read Operation is Write
Columns to choose.
Process Name PID PATH Detail Date & Time Time of Day Relative Time Duration TID Category
Change Backing File.
The maximum number of events it will capture is 199 million. This is enough on my system to capture 12 hours of activity easily. Once we have a good sample you can save it off as an XML file or CSV. Choosing CSV it is pretty easy to import the data into SQL Server using SSIS or your tool of choice.
I import the CSV into a raw table first.
Raw table to import into.
CREATE TABLE [SQLIO].[dbo].[pm_imp] ( [Process Name] VARCHAR(12), [PID] SMALLINT, [Path] VARCHAR(255), [Detail] VARCHAR(255), [Date & Time] DATETIME, [Time of Day] VARCHAR(20), [Relative Time] VARCHAR(50), [Duration] REAL, [TID] SMALLINT, [Category] VARCHAR(6))
Next I create a cleaner structure with some additional information separated from the detail provided.
SELECT [Process Name] AS ProcessName, PID AS ProcessID, PATH AS DatabaseFilePath, Detail, [Date & Time] AS EventTimeStamp, [Time of Day] AS TimeOfDay, [Relative Time] AS RelativeTime, [Duration], TID AS ThreadID, Category AS IOType, substring(detail,charindex('Length: ',detail,0) + 8, (charindex(', I/O',detail,0) - charindex('Length: ',detail,0) - 8)) AS IOLength, CASE reverse(left(reverse(PATH),3)) WHEN 'mdf' THEN 'Data' WHEN 'ndf' THEN 'Data' WHEN 'ldf' THEN 'Log' END AS FileTypeINTO SQLIODataFROM dbo.pm_impWHERE reverse(left(reverse(PATH),3)) IN ('mdf','ndf','ldf')
Once we have the data cleaned up a bit we can now start doing some analysis on it.
Queries for interesting patterns.
This query gives us our read and write counts.
SELECT count(* ) IOCount, IOTypeFROM SQLIODataGROUP BY IOTypeORDER BY count(* ) DESC
This one shows us the size of the IO and what type of operation it is.
SELECT count(* ) IOCount, IOLength, IOTypeFROM SQLIODataGROUP BY IOLength,IOTypeORDER BY count(* ) DESC
This is a look at activity by file type data or log.
SELECT count(* ) IOCount, FileTypeFROM SQLIODataGROUP BY FileTypeORDER BY count(* ) DESC
Since we are capturing the thread id we can see how many IO’s by thread.
SELECT count(* ) IOCount, ThreadID FROM SQLIOData GROUP BY ThreadID ORDER BY count(* ) DESC
We can also look at IO types, sizes and count by file helping you see which ones are hot.
SELECT count(* ) IOCount, databasefilepath, iotype, iolengthFROM SQLIODataWHERE databasefilepath LIKE '%filename%'GROUP BY databasefilepath, iotype, iolengthHAVING count(* ) > 10000ORDER BY databasefilepath, count(* ) DESC
Now that we see exactly what our IO patterns are we can make adjustments to the disk subsystem to help scale it up or tune it for a particular pattern.
This is just another tool in your tool belt. This is a supplement to using fn_virtualfilestats to track file usage. I use it to get a better idea of the size of the IO’s being issued.Using these two tools I can see the size of the IO’s in a window of time that is reported by my fn_virtualfilestats capture routine.
Always verify your assumptions, or advice from others.
At least once a year I give a large talk on disk subsystems, IO and SQL Server. It’s a ground up from the nuts and bolts of how a hard drive works through SAN’s and Solid State Disks. The reasons I give this presentation so often is it is one of the most requested topics and one of the most misunderstood. The problem often lies in the fact the DBA may not know that much about different storage systems but they do know that it is very important do their jobs. With the rise of SAN, iSCSI and other storage solutions DBA’s have less and less control over the disk system that their SQL Server relies on. It’s my goal to give them, or you, the tools they need to effectively present their needs to the storage teams hopefully without a major amount of fuss and arguments. If you know how and why it works they way it works you can make logical requests in the language that your storage folks understand.
The presentation is meant to lay the foundation that can then be built upon and expand your knowledge off all things I/O.
This article series will be slightly expanded over what my presentation normally covers, since I’m only restricted by your willingness to read what I write. It will still be a condensed version of storage systems but I’ll put up as many reference links as I can.
Storage Area Networks Network Attached Storage/iSCSI Solid State Disks SQL Server and The File System Understanding Mean Time to Failure and Other Failure Metrics Tools and Techniques To Monitor SQL Server and I/O
Some topics may be a single post some may span several I won’t know for sure until I get done writing them. As request come in I may try to post on specific questions, or at a minimum point you in the right direction.
Stay Tuned….
-Wes