We often take the advice given to us on forums or in articles at face value. Even though the authors almost always say things like “your mileage may vary” or “may not apply to your situation” people still assume it is the gospel. Sometimes it is lack of experience. Other times it is just lack of knowledge on how to verify these things on your own. In this article I’m going to give you a tool to look at what SQL Server is doing at the disk level and allow you to make better decisions on how to configure your underlying disks.
There are several things you need to know about how SQL Server accesses the database files and the implications of that before you can construct a proper testing methodology.
http://technet.microsoft.com/en-us/library/cc966500.aspx covers the basics. There are a few things I will highlight here.
ACID (Atomicity, Consistency, Isolation, and Durability) is what makes our database reliable. The ability to recover from a catastrophic failure is key to protecting your data.
WAL (Write-Ahead Logging) is how ACID is achieved. Basically, the log record must be flushed to disk before the data file is modified.
Stable media isn’t just the disk drive. A controller with a battery backed cache is also considered stable. Since SQL Server can request blocks as big as 64KB make sure your controller can handle that block size request in cache. Some older controllers only do a 16KB block or smaller.
With the requirement of stable media SQL Server creates and opens all files with a specific set of flags. FILE_FLAG_WRITETHROUGH tells the underlying OS not to use write caching that isn’t considered stable media. So, the local disk cache is normally bypassed. Not all hard drives honor the flag though, Some SATA/IDE drives ignore it. Usually, the drive manufacturer provides a tool to turn off write caching. If you are using desktop drives in a mission critical situation be aware of the potential for data loss. FILE_FLAG_NO_BUFFERING tells the OS not to buffer the file ether. At this point the only cache available will be the battery backed or other durable cached on the controller.
SQL Server uses asynchronous access for data and log files. This allows IO request to queue up and use the IO system as efficiently as possible. The main difference between the two are SQL Server will try and gather writes to the data file into bigger blocks but the log is always written to sequentially.
All of these rules apply to everything but tempdb. Since tempdb is recreated at restart every time recoverability isn’t an issue.
Searching around you will find these generalities about SQL Server’s IO patterns
Log Writes
Sequential 512 bytes to 64KB
Data File Read/Writes
8KB
Read ahead – more important to Enterprise Edition
8KB to 125KB
Bulk Insert
8KB to 128KB
Create Database
512 byte – full initialize on log file only.
Backup Sequential Read/Write
1 MB
Restore Sequential Read/Write
64K
DBCC – CHECKDB
Sequential Read 8K – 64K
DBCC – DBREINDEX
(Read Phase) Sequential Read (see Read Ahead)
(Write Phase) Sequential Write
Any multiple of 8K up to 128K
DBCC – SHOWCONTIG
Now that we have an idea of what SQL Server is suppose to be doing its time to verify our assumptions.
There are a few tools that will allow you to capture the file activity at the system level. Process Monitor is a free tool from Microsoft that I will use to collect some base line information. In it’s standard configuration Process Monitor captures a ton of stuff and uses the page file to spool the info to. So, before we begin we need to change the default configuration.
Capturing IO data using process monitor.
Filter to apply
process is sqlservr.exe Operation is Read Operation is Write
Columns to choose.
Process Name PID PATH Detail Date & Time Time of Day Relative Time Duration TID Category
Change Backing File.
The maximum number of events it will capture is 199 million. This is enough on my system to capture 12 hours of activity easily. Once we have a good sample you can save it off as an XML file or CSV. Choosing CSV it is pretty easy to import the data into SQL Server using SSIS or your tool of choice.
I import the CSV into a raw table first.
Raw table to import into.
CREATE TABLE [SQLIO].[dbo].[pm_imp] ( [Process Name] VARCHAR(12), [PID] SMALLINT, [Path] VARCHAR(255), [Detail] VARCHAR(255), [Date & Time] DATETIME, [Time of Day] VARCHAR(20), [Relative Time] VARCHAR(50), [Duration] REAL, [TID] SMALLINT, [Category] VARCHAR(6))
Next I create a cleaner structure with some additional information separated from the detail provided.
SELECT [Process Name] AS ProcessName, PID AS ProcessID, PATH AS DatabaseFilePath, Detail, [Date & Time] AS EventTimeStamp, [Time of Day] AS TimeOfDay, [Relative Time] AS RelativeTime, [Duration], TID AS ThreadID, Category AS IOType, substring(detail,charindex('Length: ',detail,0) + 8, (charindex(', I/O',detail,0) - charindex('Length: ',detail,0) - 8)) AS IOLength, CASE reverse(left(reverse(PATH),3)) WHEN 'mdf' THEN 'Data' WHEN 'ndf' THEN 'Data' WHEN 'ldf' THEN 'Log' END AS FileTypeINTO SQLIODataFROM dbo.pm_impWHERE reverse(left(reverse(PATH),3)) IN ('mdf','ndf','ldf')
Once we have the data cleaned up a bit we can now start doing some analysis on it.
Queries for interesting patterns.
This query gives us our read and write counts.
SELECT count(* ) IOCount, IOTypeFROM SQLIODataGROUP BY IOTypeORDER BY count(* ) DESC
This one shows us the size of the IO and what type of operation it is.
SELECT count(* ) IOCount, IOLength, IOTypeFROM SQLIODataGROUP BY IOLength,IOTypeORDER BY count(* ) DESC
This is a look at activity by file type data or log.
SELECT count(* ) IOCount, FileTypeFROM SQLIODataGROUP BY FileTypeORDER BY count(* ) DESC
Since we are capturing the thread id we can see how many IO’s by thread.
SELECT count(* ) IOCount, ThreadID FROM SQLIOData GROUP BY ThreadID ORDER BY count(* ) DESC
We can also look at IO types, sizes and count by file helping you see which ones are hot.
SELECT count(* ) IOCount, databasefilepath, iotype, iolengthFROM SQLIODataWHERE databasefilepath LIKE '%filename%'GROUP BY databasefilepath, iotype, iolengthHAVING count(* ) > 10000ORDER BY databasefilepath, count(* ) DESC
Now that we see exactly what our IO patterns are we can make adjustments to the disk subsystem to help scale it up or tune it for a particular pattern.
This is just another tool in your tool belt. This is a supplement to using fn_virtualfilestats to track file usage. I use it to get a better idea of the size of the IO’s being issued.Using these two tools I can see the size of the IO’s in a window of time that is reported by my fn_virtualfilestats capture routine.
Always verify your assumptions, or advice from others.
I’ve toyed with the CLR in SQL Sever 2005 off and on since the first Yukon beta had it enabled. And I’ll be honest with you, I was not a fan.It wasn’t like “YOU got chocolate in my peanut butter!” kind of moment for me. I really thought it was going to be a disaster of biblical proportions. As SQL Server DBA’s we caught a break, adoption wasn’t exactly stellar. The problem was there are enough restrictions and little gotchas to keep developers from whole sale abandoning Transact SQL for something more familiar. Fast forward a few years and now I’m not so scared.My biggest worry back then was memory usage. I’m still not very comfortable with it, but on a 64-bit platform you can mitigate those issues by adding more memory. On a 32-bit platform you could cause all kinds of damage by squeezing the lower 4GB memory space to the point you could have connection and backup failures due to lack of memory. Oh and the fix is usually restarting SQL Server. An example of this comes directly from http://msdn.microsoft.com/en-us/library/ms131075.aspx
Scalable Memory Usage
In order for managed garbage collection to perform and scale well in SQL Server, avoid large, single allocation. Allocations greater than 88 kilobytes (KB) in size will be placed on the Large Object Heap, which will cause garbage collection to perform and scale much worse than many smaller allocations. For example, if you need to allocate a large multi-dimensional array, it is better to allocate a jagged (scattered) array.
88KB!
This memory thing is serious.
The other biggie is what you can, or cannot do using the CLR.
Again from MSDN http://msdn.microsoft.com/en-us/library/ms131047.aspx
SAFE is the most reliable and secure mode with associated restrictions in terms of the allowed programming model. SAFE assemblies are given enough permission to run, perform computations, and have access to the local database. SAFE assemblies need to be verifiably type safe and are not allowed to call unmanaged code.
UNSAFE is for highly trusted code that can only be created by database administrators. This trusted code has no code access security restrictions, and it can call unmanaged (native) code.
EXTERNAL_ACCESS provides an intermediate security option, allowing code to access resources external to the database but still having the reliability guarantees of SAFE.
Most restrictive to least restrictive permissions. Something you don’t worry about in general as a C# programmer but in the database its always an issue in some way.
What it boils down to:
If you are just talking to SQL Server using basic C# stuff leave it in SAFE which is the default.
If you need access to the file system or the registry and some other limited stuff EXTERNAL_ACCESS is the way to go.
IF you want to have the ability to completely tank a production SQL Server UNSAFE puts it all into your hands. You can call unmanaged code via P/Invoke, all bets are off.
Some additional light reading on what libraries can and can’t be called in the CLR.
http://msdn.microsoft.com/en-us/library/ms403273.aspx
Fun stuff, no Finalizers or static fields, read-only static fields are ok though. You will see why this is important to me a little later on.
http://msdn.microsoft.com/en-us/library/ms403279.aspx
T-SQL vs. CLR
The other thing I had been promoting, and not always correctly, is putting complicated math functions in CLR. Generally, I’ve found that most math problems run faster in the CLR over native T-SQL. And I’ve found for the most part that holds true for the core algorithm. Once you add data retrieval into the mix things shift back in T-SQL’s favor for a lot of operations. Like everything else, test your ideas using real world scenarios or as close as you can before deciding on one technology over another. I prime example for me was coding up Pythagorean and Haversine equations for the classic distance between two zip codes in T-SQL and C# via CLR. Running test data through an array in the C# solution it ran rings around the T-SQL function I had coded up but once it had to start pulling and pushing data back to the database the T-SQL solution was the clear winner.
Another aspect where the CLR can be much better is string manipulation. I’ve written a couple of small UDF’s to handle some of this since using the LIKE ‘%’ would cause a table scan anyway the CLR UDF was faster internally when dealing with the string than T-SQL was using all the string handling functions.
I’m also seeing quite a bit on using the CLR for rolling aggregates and other kinds of aggregation problems. I don’t have any personal experience in that yet with the CLR.
There are also some things that aren’t practical at all using T-SQL, some would say you shouldn’t be using the database for some of this stuff in the first place but that is an argument for a different post.
I’ve recently started working on my most complex project using the CLR, some aspects have been covered by other folks like Adam Machanic, Robin Dewson and Jonathan Kehayias but there was some specific requirements that I needed.
Thus was born….
SQL Server File System Tools
This is a codeplex hosted project and all the source code is available there for your viewing pleasure.
I’ve done a lot of C# stuff but this was my first hard core CLR app for SQL Server.
What the assembly does is pretty simple, store files in the database ether native, encrypted or compressed.Yoel Martinez wrote up a nice UDF that does blob compression using the CLR. Between this and examples in Pro SQL Server 2005 on storing files in the database I knew I could do what I needed to do.
The wrinkle in my project was not just reading the file and storing it compressed it was putting it back on disk compressed as well. Enter #ziplib (SharpZipLib). This library allows you to pretty easily create standard zip files that even Windows Explorer can open and extract from. So with all the bits in place I set out to build my little tool.
The first thing I did was put together all the samples I’d found build them up as a set of stored procedures instead of UDF’s and just got the file in and out working. Next I added compression via C#’s DeflateStream to see what it would take to get the data flowing in and out and what the performance hit in memory and time would start looking like. At this point I was pretty optimistic I could knock this thing out in a day or two tops. That was all fine and dandy until I started integrating the #ziplib library. My initial goal was to have the assembly set to EXTERNAL_ACCESS since that was the most restrictive security model.
Since the guys that wrote #ziplib didn’t have the CLR in mind there are several things that break without UNSAFE set. As I mentioned earlier the use of finalizers and static fields were the two big ones. I will at some point recode those parts but for now they are still in place. The second thing is the library covers a lot more functionality that I actually need, So I’ve removed the bits I can without refactoring the library. The resulting DLL isn’t horribly big at this point but I figure when I get around to coding up the finalizers I’ll refactor down to what I need then. One big plus for me though is #ziplib is all managed code written in C# so it is pretty easily added directly into my DLL so I don’t have to register two assemblies or call down to the file system to a unmanaged DLL. Compression is handled by RijndaelManaged which is a built in .net 2.0 libraries.
The big downer for me was trying to debug the the code in Visual Studio 2008, when it did work it was ok but It would fail to connect or fail to register the assemblies so I just fell back to injecting debug messages and running tests manually in SSMS.
One thing I really like about programming languages like C# is method overloading, I really wished you could do that with stored procedures! Since I can’t there were only two options, a stored proc that had lots of flags and variables that may or may not be used and handle it all under the covers or just build each option into a proc with simple callers and a descriptive name. I voted for option two. Some of the T-SQL procedures are used internally by the CLR procedures while all the CLR procedures are called by the user.
Here is the list procedures and what they do.
Called by CLR procedures as helpers InsertFile Is called by every proc that inserts a file into the database. SavePassphrase Called by StorePassPhrase to handle insert into database. RetrievePassPhrase Called by any proc that has to decrypt a file stream RetrieveFile Called by any proc that retrieves a file from the database
Called by Users
T-SQL Procedures
RetrieveFileDetails Get details on a single file or every file stored in the database.
CLR Procedures
StorePassPhrase Give it a password and it generates a secure hash stored into the database for symmetric encryption
Below all store a file from the file system into the database. StoreFile StoreFileEncrypted StoreFileCompressed StoreFileEncryptedCompressed
Below all retrieve a file from the database back to the file system. ExtractFileToDisk ExtractFileToDiskEncrypted ExtractFileToDiskCompressed ExtractFileToDiskEncryptedCompressed
Below all retrieve a file from the database but returns a record set with the file name and the BLOB data. ExtractFileToRecord ExtractFileToRecordEncrypted ExtractFileToRecordCompressed ExtractFileToRecordEncryptedCompressed
And lastly, I put in an xp_getfiledetails clone since I wanted a way to verify the file is on disk and get attributes it seemed pretty straight forward since I’m getting the same details when i write the file to the database anyway.
This project isn’t done yet. there are a few more things to be added other than the code cleanup I mentioned already.
Off line decryption tool so the files dumped to disk still encrypted can be worked with.
Additional stored procedures for searching for files by tag or by attributes like name, size, etc.
A real installer and not a zip file with T-SQL scripts.
After that it goes into maintained mode with no new features but work on speeding it up, reducing the memory impact and fixing any bugs that are found. I really want to avoid this growing into a huge library, Keep it simple, do one thing and do it well.
Here are some things that helped me along the way.
Visual Studio 2008 could have used notepad but hey I’m getting lazy in my old age.
JetBrains ReSharper 4.5 If you are using Visual Studio ReSharper is a must. I feel like I’m programming the the stone age without it.
GhostDoc Free tool to help you document your C# code using XMLDoc. Yet something else I wished I could do with stored procedures
Doxygen If you are building documentation and have XMLDoc in your code this can make it easier to gather it all together. It isn’t perfect but it is free.
Both solid text Pro SQL Server 2005 has a chapter on CLR
This one is dedicated to just CLR and was also invaluable to me.
http://www.sqlclr.net Devoted to the CLR with some nice resources.
My Twitter buddies are always there to answer a question or two!
Until next time!
-Wes