Im a little late reading this, but have a couple comments to add.
My first thought was to agree with Ian, that I think this would be better done in the middle tier. It would reduce the bandwidth used, reduce the db server load, and still give you the space savings. Since it would be almost the same chunk of code, it's a 'where should the code go' question. The only downside I see to middle tier placement is that if someone wants to do some type of bulk load or update, they'll have to be sure to use the same compression method/code. Actually, I think I still agree with Ian!
I don't see this as a bad use of the CLR. If you've decided to do it on the server, clearly CLR provides the best way to accomplish it.
I'm not sure I agree with Peter that it would kill server performance. Not sure what the tradeoff would be, would guess cpu and memory usage would go up, disk io would decrease as would space usage. If you needed to optimize the compression could be done asynchrously, with a flag on the table indicating compressed or not, job running behind to handle the compression. Not much you could do about uncompression though, unless you went with a split solution where you handled compression on the server, uncompression on a different tier. Might not be best fit for things like serving images, but it would probably do well for something like document storage.
I think it parallels how we do compression in the file system. Sometimes we just mark the folder as compressed, sometimes we compress the file manually. Which is better or more correct?
I also think that if SQL offered it as a native option you'd probably see a lot of people use it.
The NTFS file system comparison is great! Sometimes you compress by using file/folder attributes because you get seamless access to the file, it's convenient and you get some compression benefits for some acceptable overhead.
Likewise, sometimes explicit compression is better - you offload the compression overhead to another one-off task, but you lose the ease of access.
As always, it really depends. I do agree with others though; it was a great article from the point of view of seeing just what sort of things you can do in the DB engine with the new CLR integration - good or bad - and is a welcome distraction from some of the more mundane DBA task articles
Got a question...
So what is happening here is that you have created a .net application and then ran some code that generated an assembly that contains one type with type funcitons that compress and decompress binary data ; and is in some way registered with sql2005.
In order to utilize this functionality one would need to need to store something like an image into the database. So since .net data binary is wider ( NEED SOME help here ) the functions compress that binary data so that it can be stored into sql..
Looking for some straightening here... I have one foot in and one out.. HELP!
Hi Yoel - I enjoyed your article on blob compression - but I believe there is a bug in the compression function...
I was starting to work on the same concept (as well as blob encryption) when I came across your article. I started playing with your code using the sort of data that I have been intending on compressing. I had noticed that for most of my data, I wasn't getting the kind of compression ratios that I was expecting, and in many cases the resulting compressed data was actually 2x larger. This seemed to be the case for our smaller blobs (around 2K to 6K bytes) - which is the majority of the data that we are looking to compress (some rows contain much larger data, too). I compared this with the same data, that I compress from a file using the same deflate method in a standalone .net program and got 80% compression or so.
I determined that the cause of the problem is that the blobData.Length (the length of the byte array) value is not reliable - it didn't actually match the length of the blob data. It was often 6-8K when the actual data was much smaller, even zero length. If I use blob.Length (the length of the SqlBytes parameter) when writing to the compression stream, I got the results I expected.
Just wanted to let you and the readers know what I found... please validate as you see fit
Oh, BTW - We are thinking about using blob compression to reduce our SAN chargeback costs (which are insanely high) - the blob data we're compressing needs to be online with the active data in our application, but isn't updated after it's inserted, and isn't selected very often relative to the rest of the data; so the compute costs are manageble.
Thanks for pointing this out. You are right, I actually responded yesterday to an issue someone had with this code in SQL Server SP1. He was setting the same varbinary(max) variable twice from larger to smaller and was getting garbage after reducing the size. I look into it and found a problem with .Buffer. That property was not refreshed by SQL every time and it kept all the garbage from previous calls. I don't know if Microsoft solved that issue in SQL Server SP2 BUT I changed the code to use the stream version of SQLBytes. By the way MS has resolved the size limitation in SP2.
There is a problem with the buffer property, I couldn't fix since it because it is "an undocumented feature". I looked into SQLCompress.NET behavior and it does not have this problem since they use streaming methods to manipulate the data. Which made me wonder: Have you tried that product? Whatever the case I love to see my code put to a good use and see people committed to try it out and improve it. Thanks again for your help and comments.
I have not tried SQLCompress.NET. We are less inclined to use smaller third party add ons. Not that they're bad, but in the end, the corporate red-tape involved in getting them in house and supportable exceeds the cost of building some of those capabilities ourselves... not to mention the general hassle of dealing with "the man".
--uncompressed BLOBs stored here:create table _Files (FileName varchar(200), FileSize int, FileContent varbinary(max))--load the data here.... (irrelevant)--compressed BLOBs will be stored here:create table _FilesCompressed (FileName varchar(200), FileSize int, FileContent varbinary(max))--an attempt to compress 56 documents blows up.insert Into _FilesCompressed (FileName, FileSize, FileContent)select f.FileName, null, dbo.dba_clrCompress(f.FileContent)from _Files f