Using CLR integration to compress BLOBs/CLOBs in SQL Server 2005

  • Comments posted here are about the content posted at http://www.sqlservercentral.com/columnists/ymartinez/2798.asp

  • Thanks for the article - I had read it on CodeProject a couple of weeks ago (http://www.codeproject.com/cs/database/blob_compress.asp). Wanted to make sure each site had the same author, which it appears is the case!

    It's a clever application of .NET procs to do something that is definitely not suitable for T-SQL. I do wonder though if this is something better done by the business layer to avoid complex processing of data on the DB Server. Also, the compression of data across the link between the client/web-server & DB tiers may be desired - (de)compressing at the DB server helps with storage but not with traffic levels.

    However, the ideas presented could certainly be useful in smaller 2-tier apps where the DB server is on a LAN and transparent access to the decompressed BLOB data via a view is desired. I'd be interested in hearing about any experiences people have had with such an approach...

    Thanks for the article

  • Thanks for the article.

    Honestly though, I'm not sure how useful this will be in practice. To begin with, as mentioned, this is not going to be a good way to deal with larger objects (larger than 5MB was mentioned at the end of the article). Also, moving a compress/decompress function into the database's process could be a nightmare for performance.

    Overall, the idea of compressing data before pushing it into a BLOB is a great idea, but this should almost certainly be handled in application code rather than inside the database.

    Still, it does get one thinking about it.

  • I understand that your primary intention is "to show how to use CLR integration in SQL Server to extend the engines functionality and provides an overview on what can be done with the new BLOB/CLOB data", but this is exactly the kind of thing you shouldn't do with CLR integration.

    Sure, there are exceptions, such as a single-user app using SQL Express, but for a robust, multi-tier, multi-user solution, this would have dire effects on the performance of the database server. As others have commented, this is best done in another tier of the solution.

    You need to make it very, very clear that this would be a poor architectural choice.

  • Good points there, but I guess I'd like to hear more about why this is a bad move or where it makes sense and where it doesn't. If you're driving the home page of a website, maybe this is bad.

    If you're storing off purchase order images that will rarely be accessed, is this a bad decision? Even in a multi-user system?

    Maybe someone will write some architectural articles about how to recognize good and bad uses of the CLR

  • Interesting article - nice job.

    Oh how I love the posts stating "don't do this or that" ... you'd think that their authors either 1) got too bloated on Microsoft documentation circa 1998, or 2) they've been taking Joe Celko posts at face value . I often think of a passage from Ken Henderson's second SQL Server 2000 book, when he quotes Steve McConnell in the following;

    "To a great designer, not applying knowledge is tantamount to not having obtained the knowledge in the first place" (Henderson, 2002).

    I would hope that those same readers would at least post their alternative, stating the reasons for their belief to go along with their rebuttal, rather than a simple “nay”.

    Frankly, I enjoy the creativity here. If not for the idea in its form as presented, certainly for the thinking behind it, the very type of creativity that makes us all better at database development. Lest we not forget that someday the relational database will not be as we know it now. In fact, many are surprised that those concepts invented back in the mid to late 1970's are still around! I believe that in the next 20 years, advances in hardware, as example, might lead us to not even think of what were once "data retrieval problems" as we now know them. Dare we venture to think that what was once unacceptable in the relational database model might gain acceptance? Sure, you may never use this type of coding, but who cares? Completely immaterial from my point of view.

    Again - nice job here. Thank goodness that we didn’t get another article on SET NOCOUNT ON, ROWCOUNT, or something so exciting as the advantages of SET over SELECT.

    References:

    Henderson, Ken (2002). The Guru's Guide to SQL Server Stored Procedures, XML, and HTML. Indianapolis: Addison-Wesley.

  • Thanks to all of you for taking the time to comment on this article, there are some valid points here (and some I disagree with). A valid point: This should be done in another tier. For highly concurrent solutions, like an enterprise-level faxing application or a large website, this approach will present performance challenges. But for such applications you probably will use unmanaged code RIGHT!?

    I disagree with “but this is exactly the kind of thing you shouldn't do with CLR integration”.  Here is a rule of thumb: you should not replace things better left to T-SQL with CLR integration. I can’t imagine how you would do something like this in T-SQL!

    I also recall a SQL 2000 application for server side encryption and it was not exactly a poor architectural choice. This is “A solution” not “THE solution”. It does provide the flexibility to avoid investing in application changes, testing and redeployment, BUT you should not use it for highly scalable, highly concurrent applications.

    Let me answer the question about possible applications, specially the third party product I mentioned in my article. The product is currently being integrated into a government archiving application, a change tracking application and reviewed by a large financial software provider for document management.

    About performance, the worst part is returning compressed data. It takes about 2/3 of the time and it is proportional to blob size. Some preliminary tests in a dual core Pentium 1.4 GHz with 2 GB of ram show about 40 MB per second decompression time (around 30 MB/s for compression). That should give you some room before you decide to rewrite an application layer (or buy a new disk array). All in all, I advise you to take the solution, test its performance under regular load and if it is good enough use it. Just remember to monitor your systems to avoid inadvertently outgrowing it (but that probably goes for everything else you write).

    Once again, thanks for your comments and I look forward to answering any questions or inquiries that might arise.

  • And what about to also encrypt the compressed content? Months ago, I was writing a UDT that automatically compress/decompress the content.  Not too much useful since and UDT can't be bigger than 8K, but I found that automatic encryption for the compressed content would be a nice feature.

    Again, we can discuss if this should be done by database layer using CLR or by application layer, but as I explain in the article, if others RDBMS are offering this feature, probably someone is using it.

    You can find this UDT at http://jcarlossaez.blogspot.com/2006/11/compressed-and-encrypted-user-defined.html

  • Juan Carlos:

    SQL Server 2005 has built-in encryption, a pretty cool feature that MS acquire from three different vendors. Check the books online, you probably find plenty of documentation there. Having implemented encryption for SQL 2000 myself, I can tell you that managing keys is the biggest problem you will face, but 2005 already fixed that.

    You are right, other RDBMS have server-side column-level compression, MySQL is one of them. There must be a reason for including such functionality, and it works just fine for them. Why can't we do it in SQL Server? Is it less of a database engine?  But then again, it is a matter of approach and what works for you based on what challenges your environment post.

  • This sort of application as well as the encryption one do make a lot of sense in mobile devices. In there you need to minimize space and maximize security.

    That said neither of the two make sense on *large* database deployments. So I think Steve was right on the money it is where you use this what make or breaks the usefulness.

    Just my $0.02


    * Noel

  • Im a little late reading this, but have a couple comments to add.

    My first thought was to agree with Ian, that I think this would be better done in the middle tier. It would reduce the bandwidth used, reduce the db server load, and still give you the space savings. Since it would be almost the same chunk of code, it's a 'where should the code go' question. The only downside I see to middle tier placement is that if someone wants to do some type of bulk load or update, they'll have to be sure to use the same compression method/code. Actually, I think I still agree with Ian!

    I don't see this as a bad use of the CLR. If you've decided to do it on the server, clearly CLR provides the best way to accomplish it.

    I'm not sure I agree with Peter that it would kill server performance. Not sure what the tradeoff would be, would guess cpu and memory usage would go up, disk io would decrease as would space usage. If you needed to optimize the compression could be done asynchrously, with a flag on the table indicating compressed or not, job running behind to handle the compression. Not much you could do about uncompression though, unless you went with a split solution where you handled compression on the server, uncompression on a different tier. Might not be best fit for things like serving images, but it would probably do well for something like document storage.

    I think it parallels how we do compression in the file system. Sometimes we just mark the folder as compressed, sometimes we compress the file manually. Which is better or more correct?

    I also think that if SQL offered it as a native option you'd probably see a lot of people use it.

  • The NTFS file system comparison is great!

    Sometimes you compress by using file/folder attributes because you get seamless access to the file, it's convenient and you get some compression benefits for some acceptable overhead.

    Likewise, sometimes explicit compression is better - you offload the compression overhead to another one-off task, but you lose the ease of access.

    As always, it really depends.   I do agree with others though; it was a great article from the point of view of seeing just what sort of things you can do in the DB engine with the new CLR integration - good or bad - and is a welcome distraction from some of the more mundane DBA task articles

  • Nice artical...

    Got a question...

    So what is happening here is that you have created a .net application and then ran some code that generated an assembly that contains one type with type funcitons that compress and decompress binary data ; and is in some way registered with sql2005.

    In order to utilize this functionality one would need to need to store something like an image into the database. So since .net data binary is wider ( NEED SOME help here ) the functions compress that binary data so that it can be stored into sql..

    Looking for some straightening here... I have one foot in and one out.. HELP!

    Erik..............

    Dam again!

  • Hi Yoel - I enjoyed your article on blob compression - but I believe there is a bug in the compression function...

    I was starting to work on the same concept (as well as blob encryption) when I came across your article.  I started playing with your code using the sort of data that I have been intending on compressing.  I had noticed that for most of my data, I wasn't getting the kind of compression ratios that I was expecting, and in many cases the resulting compressed data was actually 2x larger.  This seemed to be the case for our smaller blobs (around 2K to 6K bytes) - which is the majority of the data that we are looking to compress (some rows contain much larger data, too).  I compared this with the same data, that I compress from a file using the same deflate method in a standalone .net program and got 80% compression or so.

    I determined that the cause of the problem is that the blobData.Length (the length of the byte array) value is not reliable - it didn't actually match the length of the blob data.  It was often 6-8K when the actual data was much smaller, even zero length.  If I use blob.Length (the length of the SqlBytes parameter) when writing to the compression stream, I got the results I expected.

    Just wanted to let you and the readers know what I found... please validate as you see fit

    Oh, BTW - We are thinking about using blob compression to reduce our SAN chargeback costs (which are insanely high) - the blob data we're compressing needs to be online with the active data in our application,  but isn't updated after it's inserted, and isn't selected very often relative to the rest of the data; so the compute costs are manageble.

     

     

     

     

  • JoeA:

    Thanks for pointing this out. You are right, I actually responded yesterday to an issue someone had with this code in SQL Server SP1. He was setting the same varbinary(max) variable twice from larger to smaller and was getting garbage after reducing the size. I look into it and found a problem with .Buffer. That property was not refreshed by SQL every time and it kept all the garbage from previous calls. I don't know if Microsoft solved that issue in SQL Server SP2 BUT I changed the code to use the stream version of SQLBytes. By the way MS has resolved the size limitation in SP2.

    There is a problem with the buffer property, I couldn't fix since it because it is "an undocumented feature". I looked into SQLCompress.NET behavior and it does not have this problem since they use streaming methods to manipulate the data. Which made me wonder: Have you tried that product? Whatever the case I love to see my code put to a good use and see people committed to try it out and improve it. Thanks again for your help and comments.

Viewing 15 posts - 1 through 15 (of 30 total)

You must be logged in to reply to this topic. Login to reply