Click here to monitor SSC
SQLServerCentral is supported by Red Gate Software Ltd.
 
Log in  ::  Register  ::  Not logged in
 
 
 

K. Brian Kelley - Databases, Infrastructure, and Security

IT Security, MySQL, Perl, SQL Server, and Windows technologies.

Architecture Reasons for Putting BLOBs in the Database

I was participating in a Twitter Chat looking at what suggestions and recommendations for developer on how to make the best use of SQL Server. One of the points that came up was about BLOBs (Binary Large OBjects) within the database. The general recommendation is to keep BLOBs out of the DB. I brought up SharePoint as a counter-argument to the recommendation. Then folks starting joking on SharePoint. SharePoint does a lot of things that makes DBAs and System Administrators/Active Directory Administrators tear their hair out. However, I'm sure SharePoint includes BLOBs in the DB for a good reason or two.

Let me first state that I don't like BLOBs in the database. I like metadata in the database with the files stored somewhere else. The app reads the metadata and then goes and accesses the file(s) based on that metadata. Why do I like this?

  • It keeps the database small. This is good for performance and for my backup/restore headaches.
  • An RDBMS is not a file system. It is not optimized for such.
  • Things like implementing antivirus becomes much, much harder.

So why would a system like SharePoint implement BLOBs in the database? I can see several reasons.

 

Backups: It's Hard to Synchronize Between the Database and the File System:

If you have two separate systems, synchronization usually poses a challenge. When you have systems that are very different, both in type AND size, synchronization is extremely challenging. Consider trying to take a backup of the file system for all the files stored in a document management system like SharePoint. During the time you're doing the backup, more files could show up. Now, consider the database. If it only contains metadata, it's likely to backup much, much faster. So its backup gets done faster, meaning it will miss those incoming files. The files are on the file system but there is no metadata in the database. Now we have inconsistency between the two systems.

How do we generally solve this situation? We solve it by taking a backup of the file system first, then the database. This should ensure that you minimize the likelihood of a file existing and the metadata not being there. However, consider if you've got different systems backing up the DB versus the file system. How do you synchronize those systems so that the file system fully completes before the database backup kicks off? It may not be so easy to coordinate those backups.

 

Restore: Making Sure You Have the Right Set of Backups

Now you have to keep the pair matched up. Worst case scenario, if you don't have a matched set, is to pick an older database backup compared to the file system backup. You'll have files you don't have metadata to, but at least everything within the app works.

There are additional procedures if a coordination is required. That's additional work and potential confusion. If you find you don't have a matched pair, it's even more work. Recovery situations can be (and usually are) stressful. Disaster recovery situations are even more so. These additional procedures will likely increase the time to recover. It also means your documentation for recovery have to be detailed and complete. They should be already, but even more so with this type of architecture.

 

Backups: What about during the day?

A lot of organizations don't run file system backups all day long. They usually pick a point during the night and that's when the backups kick off. If that isn't sufficient for recovery purposes, then you're looking at modifying file system backup processes to take into account the new application. Databases, on the other hand, tend to be backed up more frequently. If all the files are in the database, I just have to leverage what I'm likely already doing for other databases. Yes, the databases and, therefore, the backups will be larger (and slower), but likely I have more infrastructure around handling database backups more frequently than once a day. I can leverage that existing infrastructure without having to build new infrastructure and design new processes.

 

Application Use: What about document versioning?

Products like SharePoint provide document versioning. We see this in source control as well as history. Windows Server has versioning using Shadow Copy, but the number of copies you have is determined by the number of snapshots you have. You can probably think about how difficult things just became to ensure you can pull old versions of documents back up in the application.

So what if I use BLOBs? Things become signficantly easier. I don't have to worry about operating system configurations, taking snapshots (Shadow Copy meaning), and I don't have to worry about losing a version when a snapshot is deleted to make room for a newer one.

 

Application Use: Moving to a Different System

As soon as I involve Shadow Copy and snapshots for versioning, I complicate the ability to move the files to a new system. When would this come up?

  • When I detect my file system has a fault that's going to get bring the system down.
  • When I need bigger hardware.
  • When I want to move to a different environment or a different location.

If it's all in the database, I can restore the database backup and I've got it all. That makes things a whole lot simpler and easier.

 

I know there are counter-arguments for every reason I've presented. My point in presenting these architecture reasons is because they do weigh in to whether for a given system we put BLOBs in the database or not. A consideration of the pros and cons for a given system is important to make the right choice. There isn't a black and white answer, as with some other design choices.

 

 

I'm sure there are other reasons. And there are counter-arguments for each of these reasons. However, whether or not BLOBs should be in

Comments

Posted by thomas.kellerer on 25 October 2013

Another reason: putting the BLOB intot the database is faster than using the filesystem.

Please see this test done by Microsoft:

research.microsoft.com/.../default.aspx

Posted by K. Brian Kelley on 25 October 2013

To a point. Even the abstract clearly lists the ranges where each is more beneficial and the range where other factors come into play.

Posted by jamyer on 26 October 2013

For a SharePoint environment I like all the data in the database.  If the blob data is numerous and static then it starts to be  worth the administrative burden to keep it in Remote BLOB storage (RBS).  The material cost of storage is cheaper, generally, for RBS then SQL data is another consideration.

Posted by Jeremy Brown on 26 October 2013

Great article Brian. There is no cut-and-dry answer other than "what makes sense" architecturally. Sometimes filestream or RBS looks right, sometimes HFS, sometimes pointers, others just a straight up BLOB in the Db. As with any decision, you have to consider how much it's worth to you in implementation overhead vs ongoing maintenance cost.

Posted by billhol 40227 on 26 October 2013

Smaller database sizes do not always result in better performance. Adding an index will increase database size, but it (when harnessed) should also increase performance. Database pages are 8KB offsets into a file - they tell NTFS where to _start_ reading (not where to stop reading).  A LOB cannot be indexed, but its pages are not being read some other object's index. Bigger is frequently faster.

Backup size is an important factor. But if a LOB must be backed up, it does not fundamentally matter whether the LOB is in the file system or the database. Still, LOB fragmentation and access time are concerns, as Jim Gray discussed in his research paper. It would be a mistake to use Jim's findings (the ranges) as a cookbook. Rather, Jim's paper ultimately encourages readers to find their own truth, by testing the design choice.

When the file system is used to persist a LOB, you cannot "solve" the inability to "synchronize" (i.e. to ensure transactional consistency) by backing up the file system first and the database second. In the context of bulk logging, Jim discusses the ACID properties of a transaction (a transaction must be Atomic, Consistent, Independent, and Durable). By backing up the file system first (in a separate process), that system's transactions (when partially within the database and partially within the file system) cannot be made Atomic. A lack of atomicity can be tolerable for any given system, but that lack is not a solution when transactional integrity ("synchronization") is required. A system's tolerance for "orphans" (whether an "orphan" exists within the file system or within the database) is a judgement call that has to be rooted in probabilities of failures. But the effects of "orphans" can be readily exposed (and considered) in controlled/tested failure conditions.

Ignoring in-row data, LOBs for individual SQL Server tables have classically been persisted within their own structure for each table (indid 255). Starting with SQL Server 2012, LOBs can also be persisted within a FileTable, which alleviates some of the concerns about the complexity of anti-virus checks. A FileTable can allow or prohibit non-transactional (file system) access. I believe FileTable is yet another offshoot of the failed en.wikipedia.org/.../WinFS effort.

Fragmentation and LOB access performance ultimately boil down to a scalability concern. Even transactional consistency can be a  scalability concern. For example, how does a system behave when the file system is out of space? As such, I think it can be a fatal mistake to preemptively choose (file system versus database) until after scalability testing has been performed and failure points have been tested/considered. I believe there will always be compromises that must be made. But until the results of comparative testis are weighed against a system's business needs, I am not comfortable with  choosing one or the other.

As Jim discussed, measuring LOB fragmentation is not very easy ("SQL Server [2005] does not  provide facilities to report fragmentation of large object data or to defragment such data"), and CLEANTABLE did not exist in 2005 . My guess is that Jim used a hex editor in conjunction his patterned files.

I believe SharePoint's choice (to persist LOBs within the database) is rooted in its business needs. I have to assume SharePoint requires transactional consistency (for checking files in or out). Further evidence for SharePoint's business need is within support.microsoft.com/.../841057  ("Support for changes to the databases that are used by Office server products and by Windows SharePoint Services"), where it states:

{

Examples of unsupported database changes include, but are not limited to, the following:

[snip]

* Running DBCC_CHECKDB WITH [sic - should be "with"] REPAIR_ALLOW_DATA_LOSS

}

Allowing data loss is likely to cause orphans (of which only some can be uncovered via a use of CHECKCONSTRAINTS). Nobody can predict how SharePoint will behave (after allowing data loss). Similarly, a multi-cylinder engine with a damaged crankshaft is unlikely to behave predictably. Flogging that simile: Unlike SharePoint YMMV.

In short, both options should be tested :)

Leave a Comment

Please register or log in to leave a comment.