I was participating in a Twitter Chat looking at what suggestions and recommendations for developer on how to make the best use of SQL Server. One of the points that came up was about BLOBs (Binary Large OBjects) within the database. The general recommendation is to keep BLOBs out of the DB. I brought up SharePoint as a counter-argument to the recommendation. Then folks starting joking on SharePoint. SharePoint does a lot of things that makes DBAs and System Administrators/Active Directory Administrators tear their hair out. However, I'm sure SharePoint includes BLOBs in the DB for a good reason or two.
Let me first state that I don't like BLOBs in the database. I like metadata in the database with the files stored somewhere else. The app reads the metadata and then goes and accesses the file(s) based on that metadata. Why do I like this?
- It keeps the database small. This is good for performance and for my backup/restore headaches.
- An RDBMS is not a file system. It is not optimized for such.
- Things like implementing antivirus becomes much, much harder.
So why would a system like SharePoint implement BLOBs in the database? I can see several reasons.
Backups: It's Hard to Synchronize Between the Database and the File System:
If you have two separate systems, synchronization usually poses a challenge. When you have systems that are very different, both in type AND size, synchronization is extremely challenging. Consider trying to take a backup of the file system for all the files stored in a document management system like SharePoint. During the time you're doing the backup, more files could show up. Now, consider the database. If it only contains metadata, it's likely to backup much, much faster. So its backup gets done faster, meaning it will miss those incoming files. The files are on the file system but there is no metadata in the database. Now we have inconsistency between the two systems.
How do we generally solve this situation? We solve it by taking a backup of the file system first, then the database. This should ensure that you minimize the likelihood of a file existing and the metadata not being there. However, consider if you've got different systems backing up the DB versus the file system. How do you synchronize those systems so that the file system fully completes before the database backup kicks off? It may not be so easy to coordinate those backups.
Restore: Making Sure You Have the Right Set of Backups
Now you have to keep the pair matched up. Worst case scenario, if you don't have a matched set, is to pick an older database backup compared to the file system backup. You'll have files you don't have metadata to, but at least everything within the app works.
There are additional procedures if a coordination is required. That's additional work and potential confusion. If you find you don't have a matched pair, it's even more work. Recovery situations can be (and usually are) stressful. Disaster recovery situations are even more so. These additional procedures will likely increase the time to recover. It also means your documentation for recovery have to be detailed and complete. They should be already, but even more so with this type of architecture.
Backups: What about during the day?
A lot of organizations don't run file system backups all day long. They usually pick a point during the night and that's when the backups kick off. If that isn't sufficient for recovery purposes, then you're looking at modifying file system backup processes to take into account the new application. Databases, on the other hand, tend to be backed up more frequently. If all the files are in the database, I just have to leverage what I'm likely already doing for other databases. Yes, the databases and, therefore, the backups will be larger (and slower), but likely I have more infrastructure around handling database backups more frequently than once a day. I can leverage that existing infrastructure without having to build new infrastructure and design new processes.
Application Use: What about document versioning?
Products like SharePoint provide document versioning. We see this in source control as well as history. Windows Server has versioning using Shadow Copy, but the number of copies you have is determined by the number of snapshots you have. You can probably think about how difficult things just became to ensure you can pull old versions of documents back up in the application.
So what if I use BLOBs? Things become signficantly easier. I don't have to worry about operating system configurations, taking snapshots (Shadow Copy meaning), and I don't have to worry about losing a version when a snapshot is deleted to make room for a newer one.
Application Use: Moving to a Different System
As soon as I involve Shadow Copy and snapshots for versioning, I complicate the ability to move the files to a new system. When would this come up?
- When I detect my file system has a fault that's going to get bring the system down.
- When I need bigger hardware.
- When I want to move to a different environment or a different location.
If it's all in the database, I can restore the database backup and I've got it all. That makes things a whole lot simpler and easier.
I know there are counter-arguments for every reason I've presented. My point in presenting these architecture reasons is because they do weigh in to whether for a given system we put BLOBs in the database or not. A consideration of the pros and cons for a given system is important to make the right choice. There isn't a black and white answer, as with some other design choices.
I'm sure there are other reasons. And there are counter-arguments for each of these reasons. However, whether or not BLOBs should be in