Need Recommendations for Document Database

Question

Need Recommendations for Document Database

aureolin

Hall of Fame

Points: 3520
More actions
February 9, 2018 at 10:37 am

#343591

We're in the process of a data migration from one LOB tool to another over this weekend. Part of the migration is moving just under a terabyte's worth of documents from files on disk to blobs in a database (SQL 2014). The destination system is a locally hosted VM with enough disk space on one data "disk drive", but I'm looking for recommendations on how to architect the destination database. Considerations are ease of management, ease of taking backups, etc. FWIW, the individual files are generally small, many are one to two page .PDF files. Yes, I do know this is last minute, but this is my only chance to "get it right" from the start.
Suggestions, comments, advice all welcome.
Steve G.
County of San Luis Obispo

Viewing 11 posts - 1 through 10 (of 10 total)

You must be logged in to reply to this topic. Login to reply

BU69 SSCrazy Points: 2784 More actions · Answer 1

For backups I had similar, also things like CHECKDB, read the thread I posted, may have some answers for maintenance and backup: https://www.sqlservercentral.com/Forums/1921328/Extremely-Large-Backup-Solutions

Basically striped backups across multiple disks are faster, I got the time down to less than half, and any VM performance settings you can change, restoring from striped is slightly more tricky but I worked out a routine which queries msdb to get the last backup paths to a restore script. I used the Ola Hallegren script for backups and you just add multiple paths in, see here where I got it wrong: https://www.sqlservercentral.com/Forums/1922540/Ola-Hallengren-Striped-Backup-Job

Did you come up with any solutions for the actual database itself?

Forgot to mention above that my DBs are also document stores, so compression on the backups I was only getting down to about 70% of the original data.

Jeff Moden SSC Guru Points: 1003857 More actions · Answer 2

aureolin - Friday, February 9, 2018 10:37 AM
We're in the process of a data migration from one LOB tool to another over this weekend. Part of the migration is moving just under a terabyte's worth of documents from files on disk to blobs in a database (SQL 2014). The destination system is a locally hosted VM with enough disk space on one data "disk drive", but I'm looking for recommendations on how to architect the destination database. Considerations are ease of management, ease of taking backups, etc. FWIW, the individual files are generally small, many are one to two page .PDF files. Yes, I do know this is last minute, but this is my only chance to "get it right" from the start.
Suggestions, comments, advice all welcome.
Steve G.
County of San Luis Obispo

Once added, do the documents change? New ones might get added but I'm thinking that existing documents are going to change. Backing up more than once, stuff that won't ever change, is a bit insane. All of this can be handled fairly gracefully where daily backups that measure in minutes (I have similar in a Terabyte telephone system where all call recordings are stored in the database) and a "get back in business DR restore" can be accomplished in a similar time.

Heh... and forget about striped backups... unless you have the right pipe and the right disk configuration, they'll actually slow things down and complicate restores. There's a much better way if the documents become static once loaded into the system.

Before we can continue, though, which version and edition of SQL Server do you have?

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

BU69 SSCrazy Points: 2784 More actions · Answer 3

Jeff Moden - Wednesday, February 14, 2018 6:43 AM
aureolin - Friday, February 9, 2018 10:37 AM
We're in the process of a data migration from one LOB tool to another over this weekend. Part of the migration is moving just under a terabyte's worth of documents from files on disk to blobs in a database (SQL 2014). The destination system is a locally hosted VM with enough disk space on one data "disk drive", but I'm looking for recommendations on how to architect the destination database. Considerations are ease of management, ease of taking backups, etc. FWIW, the individual files are generally small, many are one to two page .PDF files. Yes, I do know this is last minute, but this is my only chance to "get it right" from the start.
Suggestions, comments, advice all welcome.
Steve G.
County of San Luis Obispo
Once added, do the documents change? New ones might get added but I'm thinking that existing documents are going to change. Backing up more than once, stuff that won't ever change, is a bit insane. All of this can be handled fairly gracefully where daily backups that measure in minutes (I have similar in a Terabyte telephone system where all call recordings are stored in the database) and a "get back in business DR restore" can be accomplished in a similar time.
Heh... and forget about striped backups... unless you have the right pipe and the right disk configuration, they'll actually slow things down and complicate restores. There's a much better way if the documents become static once loaded into the system.
Before we can continue, though, which version and edition of SQL Server do you have?

It's not always possible to archive or set documents as read-only for some applications so you need to be more creative with the way you back things up. In previous systems I had the luxury of being able to get the vendor to create read-only databases we could backup once and keep but for other systems, the current one I have is in Sharepoint, the documents can be changed after they are added, so read-only is not possible.

Not sure why you dismiss striped backups, they have speeded up the backups I have in place, can you elaborate and provide examples on ways they slow the backup process down so I can keep this in mind for the future? They do complicate the restores though but htis can be mitigated with a good script.

Eric M Russell SSC Guru Points: 125520 More actions · Answer 4

aureolin - Friday, February 9, 2018 10:37 AM
We're in the process of a data migration from one LOB tool to another over this weekend. Part of the migration is moving just under a terabyte's worth of documents from files on disk to blobs in a database (SQL 2014). The destination system is a locally hosted VM with enough disk space on one data "disk drive", but I'm looking for recommendations on how to architect the destination database. Considerations are ease of management, ease of taking backups, etc. FWIW, the individual files are generally small, many are one to two page .PDF files. Yes, I do know this is last minute, but this is my only chance to "get it right" from the start.
Suggestions, comments, advice all welcome.
Steve G.
County of San Luis Obispo

Why is it considered important to migrate the documents to a database?
You're moving the documents from cheaper storage that is already optimized for files to more expensive database optimized storage, and all those BLOB pages will waste memory, data space, and transaction log space that could be put to better use for tables and indexing.

"Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

aureolin Hall of Fame Points: 3520 More actions · Answer 5

Jeff Moden - Wednesday, February 14, 2018 6:43 AM
aureolin - Friday, February 9, 2018 10:37 AM
We're in the process of a data migration from one LOB tool to another over this weekend. Part of the migration is moving just under a terabyte's worth of documents from files on disk to blobs in a database (SQL 2014). The destination system is a locally hosted VM with enough disk space on one data "disk drive", but I'm looking for recommendations on how to architect the destination database. Considerations are ease of management, ease of taking backups, etc. FWIW, the individual files are generally small, many are one to two page .PDF files. Yes, I do know this is last minute, but this is my only chance to "get it right" from the start.
Suggestions, comments, advice all welcome.
Steve G.
County of San Luis Obispo
Once added, do the documents change? New ones might get added but I'm thinking that existing documents are going to change. Backing up more than once, stuff that won't ever change, is a bit insane. All of this can be handled fairly gracefully where daily backups that measure in minutes (I have similar in a Terabyte telephone system where all call recordings are stored in the database) and a "get back in business DR restore" can be accomplished in a similar time.
Heh... and forget about striped backups... unless you have the right pipe and the right disk configuration, they'll actually slow things down and complicate restores. There's a much better way if the documents become static once loaded into the system.
Before we can continue, though, which version and edition of SQL Server do you have?

I have SQL 2014 Enterprise running the database. Generally, there's a large set of the documents that are not going to change; going forward there may be a subset of the documents that will change - though this remains to be seen.

aureolin Hall of Fame Points: 3520 More actions · Answer 6

Eric M Russell - Wednesday, February 14, 2018 7:17 AM
aureolin - Friday, February 9, 2018 10:37 AM
We're in the process of a data migration from one LOB tool to another over this weekend. Part of the migration is moving just under a terabyte's worth of documents from files on disk to blobs in a database (SQL 2014). The destination system is a locally hosted VM with enough disk space on one data "disk drive", but I'm looking for recommendations on how to architect the destination database. Considerations are ease of management, ease of taking backups, etc. FWIW, the individual files are generally small, many are one to two page .PDF files. Yes, I do know this is last minute, but this is my only chance to "get it right" from the start.
Suggestions, comments, advice all welcome.
Steve G.
County of San Luis Obispo
Why is it considered important to migrate the documents to a database?
You're moving the documents from cheaper storage that is already optimized for files to more expensive database optimized storage, and all those BLOB pages will waste memory, data space, and transaction log space that could be put to better use for tables and indexing.

Heh, This is the way our new LOB application works. I'm just going with the program. 😉

Jeff Moden SSC Guru Points: 1003857 More actions · Answer 7

aureolin - Wednesday, February 14, 2018 10:04 AM
Jeff Moden - Wednesday, February 14, 2018 6:43 AM
aureolin - Friday, February 9, 2018 10:37 AM
We're in the process of a data migration from one LOB tool to another over this weekend. Part of the migration is moving just under a terabyte's worth of documents from files on disk to blobs in a database (SQL 2014). The destination system is a locally hosted VM with enough disk space on one data "disk drive", but I'm looking for recommendations on how to architect the destination database. Considerations are ease of management, ease of taking backups, etc. FWIW, the individual files are generally small, many are one to two page .PDF files. Yes, I do know this is last minute, but this is my only chance to "get it right" from the start.
Suggestions, comments, advice all welcome.
Steve G.
County of San Luis Obispo
Once added, do the documents change? New ones might get added but I'm thinking that existing documents are going to change. Backing up more than once, stuff that won't ever change, is a bit insane. All of this can be handled fairly gracefully where daily backups that measure in minutes (I have similar in a Terabyte telephone system where all call recordings are stored in the database) and a "get back in business DR restore" can be accomplished in a similar time.
Heh... and forget about striped backups... unless you have the right pipe and the right disk configuration, they'll actually slow things down and complicate restores. There's a much better way if the documents become static once loaded into the system.
Before we can continue, though, which version and edition of SQL Server do you have?
I have SQL 2014 Enterprise running the database. Generally, there's a large set of the documents that are not going to change; going forward there may be a subset of the documents that will change - though this remains to be seen.

You definitely have the right tool to do this, then. You could do a Partitioned Table with one filegroup per month and one file per filegroup. As a month becomes "static" (no more changes/updates/additions), you can do a trick to squash all unused space out of the filegroup/file and set it (the filegroup/partition) to READ_ONLY and then back it up. What that will do for you is keep from ever having to back it up again. You only need to backup the current month and the next month (which is always empty until it becomes the current month and you build a new "next month" partition).

This also allows for online "Piece-Meal" restores (if a filegroup or file ever goes bad) and a very quick "get back in business" restore of the current data and then you can restore the outlying months is a less hectic manner.

I will say, though, that if I had it to do over, I'd use a Partitioned View, instead. It allows you to restore any "partition" at any time and also allows you to not restore them all if you're restoring to a Development box (for example). The other advantage of Partitioned Views (especially if you use one database per month {and that's not so bad as most would make it out to be}) is that the table in each of those databases can be made to have different indexes to optimize the "selectability" as compared to the current month which also needs optimization for inserts and updates.

I do have to agree with the others, though. Storing large binaries in a database is pretty tough on memory, etc. You might want to reconsider and use something like "file stream" or other available on disk/off database method. Of course, the advantage to having them in the database is that no one protects data better than a good DBA.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

BU69 SSCrazy Points: 2784 More actions · Answer 8

Jeff, can you let me know what issues you have had with the striped backups, I would genually like to know so I can avoid any issues going forward if I'm about to set our Sharepoint estate to be backed up this way? I've not been able to find anything on them slowing down backups, the only drawback so far is the restores be a bit of a pain.

For the DMS the company is implementing they are storing everything in Sharepoint (decided before my time here), so now it's more a case of trying to fix issues after the fact, we've requested from the vendor a facility to mark documents as complete so we can archive but that might be a long time in coming, at the moment users are free to change any document so our problem still remains having to backup Tbs of database on a regular basis. The only saving grace is that with Sharepoint you can set the content databases to be a maximum of 200Gb each so if you have an issue with only a few documents you can restore only that content database.

What I have found so far is striping the backups has more than halved the backup time, for CHECKDB I will need to restore each database to a separate server and kill two birds with one stone, restore test and checkdb, I can't find any alternative with the consistency check as it just take such a long time I can't let this happen on a production server. I presume that will be another issue for Steve G as well.

andrew gothard SSChampion Points: 12301 More actions · Answer 9

Eric M Russell - Wednesday, February 14, 2018 7:17 AM
aureolin - Friday, February 9, 2018 10:37 AM
We're in the process of a data migration from one LOB tool to another over this weekend. Part of the migration is moving just under a terabyte's worth of documents from files on disk to blobs in a database (SQL 2014). The destination system is a locally hosted VM with enough disk space on one data "disk drive", but I'm looking for recommendations on how to architect the destination database. Considerations are ease of management, ease of taking backups, etc. FWIW, the individual files are generally small, many are one to two page .PDF files. Yes, I do know this is last minute, but this is my only chance to "get it right" from the start.
Suggestions, comments, advice all welcome.
Steve G.
County of San Luis Obispo
Why is it considered important to migrate the documents to a database?
You're moving the documents from cheaper storage that is already optimized for files to more expensive database optimized storage, and all those BLOB pages will waste memory, data space, and transaction log space that could be put to better use for tables and indexing.

Backups. Security. Audit.

To expand slightly on this,
Backups.
we had a similar situation where we had x million critical documents held on a filestore. Backups took, due to the sheer number of documents, 4+ days. Recovery Point of this type of document, meaning a potential loss of a week's worth of data, certainly caught the attention of the board when this was explained. Log backups every 5 minutes made them rather happier.

Security.
adding effective security to a filestore of this size and complexity is rather more challenging than controlling access to a database storage system. If your document store contains any personally identifiable or commercially sensitive information you absolutely have to make sure access to this information is effectively controlled. No excuses.

Audit.
as above. If all this stuff is dumped on a disk, where's your audit trail? Who changed what, and when, and why? Who searched documents (and this is clearly easier in a properly structured database than a data jumble sale), for people, by what criteria. Who viewed data?

In the new system the 25 million + documents are fully secured. All activity; creation, position in document lifecycle, reads, updates are logged. Documents are immutable - if you make a change, even changing just the date from the 12th to the 11th, both versions, with history are stored - no idea how you'd do this on a shared drive and effectively control it auditably - which is actually critical in many situations and legally required in a hell of a lot. Partitioning of the documents by date also makes management easier, particularly as the database grows. Hitting the limit on your partition on a single SAN location could prove a maintenance nightmare in terms of storage and potential app changes if not very, very carefully planned up front. I can put my partitions wherever I want as long as I tell SQL Server. The app is obviously totally blind to that.

I'm a DBA.
I'm not paid to solve problems. I'm paid to prevent them.

Jeff Moden SSC Guru Points: 1003857 More actions · Answer 10

BU69 - Thursday, February 15, 2018 2:36 AM
Jeff, can you let me know what issues you have had with the striped backups, I would genually like to know so I can avoid any issues going forward if I'm about to set our Sharepoint estate to be backed up this way? I've not been able to find anything on them slowing down backups, the only drawback so far is the restores be a bit of a pain.
For the DMS the company is implementing they are storing everything in Sharepoint (decided before my time here), so now it's more a case of trying to fix issues after the fact, we've requested from the vendor a facility to mark documents as complete so we can archive but that might be a long time in coming, at the moment users are free to change any document so our problem still remains having to backup Tbs of database on a regular basis. The only saving grace is that with Sharepoint you can set the content databases to be a maximum of 200Gb each so if you have an issue with only a few documents you can restore only that content database.
What I have found so far is striping the backups has more than halved the backup time, for CHECKDB I will need to restore each database to a separate server and kill two birds with one stone, restore test and checkdb, I can't find any alternative with the consistency check as it just take such a long time I can't let this happen on a production server. I presume that will be another issue for Steve G as well.

Sure. And I don't actually have to write much about it. Backing up to multiple files isn't enough. Each file has to live on a separate physical disk or disk array for "striped" backups to actually provide a performance improvement. Here's an article by Jez Borland with a single graphic that explains it all and, yes... I've seen such slowdowns at several companies because they didn't actually test and just assumed that more files were better
https://www.brentozar.com/archive/2014/01/improving-the-performance-of-backups/

Now, if you really want to improve performance of backups and restores, you need to tweak a couple of things. One of the best and most comprehensive articles I've ever seen on the subject is by Ben Snaidaro. Between using native compression and such tweaking (not to mention having a pretty good "pipe" for a network), I was able to backup my 2TB production server in a single threaded manner to NAS in about an hour every night.
https://www.mssqltips.com/sqlservertip/2539/options-to-improve-sql-server-backup-performance/

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)