Archiving or Moving Data

Question

Archiving or Moving Data

Rechana Rajan

SSCertifiable

Points: 7709
More actions
April 9, 2017 at 12:21 am

#342220

Our database size is around 100GB and 50GB is occupied by a table having column datatype varbinary (max) holding documents.
can you suggest some solution to manage this table in a more efficient way.
1. I am planning to use partitioning and move data older than particular range to another table which is in separate file group.
Pros
----------
Data exists in same table.
Cons
-------
Backups will be still the same size or else need to change the whole backup plan.
2. Another option I come up is to move data to entirely new database for a particular range. This is done through a job which runs daily and a weekly job to shrink the main db and index maintenance after shrink.
Pros
-------
Backup size will be reduced
Cons
--------
For historic data ,have to query different DB.
Is there any better way to achieve this?

Viewing 6 posts - 1 through 5 (of 5 total)

You must be logged in to reply to this topic. Login to reply

Jeff Moden SSC Guru Points: 1003863 More actions · Answer 1

I did the same thing with a similar table, which contains the binary for a WAV file for recorded calls. It currently contains 91 months (we have to keep everything forever 🙁 ) of call recordings. We elected to keep it that way rather than copying the data out to files because of many reasons but the biggest reason is that they were losing call recordings on the file system as they brought up new systems and moved the data around. It was also a huge PITA to keep track of the file paths and change them when a file moved because no one would tell us when they were moving files.

The table where all of this data is stored is nearly a Terabyte in size now. Way back when it hit the 300GB mark, we realized that nightly backups were unnecessarily killing us. The time to take the backups and the disk space involved seemed stupid in every aspect because all but the "current month" was entirely static. A friend of my made the comment, "When you do the laundry, do you wash the clean clothes or the dirty ones"?

To make a longer story much shorter, we drank the MS Kool-Aid and partitioned the table with one file group per month and one file per file group. We then realized that we could make all but the current month's file group "Read Only" and never have to backup the previous months ever again. Back them up once and you're done. Our backups on that system went from sometimes more than 10 hours to just minutes because we only had to backup the "current month".

We had to be careful, though. The normal page splits on indexes that didn't use temporally based keys cause a lot of free space to be wasted. Worse than that, a shrink followed by a rebuild of indexes wasted even more space because the old index isn't dropped until the new one is completed and committed. We couldn't do some of the normal tricks with NCIs, like disabling the index and then rebuilding it. REORGANIZING the indexes didn't have the effect we wanted because, although it would "pack the space" the index was using, it didn't actually return any free space to the system. It stayed with the file group. Rebuilding the clustered index in-place was out of the question for the same reason and would have produced a bit more free space than what the actual data was.

As a result, we had to build a system that would "pack" the file group that was getting ready to be set to "read only" and it wasn't easy especially when it came to the data being offline for as little time as possible (I've got it down to about 8 seconds once a month). MS didn't make it easy either. If someone is using the table, you can't make any one of the file groups "read only". You must first set the whole damned database to "single user" to kick people and web services that connect to it out of it.

You MUST also develop a "Restore Plan" and that's another huge PITA when it comes to using "Partitioned Tables". During a restore, you can certainly get back online quickly by restoring the Primay, "current" month, and the empty "next moth" partitions in what is known as a "Piecemeal restore" BUT... you can't do a backup of any new activity until the whole bloody table and all of it's file groups are restored. You also can NOT leave out any of the file groups if you only want to restore just a couple of months to, say, a Development system. And, no, you can't MERGE file groups that aren't present to eliminate the need for restoring them all should the need arise. You'll get an error saying so if you try.

You'll also have to understand that partitioning actually makes most queries a bit slower because each partition has it's own B-TREE. The thing known as "Partition Elimination" for performance is mostly a myth and only helps when the queries and/or related indexes are poorly designed. A well designed monolithic table and related indexes will always beat and equally well designed partitioned table for performance.

Looking back at it all, we're really happy with the performance of backups and the serious reduction in disk space the backups used to take BUT, if I were to do it again, I'd consider taking on the caveats (and there are some serious ones that you WILL need to plan around) of using "Partitioned Views" instead and put each month's file into a separate database with the ISO year and month notation as a part of the database name. That would allow me to use different indexes for the Read Only months than what are used for the current month for a bit of a performance increase, allow me to restore only what I needed for Dev systems, and allow me to do backups of the current month and primary file groups before I had fully restored all of the file groups in a DR situation.

Shifting gears, NEVER have a daily job that shrinks databases or file groups. In fact, consider never shrinking a database unless something really bad happened and you need to recover a shedload (hundreds of GB) of free space. I don't do a shrink for my monthly read only partition process. Inste3ad, I temporarily copy the data in the partition to the Primary File Group, use an index REBUILD on the CI to copy it to a new file group, which also eliminates the wasted free space, rebuild the NCIs with SORT_IN_TEMP db enabled to help keep from generating any space that will be wasted, switch out the old partition, and switch in the new, and then set it to READ ONLY. That helps keep the total down time for the partition down to about 8 seconds.

The system has been operating without fault or failure for nearly 3 years but, like I said, it has some drawbacks when it comes to restores and I'm actually considering rebuilding the system to use "Partitioned Views" across multiple databases instead of using "Partitioned Tables" across multiple file groups in a single database. That would also mean that I wouldn't have to use the bloody partitioning functions or schema anymore, either, and that would be a huge joy for me. 😉 Part of my consideration for "Partitioned Views" is that I've used them for the same thing on other systems in the past and they have also continued to operate just fine and without the need for human interaction for years, as well.

The bottom line is that you should only partition to save on backups and index maintenance. If you're not going to split out into multiple file groups so that you can make the old static data read only, it isn't going to help backups at all, although it will help with index maintenance (which I don't do... it's another waste of time if the CI is correctly designed and doesn't matter on the NCIs for the most part). It's generally going to cause a bit of a performance hit and, to do it right, it's not going to be easy. And, if you don't practice doing restores on it, you might find out too late that you've done something wrong and lose all of your data during a DR situation even if you do have all the backups.

The bottom of the bottom line is, don't take any of this lightly. Plan, test, replan, test again. When you think it's right, test more and try to break it because, if you don't, it will break on its own when you least need it to during a DR restore. 😉

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

TheSQLGuru SSC Guru Points: 134017 More actions · Answer 2

I actually like the multiple database notion for such scenarios. With a good naming scheme you can automate the heck out of it, including view (with UNION ALL) modification, data movement, synonyms if you like, etc. And once you move over a period you can set each database to read only assuming no future modifications will be done. There are many many benefits to this that I don't have time to go into.

Best,
Kevin G. Boles
SQL Server Consultant
SQL MVP 2007-2012
TheSQLGuru on googles mail service

Jeff Moden SSC Guru Points: 1003863 More actions · Answer 3

TheSQLGuru - Sunday, April 9, 2017 1:22 PM
I actually like the multiple database notion for such scenarios. With a good naming scheme you can automate the heck out of it, including view (with UNION ALL) modification, data movement, synonyms if you like, etc. And once you move over a period you can set each database to read only assuming no future modifications will be done. There are many many benefits to this that I don't have time to go into.

Me too. And, the equivalent of SWITCH in or out is super easy and super fast. Just change the partitioned view to include or exclude the call to the separate DB.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Rechana Rajan SSCertifiable Points: 7709 More actions · Answer 4

Jeff Moden - Sunday, April 9, 2017 10:24 AM
I did the same thing with a similar table, which contains the binary for a WAV file for recorded calls. It currently contains 91 months (we have to keep everything forever 🙁 ) of call recordings. We elected to keep it that way rather than copying the data out to files because of many reasons but the biggest reason is that they were losing call recordings on the file system as they brought up new systems and moved the data around. It was also a huge PITA to keep track of the file paths and change them when a file moved because no one would tell us when they were moving files.
The table where all of this data is stored is nearly a Terabyte in size now. Way back when it hit the 300GB mark, we realized that nightly backups were unnecessarily killing us. The time to take the backups and the disk space involved seemed stupid in every aspect because all but the "current month" was entirely static. A friend of my made the comment, "When you do the laundry, do you wash the clean clothes or the dirty ones"?
To make a longer story much shorter, we drank the MS Kool-Aid and partitioned the table with one file group per month and one file per file group. We then realized that we could make all but the current month's file group "Read Only" and never have to backup the previous months ever again. Back them up once and you're done. Our backups on that system went from sometimes more than 10 hours to just minutes because we only had to backup the "current month".
We had to be careful, though. The normal page splits on indexes that didn't use temporally based keys cause a lot of free space to be wasted. Worse than that, a shrink followed by a rebuild of indexes wasted even more space because the old index isn't dropped until the new one is completed and committed. We couldn't do some of the normal tricks with NCIs, like disabling the index and then rebuilding it. REORGANIZING the indexes didn't have the effect we wanted because, although it would "pack the space" the index was using, it didn't actually return any free space to the system. It stayed with the file group. Rebuilding the clustered index in-place was out of the question for the same reason and would have produced a bit more free space than what the actual data was.
As a result, we had to build a system that would "pack" the file group that was getting ready to be set to "read only" and it wasn't easy especially when it came to the data being offline for as little time as possible (I've got it down to about 8 seconds once a month). MS didn't make it easy either. If someone is using the table, you can't make any one of the file groups "read only". You must first set the whole damned database to "single user" to kick people and web services that connect to it out of it.
You MUST also develop a "Restore Plan" and that's another huge PITA when it comes to using "Partitioned Tables". During a restore, you can certainly get back online quickly by restoring the Primay, "current" month, and the empty "next moth" partitions in what is known as a "Piecemeal restore" BUT... you can't do a backup of any new activity until the whole bloody table and all of it's file groups are restored. You also can NOT leave out any of the file groups if you only want to restore just a couple of months to, say, a Development system. And, no, you can't MERGE file groups that aren't present to eliminate the need for restoring them all should the need arise. You'll get an error saying so if you try.
You'll also have to understand that partitioning actually makes most queries a bit slower because each partition has it's own B-TREE. The thing known as "Partition Elimination" for performance is mostly a myth and only helps when the queries and/or related indexes are poorly designed. A well designed monolithic table and related indexes will always beat and equally well designed partitioned table for performance.
Looking back at it all, we're really happy with the performance of backups and the serious reduction in disk space the backups used to take BUT, if I were to do it again, I'd consider taking on the caveats (and there are some serious ones that you WILL need to plan around) of using "Partitioned Views" instead and put each month's file into a separate database with the ISO year and month notation as a part of the database name. That would allow me to use different indexes for the Read Only months than what are used for the current month for a bit of a performance increase, allow me to restore only what I needed for Dev systems, and allow me to do backups of the current month and primary file groups before I had fully restored all of the file groups in a DR situation.
Shifting gears, NEVER have a daily job that shrinks databases or file groups. In fact, consider never shrinking a database unless something really bad happened and you need to recover a shedload (hundreds of GB) of free space. I don't do a shrink for my monthly read only partition process. Inste3ad, I temporarily copy the data in the partition to the Primary File Group, use an index REBUILD on the CI to copy it to a new file group, which also eliminates the wasted free space, rebuild the NCIs with SORT_IN_TEMP db enabled to help keep from generating any space that will be wasted, switch out the old partition, and switch in the new, and then set it to READ ONLY. That helps keep the total down time for the partition down to about 8 seconds.
The system has been operating without fault or failure for nearly 3 years but, like I said, it has some drawbacks when it comes to restores and I'm actually considering rebuilding the system to use "Partitioned Views" across multiple databases instead of using "Partitioned Tables" across multiple file groups in a single database. That would also mean that I wouldn't have to use the bloody partitioning functions or schema anymore, either, and that would be a huge joy for me. 😉 Part of my consideration for "Partitioned Views" is that I've used them for the same thing on other systems in the past and they have also continued to operate just fine and without the need for human interaction for years, as well.
The bottom line is that you should only partition to save on backups and index maintenance. If you're not going to split out into multiple file groups so that you can make the old static data read only, it isn't going to help backups at all, although it will help with index maintenance (which I don't do... it's another waste of time if the CI is correctly designed and doesn't matter on the NCIs for the most part). It's generally going to cause a bit of a performance hit and, to do it right, it's not going to be easy. And, if you don't practice doing restores on it, you might find out too late that you've done something wrong and lose all of your data during a DR situation even if you do have all the backups.
The bottom of the bottom line is, don't take any of this lightly. Plan, test, replan, test again. When you think it's right, test more and try to break it because, if you don't, it will break on its own when you least need it to during a DR restore. 😉

Thanks Jeff for taking your time and explaining the Pros and Cons you have faced.

100% agree with you, I don't want to use partition as it wont help me reduce the DB size and for Prod copies in other environment it wont make any difference. As you mentioned, the backup will be real pain if I go for filegroup and piecemeal restorations have to be much more careful than straight forward Full+ Diff + Log.

I am planning to go with move data to another DB based which are older than X years. The advantage I am seeing here is when copying to other environments I need to give only the live copy no the whole(ofcourse with cleaner script run on top to remove sensitive data).

Rechana Rajan SSCertifiable Points: 7709 More actions · Answer 5

TheSQLGuru - Sunday, April 9, 2017 1:22 PM
I actually like the multiple database notion for such scenarios. With a good naming scheme you can automate the heck out of it, including view (with UNION ALL) modification, data movement, synonyms if you like, etc. And once you move over a period you can set each database to read only assuming no future modifications will be done. There are many many benefits to this that I don't have time to go into.

Thanks Kevin