RE: Best Approach to Archieve DB

SSCrazy Eights

Points: 9893

March 24, 2013 at 6:17 pm

Jeff Moden (3/24/2013)
Steven Willis (3/24/2013)
I'm in the middle of an archiving project right now. The customer has a single database instance on a shared server with a major host and isn't running enterprise edition so this limited my options greatly.
My basic concept is to export the data to be archived to a separate table with a constraint on the last updated date column that covers the time period to be archived. Then, I add the table to a partitioned view that can be queried without knowing which base table is involved.
To make matters more complicated, the archival data must be "split" so that any personal data is in one encrypted base table and related information that has no personally identifiable information goes into a second table. The tables are related by a uniqueidentifier that only exists in the two archive tables. Once archived, the underlying tables are read-only. The data in the view can be queried via a stored procedure that has the month and year as parameters.
The final step after creating the archive and moving over the data is to delete the rows from the production tables. This step is not to be taken lightly! So when I create the archive tables I create a flag on the production tables that they are only virtually "deleted." Then I compare the two new archive tables against the rows that have been "virtually" deleted and make sure that the numbers and data all add up. Only then are the original rows physically deleted. All of this is done within a transaction so that if something goes haywire it will get rolled back.

Good documentary on what you do for archiving. I do have to ask though.... you mention the split for personal information and that you (if I'm reading this correctly) then encrypt the personal data and store it in a separate table. My question would be, why isn't that data encrypted in the original table to begin with?

The sensitive data in the original table is encrypted (or at least it will be). This client has been collecting data for his health-care related business and "suddenly" realized he might have a problem with HIPPA compliance. 😉

However, in addition to the personal health information (PHI), there is a second set of data which consists of more than a hundred questions all of which are yes/no or have some scalar-value response that could never be directly associated to an individual unless they also had access to the encrypted table which has the uniqueidentifier FK. By separating this second set of data from the PHI the access control for reporting and analysis can be granted to users who don't need to see who the data was originally connected to. Also, performance is enhanced because it isn't necessary to encrypt the data. (Hmmmm...maybe I could take all those question responses, concatenate them into a delimited string and store them in a single column and use DelimitedSplit8K to query them...just kidding!)

If this client had the resources to do so, it would certainly be best practice to put this "cleaned" data on a completely different server or at least a different database instance so if the tables were compromised it would still be several degrees of separation from the sensitive PHI. Working in a shared hosting environment on non-Enterprise Edition instances without full sysadmin privileges requires a lot of workarounds. There have been cases when my only SQL access was via a host company's interface--yuk! But that is mostly the SQL world I live in.

Another purpose for separating the data is due to poor query performance resulting from the original design of the client's database tables. The table holding the PHI and the form question responses is in one ridiculously wide table with over 150 columns. Unfortunately I have no control over that since the data comes from a third-party application. So the best that I could do is manipulate the data myself. With all but the active and most recent data moved to the archive tables, the performance of the existing wide table is significantly improved. If a search of the archive is required it would only be on the encrypted table with the PHI (by users with proper credentials of course). That encrypted PHI archive table only has about 15 columns and by creating a partitioned view by month the query performance on searches goes from something like 30 seconds down to 1 sec. If then the user wants to see the details for one individual row, a separate query will retrieve the detail data. But then it is only pulling in the one related row and not querying 25,000 rows x 150 columns. For all of this I had to write the search query procedures so that the PHI will only be displayed encrypted except for a search column (such as name) or be completely unencrypted depending on the user's credentials.