Click here to monitor SSC
SQLServerCentral is supported by Redgate
 
Log in  ::  Register  ::  Not logged in
 
 
 


Best Approach to Archieve DB


Best Approach to Archieve DB

Author
Message
Jeff Moden
Jeff Moden
SSC-Forever
SSC-Forever (45K reputation)SSC-Forever (45K reputation)SSC-Forever (45K reputation)SSC-Forever (45K reputation)SSC-Forever (45K reputation)SSC-Forever (45K reputation)SSC-Forever (45K reputation)SSC-Forever (45K reputation)

Group: General Forum Members
Points: 45247 Visits: 39928
david.alcock (3/21/2013)
SSIS is normally utilised when the copy/move is complex, if it involves data transformations, validations etc.


I'll go ahead an disagree with that. I've built some killed ETL systems that would logon to FTP, download only what was needed, mark what was downloaded, import the data, transform it, validate it, and merge it with the final tables all using just T-SQL including some rather crazy "don't know how many columns there are in the import" functionality using data imported from Double-Click.net.

ETL is just not that difficult. It's ok to use SSIS for this type of stuff but I haven't actually found a need for it. I've even been able to do parallel loads without it.

--Jeff Moden

RBAR is pronounced ree-bar and is a Modenism for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column.
Although they tell us that they want it real bad, our primary goal is to ensure that we dont actually give it to them that way.
Although change is inevitable, change for the better is not.
Just because you can do something in PowerShell, doesnt mean you should. Wink

Helpful Links:
How to post code problems
How to post performance problems
Forum FAQs
Orlando Colamatteo
Orlando Colamatteo
SSCrazy Eights
SSCrazy Eights (8.3K reputation)SSCrazy Eights (8.3K reputation)SSCrazy Eights (8.3K reputation)SSCrazy Eights (8.3K reputation)SSCrazy Eights (8.3K reputation)SSCrazy Eights (8.3K reputation)SSCrazy Eights (8.3K reputation)SSCrazy Eights (8.3K reputation)

Group: General Forum Members
Points: 8259 Visits: 14368
In this case SSIS is a good choice regardless of the complexity of the tranformation because the source and destination databases are going to be on different instances.

If you want to stay in T-SQL your choices become Linked Servers or xp_cmdshell. Personally I would continue down the path of implementing an SSIS Package to do the archiving.

My Task is mentioned belwo..

Copy all data from Source to destination DB
Every month only the updated or new data needs to be transffered to destination..



Will SSIS help in this?

In order to retrieve 'only the updated or new data' you will need to know something about each row that you can use to derive whether it was updated or new. And what about deletes?

Typical columns used to determine new or modified rows include 'created date' or 'last updated date' columns. Do you have one or both of those, and are they reliable?

A rock solid way to do this, although it introduces more DBA work, is to leverage either Change Tracking or Change Data Capture. Those two features work at the database-engine-level to track which rows have changed.

__________________________________________________________________________________________________
There are no special teachers of virtue, because virtue is taught by the whole community. --Plato
OnlyOneRJ
OnlyOneRJ
Old Hand
Old Hand (300 reputation)Old Hand (300 reputation)Old Hand (300 reputation)Old Hand (300 reputation)Old Hand (300 reputation)Old Hand (300 reputation)Old Hand (300 reputation)Old Hand (300 reputation)

Group: General Forum Members
Points: 300 Visits: 690
I am stuck at the end part

please refer below link

http://www.sqlservercentral.com/articles/Integration+Services+(SSIS)/75331/

I have everything uptill ---(Let's continue building the incremental load. )

I did build incremental load but new rows is not getting transffered w00t

pleaes help Crying

************************************
Every Dog has a Tail !!!!! :-D
david.alcock
david.alcock
SSC Veteran
SSC Veteran (261 reputation)SSC Veteran (261 reputation)SSC Veteran (261 reputation)SSC Veteran (261 reputation)SSC Veteran (261 reputation)SSC Veteran (261 reputation)SSC Veteran (261 reputation)SSC Veteran (261 reputation)

Group: General Forum Members
Points: 261 Visits: 1158
Jeff Moden (3/21/2013)
david.alcock (3/21/2013)
SSIS is normally utilised when the copy/move is complex, if it involves data transformations, validations etc.


I'll go ahead an disagree with that. I've built some killed ETL systems that would logon to FTP, download only what was needed, mark what was downloaded, import the data, transform it, validate it, and merge it with the final tables all using just T-SQL including some rather crazy "don't know how many columns there are in the import" functionality using data imported from Double-Click.net.

ETL is just not that difficult. It's ok to use SSIS for this type of stuff but I haven't actually found a need for it. I've even been able to do parallel loads without it.


When a guy like Jeff starts off with saying 'I'm going to disagree with that' you know you're in trouble, well I get that feeling of dread anyway!!!! :-)

Of course you're absolutely right, you can implement all kinds of magic using T-SQL. My suggestion was more based on the requirement here and not a generalisation, if it involves some level of complexity then SSIS would be a way of implementing the process rather simply (like a lookup task or something).

'Only he who wanders finds new paths'
Jeff Moden
Jeff Moden
SSC-Forever
SSC-Forever (45K reputation)SSC-Forever (45K reputation)SSC-Forever (45K reputation)SSC-Forever (45K reputation)SSC-Forever (45K reputation)SSC-Forever (45K reputation)SSC-Forever (45K reputation)SSC-Forever (45K reputation)

Group: General Forum Members
Points: 45247 Visits: 39928
david.alcock (3/22/2013)
Jeff Moden (3/21/2013)
david.alcock (3/21/2013)
SSIS is normally utilised when the copy/move is complex, if it involves data transformations, validations etc.


I'll go ahead an disagree with that. I've built some killed ETL systems that would logon to FTP, download only what was needed, mark what was downloaded, import the data, transform it, validate it, and merge it with the final tables all using just T-SQL including some rather crazy "don't know how many columns there are in the import" functionality using data imported from Double-Click.net.

ETL is just not that difficult. It's ok to use SSIS for this type of stuff but I haven't actually found a need for it. I've even been able to do parallel loads without it.


When a guy like Jeff starts off with saying 'I'm going to disagree with that' you know you're in trouble, well I get that feeling of dread anyway!!!! :-)

Of course you're absolutely right, you can implement all kinds of magic using T-SQL. My suggestion was more based on the requirement here and not a generalisation, if it involves some level of complexity then SSIS would be a way of implementing the process rather simply (like a lookup task or something).



Heh... sorry, David. I didn't intend to make anyone nervous. Like you, I was making a mostly general statement. I'm just amazed at how many people make it seem like you have to use something like SSIS to do even the simplest of IO tasks. Perhaps the greatest irony of them all is the fact that many people will call a stored procedure from SSIS to accomplish their SSIS task but will still insist that SSIS is the only way to go for "complex" things.

--Jeff Moden

RBAR is pronounced ree-bar and is a Modenism for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column.
Although they tell us that they want it real bad, our primary goal is to ensure that we dont actually give it to them that way.
Although change is inevitable, change for the better is not.
Just because you can do something in PowerShell, doesnt mean you should. Wink

Helpful Links:
How to post code problems
How to post performance problems
Forum FAQs
Jeff Moden
Jeff Moden
SSC-Forever
SSC-Forever (45K reputation)SSC-Forever (45K reputation)SSC-Forever (45K reputation)SSC-Forever (45K reputation)SSC-Forever (45K reputation)SSC-Forever (45K reputation)SSC-Forever (45K reputation)SSC-Forever (45K reputation)

Group: General Forum Members
Points: 45247 Visits: 39928
opc.three (3/21/2013)
In this case SSIS is a good choice regardless of the complexity of the tranformation because the source and destination databases are going to be on different instances.

If you want to stay in T-SQL your choices become Linked Servers or xp_cmdshell. Personally I would continue down the path of implementing an SSIS Package to do the archiving.


I have to both agree and disagree there. It's no more difficult to setup a linked server than it is to setup an extra connection in SSIS. So far as importing data from a non-database source on another server goes, there's no need for either a linked server or xp_CmdShell. A simple shared drive and trusted connections along with BULK INSERT will do fine for what most people want or need to do.

On the subject of xp_CmdShell, if someone get's the heebie-geebies at the mere mention of xp_CmdShell or a trusted OPENQUERY that uses ACE drivers, then SSIS is definitely the way to go. Since we've been on that subject before, I'll remind folks that xp_CmdShell isn't a security risk. Bad security is a security risk because any hacker that get's into your system with "SA" privs is going to change your career by using xp_CmdShell whether you have it turned off or not.

--Jeff Moden

RBAR is pronounced ree-bar and is a Modenism for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column.
Although they tell us that they want it real bad, our primary goal is to ensure that we dont actually give it to them that way.
Although change is inevitable, change for the better is not.
Just because you can do something in PowerShell, doesnt mean you should. Wink

Helpful Links:
How to post code problems
How to post performance problems
Forum FAQs
Orlando Colamatteo
Orlando Colamatteo
SSCrazy Eights
SSCrazy Eights (8.3K reputation)SSCrazy Eights (8.3K reputation)SSCrazy Eights (8.3K reputation)SSCrazy Eights (8.3K reputation)SSCrazy Eights (8.3K reputation)SSCrazy Eights (8.3K reputation)SSCrazy Eights (8.3K reputation)SSCrazy Eights (8.3K reputation)

Group: General Forum Members
Points: 8259 Visits: 14368
Right, BULK INSERT, or sp_OA, or SQLCLR, and maybe some other things that could be initiated from the server that I didn't bother mentioning. I still disagree with you on the point of 'xp_cmdshell is not a security risk' but that's OK. For me, SSIS would bridge the 'instance gap' and allow us to move data across the network without initiating file system access from within the database engine, a "win" from my perspective.

__________________________________________________________________________________________________
There are no special teachers of virtue, because virtue is taught by the whole community. --Plato
Steven Willis
Steven Willis
SSC-Addicted
SSC-Addicted (483 reputation)SSC-Addicted (483 reputation)SSC-Addicted (483 reputation)SSC-Addicted (483 reputation)SSC-Addicted (483 reputation)SSC-Addicted (483 reputation)SSC-Addicted (483 reputation)SSC-Addicted (483 reputation)

Group: General Forum Members
Points: 483 Visits: 1721
I'm in the middle of an archiving project right now. The customer has a single database instance on a shared server with a major host and isn't running enterprise edition so this limited my options greatly.

My basic concept is to export the data to be archived to a separate table with a constraint on the last updated date column that covers the time period to be archived. Then, I add the table to a partitioned view that can be queried without knowing which base table is involved.

To make matters more complicated, the archival data must be "split" so that any personal data is in one encrypted base table and related information that has no personally identifiable information goes into a second table. The tables are related by a uniqueidentifier that only exists in the two archive tables. Once archived, the underlying tables are read-only. The data in the view can be queried via a stored procedure that has the month and year as parameters.

The final step after creating the archive and moving over the data is to delete the rows from the production tables. This step is not to be taken lightly! So when I create the archive tables I create a flag on the production tables that they are only virtually "deleted." Then I compare the two new archive tables against the rows that have been "virtually" deleted and make sure that the numbers and data all add up. Only then are the original rows physically deleted. All of this is done within a transaction so that if something goes haywire it will get rolled back.

 
Jeff Moden
Jeff Moden
SSC-Forever
SSC-Forever (45K reputation)SSC-Forever (45K reputation)SSC-Forever (45K reputation)SSC-Forever (45K reputation)SSC-Forever (45K reputation)SSC-Forever (45K reputation)SSC-Forever (45K reputation)SSC-Forever (45K reputation)

Group: General Forum Members
Points: 45247 Visits: 39928
Steven Willis (3/24/2013)
I'm in the middle of an archiving project right now. The customer has a single database instance on a shared server with a major host and isn't running enterprise edition so this limited my options greatly.

My basic concept is to export the data to be archived to a separate table with a constraint on the last updated date column that covers the time period to be archived. Then, I add the table to a partitioned view that can be queried without knowing which base table is involved.

To make matters more complicated, the archival data must be "split" so that any personal data is in one encrypted base table and related information that has no personally identifiable information goes into a second table. The tables are related by a uniqueidentifier that only exists in the two archive tables. Once archived, the underlying tables are read-only. The data in the view can be queried via a stored procedure that has the month and year as parameters.

The final step after creating the archive and moving over the data is to delete the rows from the production tables. This step is not to be taken lightly! So when I create the archive tables I create a flag on the production tables that they are only virtually "deleted." Then I compare the two new archive tables against the rows that have been "virtually" deleted and make sure that the numbers and data all add up. Only then are the original rows physically deleted. All of this is done within a transaction so that if something goes haywire it will get rolled back.

 


Good documentary on what you do for archiving. I do have to ask though.... you mention the split for personal information and that you (if I'm reading this correctly) then encrypt the personal data and store it in a separate table. My question would be, why isn't that data encripted in the original table to begin with?

--Jeff Moden

RBAR is pronounced ree-bar and is a Modenism for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column.
Although they tell us that they want it real bad, our primary goal is to ensure that we dont actually give it to them that way.
Although change is inevitable, change for the better is not.
Just because you can do something in PowerShell, doesnt mean you should. Wink

Helpful Links:
How to post code problems
How to post performance problems
Forum FAQs
Steven Willis
Steven Willis
SSC-Addicted
SSC-Addicted (483 reputation)SSC-Addicted (483 reputation)SSC-Addicted (483 reputation)SSC-Addicted (483 reputation)SSC-Addicted (483 reputation)SSC-Addicted (483 reputation)SSC-Addicted (483 reputation)SSC-Addicted (483 reputation)

Group: General Forum Members
Points: 483 Visits: 1721
Jeff Moden (3/24/2013)
Steven Willis (3/24/2013)
I'm in the middle of an archiving project right now. The customer has a single database instance on a shared server with a major host and isn't running enterprise edition so this limited my options greatly.

My basic concept is to export the data to be archived to a separate table with a constraint on the last updated date column that covers the time period to be archived. Then, I add the table to a partitioned view that can be queried without knowing which base table is involved.

To make matters more complicated, the archival data must be "split" so that any personal data is in one encrypted base table and related information that has no personally identifiable information goes into a second table. The tables are related by a uniqueidentifier that only exists in the two archive tables. Once archived, the underlying tables are read-only. The data in the view can be queried via a stored procedure that has the month and year as parameters.

The final step after creating the archive and moving over the data is to delete the rows from the production tables. This step is not to be taken lightly! So when I create the archive tables I create a flag on the production tables that they are only virtually "deleted." Then I compare the two new archive tables against the rows that have been "virtually" deleted and make sure that the numbers and data all add up. Only then are the original rows physically deleted. All of this is done within a transaction so that if something goes haywire it will get rolled back.

 

Good documentary on what you do for archiving. I do have to ask though.... you mention the split for personal information and that you (if I'm reading this correctly) then encrypt the personal data and store it in a separate table. My question would be, why isn't that data encrypted in the original table to begin with?

The sensitive data in the original table is encrypted (or at least it will be). This client has been collecting data for his health-care related business and "suddenly" realized he might have a problem with HIPPA compliance. ;-)

However, in addition to the personal health information (PHI), there is a second set of data which consists of more than a hundred questions all of which are yes/no or have some scalar-value response that could never be directly associated to an individual unless they also had access to the encrypted table which has the uniqueidentifier FK. By separating this second set of data from the PHI the access control for reporting and analysis can be granted to users who don't need to see who the data was originally connected to. Also, performance is enhanced because it isn't necessary to encrypt the data. (Hmmmm...maybe I could take all those question responses, concatenate them into a delimited string and store them in a single column and use DelimitedSplit8K to query them...just kidding!)

If this client had the resources to do so, it would certainly be best practice to put this "cleaned" data on a completely different server or at least a different database instance so if the tables were compromised it would still be several degrees of separation from the sensitive PHI. Working in a shared hosting environment on non-Enterprise Edition instances without full sysadmin privileges requires a lot of workarounds. There have been cases when my only SQL access was via a host company's interface--yuk! But that is mostly the SQL world I live in.

Another purpose for separating the data is due to poor query performance resulting from the original design of the client's database tables. The table holding the PHI and the form question responses is in one ridiculously wide table with over 150 columns. Unfortunately I have no control over that since the data comes from a third-party application. So the best that I could do is manipulate the data myself. With all but the active and most recent data moved to the archive tables, the performance of the existing wide table is significantly improved. If a search of the archive is required it would only be on the encrypted table with the PHI (by users with proper credentials of course). That encrypted PHI archive table only has about 15 columns and by creating a partitioned view by month the query performance on searches goes from something like 30 seconds down to 1 sec. If then the user wants to see the details for one individual row, a separate query will retrieve the detail data. But then it is only pulling in the one related row and not querying 25,000 rows x 150 columns. For all of this I had to write the search query procedures so that the PHI will only be displayed encrypted except for a search column (such as name) or be completely unencrypted depending on the user's credentials.

 
Go


Permissions

You can't post new topics.
You can't post topic replies.
You can't post new polls.
You can't post replies to polls.
You can't edit your own topics.
You can't delete your own topics.
You can't edit other topics.
You can't delete other topics.
You can't edit your own posts.
You can't edit other posts.
You can't delete your own posts.
You can't delete other posts.
You can't post events.
You can't edit your own events.
You can't edit other events.
You can't delete your own events.
You can't delete other events.
You can't send private messages.
You can't send emails.
You can read topics.
You can't vote in polls.
You can't upload attachments.
You can download attachments.
You can't post HTML code.
You can't edit HTML code.
You can't post IFCode.
You can't post JavaScript.
You can post emoticons.
You can't post or upload images.

Select a forum

































































































































































SQLServerCentral


Search