Best Approach to Archieve DB

  • runal_jagtap (3/19/2013)


    Thanks Jeff 🙂

    Currently me planning to go with SSIS Package Automation...

    Well see how i succedd.. incase any query i will come here again.. stating

    65 views & 0 replies :w00t: ( 😀 )

    So, what's the plan? Will the archive table be in a different database or the same database> I'm asking because although SSIS will certainly do the job, it seems like real overkill.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff, Yup The Srever & the Database would be different...

    Why it would be overkill :w00t: explain me.. coz i am going to do it.. i dont know how to do it..

    ************************************
    Every Dog has a Tail !!!!! :-D

  • SSIS is normally utilised when the copy/move is complex, if it involves data transformations, validations etc.

    'Only he who wanders finds new paths'

  • My Task is mentioned belwo..

    Copy all data from Source to destination DB

    Every month only the updated or new data needs to be transffered to destination..

    Will SSIS help in this?

    ************************************
    Every Dog has a Tail !!!!! :-D

  • runal_jagtap (3/21/2013)


    Jeff, Yup The Srever & the Database would be different...

    Why it would be overkill :w00t: explain me.. coz i am going to do it.. i dont know how to do it..

    If you've already decided to use SSIS, then my suggestion of making a simple job do a simple insert to the archive and a simple delete from the source would probably underwhelm everyone's gumption. Like I said, doing this through SSIS is real overkill.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • david.alcock (3/21/2013)


    SSIS is normally utilised when the copy/move is complex, if it involves data transformations, validations etc.

    I'll go ahead an disagree with that. I've built some killed ETL systems that would logon to FTP, download only what was needed, mark what was downloaded, import the data, transform it, validate it, and merge it with the final tables all using just T-SQL including some rather crazy "don't know how many columns there are in the import" functionality using data imported from Double-Click.net.

    ETL is just not that difficult. It's ok to use SSIS for this type of stuff but I haven't actually found a need for it. I've even been able to do parallel loads without it.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • In this case SSIS is a good choice regardless of the complexity of the tranformation because the source and destination databases are going to be on different instances.

    If you want to stay in T-SQL your choices become Linked Servers or xp_cmdshell. Personally I would continue down the path of implementing an SSIS Package to do the archiving.

    My Task is mentioned belwo..

    Copy all data from Source to destination DB

    Every month only the updated or new data needs to be transffered to destination..

    Will SSIS help in this?

    In order to retrieve 'only the updated or new data' you will need to know something about each row that you can use to derive whether it was updated or new. And what about deletes?

    Typical columns used to determine new or modified rows include 'created date' or 'last updated date' columns. Do you have one or both of those, and are they reliable?

    A rock solid way to do this, although it introduces more DBA work, is to leverage either Change Tracking or Change Data Capture. Those two features work at the database-engine-level to track which rows have changed.

    There are no special teachers of virtue, because virtue is taught by the whole community.
    --Plato

  • I am stuck at the end part

    please refer below link

    http://www.sqlservercentral.com/articles/Integration+Services+(SSIS)/75331/

    I have everything uptill ---(Let's continue building the incremental load. )

    I did build incremental load but new rows is not getting transffered :w00t:

    pleaes help :crying:

    ************************************
    Every Dog has a Tail !!!!! :-D

  • Jeff Moden (3/21/2013)


    david.alcock (3/21/2013)


    SSIS is normally utilised when the copy/move is complex, if it involves data transformations, validations etc.

    I'll go ahead an disagree with that. I've built some killed ETL systems that would logon to FTP, download only what was needed, mark what was downloaded, import the data, transform it, validate it, and merge it with the final tables all using just T-SQL including some rather crazy "don't know how many columns there are in the import" functionality using data imported from Double-Click.net.

    ETL is just not that difficult. It's ok to use SSIS for this type of stuff but I haven't actually found a need for it. I've even been able to do parallel loads without it.

    When a guy like Jeff starts off with saying 'I'm going to disagree with that' you know you're in trouble, well I get that feeling of dread anyway!!!! 🙂

    Of course you're absolutely right, you can implement all kinds of magic using T-SQL. My suggestion was more based on the requirement here and not a generalisation, if it involves some level of complexity then SSIS would be a way of implementing the process rather simply (like a lookup task or something).

    'Only he who wanders finds new paths'

  • david.alcock (3/22/2013)


    Jeff Moden (3/21/2013)


    david.alcock (3/21/2013)


    SSIS is normally utilised when the copy/move is complex, if it involves data transformations, validations etc.

    I'll go ahead an disagree with that. I've built some killed ETL systems that would logon to FTP, download only what was needed, mark what was downloaded, import the data, transform it, validate it, and merge it with the final tables all using just T-SQL including some rather crazy "don't know how many columns there are in the import" functionality using data imported from Double-Click.net.

    ETL is just not that difficult. It's ok to use SSIS for this type of stuff but I haven't actually found a need for it. I've even been able to do parallel loads without it.

    When a guy like Jeff starts off with saying 'I'm going to disagree with that' you know you're in trouble, well I get that feeling of dread anyway!!!! 🙂

    Of course you're absolutely right, you can implement all kinds of magic using T-SQL. My suggestion was more based on the requirement here and not a generalisation, if it involves some level of complexity then SSIS would be a way of implementing the process rather simply (like a lookup task or something).

    Heh... sorry, David. I didn't intend to make anyone nervous. Like you, I was making a mostly general statement. I'm just amazed at how many people make it seem like you have to use something like SSIS to do even the simplest of IO tasks. Perhaps the greatest irony of them all is the fact that many people will call a stored procedure from SSIS to accomplish their SSIS task but will still insist that SSIS is the only way to go for "complex" things.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • opc.three (3/21/2013)


    In this case SSIS is a good choice regardless of the complexity of the tranformation because the source and destination databases are going to be on different instances.

    If you want to stay in T-SQL your choices become Linked Servers or xp_cmdshell. Personally I would continue down the path of implementing an SSIS Package to do the archiving.

    I have to both agree and disagree there. It's no more difficult to setup a linked server than it is to setup an extra connection in SSIS. So far as importing data from a non-database source on another server goes, there's no need for either a linked server or xp_CmdShell. A simple shared drive and trusted connections along with BULK INSERT will do fine for what most people want or need to do.

    On the subject of xp_CmdShell, if someone get's the heebie-geebies at the mere mention of xp_CmdShell or a trusted OPENQUERY that uses ACE drivers, then SSIS is definitely the way to go. Since we've been on that subject before, I'll remind folks that xp_CmdShell isn't a security risk. Bad security is a security risk because any hacker that get's into your system with "SA" privs is going to change your career by using xp_CmdShell whether you have it turned off or not.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Right, BULK INSERT, or sp_OA, or SQLCLR, and maybe some other things that could be initiated from the server that I didn't bother mentioning. I still disagree with you on the point of 'xp_cmdshell is not a security risk' but that's OK. For me, SSIS would bridge the 'instance gap' and allow us to move data across the network without initiating file system access from within the database engine, a "win" from my perspective.

    There are no special teachers of virtue, because virtue is taught by the whole community.
    --Plato

  • I'm in the middle of an archiving project right now. The customer has a single database instance on a shared server with a major host and isn't running enterprise edition so this limited my options greatly.

    My basic concept is to export the data to be archived to a separate table with a constraint on the last updated date column that covers the time period to be archived. Then, I add the table to a partitioned view that can be queried without knowing which base table is involved.

    To make matters more complicated, the archival data must be "split" so that any personal data is in one encrypted base table and related information that has no personally identifiable information goes into a second table. The tables are related by a uniqueidentifier that only exists in the two archive tables. Once archived, the underlying tables are read-only. The data in the view can be queried via a stored procedure that has the month and year as parameters.

    The final step after creating the archive and moving over the data is to delete the rows from the production tables. This step is not to be taken lightly! So when I create the archive tables I create a flag on the production tables that they are only virtually "deleted." Then I compare the two new archive tables against the rows that have been "virtually" deleted and make sure that the numbers and data all add up. Only then are the original rows physically deleted. All of this is done within a transaction so that if something goes haywire it will get rolled back.

     

  • Steven Willis (3/24/2013)


    I'm in the middle of an archiving project right now. The customer has a single database instance on a shared server with a major host and isn't running enterprise edition so this limited my options greatly.

    My basic concept is to export the data to be archived to a separate table with a constraint on the last updated date column that covers the time period to be archived. Then, I add the table to a partitioned view that can be queried without knowing which base table is involved.

    To make matters more complicated, the archival data must be "split" so that any personal data is in one encrypted base table and related information that has no personally identifiable information goes into a second table. The tables are related by a uniqueidentifier that only exists in the two archive tables. Once archived, the underlying tables are read-only. The data in the view can be queried via a stored procedure that has the month and year as parameters.

    The final step after creating the archive and moving over the data is to delete the rows from the production tables. This step is not to be taken lightly! So when I create the archive tables I create a flag on the production tables that they are only virtually "deleted." Then I compare the two new archive tables against the rows that have been "virtually" deleted and make sure that the numbers and data all add up. Only then are the original rows physically deleted. All of this is done within a transaction so that if something goes haywire it will get rolled back.

     

    Good documentary on what you do for archiving. I do have to ask though.... you mention the split for personal information and that you (if I'm reading this correctly) then encrypt the personal data and store it in a separate table. My question would be, why isn't that data encripted in the original table to begin with?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (3/24/2013)


    Steven Willis (3/24/2013)


    I'm in the middle of an archiving project right now. The customer has a single database instance on a shared server with a major host and isn't running enterprise edition so this limited my options greatly.

    My basic concept is to export the data to be archived to a separate table with a constraint on the last updated date column that covers the time period to be archived. Then, I add the table to a partitioned view that can be queried without knowing which base table is involved.

    To make matters more complicated, the archival data must be "split" so that any personal data is in one encrypted base table and related information that has no personally identifiable information goes into a second table. The tables are related by a uniqueidentifier that only exists in the two archive tables. Once archived, the underlying tables are read-only. The data in the view can be queried via a stored procedure that has the month and year as parameters.

    The final step after creating the archive and moving over the data is to delete the rows from the production tables. This step is not to be taken lightly! So when I create the archive tables I create a flag on the production tables that they are only virtually "deleted." Then I compare the two new archive tables against the rows that have been "virtually" deleted and make sure that the numbers and data all add up. Only then are the original rows physically deleted. All of this is done within a transaction so that if something goes haywire it will get rolled back.

     

    Good documentary on what you do for archiving. I do have to ask though.... you mention the split for personal information and that you (if I'm reading this correctly) then encrypt the personal data and store it in a separate table. My question would be, why isn't that data encrypted in the original table to begin with?

    The sensitive data in the original table is encrypted (or at least it will be). This client has been collecting data for his health-care related business and "suddenly" realized he might have a problem with HIPPA compliance. 😉

    However, in addition to the personal health information (PHI), there is a second set of data which consists of more than a hundred questions all of which are yes/no or have some scalar-value response that could never be directly associated to an individual unless they also had access to the encrypted table which has the uniqueidentifier FK. By separating this second set of data from the PHI the access control for reporting and analysis can be granted to users who don't need to see who the data was originally connected to. Also, performance is enhanced because it isn't necessary to encrypt the data. (Hmmmm...maybe I could take all those question responses, concatenate them into a delimited string and store them in a single column and use DelimitedSplit8K to query them...just kidding!)

    If this client had the resources to do so, it would certainly be best practice to put this "cleaned" data on a completely different server or at least a different database instance so if the tables were compromised it would still be several degrees of separation from the sensitive PHI. Working in a shared hosting environment on non-Enterprise Edition instances without full sysadmin privileges requires a lot of workarounds. There have been cases when my only SQL access was via a host company's interface--yuk! But that is mostly the SQL world I live in.

    Another purpose for separating the data is due to poor query performance resulting from the original design of the client's database tables. The table holding the PHI and the form question responses is in one ridiculously wide table with over 150 columns. Unfortunately I have no control over that since the data comes from a third-party application. So the best that I could do is manipulate the data myself. With all but the active and most recent data moved to the archive tables, the performance of the existing wide table is significantly improved. If a search of the archive is required it would only be on the encrypted table with the PHI (by users with proper credentials of course). That encrypted PHI archive table only has about 15 columns and by creating a partitioned view by month the query performance on searches goes from something like 30 seconds down to 1 sec. If then the user wants to see the details for one individual row, a separate query will retrieve the detail data. But then it is only pulling in the one related row and not querying 25,000 rows x 150 columns. For all of this I had to write the search query procedures so that the PHI will only be displayed encrypted except for a search column (such as name) or be completely unencrypted depending on the user's credentials.

     

Viewing 15 posts - 16 through 30 (of 33 total)

You must be logged in to reply to this topic. Login to reply