Archiving data

  • Hi,

    I'm trying to archive tables that have data older than 6 months into an Archive database from source database.

    There are many parent child tables and I've created them in Archive database with the same structure.

    I've created a stored procedure similar to the one explained here at:  http://vyaskn.tripod.com/sql_archive_data.htm

    It works fine few runs but after few days, I'm getting "Cannot inset duplicates" and some insert failures on foreign key errors.

    I took a copy of source tables and I'm testing there, where actually no updates happens to source database but still I'm getting above errors while archiving

    Please advise if there are any checks I can implement.

    And also do you use any tools to achieve this? What is industry standard when it comes to data archiving like this?

     

     

     

  • I am not aware of any archiving tools that work on generic data.  And designing one would not be the most fun thing to do as each table can be designed differently.

    Looking at the error about cannot insert duplicates, it sounds to me like some data went into the archive database but was not cleaned up in the source and thus it was attempted to insert twice.  And the foreign key errors are likely the opposite problem of some data missing in the archive database that exists (or at least existed) in the live.  This makes me think that some of the query you are using is done wrong.  For example, having a > in the INSERT portion but a >= in the delete or the other way around or possibly using GETDATE() instead of a datetime variable to store the date causing the time to change while data is being archived vs deleted.

    The way that I would implement this if it was me would be using SSIS to move the data over, validate the row count, and if the same number of rows exist on both sides, remove the data on the source.  If the rows don't match and the archive has fewer, copy the missing rows from the source to the archive and repeat the check.  Repeat this N times and after N attempts, fail and alert the DBAs.

    If the row counts DO match, then delete the rows from the source.

    But if you prefer the stored procedure route, I would be looking at row counts before deleting anything.  Before I delete any data, I like to be certain that I can recover it from somewhere in the event I need it back.  The worst case scenario that I have hit was being told I could delete a lot of data (roughly 1TB of data) that was unneeded.  I removed it, and the next day it was needed again.  Thankfully I had a backup to restore from and I was back up and running in little bit.

    The foreign key errors likely mean that you are missing data in the archive database.

    The above is all just my opinion on what you should do. 
    As with all advice you find on a random internet forum - you shouldn't blindly follow it.  Always test on a test server to see if there is negative side effects before making changes to live!
    I recommend you NEVER run "random code" you found online on any system you care about UNLESS you understand and can verify the code OR you don't care if the code trashes your system.

  • I lean towards what bmg002 said. Without more info, hard to know what's wrong. Have you added other fields? When you get this message, do you check the PK/FK violations? Can you determine what might not be working in your archive proc? Are you sure that when you archive data that you remove those rows from the source?

  • Thanks bmg002,

    I will try SSIS for data archiving and see if that helps.

  • If you do go the SSIS route, there are a few things to watch out for:

    •  if any step in the SSIS package fails, the whole package fails
    •  

      SSIS will fail on some things that are only warnings in T-SQL and could pass in a stored procedure

    • the SSIS SERVER side requires a SQL license.  I do not believe the client tools do
    • If something is broken in your archive process, SSIS won't fix that for you, it may just present it differently

    SSIS is not going to be a "magic fix everything" tool for you, I just personally find it easier to work with and test when I am doing data archiving.  I would be curious to figure out why your stored procedure though is not moving and removing the data properly though.  That may be a quicker fix than building up an SSIS package.

    The above is all just my opinion on what you should do. 
    As with all advice you find on a random internet forum - you shouldn't blindly follow it.  Always test on a test server to see if there is negative side effects before making changes to live!
    I recommend you NEVER run "random code" you found online on any system you care about UNLESS you understand and can verify the code OR you don't care if the code trashes your system.

  • I found that the issue why I'm getting primary key violation while inserting to archive database.

    Here the query failing in the stored procedure:

    insert into archive.dbo.operator
    select * from sourcedb.dbo.Operator
    where OperatorID in
    (
    select OperatorID from sourcedb.dbo.Transaction where createdatetime<@RetentionPeriod)
    IF @@ERROR <> 0
    BEGIN
    ROLLBACK TRAN
    RAISERROR ('Error occured while copying data to Archive.dbo.Operator', 16, 1)
    RETURN -1
    END

    delete from Operator
    where OperatorID in
    (
    select OperatorID from sourcedb.dbo.Transaction where createdatetime<@RetentionPeriod)
    IF @@ERROR <> 0
    BEGIN
    ROLLBACK TRAN
    RAISERROR ('Error occured while deleting data from sourcedb.dbo.Transaction', 16, 1)
    RETURN -1
    END

    Getting duplicates while inserting to archive.dbo.operator because sourcedb.dbo.Transaction table has transactions with operatorid for the date 5/08/2019 and that operatorid already exists in archive.dbo.operator for different date, 5/07/2019?

    Getting duplicates while inserting to archive.dbo.operator because sourcedb.dbo.Transaction table has transactions with operatorid for the date 5/08/2019 and that operatorid already exists in archive.dbo.operator for different date, 5/07/2019

    But the recodrs in sourcedb.dbo.Transaction are unique with primary key transactionId which i'm not considered anywhere in my query. Because theese two tables have releation based on operatorid column.

    OperatorID is primary key in operator table and foreign key in Transaction table.

    So how I can use transactionId column in my query so that I don't get duplicate issue?

  • You haven't provided any DDL here. WE have no idea what the Transaction or Operator tables look like,and your queries don't really provide much explanation. Also, please don't use SELECT *. If someone revs the transaction table, then this will break if the operator table isn't changed. If you have a column list, this will still work.

    I would guess that you need to consider the transaction id as a part of your WHERE clause, but without knowing a bit more about the data and structure, it's hard to say what to do.

  • Thanks Steve,

    Here is the DDL for the tables:

    Transaction:

    CREATE TABLE [dbo].Transaction(

    [TransactionID] [int] NOT NULL,

    [OperatorID] [nchar](32) NOT NULL,

    [CustomerID] [char](100) NULL,

    [TransactionNumber] [int] NOT NULL,

    [TransactionDateTime] [datetime] NOT NULL,

    [CreateDateTime] [datetime] NOT NULL,

    PRIMARY KEY CLUSTERED

    (

    [TransactionID] ASC

    )WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, FILLFACTOR = 80) ON [PRIMARY]

    ) ON [PRIMARY]

    GO

    ALTER TABLE [dbo].Transaction ADD CONSTRAINT [DF_Transaction_OperatorID] DEFAULT ('0') FOR [OperatorID]

    GO

    ALTER TABLE [dbo].Transaction WITH CHECK ADD CONSTRAINT [FK_Transaction_Operator] FOREIGN KEY([OperatorID])

    REFERENCES [dbo].[Operator] ([OperatorID])

    ON DELETE CASCADE

    GO

    ALTER TABLE [dbo].Transaction CHECK CONSTRAINT [FK_Transaction_Operator]

    GO

     

    Operator:

    CREATE TABLE [dbo].[Operator](

    [OperatorID] [nchar](32) NOT NULL,

    [OperatorName] [varchar](50) NULL,

    CONSTRAINT [PK_Operator] PRIMARY KEY CLUSTERED

    (

    [OperatorID] ASC

    )WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, FILLFACTOR = 80) ON [PRIMARY]

    ) ON [PRIMARY]

    GO

    ALTER TABLE [dbo].[Operator] ADD CONSTRAINT [DF_Operator_OperatorID] DEFAULT (N'0') FOR [OperatorID]

    GO

     

  • Can you have duplicate OperatorID's in the Transaction table?

    try running the following query on the Transaction table:

    WITH duplicateCheck AS
    (
    SELECT OperatorID, count(OperatorID) OVER (PARTITION BY OperatorID) AS OperatorCount
    FROM Transaction
    )
    SELECT OperatorID
    FROM duplicateCheck
    WHERE OperatorCount > 1

    What this will do is tell you if you have duplicate operatorID's in the Transaction table.  If you do, that is likely why you are getting that key violation error.

    In the Operator table, it looks like the OperatorID HAS to be unique because of your primary key.  But I am expecting that your Transaction table allows for duplicate OperatorIDs per row.

    I can think of a few solutions to this problem, the easiest (I think) being to check both past and future records instead of just past records.  What I mean is change your WHERE clause to be something like:

    WHERE OperatorID IN (
    SELECT OperatorID
    FROM sourcedb.dbo.Transaction
    WHERE createdatetime < @RetentionPeriod
    )
    AND OperatorID NOT IN (
    SELECT OperatorID
    FROM sourcedb.dbo.Transaction
    WHERE createdatetime >= @RetentionPeriod
    )

    Basically, archive the Operator if it has not been used since @RetentionPeriod.

    But without seeing some actual table data and without knowing that there are duplicate OperatorID's inside Transaction, I am not 100% certain the above will work.

    The above is all just my opinion on what you should do. 
    As with all advice you find on a random internet forum - you shouldn't blindly follow it.  Always test on a test server to see if there is negative side effects before making changes to live!
    I recommend you NEVER run "random code" you found online on any system you care about UNLESS you understand and can verify the code OR you don't care if the code trashes your system.

  • Thank you.

    Yes, Transaction table has duplicate values for OperatorID but they are unique with TransactionID. Please see the screenshot.

     

    I have updated the insert logic in archive.dbo.operator table as you advised. Do we need to change anything in delete logic for Operator table from source?

     

     

    • This reply was modified 4 years, 9 months ago by  jdc.
    Attachments:
    You must be logged in to view attached files.
  • You would need the same logic in your delete.  Your insert and delete logic should be the same or you will either insert data that isn't deleted (resulting in duplicate primary key issues) or deleting data that isn't inserted (potentially resulting in data loss, but more likely causing you a foreign key error).

    that being said, we really have no sample data to work off of here and are just guessing at what your system looks like.  I am going off of assumptions based on what you told us.  It is all "best guess" here.

    But your query and your last response make me think that the archiving logic is definitely wrong with the operator ID section, but not sure about anywhere else.

    Re-thinking what I was recommending earlier, I think that the logic may be backwards and your overall archiving process may have some issues.  For example, do you REALLY want to delete operator ID's?  What if somebody goes on leave of absence (medical leave, extended vacation, etc) longer than your retention period?  Their Operator ID would be removed and put into the archive and they would no longer be able to do transactions.

    If archiving the operator ID is required after the retention period, it may be better to do a process more like this:

    step 1 - insert all missing operator ID's from the sourcedb into archive

    step 2 - copy all archived data from the transaction table from sourcedb into archive and remove sourcedb data

    step 3 - copy all archived data from all other tables that have a foreign key pointing to the OperatorID and remove sourcedb data

    step 4 - remove the operator ID's where the operator ID no longer exists in the other tables

    The operator ID's look like small rows, so having some duplicate data between archive and sourcedb shouldn't be a problem, plus those operator ID's are going to need to exist in the archive in order for the transaction table to have data (unless you drop the foreign keys in which case your archived data has no real referential integrity).  Then you move the transaction data over to archive and remove it from sourcedb, and finally you remove the operator ID's which have no data in other tables.

    My recommendation though would be to skip step 4 for the most part as each row in the operatorID table is pretty small (if my math is correct, it is around 114 bytes).  Looking at your attachment though, is there a reason that your OperatorID is being stored as an nchar(32) (64 bytes of data per row) instead of an INT (4 bytes per row).  I know that some applications when building the database will use the wrong data types, so this may be beyond your control.  But if it is something you can control/manage, it may be worth looking into.  60 bytes is not that much, but 60 bytes in a million or billion row table adds up.  Plus if there is a lot of data in that table, performance will be impacted.  Pulling 4 million bytes is going to be much faster than 60 million bytes.

    The above is all just my opinion on what you should do. 
    As with all advice you find on a random internet forum - you shouldn't blindly follow it.  Always test on a test server to see if there is negative side effects before making changes to live!
    I recommend you NEVER run "random code" you found online on any system you care about UNLESS you understand and can verify the code OR you don't care if the code trashes your system.

  • Thanks so much!

    Can you please advise me how exactly I should add @@rowcount to row count for inserts into each tables also deletes from each table?

    I tired to have below code snipped after every table and it is working. Is  there any better way to do it?

    declare @Delete int, @Insert int

    select @Insert = @@ROWCOUNT

    print 'rows inserted into TableName is' + convert(varchar, @Insert)

    I wan to make sure  number of rows inserts and deletes are same.

    • This reply was modified 4 years, 9 months ago by  jdc.
    • This reply was modified 4 years, 9 months ago by  jdc.
  • Sorry for the delayed response, but that is the method I would use.  Basically exactly the code you have.

    Then before you do the delete, do a quick check to make sure that the numbers match between the number of rows inserted and the expected.  I would have an IF statement before your DELETE to make sure that @Insert is the same as COUNT(1) from the table with the same filter as your insert.  Really though, checking the number of rows inserted is more of a sanity check.  If you use the same WHERE clause, the numbers should always match.  The only case where they might not is if someone puts some data in that fits the WHERE clause after you inserted the data into archive... or if you have some weird stuff happening with your indexes.

    I just like having a few extra checks in my code, especially when I am doing a delete, if performance is not critical.  It is a trade off as every extra step you do will take some time.  The systems I work with, performance is "good enough" even with those types of sanity checks in place.  I am not sure what your system looks like performance wise, so I can't say for certain if adding in anything like that is going to get someone calling you up due to performance issues...  I expect the performance hit to be pretty small, but without seeing the actual table structure and full data set, it is hard to know.

    The above is all just my opinion on what you should do. 
    As with all advice you find on a random internet forum - you shouldn't blindly follow it.  Always test on a test server to see if there is negative side effects before making changes to live!
    I recommend you NEVER run "random code" you found online on any system you care about UNLESS you understand and can verify the code OR you don't care if the code trashes your system.

  • thanks bmg002. I really appreciate your help.

    And also I want to do a schema comparison for the source database and Archive database tables ?

    I want to perform a check before the archive procedure starts such that if there is a column added in Source database, the same column should be added in Archive database table dynamically? Can you please advise.

    • This reply was modified 4 years, 8 months ago by  jdc.
  • If you are wanting to do it manually, I believe this query should help (may want to tweak it a bit to suit your specific needs):

    SELECT
    [tables].[name]
    , [columns].[name]
    , [system_type_id]
    , [user_type_id]
    , [max_length]
    , [precision]
    , [scale]
    , [collation_name]
    , [is_nullable]
    , [is_identity]
    FROM [sys].[columns]
    JOIN [sys].[tables]
    ON [COLUMNS].[object_id] = [TABLES].[object_id];

    That will give you a comparison that you can do, but to dynamically add columns sounds like it has room for automation related errors.  For example, lets say a new column gets added to source.dbo.tableA - NewColumn INT NOT NULL  with a default of 0.  Once the column is created, the default constraint is removed.  Then your automation tool runs and tries to add a column to tableA called NewColumn INT NOT NULL.  And it fails.

    My opinion on this would be that the columns in the source system should not be modified if at all possible OR if they are modified, that the script for modifying them is run against both the live and the archive.  Now, if this is a 3rd party system, you may not have control over changes to the source, but if it IS a 3rd party system, they may have an archiving solution that they can provide for you.

    If you really need to keep track of schema level changes on source, you could use the above query and dump that to a table on source.  Then in your data load compare the above query with the table copy of it and if they differ, send an email alert to the DBA that the source schema has changed, or you COULD build a dynamic SQL query to create or remove specific columns.  I personally don't like automated schema level changes, but it may be required for what you are trying to do...

    The above is all just my opinion on what you should do. 
    As with all advice you find on a random internet forum - you shouldn't blindly follow it.  Always test on a test server to see if there is negative side effects before making changes to live!
    I recommend you NEVER run "random code" you found online on any system you care about UNLESS you understand and can verify the code OR you don't care if the code trashes your system.

Viewing 15 posts - 1 through 15 (of 15 total)

You must be logged in to reply to this topic. Login to reply