Best Practice for loading DW tables from live source with timestamp/Rowversion column to identify changes

  • I want to build a DW and I need to identify records that have changed, and luckily each table in our ERP system has a timestamp/rowversion column, but I cannot see how to easily and efficiently compare the current rowversion value to one I could save to a staging table.

    I was planning a staging table to contain the unique ID of the table and the rowversion value which will allow me using the unique ID to find records not in staging but it is the edits I am having trouble with, as the rwovdersion needs to handled differently.

    What is the normal best practice for identifying changed records via a rowversion column rather than checking the values of say the 20 fields I am interested in, or it is best practice to not refer to rowversion at all and instead have something like this.

    insert into staging(f1,f2,f3,f4,f5)

    select live.f1, live.f2, live.f3, live.f4, live.5

    from live

    inner join staging on live.uniqueid = staging.uniqueID

    AND (live.f1 <> staging.f1 OR live.f2 <> staging.f2 OR live.f3 <> staging.f3 OR live.f4 <> staging.f4 OR live.f5 <> staging.f5)

    There must be an efficient way to copy the rowversion value to the staging table and do an efficient join query, I would have thought.

    insert into staging(f1,f2,f3,f4,f5)

    select live.f1, live.f2, live.f3, live.f4, live.5

    from live

    inner join staging on live.uniqueid = staging.uniqueID AND live.timestamp <> staging.timestamp

    Thanks in advance.

  • Please research SLD of the type 2 (Slowly Changing Dimensions)

    _____________________________________
    Pablo (Paul) Berzukov

    Author of Understanding Database Administration available at Amazon and other bookstores.

    Disclaimer: Advice is provided to the best of my knowledge but no implicit or explicit warranties are provided. Since the advisor explicitly encourages testing any and all suggestions on a test non-production environment advisor should not held liable or responsible for any actions taken based on the given advice.
  • thanks for the reply, but scd type 2 won't help me as the change will normally be in the fact table.

    for example the use updates an order to change the price 3 days after it was created.

  • sotn (5/2/2013)


    I want to build a DW and I need to identify records that have changed, and luckily each table in our ERP system has a timestamp/rowversion column, but I cannot see how to easily and efficiently compare the current rowversion value to one I could save to a staging table.

    I was planning a staging table to contain the unique ID of the table and the rowversion value which will allow me using the unique ID to find records not in staging but it is the edits I am having trouble with, as the rwovdersion needs to handled differently.

    What is the normal best practice for identifying changed records via a rowversion column rather than checking the values of say the 20 fields I am interested in, or it is best practice to not refer to rowversion at all and instead have something like this.

    insert into staging(f1,f2,f3,f4,f5)

    select live.f1, live.f2, live.f3, live.f4, live.5

    from live

    inner join staging on live.uniqueid = staging.uniqueID

    AND (live.f1 <> staging.f1 OR live.f2 <> staging.f2 OR live.f3 <> staging.f3 OR live.f4 <> staging.f4 OR live.f5 <> staging.f5)

    There must be an efficient way to copy the rowversion value to the staging table and do an efficient join query, I would have thought.

    insert into staging(f1,f2,f3,f4,f5)

    select live.f1, live.f2, live.f3, live.f4, live.5

    from live

    inner join staging on live.uniqueid = staging.uniqueID AND live.timestamp <> staging.timestamp

    Thanks in advance.

    I've always thought using a rowversion column to manage data extracts for a DW is a real pain. For the 8 bytes of storage required by the rowversion type, you can use a datetime or datetime2 column to record the date a row is modified. This requires code (either within every procedure that updates the table or a trigger) to handle the update of the datetime/datetime2 column, but an AFTER UPDATE trigger that just sets a single column = getdate() creates only a very minimal amount of overhead. Extracts to the DW become much easier - create a table to log the starting datetime of each extract that completes successfully, and for the next extract, look for modifications that occurred after the most recent starting datetime in the log table. An index on the modification datetime column makes this process run very efficiently.

    Jason Wolfkill

  • thanks for the reply.

    Unfortunately, I've already examined that option but we cannot really change our erp system easily, so I hoping for an easy 'timestamp' comparison rather than comparing all fields.

  • sotn (5/3/2013)


    thanks for the reply.

    Unfortunately, I've already examined that option but we cannot really change our erp system easily, so I hoping for an easy 'timestamp' comparison rather than comparing all fields.

    You should be able to do the same thing with the rowversion column. Create a log table for the rowversion value at the time you start an extraction. At the start of the extraction, capture the current rowversion value into a variable (DECLARE @currentRowversionValue varbinary(8) = @@DBTS) and the maximum rowversion value from the log table into a variable (DECLARE @maxLoggedRowversion varbinary(8) = (SELECT MAX(rowversionValue) FROM extractLog)). Within the extraction process, extract only rows where rowversion > @maxLoggedRowversion. When that's complete, INSERT INTO extractLog(rowversionValue) SELECT @currentRowversionValue. You may want to wrap all of that into an explicit transaction with appropriate error handling so that the extractLog table is only updated when the extraction process completes without error.

    You can index your rowversion column to make this faster. I would create a non-clustered index with only the rowversion column (with an appropriate maintenance plan to manage fragmentation). Presumably you'll want all columns of an updated row in the extract, so an index seek on the rowversion index to identify the updated rows followed by a key lookup to the underlying table to get all the columns will not be unduly burdensome.

    Jason Wolfkill

  • Essentially you're looking for the best way to find records that have changed in your source system?

    If your source system is a SQL Server 2012 enterprise you could look into implementing change data capture, that is designed for what your trying to do. I'm guessing that's unlikely, so you're going forced to find the changed records yourself, there is no other way round it.

    You could checksum the records or use OR's like you did. To implement the checksum you would have to store the primary key and checksum value for the record in your staging area then compare that list to your extract from source.

    I would also be tempted to consider a full reload of your DW during your ETL which would make the problem go away. That depends on how much transformation you need to do really, you're having to do a full extract from source anyway so it might be an option. You can load all of your data into a second version of the fact table then at the end just rename the table or switch the data in using partition switching.

    Hope this helps.

  • thanks for the reply.

    We are using SQL 2008 R2.

    I shall experiment with a full reload from scratch every day rather than just updates.

    As each table has a timestamp as well as a primary key field I was hoping I could quickly just compare those 2 fields for new/updated/deleted records rather than checking all the fields I am interested in as over time I'm sure new fields will be asked for so I will need to update the check script but with the timestamp I wouldn't have to.

  • we started doing incremental fact table loads using row version and like it a lot.

    I do have a log table that identifies source (we host so have mult source servers), extract name, last row version, last update.

    I merely need to pass the last known row version to the query for that source and only updated records are returned. Once in staging, depending on which fact table I'm working on, it's either a straight insert or first remove earlier edition of this record.

    At the end of my process I update my log table with the max row version per source for that extract.

    We had in the past done full dump and reload but this is much faster and less weight on source systems.

Viewing 9 posts - 1 through 8 (of 8 total)

You must be logged in to reply to this topic. Login to reply