Keep only the true changes in a massive table

  • I've a history that got populated through a trigger when some change happened. Trigger had an issue and it was storing records in the history table whether there was an actual upate or not. Trigger is fixed now. Challange is how to clean up this massive table. We need to identify the records that got changed and remove others. Any better way to do this? Since this is a massive table, query is taking exteremly longer to return results.
    I'm coming up with query like this

    SELECT * FROM tbl_history a
    INNER JOIN tbl_history b ON a.ID = b.ID
    WHERE a.col1 <> b.col1
          OR a.col2 <> b.col2
          OR a.col3 <> b.col3
          OR a.col3 <> b.col3
          OR a.col4 <> b.col4
          OR a.col5 <> b.col5
          OR a.col6<> b.col6
          OR a.col7 <> b.col7
          OR a.col8 <> b.col8
          OR a.col9 <> b.col9
          OR a.col10 <> b.col10
          OR a.col11 <> b.col11

  • SQL_Surfer - Thursday, July 19, 2018 8:39 AM

    I've a history that got populated through a trigger when some change happened. Trigger had an issue and it was storing records in the history table whether there was an actual upate or not. Trigger is fixed now. Challange is how to clean up this massive table. We need to identify the records that got changed and remove others. Any better way to do this? Since this is a massive table, query is taking exteremly longer to return results.
    I'm coming up with query like this

    SELECT * FROM tbl_history a
    INNER JOIN tbl_history b ON a.ID = b.ID
    WHERE a.col1 <> b.col1
          OR a.col2 <> b.col2
          OR a.col3 <> b.col3
          OR a.col3 <> b.col3
          OR a.col4 <> b.col4
          OR a.col5 <> b.col5
          OR a.col6<> b.col6
          OR a.col7 <> b.col7
          OR a.col8 <> b.col8
          OR a.col9 <> b.col9
          OR a.col10 <> b.col10
          OR a.col11 <> b.col11

    This query will not correctly handle the case where one of the columns being checked is a NULL. A method which does is as follows:
    SELECT ...
    FROM ...
    WHERE EXISTS (SELECT a.col1, a.col2, ..., a.col11 EXCEPT SELECT b.col1, b.col2, ..., b.col11)

    But it's not going to run fast.

    If you haven't even tried to resolve your issue, please don't expect the hard-working volunteers here to waste their time providing links to answers which you could easily have found yourself.

  • Thanks, any idea to speed it up? May be do it in a batch?

  • Wouldn't you want to do that the other way?  That is, find rows where the previous row matches the next row -- i.e. no column changed -- so you could delete the older of those rows?  Thus something like this (?):

    WHERE a.col1 = b.col1 AND a.col2 = b.col2 AND a.col3 <> b.col3 ...

    As to NULL issues, check the column constraints first: if the column is defined as NOT NULL, obviously you don't have to worry about NULLs for that particular column.

    You didn't state it, but presumably ID is the clustering key on this table (not really the best idea of an audit table, but extremely common).  If so, you'd update by going in batches thru the table and deleting rows that were dups of the prior row.

    As to efficiency, perhaps a PARTITION BY col1, col2, col3, .... would work best, and would automatically take care of the NULL issue.  Perhaps with some type of "gaps and islands" logic to determine which rows to delete.  I/we would need more details about the table and its data to be sure on that one.

    SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.

Viewing 4 posts - 1 through 3 (of 3 total)

You must be logged in to reply to this topic. Login to reply