Keep only the true changes in a massive table

Question

Keep only the true changes in a massive table

SQL_Surfer

SSCrazy Eights

Points: 8320
More actions
July 19, 2018 at 8:39 am

#373667

I've a history that got populated through a trigger when some change happened. Trigger had an issue and it was storing records in the history table whether there was an actual upate or not. Trigger is fixed now. Challange is how to clean up this massive table. We need to identify the records that got changed and remove others. Any better way to do this? Since this is a massive table, query is taking exteremly longer to return results.
I'm coming up with query like this
SELECT * FROM tbl_history a
INNER JOIN tbl_history b ON a.ID = b.ID
WHERE a.col1 <> b.col1
      OR a.col2 <> b.col2
      OR a.col3 <> b.col3
      OR a.col3 <> b.col3
      OR a.col4 <> b.col4
      OR a.col5 <> b.col5
      OR a.col6<> b.col6
      OR a.col7 <> b.col7
      OR a.col8 <> b.col8
      OR a.col9 <> b.col9
      OR a.col10 <> b.col10
      OR a.col11 <> b.col11

Viewing 4 posts - 1 through 4 (of 4 total)

You must be logged in to reply to this topic. Login to reply

Phil Parkin SSC Guru Points: 247227 More actions · Answer 1

SQL_Surfer - Thursday, July 19, 2018 8:39 AM
I've a history that got populated through a trigger when some change happened. Trigger had an issue and it was storing records in the history table whether there was an actual upate or not. Trigger is fixed now. Challange is how to clean up this massive table. We need to identify the records that got changed and remove others. Any better way to do this? Since this is a massive table, query is taking exteremly longer to return results.
I'm coming up with query like this
SELECT * FROM tbl_history a
INNER JOIN tbl_history b ON a.ID = b.ID
WHERE a.col1 <> b.col1
      OR a.col2 <> b.col2
      OR a.col3 <> b.col3
      OR a.col3 <> b.col3
      OR a.col4 <> b.col4
      OR a.col5 <> b.col5
      OR a.col6<> b.col6
      OR a.col7 <> b.col7
      OR a.col8 <> b.col8
      OR a.col9 <> b.col9
      OR a.col10 <> b.col10
      OR a.col11 <> b.col11

This query will not correctly handle the case where one of the columns being checked is a NULL. A method which does is as follows:
SELECT ... FROM ... WHERE EXISTS (SELECT a.col1, a.col2, ..., a.col11 EXCEPT SELECT b.col1, b.col2, ..., b.col11)
But it's not going to run fast.

SQL_Surfer SSCrazy Eights Points: 8320 More actions · Answer 2

Thanks, any idea to speed it up? May be do it in a batch?

ScottPletcher SSC Guru Points: 101238 More actions · Answer 3

Wouldn't you want to do that the other way? That is, find rows where the previous row matches the next row -- i.e. no column changed -- so you could delete the older of those rows? Thus something like this (?):

WHERE a.col1 = b.col1 AND a.col2 = b.col2 AND a.col3 <> b.col3 ...

As to NULL issues, check the column constraints first: if the column is defined as NOT NULL, obviously you don't have to worry about NULLs for that particular column.

You didn't state it, but presumably ID is the clustering key on this table (not really the best idea of an audit table, but extremely common). If so, you'd update by going in batches thru the table and deleting rows that were dups of the prior row.

As to efficiency, perhaps a PARTITION BY col1, col2, col3, .... would work best, and would automatically take care of the NULL issue. Perhaps with some type of "gaps and islands" logic to determine which rows to delete. I/we would need more details about the table and its data to be sure on that one.

SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".