XML to compare two rows is slow when dealing with large sets

Question

XML to compare two rows is slow when dealing with large sets

lspfister

Grasshopper

Points: 24
More actions
April 19, 2017 at 8:47 am

#342524

We have a process where every time an update is made to a table the current / new records are written out to an audit table. The current process that loops thru this data uses a cursor to read in each row and populate a bunch of variables with the values and then if statements to determine which columns have changed. I was trying to update this to use XML but am finding that trying to extract the changes is a very slow process.
In this example, the person table has 50 columns. We try to break up our batches into smaller chunks so that we are not trying to process 500k rows at once but more like 10k. If I grab 10k changes, that eventually explodes into 10K * 2 * 50 = @ 1 million person id, column name, value combinations.    The size of the workset can go higher or lower, but that doesn't seem to change how long it takes to process the full batch. RIght now a batch of 50k changes will take roughly 5 minutes to process.   The biggest timesink appears to be the two CTE's in the snippet below. By comparison, the cursor can do a batch like this in around 90 seconds.
SET @x = (
          SELECT
                     new.PID AS [@PID]
              , new.HL_GID AS [@HL_GID]
              , ( SELECT old.* FOR XML PATH('old'), TYPE, ELEMENTS XSINIL ) AS [*]
              , ( SELECT new.* FOR XML PATH('new'), TYPE, ELEMENTS XSINIL ) AS [*]
          FROM person_hl AS old
          INNER JOIN person_hl AS new ON old.[PID] = new.[PID] AND old.HL_GID= new.HL_GID
          INNER JOIN #Person_Update_HL phl ON phl.pid=old.PID AND phl.HL_GID=old.HL_GID
          INNER JOIN #workset w ON w.rowid=phl.rowid
          WHERE old.HL_IID=phl.OldRecID
          AND new.HL_IID=phl.NewRecID
          FOR XML PATH('row'), ELEMENTS XSINIL, ROOT('root')
          )

       -- build CTE of just the changes
       ;WITH Nodes AS (
          SELECT
                  P.X.value('.[1]','int') AS [Key]
              , G.X.value('.[1]','int') AS Version_ID
              , T.C.value('local-name(..[1])', 'varchar(255)')AS Version_Alias
              , T.C.value('local-name(.[1])', 'varchar(255)') AS Field
              , T.C.value('.[1]', 'varchar(max)') AS Val
          FROM @x.nodes('/root/row/*/*') AS T(C)
          OUTER APPLY T.c.nodes('../../@PID') AS P(X)
          OUTER APPLY T.c.nodes('../../@HL_GID') AS G(X)
          )
       , NodeChanges as (
              SELECT
                     , Version_ID, Field
                 , Max(CASE Version_Alias WHEN 'old' THEN Val END) AS oldrow
                 , Max(CASE Version_Alias WHEN 'new' THEN Val END) AS newrow
              FROM Nodes
              GROUP BY [Key], Version_ID, Field
              HAVING Max(CASE Version_Alias WHEN 'old' THEN Val END) <> Max(CASE Version_Alias WHEN 'new' THEN Val END)
              )
Is there any way to speed up how it is determining which columns are different?

Viewing 2 posts - 1 through 2 (of 2 total)

You must be logged in to reply to this topic. Login to reply

Joe Torre SSChampion Points: 10248 More actions · Answer 1

Parsing large XML data sets is very expensive. You would get the fastest results putting the data into in memory tables, that would be fastest if you have Enterprise edition of SS 2012+, or, not as fast but faster than XQuery, regular SQL tables instead of XML.