XML to compare two rows is slow when dealing with large sets

  • We have a process where every time an update is made to a table the current / new records are written out to an audit table.  The current process that loops thru this data uses a cursor to read in each row and populate a bunch of variables with the values and then if statements to determine which columns have changed.  I was trying to update this to use XML but am finding that trying to extract the changes is a very slow process.  

    In this example, the person table has 50 columns.   We try to break up our batches into smaller chunks so that we are not trying to process 500k rows at once but more like 10k.  If I grab 10k changes, that eventually explodes into 10K * 2 * 50 = @ 1 million person id, column name, value combinations.    The size of the workset can go higher or lower, but that doesn't seem to change how long it takes to process the full batch.  RIght now a batch of 50k changes will take roughly 5 minutes to process.   The biggest timesink appears to be the two CTE's in the snippet below.  By comparison, the cursor can do a batch like this in around 90 seconds.   

    SET @x = (
              SELECT
                         new.PID AS [@PID]
                    , new.HL_GID AS [@HL_GID]
                    , ( SELECT old.* FOR XML PATH('old'), TYPE, ELEMENTS XSINIL ) AS [*]
                    , ( SELECT new.* FOR XML PATH('new'), TYPE, ELEMENTS XSINIL ) AS [*]
              FROM person_hl AS old
              INNER JOIN person_hl AS new ON old.[PID] = new.[PID] AND old.HL_GID= new.HL_GID
              INNER JOIN #Person_Update_HL phl ON phl.pid=old.PID AND phl.HL_GID=old.HL_GID
              INNER JOIN #workset w ON w.rowid=phl.rowid
              WHERE old.HL_IID=phl.OldRecID
              AND new.HL_IID=phl.NewRecID
              FOR XML PATH('row'), ELEMENTS XSINIL, ROOT('root')
              )
     
           -- build CTE of just the changes
           ;WITH Nodes AS (
              SELECT
                      P.X.value('.[1]','int') AS [Key]
                    , G.X.value('.[1]','int') AS Version_ID
                    , T.C.value('local-name(..[1])', 'varchar(255)')AS Version_Alias
                    , T.C.value('local-name(.[1])', 'varchar(255)') AS Field
                    , T.C.value('.[1]', 'varchar(max)') AS Val
              FROM @x.nodes('/root/row/*/*') AS T(C)
              OUTER APPLY T.c.nodes('../../@PID') AS P(X)
              OUTER APPLY T.c.nodes('../../@HL_GID') AS G(X)
              )
           , NodeChanges as (
                  SELECT
                         , Version_ID, Field
                     , Max(CASE Version_Alias WHEN 'old' THEN Val END) AS oldrow
                     , Max(CASE Version_Alias WHEN 'new' THEN Val END) AS newrow
                  FROM Nodes
                  GROUP BY [Key], Version_ID, Field
                  HAVING Max(CASE Version_Alias WHEN 'old' THEN Val END) <> Max(CASE Version_Alias WHEN 'new' THEN Val END)
                  )

    Is there any way to speed up how it is determining which columns are different?

  • Parsing large XML data sets is very expensive. You would get the fastest results putting the data into in memory tables, that would be fastest if you have Enterprise edition of SS 2012+, or, not as fast but faster than XQuery, regular SQL tables instead of XML.

Viewing 2 posts - 1 through 1 (of 1 total)

You must be logged in to reply to this topic. Login to reply