SQLServerCentral Article

T-SQL Data Processing

,

Introduction

The following scenario is nothing new to the SQL programmer. You have a master table, or tables, that persist across processing cycles, and you have periodic updates that need to be merged with these master tables. Although this scenario is ubiquitous, it is frequently far from trivial.

Your updates might include rows that don't yet exist in your masters, rows that are duplicates of rows in your masters, or rows that should be "merged" with rows in your masters.

We've all seen many good articles on how to "dedup" our data. But how do you implement those "fuzzy" merge requirements? Do your project spec's require that you replace "empty" columns (columns with blanks, Nulls, or default values) with more relevant data, replace first or middle initials with a full first or middle names, arbitrate meta-data columns like AddDates or ModDates, track row update sources via merge bit maps, and perform the above operations via multi-row merge?

Your job, should you decide to accept it, is to make these tables become one, and to do it without resorting to an RBAR process.

Let's get right into the code; after all that's really all you need to see!

The following TSQL script segments represent a simplified implementation using tables with only enough columns to demonstrate the requisite techniques. The example code, in it entirety, can be download from a link supplied with this article.


Code Segment 1 - Initialization

Use TempDB
Set NoCount On
Declare @AddDate SmallDateTime,@ModDate SmallDateTime
Select @AddDate=GetDate(),@ModDate=@AddDate
If Object_Id('dbo.ExistingMaster') is not Null Drop table dbo.ExistingMaster
If Object_Id('dbo.Updates') is not Null Drop table dbo.Updates
If Object_Id('dbo.NewMaster') is not Null Drop table dbo.NewMaster
If Object_Id('dbo.Merged') is not Null Drop table dbo.Merged
If Object_Id('dbo.Matched') is not Null Drop table dbo.Matched
--------------------------------------------------------------------------------
-- Create example tables.
--------------------------------------------------------------------------------
-- Create ExistingMaster table.
Create table dbo.ExistingMaster
(
PartyId Int not Null,
FirstName VarChar(64),
LastName VarChar(64),
-- A bit mask indicating data source.
-- 0x01 indicates this rows contains data from Source1.
-- 0x02 indicates this rows contains data from Source2.
-- 0x04 indicates this rows contains data from Source3.
-- (0x05 would indicate data dulicated/merged from sources 1 and 3, etc.)
Source TinyInt not Null,
-- The date/time this row was added to the table.
AddDate SmallDateTime not Null,
-- The date/time this row was modified. Row modifications are
-- defined as either a dedup or data merge against this row.
ModDate SmallDateTime not Null
)
Insert dbo.ExistingMaster Values(1,'R','Jones',1,'1/1/2005','1/1/2005')
Insert dbo.ExistingMaster Values(1,'Sarah','Jones',1,'1/1/2005','1/1/2005')
Insert dbo.ExistingMaster Values(2,'Bob','Smith',1,'3/11/2006','6/22/2006')
Insert dbo.ExistingMaster Values(4,'Brad','Allen',1,'11/2/2006','12/1/2006')
-- Create Update table.
Create table dbo.Updates
(
PartyId Int not Null,
FirstName VarChar(64),
LastName VarChar(64),
Source TinyInt not Null
)
Insert dbo.Updates Values(1,'R','Jones',1)
Insert dbo.Updates Values(1,'Ron','Jones',4)
Insert dbo.Updates Values(1,'S','Jones',2)
Insert dbo.Updates Values(3,'John','Kelly',1)
Insert dbo.Updates Values(3,'J','Kelly',2)
Insert dbo.Updates Values(3,'John','Kelly',4)
-- Create empty NewMaster table.
Select top 0 * into dbo.NewMaster from dbo.ExistingMaster

I like to use TempDB for examples because if you forget to drop your tables they go away the next time you restart your server!

Here we just created the resources necessary for our example. I've created and populated a master table and an updates table. The final step creates the new master table into which the deduped/mered rows will be placed as the process executes. This simple table does not contain any computed columns or other entanglements so this works well.

The utility of the new master is such that all processing and data can be verified before the old master is dropped (or renamed) and the new master is renamed into its place.


Code Segment 2 - Pre-Loop Data Aggregation

--------------------------------------------------------------------------------
-- Get ExistingMaster rows and Updates rows into a common table adding a
-- sequence column for processing efficiency.
--------------------------------------------------------------------------------
Select
Identity(Int,1,1)[Seq],
*
into dbo.Merged
from
(
Select
PartyId,
FirstName,
LastName,
Case when (Source&1)<>0 then 1 else 0 End[Source1],
Case when (Source&2)<>0 then 2 else 0 End[Source2],
Case when (Source&4)<>0 then 4 else 0 End[Source3],
AddDate,
ModDate
from dbo.ExistingMaster
Union all
Select
PartyId,
FirstName,
LastName,
Cast(Case Source when 1 then 1 else 0 End as TinyInt),
Cast(Case Source when 2 then 2 else 0 End as TinyInt),
Cast(Case Source when 4 then 4 else 0 End as TinyInt),
@AddDate,
@ModDate
from dbo.Updates
) t
-- Output the contents of ExistingMaster and Updates table.
Select Replicate('-',Len('ExistingMaster'))[ExistingMaster],* from dbo.ExistingMaster
Select Replicate('-',Len('Updates'))[Updates],* from dbo.Updates

Here we're pulling together, into one table, all the rows from our master and updates tables. These are the rows we want to dedup/merge together. You could include a distinct clause here but it is doubtful that it would eliminate anything as the master is already deduped and the updates are acquiring current Add/Mod dates which would thwart distinctness. Also, notice the inclusion of the Seq identity column. This column is used in subsequent code so that the join operation will produce a quasi-combination of matching rows instead of a permutation.


Code Segment 3 - Matching Dedup/Merge Candidate Rows

--------------------------------------------------------------------------------
-- Dedup/Merge loop.
--------------------------------------------------------------------------------
While (1=1) Begin
If Object_Id('dbo.Matched') is not Null Drop table dbo.Matched
-- With this outer self join operation we match (pair up) rows that
-- are either duplicates or need to be merged.
-- The column-specific match rules are specified in the ON clause
-- of the LEFT OUTER JOIN predicate.
Select
-- Left side of the matched row pair.
l.Seq[lSeq],l.PartyId[lPartyId],
l.FirstName[lFirstName],l.LastName[lLastName],
l.Source1[lSource1],l.Source2[lSource2],l.Source3[lSource3],
l.AddDate[lAddDate],l.ModDate[lModDate],
-- Right side of the matched row pair.
r.Seq[rSeq],r.PartyId[rPartyId],
r.FirstName[rFirstName],r.LastName[rLastName],
r.Source1[rSource1],r.Source2[rSource2],r.Source3[rSource3],
r.AddDate[rAddDate],r.ModDate[rModDate]
into dbo.Matched
from dbo.Merged l
left outer join dbo.Merged r
-- To get a quasi-combination of matched rows instead of a permutation.
on l.Seq<r.Seq
-- FOLLOWING ARE THE MATCH RULES...
-- PartyIds match if they are both the same.
and l.PartyId=r.PartyId
-- FirstNames match if...
and
(
-- One or both of them is Null.
(l.FirstName is Null or r.FirstName is Null)
-- One or both of them is blank.
or (l.FirstName='' or r.FirstName='')
-- Left is an initial and it's the same as the first char of the right.
or ((Len(l.FirstName)=1) and (l.FirstName=Left(r.FirstName,1)))
-- Right is an initial and it's the same as the first char of the left.
or ((Len(r.FirstName)=1) and (r.FirstName=Left(l.FirstName,1)))
-- They are both the same.
or (l.FirstName=r.FirstName)
)
-- LastNames match if they are both the same.
and l.LastName=r.LastName
-- Source1 (not part of the match rules)
-- Source2 (not part of the match rules)
-- Source3 (not part of the match rules)
-- AddDate (not part of the match rules)
-- ModDate (not part of the match rules)

In this segment we're matching row merge candidates. The ON clause of the LEFT OUTER JOIN does all the work. This is where you put your "fuzzy" match rules. Remember, you are only matching rows here, the merge gets done in a subsequent step. The Seq column is used here to limit row matching to useful pairings. Why match a row to its self? Or why match row 1 to row 5 and then match row 5 to row 1? l.Seq<r.Seq keeps this from happening.


Code Segment 4 - Inserting Deduped/Merged Rows into the New Master

-- We now have a table with matched, left and right, row images. However,
-- since this is a combinatoric LEFT OUTER JOIN, there will be left-side
-- row images that do not have a matching right-side row image. With one
-- exception these left-side row images are the final, dedup'd/merged,
-- output of this row set process. The one exception is that the left-side
-- sequence number is not found in any of the non-Null right-side sequence
-- numbers. We now insert these rows into the NewMaster table.
Insert dbo.NewMaster
Select
lPartyId[PartyId],
lFirstName[FirstName],lLastName[LastName],
lSource1|lSource2|lSource3[Source],
lAddDate[AddDate],lModDate[ModDate]
from dbo.Matched
where rSeq is Null
and lSeq not in
(
Select rSeq from dbo.Matched where rSeq is not Null
)

Here the disposition of the deduped/merged rows is handled. Obviously, any left-side row image with no right-side row match is where you want to start looking for completed rows. Not so obvious is the one test in the WHERE clause. A given row is not completely deduped/merged if it is still contributing to another match pairing in another row. Also, notice how the select-list recombines the source bits back into the original Source column.


Code Segment 5 - Merging Matched Rows

If Object_Id('dbo.Merged') is not Null Drop table dbo.Merged
-- Merge the remaining matched rows. At this point any row with a Null
-- right-side sequence number is no longer needed and is filtered out
-- of subsequent processing. GROUP BY is used to mediate meta-data
-- merging (aggregation) and elimination of duplicates resulting from
-- them merge operation.
Select
-- Seq
-- Re-added for continued processing.
Identity(Int,1,1)[Seq],
-- PartyId
-- Take the left-side value as they are both the same.
lPartyId[PartyId],
-- FirstName
-- If one side is Null then take the other side,
-- else take the longest one (to get a fullname instead of an initial).
Case
when lFirstName is Null then rFirstName
when rFirstName is Null then lFirstName
else Case
when Len(lFirstName)>Len(rFirstName) then lFirstName
else rFirstName
End
End[FirstName],
-- LastName
-- Take the left-side value as they are both the same.
lLastName[LastName],
-- Source1
-- OR the two values
-- then take the Max() from GROUP BY rows
Max(lSource1|rSource1)[Source1],
-- Source2
-- OR the two values
-- then take the Max() from GROUP BY rows
Max(lSource2|rSource2)[Source2],
-- Source3
-- OR the two values
-- then take the Max() from GROUP BY rows
Max(lSource3|rSource3)[Source3],
-- AddDate
-- Take the lowest of the two values
-- and then take the Min() from GROUP BY rows.
Min
(
Case
when lAddDate<rAddDate then lAddDate
else rAddDate
End
)[AddDate],
-- ModDate
-- Take the highest of the two values
-- then take the Max() from GROUP BY rows.
Max
(
Case
when lModDate>rModDate then lModDate
else rModDate
End
)[ModDate]
into dbo.Merged
from dbo.Matched with (nolock)
where rSeq is not Null
group by
lPartyId,
Case
when lFirstName is Null then rFirstName
when rFirstName is Null then lFirstName
else Case
when Len(lFirstName)>Len(rFirstName) then lFirstName
else rFirstName
End
End,
lLastName

Now we've got a rowset of left-side, right-side matched row pairs that need to be merged. The select-list, in combination with some judicious CASE statements (or UDF's if the logic is complex), merges the row pairs back into a single more robust row. Notice how the source and date fields are processed. After the select-list has done its thing the GROUP BY predicate comes along and cleans up the duplicates and facilitates source bit and date merges. Also, the WHERE clause rSeq is not Null just tells this step to only attempt to merge where there is a left and right side pairing. As this loop iterates there will be less and less data to process.


Code Segment 6 - Loop Termination

-- Test for completion.
-- If there were no matched rows to merge then we are done.
If @@RowCount=0 Break
End
-- Output the NewMaster table.
Select Replicate('-',Len('NewMaster'))[NewMaster],* from dbo.NewMaster order by PartyId

You can see what's going on here. With the data I'm processing I generally see 2-3 iterations for a complete dedup/merge cycle. Your mileage may vary.

Altering the contents of the example input tables and dumping the intermediate loop result sets may help you to understand the process more fully.


Discriminating Observations:

  • The Source column in the master table is a bit-mask indicating the origin of the data used to update a given row. In the case of this example there are three data providers, Source1, Source2, & Source3.
  • The single Source column in the master table is maintained, during processing, as three separate source columns. This distinction is important as the number of data providers is manifested in the logic.
  • This technique, splitting the bit-mask into separate columns for processing, is used because MS SQL does not have a Bitwise-OR aggregate function.
  • SQL 2005 Enhancement: Create a Bitwise-OR aggregate function with the CLR.
  • The inclusion of an Identity column may not be necessary if your data already has a primary key, however, it is unlikely that the update data will have a compatible primary key.
  • You may want to include some sort of data normalization in one of the preliminary steps so that column comparisons function as expected.
  • Place SELECT statements in the loop to help you visualize the data transformations. This shouldn't generate to many row sets as most dedup/merges can be performed within 2-3 loops (depending on the input data).
  • After the first SELECT statement in the loop (the LEFT OUTER JOIN) examine dbo.Matched to understand the differentiating logic used in the next step (its WHERE clause).
  • Obviously, any rows with data in the right-side image are candidates for subsequent dedup/merge processing.
  • Rows without a right-side image (all nulls, as it was generated with a LEFT OUTER JOIN) can be interpreted in one of two ways:

    • If the left-side Seq column exists in any of the right-side Seq rows then this row is redundant. It has already been included in a subsequent match pairing and will contribute its "DNA" in the results of the subsequent merge operation.
    • If the left-side Seq column does not exist in any of the right-side Seq rows then this row has been completely dedup'd or merged. It can be filtered out into the new master table (the very next step).
  • In the sub-query for the INSERT statement the three separate source columns (Source1, Source2, & Source3) are recombined, via a Bitwise-OR operation, back into a single Source column.
  • The two types of rows with right-side null images are filtered out in the final SELECT statement of the loop where the matched data is dedup'd/merged back into a single row image.
  • The final SELECT statement in the loop merges the two row images in dbo.Matched into a single row image in dbo.Merged and:

    • Eliminates row-pairs that merge to the same result via GROUP BY aggregation.
    • "Rolls-Up" Source1, Source2, Source3, AddDate, & ModDate via GROUP BY aggregation (see code for specifics of each column).
  • None of the input tables are modified in any way during the process.
  • Process verification via before-after comparisons can be made before final disposition of the intermediate tables.
  • At the end of the process appropriate checks should be made before dropping dbo.ExistingMaster and renaming dbo.NewMaster to dbo.ExistingMaster.

Disclaimer:

Before using this to technique to process your beloved data make sure that you completely understand the nuances of the code especially as it applies to your use of blank or null column values.

Debug your code and Q/A your results thoroughly.

The following suggestions should be taken into account to suit your environment and operational requirements.

  • Code refactoring

    • Because you frown on my aesthetics.
    • The twisted logic gives you a headache.
  • Table placement for better I/O performance

    • Use stripped drives for all intermediate tables.
    • Use RAID protected drives for persisted tables.
    • Alternate drives between the two tables in the loop.
  • Indexing

    • You know how this goes... try everything.
    • In the loop, is the cost of creating an index offset by better DML performance?
    • For any given piece of T-SQL code - where is the "elbow" in the row-count vs performance curve where intermediate indexing will make a positive difference?

And finally, in closing...

Hey, it works on my system!

Peter E. Kierstead
Sr. Database Programmer
Merlin Information Services
Kalispell, MT

Resources

Rate

4 (10)

You rated this post out of 5. Change rating

Share

Share

Rate

4 (10)

You rated this post out of 5. Change rating