SQLServerCentral Article

Roll Your Own SSIS Fuzzy Matching / Grouping (Jaro - Winkler)

,

The objective here is to demonstrate how to effectively use SSIS Scripting to accomplish Fuzzy Matching and Fuzzy Grouping . I have worked in this area for many years and this is by no means a complete solution suitable for very high volumes, however this article will provide a workable SSIS solution for matching and grouping required for de duplication. In addition we will demonstrate how to acquire a source connection in a Script component.

Some examples of using SSIS Fuzzy Lookup are at SSIS: Adventures with Fuzzy Matching and Fuzzy Lookup and Fuzzy Grouping in SQL Server Integration Services 2005

The Problem

Let's assume you have a list of prospective customers and you want to identify which ones are the same. However the list of prospective customers has some duplicate due to misspelling and or typos. Notice below cust_id 11 and 111 are probably the same person.

As you can see from the list above we have a list of Customer Ids and First and Last names. For our exercise the last names are assumed to be correct. Our objective is to group or match the unique Cust_Id records. We want to create an output list that links the similar customers and also normalizes or standardizes the first names.

I like to start with the end in mind and below is our final output.

We have cleansed (First_Name_Normalized) and grouped (Name_Id) similar customers. In this example for the customer Jones with three different first names Tomas, Tom and Thhomas we have normalized the first name as Thomas. Our next step would be to do a Merge/Purge and create a single customer records, we will examine this in a future article.

The Approach

Create a process which will compare the customer Name_Input list to a Lookup table with existing Standard first_Names , identify the similar first names and group them by assigning a Name_Id. separate lists into a new list containing and identifying the matches. In addition we will standardize the name. For instance we will make Tom and Tomas into Thomas or Pete into Peter.

This article will describe how to create a SSIS Package that will provide Fuzzy Matching via a SSIS Data Flow Script Component.

SSIS Fuzzy Matching and Fuzzy Grouping stock components are provided for the SQL Server Enterprise Edition and are not available in SQL Server Standard Edition.

We will develop a SSIS Package that will support both Fuzzy Matching and Fuzzy Grouping and can run in any SQL Server Edition that supports SSIS. Our initial version will perform adequately for medium volumes.

Specifically we will demonstrate how to implement the Jaro-Winkler Matching Algorithm .

“The Jaro-Winkler distance (Winkler, 1999) is a measure of similarity between two strings. It is a variant ofthe Jaro distance metric (Jaro, 1989, 1995) and mainly used in the area of record linkage (duplicatedetection). The higher the Jaro-Winkler distance for two strings is, the more similar the strings are. The Jaro-Winkler distance metric is designed and best suited for short strings such as person names. The score isnormalized such that 0 equates to no similarity and 1 is an exact match.”

The references used for this effort were:

This is not a tutorial on SSIS, we assume you know how to create a package and distinguish between a Control and a Data Flow. If you need a primer see The New ETL Paradigm by Jamie Thomson or review Speak like a SSIS Developer at JumpStartTV by Brian Knight.

The way it works

Basically we will read all the rows(NameInput) from an input dataet and compare each row (FirstName) in a lookup or reference table(NameLookup).

Obviously this process will create a large volumn of rows, we will split the rows into three categories.

Matches with Scores over 95%

Near Matches with Scores over 75% and under 85%

No Match with Scores under 75%

One advantage we have in implementing this in SSIS is the "Pipeline", The records will be created in the "Pipeline" and will not have to be staged or written to a file.

Pipeline architectureas defined by Microsoft

"At the core of SSIS is the data transformation pipeline. This pipeline has a buffer-oriented architecturethat is extremely fast at manipulating rowsets of data once they have been loaded into memory. The approach is to perform all data transformation steps of the ETL process in a single operation without staging data, although specific transformation or operational requirements, or indeed hardware may be a hindrance. Nevertheless, for maximum performance, the architecture avoids staging. Even copying the data in memory is avoided as far as possible. This is in contrast to traditional ETL tools, which often require staging at almost every step of the warehousing and integration process. The ability to manipulate data without staging extends beyond traditional relational and flat file data andbeyond traditional ETL transformation capabilities. With SSIS, all types of data (structured, unstructured, XML, etc.) are converted to a tabular (columns and rows) structure before being loaded into its buffers. Any data operation that can be applied to tabular data can be applied to the data at anystep in the data-flow pipeline. This means that a single data-flow pipeline can integrate diverse sources of data and perform arbitrarily complex operations on these data without having to stage the data.

It should also be noted though, that if staging is required for business or operational reasons, SSIS hasgood support for these implementations as well.

This architecture allows SSIS to be used in a variety of data integration scenarios, ranging fromtraditional DW-oriented ETL to nontraditional information integration technologies."

The SSIS Solution

The Data Flow Toolbox items we will use once we have created a Data Flow in a Package are:

Create OLD DB Source

I have provided a script to load the NameInput and NameLookup sample tables for our package. Incidently I used a SQL Server Central script from Tatsu to generate the Insert Scripts.

Set up Script Component

Add a Script Component and select all columns as input.

Create three outputs:

  • Match
  • Near Match
  • No Match

Set the SyncronousInputID to "None" for each output. This will make the Script Component asynchronous.

Next create the necessary Outputs and their associated columns , all of them as DT_STR length of 10.

  • NameInput
  • LastName
  • FirstNameNormalized
  • FirstName
  • NameGroupId
  • Score

Now add a Connection Manager for the NameLookUp reference table.

Now let's create the script.

Here is the complete script

Option Strict Off
Imports System
Imports System.Data
Imports System.Math
Imports Microsoft.SqlServer.Dts.Pipeline.Wrapper
Imports Microsoft.SqlServer.Dts.Runtime.Wrapper
Imports Microsoft.SqlServer.Dts.Runtime
Imports System.Data.SqlClient
Public Class ScriptMain
    Inherits UserComponent
    Dim connMgr As IDTSConnectionManager90
    Dim sqlConn As SqlConnection
    Dim sqlCmd As SqlCommand
    Dim sqlParam As SqlParameter
    Public Overrides Sub AcquireConnections(ByVal Transaction As Object)
   connMgr = Me.Connections.LookupConnection
        sqlConn = CType(connMgr.AcquireConnection(Nothing), SqlConnection)
    End Sub

    Public Overrides Sub PreExecute()
        sqlCmd = New SqlCommand("SELECT name_group_id ,first_name , first_name_normalized FROM NameLookup", sqlConn)
   
    End Sub

    Public Overrides Sub CustomerRecordsInput_ProcessInputRow(ByVal Row As CustomerRecordsInputBuffer)
        Dim reader As SqlDataReader
        Dim str1in As String
        Dim str2in As String
        reader = sqlCmd.ExecuteReader()
        Do While reader.Read()
   If jaro(Row.NameInput, (reader("First_Name"))) > 0.85 Then
                'Explicitly add a row to the output buffer
                MatchBuffer.AddRow()
                'Add rows from the input buffer
                MatchBuffer.NameInput = Row.NameInput
                MatchBuffer.LastName = Row.LastName
                MatchBuffer.FirstNameNormalized = (reader("first_name_normalized"))
                MatchBuffer.CustId = Row.CustId
                MatchBuffer.FirstName = (reader("first_name"))
                MatchBuffer.NameGroupId = (reader("name_group_id"))
                MatchBuffer.Score = jaro(Row.NameInput, (reader("First_Name")))
            Else
                If jaro(Row.NameInput, (reader("First_Name"))) > 0.75 Then
                    NearMatchBuffer.AddRow()
                    NearMatchBuffer.NameInput = Row.NameInput
                    NearMatchBuffer.LastName = Row.LastName
                    NearMatchBuffer.FirstNameNormalized = (reader("first_name_normalized"))
                    NearMatchBuffer.Custid = Row.CustId
                    NearMatchBuffer.FirstName = (reader("first_name"))
                    NearMatchBuffer.NameGroupId = (reader("name_group_id"))
                    NearMatchBuffer.Score = jaro(Row.NameInput, (reader("First_Name")))
                Else
                    If jaro(Row.NameInput, (reader("First_Name"))) > 0 Then
                        NoMatchBuffer.AddRow()
                        NoMatchBuffer.NameInput = Row.NameInput
                        NoMatchBuffer.LastName = Row.LastName
                        NoMatchBuffer.FirstNameNormalized = (reader("first_name_normalized"))
                        NoMatchBuffer.CustId = Row.CustId
                        NoMatchBuffer.FirstName = (reader("first_name"))
                        NoMatchBuffer.NameGroupId = (reader("name_group_id"))
                        NoMatchBuffer.Score = jaro(Row.NameInput, (reader("First_Name")))
                    Else
                        NoMatchBuffer.AddRow()
                        NoMatchBuffer.NameInput = Row.NameInput
                        NoMatchBuffer.LastName = Row.LastName
                        NoMatchBuffer.FirstNameNormalized = (reader("first_name_normalized"))
                        NoMatchBuffer.CustId = Row.CustId
                        NoMatchBuffer.FirstName = (reader("first_name"))
                        NoMatchBuffer.NameGroupId = (reader("name_group_id"))
                        NoMatchBuffer.Score = jaro(Row.NameInput, (reader("First_Name")))
                    End If
                End If
            End If
        Loop
        reader.Close()
    End Sub
    Public Overrides Sub ReleaseConnections()
        connMgr.ReleaseConnection(sqlConn)
    End Sub
    Function jaro(ByVal str1 As String, ByVal str2 As String) As Double
        Dim l1, l2, lmin, lmax, aux, f, l, m, i, j As Integer
        Dim common As Integer
        Dim tr
        Dim a1, auxstr, a2 As String
        l1 = Len(str1)
        l2 = Len(str2)
        If l1 > l2 Then
            aux = l2
            l2 = l1
            l1 = aux
            auxstr = str1
            str1 = str2
            str2 = auxstr
        End If
        lmin = l1
        lmax = l2
        Dim f1(), f2() As Boolean
        ReDim f1(l1), f2(l2)
        For i = 1 To l1
            f1(i) = False
        Next i
        For j = 1 To l2
            f2(j) = False
        Next j
        m = CInt(Int((lmax / 2) - 1))
        common = 0
        tr = 0
        For i = 1 To l1
            a1 = Mid(str1, i, 1)
            If m >= i Then
                f = 1
                l = i + m
            Else
                f = i - m
                l = i + m
            End If
            If l > lmax Then
                l = lmax
            End If
            For j = f To l
                a2 = Mid(str2, j, 1)
                If (a2 = a1) And (f2(j) = False) Then
                    common = common + 1
                    f1(i) = True
                    f2(j) = True
                    GoTo linea_exit
                End If
            Next j
linea_exit:
        Next i
        Dim wcd, wrd, wtr As Double
        l = 1
        For i = 1 To l1
            If f1(i) Then
                For j = l To l2
                    If f2(j) Then
                        l = j + 1
                        a1 = Mid(str1, i, 1)
                        a2 = Mid(str2, j, 1)
                        If a1 <> a2 Then
                            tr = tr + 0.5
                        End If
                        Exit For
                    End If
                Next j
            End If
        Next i
        wcd = 1 / 3
        wrd = 1 / 3
        wtr = 1 / 3
        If common <> 0 Then
            jaro = wcd * common / l1 + wrd * common / l2 + wtr * (common - tr) / common
        End If
    End Function
End Class

Highlights of the script

First for the Jaro-Winkler algorithm we have created a Function

Function jaro(ByVal str1 As String, ByVal str2 As String) As Double
        Dim l1, l2, lmin, lmax, aux, f, l, m, i, j As Integer
        Dim common As Integer
        Dim tr
        Dim a1, auxstr, a2 As String
        l1 = Len(str1)
        l2 = Len(str2)
        If l1 > l2 Then
            aux = l2
            l2 = l1
            l1 = aux
            auxstr = str1
            str1 = str2
            str2 = auxstr
        End If
        lmin = l1
        lmax = l2
        Dim f1(), f2() As Boolean
        ReDim f1(l1), f2(l2)
        For i = 1 To l1
            f1(i) = False
        Next i
        For j = 1 To l2
            f2(j) = False
        Next j
        m = CInt(Int((lmax / 2) - 1))
        common = 0
        tr = 0
        For i = 1 To l1
            a1 = Mid(str1, i, 1)
            If m >= i Then
                f = 1
                l = i + m
            Else
                f = i - m
                l = i + m
            End If
            If l > lmax Then
                l = lmax
            End If
            For j = f To l
                a2 = Mid(str2, j, 1)
                If (a2 = a1) And (f2(j) = False) Then
                    common = common + 1
                    f1(i) = True
                    f2(j) = True
                    GoTo linea_exit
                End If
            Next j
linea_exit:
        Next i
        Dim wcd, wrd, wtr As Double
        l = 1
        For i = 1 To l1
            If f1(i) Then
                For j = l To l2
                    If f2(j) Then
                        l = j + 1
                        a1 = Mid(str1, i, 1)
                        a2 = Mid(str2, j, 1)
                        If a1 <> a2 Then
                            tr = tr + 0.5
                        End If
                        Exit For
                    End If
                Next j
            End If
        Next i
        wcd = 1 / 3
        wrd = 1 / 3
        wtr = 1 / 3
        If common <> 0 Then
            jaro = wcd * common / l1 + wrd * common / l2 + wtr * (common - tr) / common
        End If
    End Function

Next we have created a connection to access our look up table

Dim sqlConn As SqlConnection
    Dim sqlCmd As SqlCommand
    Dim sqlParam As SqlParameter
    Public Overrides Sub AcquireConnections(ByVal Transaction As Object)
   connMgr = Me.Connections.LookupConnection
        sqlConn = CType(connMgr.AcquireConnection(Nothing), SqlConnection)
    End Sub

    Public Overrides Sub PreExecute()
        sqlCmd = New SqlCommand("SELECT name_group_id ,first_name , first_name_normalized FROM NameLookup", sqlConn)
   
    End Sub
  

Finally we will process the pipeline

        Public Overrides Sub CustomerRecordsInput_ProcessInputRow(ByVal Row As CustomerRecordsInputBuffer)
        Dim reader As SqlDataReader
        Dim str1in As String
        Dim str2in As String
        reader = sqlCmd.ExecuteReader()
        Do While reader.Read()
   If jaro(Row.NameInput, (reader("First_Name"))) > 0.85 Then
                'Explicitly add a row to the output buffer
                MatchBuffer.AddRow()
                'Add rows from the input buffer
                MatchBuffer.NameInput = Row.NameInput
                MatchBuffer.LastName = Row.LastName
                MatchBuffer.FirstNameNormalized = (reader("first_name_normalized"))
                MatchBuffer.CustId = Row.CustId
                MatchBuffer.FirstName = (reader("first_name"))
                MatchBuffer.NameGroupId = (reader("name_group_id"))
                MatchBuffer.Score = jaro(Row.NameInput, (reader("First_Name")))
            Else
                If jaro(Row.NameInput, (reader("First_Name"))) > 0.75 Then
                    NearMatchBuffer.AddRow()
                    NearMatchBuffer.NameInput = Row.NameInput
                    NearMatchBuffer.LastName = Row.LastName
                    NearMatchBuffer.FirstNameNormalized = (reader("first_name_normalized"))
                    NearMatchBuffer.Custid = Row.CustId
                    NearMatchBuffer.FirstName = (reader("first_name"))
                    NearMatchBuffer.NameGroupId = (reader("name_group_id"))
                    NearMatchBuffer.Score = jaro(Row.NameInput, (reader("First_Name")))
                Else
                    If jaro(Row.NameInput, (reader("First_Name"))) > 0 Then
                        NoMatchBuffer.AddRow()
                        NoMatchBuffer.NameInput = Row.NameInput
                        NoMatchBuffer.LastName = Row.LastName
                        NoMatchBuffer.FirstNameNormalized = (reader("first_name_normalized"))
                        NoMatchBuffer.CustId = Row.CustId
                        NoMatchBuffer.FirstName = (reader("first_name"))
                        NoMatchBuffer.NameGroupId = (reader("name_group_id"))
                        NoMatchBuffer.Score = jaro(Row.NameInput, (reader("First_Name")))
                    Else
                        NoMatchBuffer.AddRow()
                        NoMatchBuffer.NameInput = Row.NameInput
                        NoMatchBuffer.LastName = Row.LastName
                        NoMatchBuffer.FirstNameNormalized = (reader("first_name_normalized"))
                        NoMatchBuffer.CustId = Row.CustId
                        NoMatchBuffer.FirstName = (reader("first_name"))
                        NoMatchBuffer.NameGroupId = (reader("name_group_id"))
                        NoMatchBuffer.Score = jaro(Row.NameInput, (reader("First_Name")))
                    End If
                End If
            End If
        Loop
        reader.Close()
    End Sub

Running the Package

Now lets run the completed package using Data Viewers. For a quick tutorial on SSIS Data Viewers try JumpstartTV Using Data Viewers in SSIS by Brian Knight.

As you can see the record for Cust_Id 11 has two associated match records thta have a score over 95%.

Here are the nomatch records thta had a score less than 75%.

Near match records thta have a score over 85% and less than 95%.

As you can see above the result of the Sort Component is to "Group" the records into a set containing the input records along with the Normalized names and NameId which effectively identifies similar records together.

We have demonstrated how to create a solution for matching or linking records using SSIS that will run and be deployable on any version of SQL Server 2005 or 2008 that supports SSIS.

The major enhancement we would make in order to create a more performant solution would be to add a separate index based on decomposing the algorithm components.


Ira Warren Whiteside
Actuality Business Intelligence

"karo yaa na karo, koshish jaisa kuch nahi hai"
"Do, or do not. There is no try."

Rate

5 (15)

You rated this post out of 5. Change rating

Share

Share

Rate

5 (15)

You rated this post out of 5. Change rating