Using Fuzzy Lookups for Record Linkage

,

In a previous article I discussed using Fuzzy Grouping to

find inexact duplicates in the same dataset. But what if you have 2 different

datasets that you need to review to find common people? You could look for

exact records, but what about misspellings, typos There are a few

probabilistic linkage software packages on the market, but the fuzzy lookup

transformation in SSIS can do the job. For example, in our hospital I was

presented with 1244 patients from an outside study database and was asked if

they existed in our 1.7 million patient registration database.

Fix the SSIS executables first

First install SQL Server service pack 2 or the memory leak

fix http://support.microsoft.com/kb/912423/

. This fix resolves many memory issues when using record sets with over 1

million records, but there are still issues with more than 4 million records.

Define a Dataset

First add the input dataset that contains the records you

want to match. In this example it is my table of 1244 patients. To do this,

open Business Intelligence Development Studio.and start a BI project, then

define an OLE DB Source dataset. Any definable dataset may be used, but as many

temporary tables are created, you must have a connection to a SQL Server and be

a user with permission to create tables. Also limit your dataset to the fields

to be used for lookups and pass through to increase performance. Be wary of

data types, the fuzzy transformations rely on string comparisons to build the

Error-Tolerant Index (ETI), so datatypes like datetime must be transformed to

varchar.

Add a Fuzzy Lookup Transformation

Drop the fuzzy lookup transformation on the project. Then

point the datasource output to the fuzzy lookup. Double click on the

transformation to get to the Fuzzy Lookup Transformation Editor.

In the editor, define your reference table. My reference

table will be my patient registration table. The transformation is dependent on

indexes that are created during the lookup process. These indexes are different

than normal table indexes and are not stored, unless the Store new index box is

checked. Since this will not be a normal process of finding small lookup

values, there is no reason to store the newly created index. Add your OLE DB

connection, choose Generate new index, and then select the reference table

(mine is actually a view that converts the date of birth to a varchar).

Choose the Columns to Lookup and Return

Clicking on the Columns tab will allow selection of columns

to use in the fuzzy lookup and to return as a pass through.

The two tables here are the Input and the reference tables.

Simply drag and drop the fields from the input to the appropriate Lookup

columns. In this example I have first name, last name and date of birth in my

input, so I will match on those in the reference table. I need to get the medical

record number (MRN) out of the lookup table to return to our study coordinator,

so I choose MRN as a pass through.

Right clicking on the joined field line and choosing Edit

Mappings will allow further customization of the lookup in the Create

Relationships editor.

The Create Relationships Editor

Here are the guts of the lookup:

The Mapping Type indicates if the join is Fuzzy or Exact. If

your data columns are anything aside from strings (datetime, int, etc.) the

Mapping Type will only allow exact. The Comparison Flags allow further

processing choices, such as Ignore case, Ignore kana type, and others.

The Minimum Similarity sets the threshold similarity for

each field. If you want some fields to have a stronger weight in the linkage,

it should be given a higher value (closer to 1). In my example, I have very few

fields, and both first and last name can vary due to marriage, nicknames,

abbreviations and so forth. However, a patients date of birth should remain

constant, so I want it weighted higher than other columns. Setting a higher

Minimum Similarity will reduce the number of potential matches and thus speed

processing. Lastly, set the Similarity Output Alias if you want column names

other than the default in the resulting output table.

The Advanced Tab

Clicking on the Advanced tab shows the Maximum number of

matches to output per lookup. The default is 1, but if multiple matches are

desired, this number can be adjusted. The Similarity threshold sets the

threshold for the entire lookup. The closer to 1, the more alike the records

must be to satisfy the lookup. Bumping up this threshold will reduce the number

of matches, but could also miss records if they are less exact.

The Advanced Editor for Fuzzy Lookup

Most of the Custom Properties have been set in the

Relationship Editor; however there are other important elements in the advanced

editor. The Exhaustive indicator defaults to False. Setting it to True will

cause the lookup to reference each row in the input to each row in the

reference table. To do this, all records in the reference table are loaded into

memory and this can be very slow for large datasets. But it may result in

better matches. False will limit the results to matches where at least one of

the substrings is common to both the input and the reference.

WarmCaches are very important for speed with large datasets,

but do eat up more memory. If set to True, the ETI and the reference table will

be partially loaded into RAM. For small datasets setting WarmCaches to False

may actually be faster. The MaxMemoryUseage setting can be useful if you dont

want to allocate all available RAM to the transformation. If zero, it will

dynamically use RAM based on need and availability.

Add a Destination

Once the lookup component is configured, add a destination

and link the Fuzzy Lookup Output to this destination. Create the destination

first in the Connection Manager. Next click on the mappings to see what the

output will look like.

Running the package.

I recommend having SSIS on a separate machine (yes you need

another license) from your databases as this is a very memory- and CPU-

intensive operation. I ran this on a Dual Core 1.86 GHZ with 2 GB of RAM with

my reference and input tables on a Dual Core 2.21 GHZ server with 4 GB RAM. My

reference table had 1,711,968 records and the input had 1244. A non-exhaustive

transform took just over 15 minutes.

The Output

The output is of course dependent on your input and

selection of pass through fields. But the important fields are the Similarity

(one for each field and an overall), and the Confidence. Each field can have

its similarity to the reference output and it will tell you how like the values

are, where 1 is an exact match and .00 has nothing in common. There is also an overall

Similarity for all fields. The confidence field is the measure of likelihood

that this particular record is the best match with the reference table. So both

these fields are important in determining if the records are a match. A tie in

the similarities could be broken with a higher confidence score. Of course,

with SSIS the output can be directed to do many things. For my purposes, I will

have to generate a work list for someone to review.

To demonstrate, I created 2 dummy tables with fake data, and

I ran the project with a MaxOutputMatchesPerInput value

of 2.

The input:

The reference:

The output:

Note that with the MaxOutputMatchesPerInput of 2, I get two possible

matches for Mary Smith, Mary and Mark (both mathematically very similar). The

Similarity and confidence scores should tell me which to use. Also, if no

records are found that meet the minimum similarity, Nulls will be returned.

Conclusion

The fuzzy lookup transformation was designed to find and

standardize lookup values; however, with a little tweaking, it can be used for

probabilistic linkage of different datasets. The quality of the matching is

dependent on the fields you have and the quality of data therein.

Rate

5 (6)

Share

Share

Rate

5 (6)