Click here to monitor SSC
SQLServerCentral is supported by Red Gate Software Ltd.
 
Log in  ::  Register  ::  Not logged in
 
 
 

Using Fuzzy Lookups for Record Linkage

By Brian Nordberg,

In a previous article I discussed using Fuzzy Grouping to find inexact duplicates in the same dataset. But what if you have 2 different datasets that you need to review to find common people? You could look for exact records, but what about misspellings, typos There are a few probabilistic linkage software packages on the market, but the fuzzy lookup transformation in SSIS can do the job. For example, in our hospital I was presented with 1244 patients from an outside study database and was asked if they existed in our 1.7 million patient registration database.

Fix the SSIS executables first

First install SQL Server service pack 2 or the memory leak fix http://support.microsoft.com/kb/912423/ . This fix resolves many memory issues when using record sets with over 1 million records, but there are still issues with more than 4 million records.

Define a Dataset

First add the input dataset that contains the records you want to match. In this example it is my table of 1244 patients. To do this, open Business Intelligence Development Studio.and start a BI project, then define an OLE DB Source dataset. Any definable dataset may be used, but as many temporary tables are created, you must have a connection to a SQL Server and be a user with permission to create tables. Also limit your dataset to the fields to be used for lookups and pass through to increase performance. Be wary of data types, the fuzzy transformations rely on string comparisons to build the Error-Tolerant Index (ETI), so datatypes like datetime must be transformed to varchar.

Add a Fuzzy Lookup Transformation

Drop the fuzzy lookup transformation on the project. Then point the datasource output to the fuzzy lookup. Double click on the transformation to get to the Fuzzy Lookup Transformation Editor.

In the editor, define your reference table. My reference table will be my patient registration table. The transformation is dependent on indexes that are created during the lookup process. These indexes are different than normal table indexes and are not stored, unless the Store new index box is checked. Since this will not be a normal process of finding small lookup values, there is no reason to store the newly created index. Add your OLE DB connection, choose Generate new index, and then select the reference table (mine is actually a view that converts the date of birth to a varchar).

Choose the Columns to Lookup and Return

Clicking on the Columns tab will allow selection of columns to use in the fuzzy lookup and to return as a pass through.

The two tables here are the Input and the reference tables. Simply drag and drop the fields from the input to the appropriate Lookup columns. In this example I have first name, last name and date of birth in my input, so I will match on those in the reference table. I need to get the medical record number (MRN) out of the lookup table to return to our study coordinator, so I choose MRN as a pass through.

Right clicking on the joined field line and choosing Edit Mappings will allow further customization of the lookup in the Create Relationships editor.

The Create Relationships Editor

Here are the guts of the lookup:

The Mapping Type indicates if the join is Fuzzy or Exact. If your data columns are anything aside from strings (datetime, int, etc.) the Mapping Type will only allow exact. The Comparison Flags allow further processing choices, such as Ignore case, Ignore kana type, and others.

The Minimum Similarity sets the threshold similarity for each field. If you want some fields to have a stronger weight in the linkage, it should be given a higher value (closer to 1). In my example, I have very few fields, and both first and last name can vary due to marriage, nicknames, abbreviations and so forth. However, a patients date of birth should remain constant, so I want it weighted higher than other columns. Setting a higher Minimum Similarity will reduce the number of potential matches and thus speed processing. Lastly, set the Similarity Output Alias if you want column names other than the default in the resulting output table.

The Advanced Tab

Clicking on the Advanced tab shows the Maximum number of matches to output per lookup. The default is 1, but if multiple matches are desired, this number can be adjusted. The Similarity threshold sets the threshold for the entire lookup. The closer to 1, the more alike the records must be to satisfy the lookup. Bumping up this threshold will reduce the number of matches, but could also miss records if they are less exact.

The Advanced Editor for Fuzzy Lookup

Most of the Custom Properties have been set in the Relationship Editor; however there are other important elements in the advanced editor. The Exhaustive indicator defaults to False. Setting it to True will cause the lookup to reference each row in the input to each row in the reference table. To do this, all records in the reference table are loaded into memory and this can be very slow for large datasets. But it may result in better matches. False will limit the results to matches where at least one of the substrings is common to both the input and the reference.

WarmCaches are very important for speed with large datasets, but do eat up more memory. If set to True, the ETI and the reference table will be partially loaded into RAM. For small datasets setting WarmCaches to False may actually be faster. The MaxMemoryUseage setting can be useful if you dont want to allocate all available RAM to the transformation. If zero, it will dynamically use RAM based on need and availability.

Add a Destination

Once the lookup component is configured, add a destination and link the Fuzzy Lookup Output to this destination. Create the destination first in the Connection Manager. Next click on the mappings to see what the output will look like.

Running the package.

I recommend having SSIS on a separate machine (yes you need another license) from your databases as this is a very memory- and CPU- intensive operation. I ran this on a Dual Core 1.86 GHZ with 2 GB of RAM with my reference and input tables on a Dual Core 2.21 GHZ server with 4 GB RAM. My reference table had 1,711,968 records and the input had 1244. A non-exhaustive transform took just over 15 minutes.

The Output

The output is of course dependent on your input and selection of pass through fields. But the important fields are the Similarity (one for each field and an overall), and the Confidence. Each field can have its similarity to the reference output and it will tell you how like the values are, where 1 is an exact match and .00 has nothing in common. There is also an overall Similarity for all fields. The confidence field is the measure of likelihood that this particular record is the best match with the reference table. So both these fields are important in determining if the records are a match. A tie in the similarities could be broken with a higher confidence score. Of course, with SSIS the output can be directed to do many things. For my purposes, I will have to generate a work list for someone to review.

To demonstrate, I created 2 dummy tables with fake data, and I ran the project with a MaxOutputMatchesPerInput value of 2.

The input:

The reference:

The output:

Note that with the MaxOutputMatchesPerInput of 2, I get two possible matches for Mary Smith, Mary and Mark (both mathematically very similar). The Similarity and confidence scores should tell me which to use. Also, if no records are found that meet the minimum similarity, Nulls will be returned.

Conclusion

The fuzzy lookup transformation was designed to find and standardize lookup values; however, with a little tweaking, it can be used for probabilistic linkage of different datasets. The quality of the matching is dependent on the fields you have and the quality of data therein.

Total article views: 9473 | Views in the last 30 days: 13
 
Related Articles
FORUM

Lookup transformation [Lookup Data Key [95]] Error: Row yielded no match during lookup

Lookup transformation [Lookup Data Key [95]] Error: Row yielded no match during lookup

FORUM

Multiple reference in a single field

Multiple reference in a single field

ARTICLE

SSIS Lookup using a Script Component

What would you do if you want to perform 15 to 30 lookups using the same reference dataset? This is ...

FORUM

find out which record_ids match on 2 fields and not match a third field.

match up information from 2 sources.

FORUM

Fuzzy Lookup???

How do I use the Fuzzy Lookup Component

 
Contribute

Join the most active online SQL Server Community

SQL knowledge, delivered daily, free:

Email address:  

You make SSC a better place

As a member of SQLServerCentral, you get free access to loads of fresh content: thousands of articles and SQL scripts, a library of free eBooks, a weekly database news roundup, a great Q & A platform… And it’s our huge, buzzing community of SQL Server Professionals that makes it such a success.

Join us!

Steve Jones
Editor, SQLServerCentral.com

Already a member? Jump in:

Email address:   Password:   Remember me: Forgotten your password?
Steve Jones