Fuzzy Matching

Question

Fuzzy Matching

RonMexico

Hall of Fame

Points: 3474
More actions
August 30, 2016 at 12:57 pm

#329633

I'm reaching out looking for pointers as I'm starting to brainstorm ideas how to handle a task that has been presented to me. We receive a set of data that has a manually entered piece to it so, therefore, we can receive a file that has the entries "ABC Movers", "A.B.C. Movers", or "ABC Mvoers". We would like to maintain a list of what we know the value should be (i.e. ABC Movers) and have any variation that is close to it change to show the correct value. I started some research and came across the soundex function but I'm not sure how reliable that can be or if it is the best way. Does anyone have experience in doing something like this?

Viewing 15 posts - 1 through 15 (of 30 total)

You must be logged in to reply to this topic. Login to reply

ManicStar SSCoach Points: 17992 More actions · Answer 1

RonMexico (8/30/2016)
I'm reaching out looking for pointers as I'm starting to brainstorm ideas how to handle a task that has been presented to me. We receive a set of data that has a manually entered piece to it so, therefore, we can receive a file that has the entries "ABC Movers", "A.B.C. Movers", or "ABC Mvoers". We would like to maintain a list of what we know the value should be (i.e. ABC Movers) and have any variation that is close to it change to show the correct value. I started some research and came across the soundex function but I'm not sure how reliable that can be or if it is the best way. Does anyone have experience in doing something like this?

oh yay, lucky you!

Start of Minor rant: I hate free-flowing note fields that contain data that belongs in columns in perhaps lookup tables. End minor rant.

Your best bet is going to be regular expressions (fun fun, luckily there are a lot of websites with pre-made ones to get one started) , and tsql will not be the easiest way to do that. If possible, do it in C# or vb.net or something that support strong regex's.

This whole thing is problematic tho, because people who are entering things in free-flowing notes will not always format it in the way you assume they would. There will always be mis-matches. Therefore, in addition to the matches, you should report on the non-matches so someone can be tasked with seeing if there additional patterns to add to the regex's that are scrubbing the data out for you.

Eric M Russell SSC Guru Points: 125523 More actions · Answer 2

The SOUNDEX hashing function is primitive and will result in many false matches when comparing strings of multiple words, even if they are very dissimilar. Basically, it retains the first character, strips additional vowels, hashes the first character of each word, and is senstive to spaces. It is better suited for hashing single words.

For examples, all of the following coverts to the same soundex hash, despite their being obviously different.

print soundex('ABC Movers')

A120

print soundex('ABC Maplethorpe')

A120

print soundex('ABC M')

A120

It also is senstive to punctuation and symbols, which again is not what you want:

print soundex('A.B.C. Movers')

A000

"Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

The Dixie Flatline SSC Guru Points: 53354 More actions · Answer 3

I've got a better function than SOUNDEX for you. It's called "StringsClose" and it compares strings while forgiving one and only one typo. A typo is defined as a mistyped character, an extra character, an omitted character or a single transposition of two characters. It is not perfect, and really shouldn't be trusted where both strings are 1-5 characters. All disclaimers aside, it will still handle what you're asking for in your examples.

Before comparing the strings in your example, you will probably want to get rid of all periods (and probably other special characters as well). Like this:

d eclare @StringA varchar(50) = 'A.B.C. Movers'

,@StringB varchar(50) = 'ABC Movres'

select @StringA as [@StringA], @StringB as [@StringB], sc.StringsClose

from dbo.itvf_StringsClose(replace(@StringA,'.',''),

replace(@StringB,'.','')

) sc

The code for the function follows. It looks clunky but runs fast enough that I've never had to revisit it and make it more elegant. Let me know if this works for you, or if you have any questions.

C REATE FUNCTION [dbo].[itvf_StringsClose]

(

@String1 varchar(50)

,@String2 varchar(50)

)

RETURNS TABLE

AS

-- =================================================================================================

-- Author:Bob Hovious

-- Create date: 3/19/2008

-- Description:This function compares two Strings and deems them to be "close" if:

--1. One has one more character than the other, in any position

--2. One character has been mistyped

--3. Two characters are transposed

-- =================================================================================================

RETURN

(

WITH

-- remove all blanks from Strings

trimmedStrings AS (SELECT replace(@String1,' ','') AS String1

,replace(@String2,' ','') AS String2)

-- make sure that the longest String is always put in String1

-- if they are 2 or more apart they aren't close, so don't bother passing rows to the "stuffing cte

,basevalues AS

(SELECT CASE WHEN len(String1) >= len(String2) THEN String1 ELSE String2 END AS String1 -- long String 1

,CASE WHEN len(String1) >= len(String2) THEN String2 ELSE String1 END AS String2 -- short String 2

FROM trimmedStrings

WHERE abs(len(String1)-len(String2)) < 2

)

,tally1 AS (SELECT * FROM (VALUES (1),(2),(3),(4),(5),(6),(7),(8),(9),(10),

(11),(12),(13),(14),(15),(16),(17),(18),(19),(20),

(21),(22),(23),(24),(25),(26),(27),(28),(29),(30),

(31),(32),(33),(34),(35),(36),(37),(38),(39),(40),

(41),(42),(43),(44),(45),(46),(47),(48),(49),(50)) v (N1)

)

,tally2 AS (SELECT * FROM (VALUES (1),(2),(3),(4),(5),(6),(7),(8),(9),(10),

(11),(12),(13),(14),(15),(16),(17),(18),(19),(20),

(21),(22),(23),(24),(25),(26),(27),(28),(29),(30),

(31),(32),(33),(34),(35),(36),(37),(38),(39),(40),

(41),(42),(43),(44),(45),(46),(47),(48),(49),(50)) v (N2)

)

----------------------------------------------------------------------------------------

-- build sets of character Strings for "Like" comparisons and transposition comparisons

----------------------------------------------------------------------------------------

,comparisons AS (-- if they're equal, they're equal

SELECT TOP (1) 'Y' AS StringsClose

FROM basevalues

WHERE String1 = String2

UNION ALL

-- replace each character in String1 with a wildcard to handle single keystroke typos

-- whether wrong character typed or single character omitted

-- this works whether or not the length of String1 is greater than String2

SELECT TOP (1) 'Y' AS StringsClose

FROM basevalues

CROSS JOIN tally1 t1

WHERE 1 = 1

AND String2 LIKE stuff(String1,N1,1,'%')

AND N1 BETWEEN 2 and len(String1)

UNION ALL

-- where Strings are the same length, check for transposition

SELECT TOP (1) 'Y' AS Stringsclose

FROM basevalues

CROSS JOIN tally2 t2

WHERE 1 = 1

AND len(String1) = len(String2)

AND N2 < len(String1)

AND String2 = stuff(String1,N2,2,reverse(subString(String1,N2,2)))

)

SELECT isnull((SELECT TOP (1) StringsClose FROM comparisons),'N') AS StringsClose

)

Edited to add: Having re-read and thought about your problem, this may not be much help as it would run slow. It runs fast as written because it compares two strings almost instantly. In your case, you would have to adapt it to run against your table of correct values for each of the string one "LIKE" combinations generated by the function. That would NOT run almost instantly.

The deciding factor is how often you hit these misspellings. If it's one time in a thousand, this approach might be workable for you.

How many rows are in the table with correct values? I assume you have an index over the column with the correct values. What is the length of that column?

__________________________________________________

Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

Sergiy SSC Guru Points: 110206 More actions · Answer 4

RonMexico (8/30/2016)
I'm reaching out looking for pointers as I'm starting to brainstorm ideas how to handle a task that has been presented to me. We receive a set of data that has a manually entered piece to it so, therefore, we can receive a file that has the entries "ABC Movers", "A.B.C. Movers", or "ABC Mvoers". We would like to maintain a list of what we know the value should be (i.e. ABC Movers) and have any variation that is close to it change to show the correct value. I started some research and came across the soundex function but I'm not sure how reliable that can be or if it is the best way. Does anyone have experience in doing something like this?

You need to maintain a list of what you know the value should be (i.e. ABC Movers) along with a table with all variations that are identified to be associated with the correct value.

I had an extensive experience with attempts of matching "close strings" on fly - it never works.

I bet "A.B.C" in "A.B.C. Movers" stands for something.

I can guarantee that somebody will enter the full name, not the abbreviation.

And it will never be close to "ABC Movers". You need just to know that, say, "Authentic British Cars Movers" is another name for "ABC Movers".

_____________
Code for TallyGenerator

The Dixie Flatline SSC Guru Points: 53354 More actions · Answer 5

Sergiy, the problem with that approach is that it works for synonyms but not for typos in the correct character string. For example:

ABC Movres

A.B.C Movers

Authentic British Cars Movers

Authentic British Car Movers

Authntic British Cars Movers

Auth. Brit. Car Movers

Auth Br Car Moevrs

I was about to say you can't anticipate typos, but if you are willing to pay the price in disk space, it's possible to generate a routine that will generate the typo patterns I handle with StringsClose. Obviously, the larger the string, the more rows of acceptable combinations will be produced. But with some limitations on what is considered acceptable, it might be workable.

There would need to be additional synonym capability to associate the 'ABC Movers' with 'Authentic British Car Movers' so you could handle typos against the longer string and against abbreviated strings such as 'Auth Br Car Movres'. There is no taking some of the manual work out of it, and it will always be imperfect. Some edge case you didn't anticipate will always slip through the cracks. But when it's running, it's probably going to be the fastest of many imperfect solutions.

__________________________________________________

Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

ChrisM@Work SSC Guru Points: 186127 More actions · Answer 6

This article[/url] describes a CLR function to calculate the Levenshtein-Distance. Implementing the full functionality in an iTVF is tricky and results in poor performance. Here's an iTVF which is almost there but runs fast enough for most purposes - about 10,000 matches per second.

USE [Matching]

GO

/****** Object: UserDefinedFunction [dbo].[IF_Levenshtein02] Script Date: 27/01/2014 20:12:53 ******/

SET ANSI_NULLS ON

GO

SET QUOTED_IDENTIFIER ON

GO

-- this will score around 10,000 word pairs per second on 2010 laptop technology

ALTER FUNCTION [dbo].[IF_Levenshtein02]

(

@Reference VARCHAR(20), @Target VARCHAR(20)

)

RETURNS TABLE WITH SCHEMABINDING AS

RETURN

( -- output query

SELECT [Score %] = CASE

WHEN @Reference = @Target THEN CAST(100 AS NUMERIC(5,2))

WHEN 0 = 1 THEN CAST(100 AS NUMERIC(5,2))-- placeholder for any other shortcuts

ELSE

(SELECT

[Score %] = CAST(SUM(LetterScore)*100.0/MAX(WordLength*WordLength) AS NUMERIC(5,2))

FROM ( -- do

SELECT

seq = t1.n,

ref.Letter,

v.WordLength,

LetterScore = v.WordLength - ISNULL(MIN(tgt.n),v.WordLength)

FROM ( -- v

SELECT

Reference = LEFT(@Reference + REPLICATE('_',WordLength),WordLength),

Target = LEFT(@Target + REPLICATE('_',WordLength),WordLength),

WordLength = WordLength

FROM ( -- di

SELECT WordLength = MAX(WordLength)

FROM (VALUES (DATALENGTH(@Reference)),(DATALENGTH(@Target))) d (WordLength)

) di

) v

CROSS APPLY ( -- t1

SELECT TOP(WordLength) n

FROM (VALUES (1),(2),(3),(4),(5),(6),(7),(8),(9),(10),(11),(12),(13),(14),(15),(16),(17),(18),(19),(20)) t2 (n)

) t1

CROSS APPLY (SELECT Letter = SUBSTRING(Reference,t1.n,1)) ref

OUTER APPLY ( -- tgt

SELECT TOP(WordLength) n = ABS(t1.n - t2.n)

FROM (VALUES (1),(2),(3),(4),(5),(6),(7),(8),(9),(10),(11),(12),(13),(14),(15),(16),(17),(18),(19),(20)) t2 (n)

WHERE SUBSTRING(@Target,t2.n,1) = ref.Letter

) tgt

GROUP BY t1.n, ref.Letter, v.WordLength

) do

)

END

) -- output query

Edit: see also these articles [/url]by Alan.B

^{“Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw}

For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

Sergiy SSC Guru Points: 110206 More actions · Answer 7

The Dixie Flatline (8/30/2016) if you are willing to pay the price in disk space, it's possible to generate a routine that will generate the typo patterns I handle with StringsClose.

No, not generate all possible typos.

Just record every resolution an operator applied when a name was not recognised by the system.

On every such occasion system shoul stop processing a file and notify about the issue. Once the correct match is suggested the system saves it in the [Synonyms] table and uses it on every further occasion.

From my experience, over the course of 10 years every name accumulated 5-7 nicknames, with couple of unfortunate ones having >20 variations.

Initially it will require some attention, but when when all most typical errors are recorded it will bother people probably once a month.

_____________
Code for TallyGenerator

The Dixie Flatline SSC Guru Points: 53354 More actions · Answer 8

Sergiy, I too have been working with fuzzy matching on peoples' names for over a decade. I hear what you're saying, but that remains a manual process - when something doesn't match, look at it and make some table entries. Obviously that has to happen for edge cases that fall through the cracks in code. My point was that for the price in disk space, one-character typos can be anticipated by the system.

Having slept on this, it may still be possible to generate LIKE patterns on the fly, but substantially reduce the I/O. I have to go to work in a few, but I could do some coding and testing tonight. If anyone else wants to tackle this, here are my thoughts.

Have indexes on the correct string (targetstring) and on REVERSE(targetstrings)

Remove spaces and special characters.

First pass, get all rows where targetstring LIKE left(sourcestring,2)+% AND ABS(len(1)-len(2) =< 1

Cross join the result set to the set of patterns generated by stringclose

Take top (1) result.

If no hit, get rows where REVERSE(RIGHT(sourcestring,2)+% in the REVERSE(targetstring column)

and ABS(len(1)-len(2) =< 1

Cross join the result set to the set of patterns generated by stringclose

Take top (1) result

This would not run as fast as generating a table with all possible one-character typos, but it would certainly run much faster than a manual process.

__________________________________________________

Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

RonMexico Hall of Fame Points: 3474 More actions · Answer 9

Thanks for all the help! It sounds like I might have to maintain a table with the correct name and create a list of all of the incorrect names that go along with it as they happen. It's a monthly process that brings in 300k records but I don't anticipate there being many misspellings given that it is a feed from a reputable source.

The Dixie Flatline SSC Guru Points: 53354 More actions · Answer 10

Ron, how many correct names are we talking about? I'm assuming this is like a customer name or something. Is the amount of correct names in the thousands, tens of thousands, millions?

__________________________________________________

Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

The Dixie Flatline SSC Guru Points: 53354 More actions · Answer 11

If the manual process works for you, then go with it. But thanks for bringing up an interesting problem. 🙂

__________________________________________________

Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

The Dixie Flatline SSC Guru Points: 53354 More actions · Answer 12

Mis-posted. Withdrawn.

__________________________________________________

Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

RonMexico Hall of Fame Points: 3474 More actions · Answer 13

This is a new process so we haven't really taken a close look at the data at this point. I can't put a number on it but would venture to say the vast majority will be correct.

The Dixie Flatline SSC Guru Points: 53354 More actions · Answer 14

No, I'm not talking about the input batches. I mean how many names are in your client table, or whatever.

__________________________________________________

Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills