Comments posted to this topic are about the item Fuzzy Search
Love it. Brilliant work. So simple and effective. I just tried this on one of my databases, and it's a solid performer. I'm thinking of updating the algorithm to account for common transpositions. ie = ei and vice-versa..
I've tried this on MySQL with similar results.
As a homage to Jeff Modem I created a tally table with as least as many number as the longest string to be compared.
MySQL uses LOCATE rather than CHARINDEX but the principal is the same.
SELECT COUNT( DISTINCT LOCATE(SUBSTRING(@name1,tally_id,2),@name2))
I then ran the following to make sure that I understood what was going on
SELECT LOCATE(SUBSTRING(@name1,tally_id,2),@name2),SUBSTRING(@name2,tally_id,2),@name1, @name2
This revealed a gotcha to consider. Non-alphabetic characters crop up in pairs and count towards the score. This can increase the score without improving the match thus suggesting that some string cleansing would be required in the source data or some considerations to be added to the function. The combinations that can inflate the score that I can see are as follows
As a quick bodge to get around this I tried changing the score to a fraction of the 2 character matches to 3 character matches.
WHEN 0 THEN 0
ELSE trigram_match_count/bigram_match_count END AS Score
COUNT(DISTINCT LOCATE(SUBSTRING(@name1,tally_id,2),@name2)) AS bigram_match_count,
COUNT(DISTINCT LOCATE(SUBSTRING(@name1,tally_id,3),@name2)) AS trigram_match_count,
MAX(LOCATE(SUBSTRING(@name1,tally_id,3),@name2)) AS trigram_position_match
) AS D;
By having the score as a 3 letter match / 2 letter match the score range is from 0 to 1 with 1 being a perfect match.
Tried this method years ago - it's not very accurate. Here's an alternative:
ChrisM@Work - Wednesday, November 19, 2008 5:43 AM
For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden
I like the "brute-force" approach to this - just whip through the data and count matches.
For organization matches I have a table of common abbreviations (Co., Company, LLC, Inc., Brothers, And Sons etc.) and industry-specific words (Construction, Contractors, Builders, etc.) that I toss when looking for matches.
ONE caveat to the function here - I would add a SUBTRACTION to account for extra characters.
[James Madison] --> [James Madison] *SHOULD SCORE HIGHER THAN: [James Madison, JR] --> [James Madison]
There has actually been some research on this approach, see for instance:
And Microsoft SSIS uses a similar approach in it's "fuzzy matching" component, see:
lauri.pietarinen - Thursday, August 3, 2017 5:18 AM
From reading those links, I believe those approaches use old procedural logical methodologies to solve the problem. I've tried many "intelligent" approaches and they are inferior. The approach in this article uses what I would call a holistic approach. Notice there is virtually no process to this function. I'm forced to loop because of the language but with the right construct there would be no need for looping and you would just "get" a score.
Well, not really procedural, since all those "side tables" can be created using "vanilla" SQL. Your approach may run into problems if you are comparing thousands of values with each other. (1000x1000 = 1M, 1M x 1 sec = a very long time!). So basically the idea is exactly the same, but some "preprocessing" is done in order to use SQL join power to process lot's of rows efficiently.
That being said your solution is certainly very simple and elegant for certain situations!
Did you try it for pairs like
"UCLA" <--> "University of California Los Angeles" ?
What kind of score does it return?
And which pair is a better match:
[James Madison, JR] --> [James Madison, SR]
[James Madison, JR] --> [James Madison Junior]
Code for TallyGenerator
Sergiy - Thursday, August 3, 2017 6:09 AM
In the first parameter include all known variations so any of them can be found in the second parameter.
SELECT dbo.FuzzyMatchString('UCLA','University of California Los Angeles'); -- returns 0
SELECT dbo.FuzzyMatchString('UCLA University of California Los Angeles','University of California Los Angeles'); -- returns 36
SELECT dbo.FuzzyMatchString('James Madison, JR','James Madison, Junior'); -- returns 15
SELECT dbo.FuzzyMatchString('Mr. Dr. James R. Madison, JR Junior','James Madison, Junior'); -- returns 22
While this is very useful for returning a list to be examined manually, it's not appropriate for most automated name matching. For instance, if you were running this against first name, and used William as the search string, an exact match on William is rated the same as William-Frank, and Bill is rated the same as Lillian. Additional logic to weight the score to account for things such as exact match, matching substrings, etc. would increase the accuracy.
Kimberly Blum - Thursday, August 3, 2017 9:04 AM
Yes, it is a fuzzy search and not applicable to equi-join situations otherwise we would just do exact matching. I question any "automated name matching" routine as being 100% reliable. The magic is that the old saying "garbage in, garbage out" is no longer true. If a receptionist types your name in wrong while searching for your previous account, it will still find it, thus, "garbage in, good data out".
Bill Talada - Thursday, August 3, 2017 10:09 AM
I see this as a FABULOUS approach to give the user a quick-and-dirty listing of what MIGHT be what they are looking for on a lookup. Hey...I didn't find EXACTLY what you are looking for (else I return that first) but here are some items that MIGHT be what you are looking for.
any links to code/data sources to generate this 1 million row "Names" table please?
you can lead a user to data....but you cannot make them think
and remember....every day is a school day
this was really neat Bill, I like it;
the thing i wanted was to also see a perfect score alongside the actual score, since longer strings would return larger scores
I slapped together a ITVF version, and my scrores seem to be +1 compared to your score, but i get a bit more detail for me to see
IF OBJECT_ID('[dbo].[FuzzyMatchString_ITVF]') IS NOT NULL
DROP FUNCTION [dbo].[FuzzyMatchString_ITVF]
CREATE FUNCTION dbo.FuzzyMatchString_ITVF (@s1 varchar(100), @s2 varchar(100))
--yet another small tally generator: 256 values
ROW_NUMBER() over (order by num) as n
from (values (1)) t (num)
group by cube (num, num, num, num, num, num,num,num)
--more num = another order power of 2
SUM(CASE WHEN CHARINDEX(SUBSTRING(@s2,MiniTally.n,2),@s1) > 0 THEN 1 ELSE 0 END) AS SearchScore,
SUM(CASE WHEN CHARINDEX(SUBSTRING(@s1,MiniTally.n,2),@s1) > 0 THEN 1 ELSE 0 END) AS PerfectScore,
CONVERT(DECIMAL(5,2),100 * ((SUM(CASE WHEN CHARINDEX(SUBSTRING(@s2,MiniTally.n,2),@s1) > 0 THEN 1 ELSE 0 END) * 1.0) / (SUM(CASE WHEN CHARINDEX(SUBSTRING(@s1,MiniTally.n,2),@s1) > 0 THEN 1 ELSE 0 END) * 1.0))) AS PercentScore,
@s1 AS SearchTerm,
@s2 AS ItemCompared
and here's some sample data and the results
SELECT '40','Harrison','Ford' UNION ALL
SELECT '93','Jason','Robards' UNION ALL
SELECT '88','Edward','Norton' UNION ALL
SELECT '25','Orson','Welles' UNION ALL
SELECT '3','Marlon','Brando' UNION ALL
SELECT '61','Michael','Caine' UNION ALL
SELECT * , dbo.FuzzyMatchString('Harrison Ford',msn.[StarFirstName] + ' ' + msn.[StarLastName]) AS ScalarVal
FROM [MovieStarNames] AS [msn]
CROSS APPLY dbo.FuzzyMatchString_ITVF('Harrison Ford',msn.[StarFirstName] + ' ' + msn.[StarLastName]) fn
Viewing 15 posts - 1 through 15 (of 32 total)