January 15, 2013 at 6:11 pm
wolfkillj (1/15/2013)
Actually, that article was my starting point in challenging myself to come up with a true set-based DLD solution. The code posted there is functional but requires a lot of control-of-flow language. I hope to come up with a solution that can be used as an inline table-valued function. I'd say I've got it about 90% solved, but there are still some inconsistencies to resolve.
I can certainly appreciate that perspective as my curiosity often pulls me into such things as well. I looked at the LD algorithm on Wiki and since it was recursive I thought to apply a rCTE to it. Unfortunately I got bogged down and couldn't get one working in the 30 minutes (over lunch like Chris) that I had to devote to it.
Be sure to let us know what you come up with. There may be an article in it.
My thought question: Have you ever been told that your query runs too fast?
My advice:
INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.
Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
Since random numbers are too important to be left to chance, let's generate some![/url]
Learn to understand recursive CTEs by example.[/url]
[url url=http://www.sqlservercentral.com/articles/St
January 16, 2013 at 1:02 am
wolfkillj (1/15/2013)
dwain.c (1/15/2013)
wolfkillj (1/15/2013)
I've been working on a set-based T-SQL solution to determine the Damerau-Levenshtein Distance between two strings. (DLD distance is just like LD - the number of changes required to transform one string into the other - but DLD allows transpositions (counting as one change) as well as the additions, deletions, and substitutions allowed by LD). My code is not yet working reliably, but it looks very similar to Chris's. If I ever get it working, I'll post it, but it's been about six weeks since my "real" work has allowed me any time to tinker with it.Jason
Jason - DLD is actually the approach applied in the article I linked to in my prior post. Forgot to mention it. You may want to take a look because the author spent a lot of time trying to optimize his solution and it might save you a bit of effort.
Don't mean for it to take away the challenge though. π
dwain.c (1/14/2013)
Jeff Moden (1/14/2013)
dwain.c (1/14/2013)
Good grief! Now I suppose I'm going to need to put up or shut up.Heh... glad it's not me this time. π
Maybe I don't need to because it's already been done.
http://www.sqlservercentral.com/articles/Fuzzy+Match/92822/
Seems I'd forgotten about this article, even though I believe I posted something to its discussion thread.
Actually, that article was my starting point in challenging myself to come up with a true set-based DLD solution. The code posted there is functional but requires a lot of control-of-flow language. I hope to come up with a solution that can be used as an inline table-valued function. I'd say I've got it about 90% solved, but there are still some inconsistencies to resolve.
That was my intention too. It's not quite there in terms of the spec, but I think it may be tweakable, and it can easily be converted to an iTVF.
Bearing in mind that fuzzy-matching is an inexact science, I spent a few minutes working on a modification - functionally crippled in terms of fitting the full Levenshtein distance model but designed instead for speed (fuzzy matching is SLOW), and this effort cropped up out of the alphabet soup:
DECLARE @Reference VARCHAR(100), @Target VARCHAR(100), @WordLength INT
SELECT @Reference = 'Maltesers', @Target = 'Maltster' -- 91.36%
SELECT @Reference = 'millipede', @Target = 'millimetre' -- 69.00%
SELECT @Reference = 'smith', @Target = 'smythe' -- 66.67%
SELECT @Reference = 'Smith', @Target = 'Smith' -- 100.00%
SELECT @WordLength = MAX(WordLength)
FROM (SELECT WordLength = DATALENGTH(@Reference)
UNION ALL SELECT DATALENGTH(@Target)) d
SET @Reference = LEFT(@Reference + REPLICATE('_',@WordLength),@WordLength)
SET @Target = LEFT(@Target + REPLICATE('_',@WordLength),@WordLength)
;WITH Tally AS ( -- restricted to 20 letters max
SELECT TOP(@WordLength) n
FROM (VALUES (1),(2),(3),(4),(5),(6),(7),(8),(9),(10),(11),(12),(13),(14),(15),(16),(17),(18),(19),(20)) t2 (n))
SELECT
[Score %] = CAST(SUM(LetterScore)*100.0/(@WordLength*@WordLength) AS NUMERIC(5,2))
FROM (
SELECT
seq = t1.n,
ref.Letter,
LetterScore = @WordLength - ISNULL(MIN(tgt.n),@WordLength)
FROM tally t1
CROSS APPLY (SELECT Letter = SUBSTRING(@Reference,t1.n,1)) ref
OUTER APPLY (
SELECT n = ABS(t1.n - t2.n)
FROM tally t2
WHERE SUBSTRING(@Target,t2.n,1) = ref.Letter
) tgt
GROUP BY t1.n, ref.Letter
) d
The execution plan and brief testing indicate that it's very quick and again it's easily converted to an iTVF.
If I were performing fuzzy-matching now, I'd probably settle for a couple of complimentary methods like this one plus maybe token-matching.
For fast, accurate and documented assistance in answering your questions, please read  this article.
Understanding and using APPLY, (I) and  (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden
Viewing 2 posts - 31 through 32 (of 32 total)
You must be logged in to reply to this topic. Login to reply