Click here to monitor SSC
SQLServerCentral is supported by Red Gate Software Ltd.
 
Log in  ::  Register  ::  Not logged in
 
 
 
        
Home       Members    Calendar    Who's On


Add to briefcase 12»»

Fuzzy searching Expand / Collapse
Author
Message
Posted Wednesday, July 24, 2013 6:55 AM
Forum Newbie

Forum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum Newbie

Group: General Forum Members
Last Login: Monday, August 12, 2013 12:41 PM
Points: 5, Visits: 26
Hello everyone,

I'm having trouble finding a way to use fuzzy searching (specifically Levenstein) in my database.

For now, I've been using an optimized T-SQL implementation of the Levenstein algorithm that I found on another forum. However, it is way too slow for me to use.

I'm unable to use CLR functions on my server due to "memory pressure" issues (My dba is nervous about using the -g start up parameter to allocate additional memory).

In addition, I'm unable to use Master Data Services' Similarity function. I'm using SQL 2008 R2 Standard, which doesn't support MDS.

Can anyone think of another way that I can use fuzzy searching to compare entries in my database?

Thanks in advance.
Post #1477041
Posted Wednesday, July 24, 2013 10:21 PM
SSC Rookie

SSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC Rookie

Group: General Forum Members
Last Login: Tuesday, November 4, 2014 6:03 PM
Points: 25, Visits: 160
Its unclear if you are trying to match individual words (as in name matching) or entire phrases. The performance of any routine which has to match every row against every other row in a large table will always be a problem. The trick is to find a way of "rough matching" which can be indexed. In word checking, you might use the length of the words being matched, so that only those word close to each other in length are checked. Usually, users of Levenstein type routines will only accept matches with values less than (say) 4. In that case, the only words which can match must have lengths within 4 of each other. You can filter the data to eliminate the vast bulk of potential matches which can never match. There are lots of academic papers on improving the performance of matching routines, and they might be fruitful source of potential ideas for your matching problem. They often have long lists of references which can be consulted as well. Good luck.
Post #1477336
Posted Thursday, July 25, 2013 2:09 AM


SSC Eights!

SSC Eights!SSC Eights!SSC Eights!SSC Eights!SSC Eights!SSC Eights!SSC Eights!SSC Eights!

Group: General Forum Members
Last Login: Friday, December 12, 2014 1:02 AM
Points: 823, Visits: 753
If you have a CLR solution, looks like your main problem is your DBA.

Could you post the output from "SELECT @@version" for your server, as well information about the machine. How much memory does it have? Are there other instances on the machine? Any other software competing about resources?


Erland Sommarskog, SQL Server MVP, www.sommarskog.se
Post #1477385
Posted Thursday, July 25, 2013 3:13 AM
Ten Centuries

Ten CenturiesTen CenturiesTen CenturiesTen CenturiesTen CenturiesTen CenturiesTen CenturiesTen Centuries

Group: General Forum Members
Last Login: Today @ 1:05 AM
Points: 1,037, Visits: 7,018
I've had some good results from this in recent years:

USE [Matching]
GO
/****** Object: UserDefinedFunction [dbo].[IF_Levenshtein01] Script Date: 25/07/2013 10:00:08 ******/
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
-- this will score around 10,000 word pairs per second on 2010 laptop technology
alter FUNCTION [dbo].[IF_Levenshtein02]
(
@Reference VARCHAR(20), @Target VARCHAR(20)
)
RETURNS TABLE WITH SCHEMABINDING AS
RETURN
( -- output query

SELECT [Score %] = CASE
WHEN @Reference = @Target THEN CAST(100 AS NUMERIC(5,2))
WHEN 0 = 1 THEN CAST(100 AS NUMERIC(5,2)) -- placeholder for any other shortcuts
ELSE
(SELECT
[Score %] = CAST(SUM(LetterScore)*100.0/MAX(WordLength*WordLength) AS NUMERIC(5,2))
FROM ( -- do
SELECT
seq = t1.n,
ref.Letter,
v.WordLength,
LetterScore = v.WordLength - ISNULL(MIN(tgt.n),v.WordLength)
FROM ( -- v
SELECT
Reference = LEFT(@Reference + REPLICATE('_',WordLength),WordLength),
Target = LEFT(@Target + REPLICATE('_',WordLength),WordLength),
WordLength = WordLength
FROM ( -- di
SELECT WordLength = MAX(WordLength)
FROM (VALUES (DATALENGTH(@Reference)),(DATALENGTH(@Target))) d (WordLength)

) di
) v
CROSS APPLY ( -- t1
SELECT TOP(WordLength) n
FROM (VALUES (1),(2),(3),(4),(5),(6),(7),(8),(9),(10),(11),(12),(13),(14),(15),(16),(17),(18),(19),(20)) t2 (n)
) t1
CROSS APPLY (SELECT Letter = SUBSTRING(Reference,t1.n,1)) ref
OUTER APPLY ( -- tgt
SELECT TOP(WordLength) n = ABS(t1.n - t2.n)
FROM (VALUES (1),(2),(3),(4),(5),(6),(7),(8),(9),(10),(11),(12),(13),(14),(15),(16),(17),(18),(19),(20)) t2 (n)
WHERE SUBSTRING(@Target,t2.n,1) = ref.Letter
) tgt
GROUP BY t1.n, ref.Letter, v.WordLength
) do
)
END

) -- output query

GO
SELECT * FROM [dbo].[IF_Levenshtein02] ('summer day','Sommarskog')

Score %
--------
52.00



Low-hanging fruit picker and defender of the moggies





For better assistance in answering your questions, please read this.




Understanding and using APPLY, (I) and (II) Paul White

Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden
Post #1477405
Posted Thursday, July 25, 2013 6:47 AM
Forum Newbie

Forum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum Newbie

Group: General Forum Members
Last Login: Monday, August 12, 2013 12:41 PM
Points: 5, Visits: 26
Could you post the output from "SELECT @@version" for your server, as well information about the machine. How much memory does it have? Are there other instances on the machine? Any other software competing about resources?

Here's the results of SELECT @@versions:

Microsoft SQL Server 2005 - 9.00.5069.00 (Intel X86) Aug 22 2012 16:01:52 Copyright (c) 1988-2005 Microsoft Corporation Standard Edition on Windows NT 5.2 (Build 3790: Service Pack 2) The server has 16 GBs of memory, but there is a lot of competition for memory. This is our main server and there are a lot of jobs running in the background.

The reason why I'm not able to use CLR functions is because I get the following error when I try to install the .dll file: Msg 6513, Level 16, State 27, Line 8 Failed to initialize the Common Language Runtime (CLR) v2.0.50727 due to memory pressure. Please restart SQL server in Address Windowing Extensions (AWE) mode to use CLR integration features.

I've tried restarting the server with AWE mode enabled and have tried all the common solutions that the internet has suggested. We think that the problem could be a memory leak. When the problem first started happening, we allocated more memory to SQL server. It sucked it up, but the problems didn't go away, including the CLR error. I've tried installing the CLR function on other servers, but strangely, I get the same error even on servers that are hardly being used and have a lot of free memory.

Its unclear if you are trying to match individual words (as in name matching) or entire phrases. The performance of any routine which has to match every row against every other row in a large table will always be a problem. The trick is to find a way of "rough matching" which can be indexed. In word checking, you might use the length of the words being matched, so that only those word close to each other in length are checked. Usually, users of Levenstein type routines will only accept matches with values less than (say) 4. In that case, the only words which can match must have lengths within 4 of each other. You can filter the data to eliminate the vast bulk of potential matches which can never match. There are lots of academic papers on improving the performance of matching routines, and they might be fruitful source of potential ideas for your matching problem. They often have long lists of references which can be consulted as well. Good luck.

Thanks for the suggestion. Right now I'm comparing full names with each other. I'll try editing the function so that it ignores pairs that are too different in length.

@ChrisM@home: I'll give it a try, thanks.





Thanks again, everyone. I'm at a roadblock atm, so I appreciate your expertise.
Post #1477460
Posted Thursday, July 25, 2013 3:41 PM


SSC Eights!

SSC Eights!SSC Eights!SSC Eights!SSC Eights!SSC Eights!SSC Eights!SSC Eights!SSC Eights!

Group: General Forum Members
Last Login: Friday, December 12, 2014 1:02 AM
Points: 823, Visits: 753
Is /PAE present in BOOT.INI?

Can you post the output of this batch:

select physical_memory_in_bytes, virtual_memory_in_bytes,bpool_committed, bpool_commit_target, bpool_visible 
from sys.dm_os_sys_info

EXEC sp_configure 'show advanced options', 1
RECONFIGURE
EXEC sp_configure 'awe enabled'
EXEC sp_configure 'max server memory'



Erland Sommarskog, SQL Server MVP, www.sommarskog.se
Post #1477771
Posted Thursday, July 25, 2013 4:24 PM
SSC Rookie

SSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC Rookie

Group: General Forum Members
Last Login: Tuesday, November 4, 2014 6:03 PM
Points: 25, Visits: 160
The paper you really need to read is:
A Guided Tour to Approximate String Matching
GONZALO NAVARRO
University of Chile
ACM Computing Surveys, Vol. 33, No. 1, March 2001.
It gives a concise statement of academic progress on the problem up to that date, and includes a terrific list of references. I guess you have three directions to go: improve your algorithm, or SQL Server performance, or hardware performance. Or all three.
Post #1477792
Posted Thursday, July 25, 2013 10:43 PM
SSC Rookie

SSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC Rookie

Group: General Forum Members
Last Login: Tuesday, November 4, 2014 6:03 PM
Points: 25, Visits: 160
Other possible "rough matching" filters to apply before the full matching routine:
- the first characters must match
- one of first two characters must match
- 2 out of the first three characters must match
Tests such as these can be hardcoded ar prefilters to allow the full matching routine to be invoked as rarely as possible. They are a compromise, but some of them may be acceptable compromises, if an impossible task becomes possible. Maybe you can plow through "the rest" in background as a long running low priority task, so you can evaluate the effect of the compromises. Or a sample of "the rest", at least.
Post #1477850
Posted Friday, July 26, 2013 2:59 AM


Hall of Fame

Hall of FameHall of FameHall of FameHall of FameHall of FameHall of FameHall of FameHall of FameHall of Fame

Group: General Forum Members
Last Login: Yesterday @ 4:49 PM
Points: 3,427, Visits: 5,378
Perhaps this article will help you: Fuzzy-String Search: Find misspelled information with T-SQL


My mantra: No loops! No CURSORs! No RBAR! Hoo-uh!

My thought question: Have you ever been told that your query runs too fast?

My advice:
INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.


Need to UNPIVOT? Why not CROSS APPLY VALUES instead?
Since random numbers are too important to be left to chance, let's generate some!
Learn to understand recursive CTEs by example.
Splitting strings based on patterns can be fast!
Post #1477909
Posted Friday, July 26, 2013 6:56 AM
Forum Newbie

Forum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum Newbie

Group: General Forum Members
Last Login: Monday, August 12, 2013 12:41 PM
Points: 5, Visits: 26
Is /PAE present in BOOT.INI?

No. Here's the file in case it might be helpful.

[boot loader]
timeout=30
default=multi(0)disk(0)rdisk(0)partition(1)\WINDOWS
[operating systems]
multi(0)disk(0)rdisk(0)partition(1)\WINDOWS="Windows Server 2003, Enterprise" /noexecute=optout /fastdetect

Can you post the output of this batch:

select physical_memory_in_bytes, virtual_memory_in_bytes,bpool_committed, bpool_commit_target, bpool_visible
from sys.dm_os_sys_info

EXEC sp_configure 'show advanced options', 1
RECONFIGURE
EXEC sp_configure 'awe enabled'
EXEC sp_configure 'max server memory'

physical_memory_in_bytes: 17169162240
virtual_memory_in_bytes: 2147352576
bpool_committed: 1769472
bpool_commit_target: 1769472
bpool_visible: 181248

name: awe enabled
minimum: 0
maximum: 1
config_value: 1
run_value: 1

name: max server memory (MB)
minimum: 16
maximum: 2147483647
config_value: 16000
run_value: 16000

The paper you really need to read is:
A Guided Tour to Approximate String Matching
GONZALO NAVARRO

Perhaps this article will help you: Fuzzy-String Search: Find misspelled information with T-SQL

Thanks for the references and suggestions. I'm comparing so many strings that I probably won't be able to simply use a Levenstein algorithm.
Post #1477981
« Prev Topic | Next Topic »

Add to briefcase 12»»

Permissions Expand / Collapse