Fuzzy searching

  • Hello everyone,

    I'm having trouble finding a way to use fuzzy searching (specifically Levenstein) in my database.

    For now, I've been using an optimized T-SQL implementation of the Levenstein algorithm that I found on another forum. However, it is way too slow for me to use.

    I'm unable to use CLR functions on my server due to "memory pressure" issues (My dba is nervous about using the -g start up parameter to allocate additional memory).

    In addition, I'm unable to use Master Data Services' Similarity function. I'm using SQL 2008 R2 Standard, which doesn't support MDS.

    Can anyone think of another way that I can use fuzzy searching to compare entries in my database?

    Thanks in advance.

  • Its unclear if you are trying to match individual words (as in name matching) or entire phrases. The performance of any routine which has to match every row against every other row in a large table will always be a problem. The trick is to find a way of "rough matching" which can be indexed. In word checking, you might use the length of the words being matched, so that only those word close to each other in length are checked. Usually, users of Levenstein type routines will only accept matches with values less than (say) 4. In that case, the only words which can match must have lengths within 4 of each other. You can filter the data to eliminate the vast bulk of potential matches which can never match. There are lots of academic papers on improving the performance of matching routines, and they might be fruitful source of potential ideas for your matching problem. They often have long lists of references which can be consulted as well. Good luck.

  • If you have a CLR solution, looks like your main problem is your DBA.

    Could you post the output from "SELECT @@version" for your server, as well information about the machine. How much memory does it have? Are there other instances on the machine? Any other software competing about resources?

    [font="Times New Roman"]Erland Sommarskog, SQL Server MVP, www.sommarskog.se[/font]

  • I've had some good results from this in recent years:

    USE [Matching]

    GO

    /****** Object: UserDefinedFunction [dbo].[IF_Levenshtein01] Script Date: 25/07/2013 10:00:08 ******/

    SET ANSI_NULLS ON

    GO

    SET QUOTED_IDENTIFIER ON

    GO

    -- this will score around 10,000 word pairs per second on 2010 laptop technology

    alter FUNCTION [dbo].[IF_Levenshtein02]

    (

    @Reference VARCHAR(20), @Target VARCHAR(20)

    )

    RETURNS TABLE WITH SCHEMABINDING AS

    RETURN

    ( -- output query

    SELECT [Score %] = CASE

    WHEN @Reference = @Target THEN CAST(100 AS NUMERIC(5,2))

    WHEN 0 = 1 THEN CAST(100 AS NUMERIC(5,2))-- placeholder for any other shortcuts

    ELSE

    (SELECT

    [Score %] = CAST(SUM(LetterScore)*100.0/MAX(WordLength*WordLength) AS NUMERIC(5,2))

    FROM ( -- do

    SELECT

    seq = t1.n,

    ref.Letter,

    v.WordLength,

    LetterScore = v.WordLength - ISNULL(MIN(tgt.n),v.WordLength)

    FROM ( -- v

    SELECT

    Reference = LEFT(@Reference + REPLICATE('_',WordLength),WordLength),

    Target = LEFT(@Target + REPLICATE('_',WordLength),WordLength),

    WordLength = WordLength

    FROM ( -- di

    SELECT WordLength = MAX(WordLength)

    FROM (VALUES (DATALENGTH(@Reference)),(DATALENGTH(@Target))) d (WordLength)

    ) di

    ) v

    CROSS APPLY ( -- t1

    SELECT TOP(WordLength) n

    FROM (VALUES (1),(2),(3),(4),(5),(6),(7),(8),(9),(10),(11),(12),(13),(14),(15),(16),(17),(18),(19),(20)) t2 (n)

    ) t1

    CROSS APPLY (SELECT Letter = SUBSTRING(Reference,t1.n,1)) ref

    OUTER APPLY ( -- tgt

    SELECT TOP(WordLength) n = ABS(t1.n - t2.n)

    FROM (VALUES (1),(2),(3),(4),(5),(6),(7),(8),(9),(10),(11),(12),(13),(14),(15),(16),(17),(18),(19),(20)) t2 (n)

    WHERE SUBSTRING(@Target,t2.n,1) = ref.Letter

    ) tgt

    GROUP BY t1.n, ref.Letter, v.WordLength

    ) do

    )

    END

    ) -- output query

    GO

    SELECT * FROM [dbo].[IF_Levenshtein02] ('summer day','Sommarskog')

    Score %

    --------

    52.00


    [font="Arial"]Low-hanging fruit picker and defender of the moggies[/font]

    For better assistance in answering your questions, please read this[/url].


    Understanding and using APPLY, (I)[/url] and (II)[/url] Paul White[/url]

    Hidden RBAR: Triangular Joins[/url] / The "Numbers" or "Tally" Table: What it is and how it replaces a loop[/url] Jeff Moden[/url]

  • Could you post the output from "SELECT @@version" for your server, as well information about the machine. How much memory does it have? Are there other instances on the machine? Any other software competing about resources?

    Here's the results of SELECT @@versions:

    Microsoft SQL Server 2005 - 9.00.5069.00 (Intel X86) Aug 22 2012 16:01:52 Copyright (c) 1988-2005 Microsoft Corporation Standard Edition on Windows NT 5.2 (Build 3790: Service Pack 2) The server has 16 GBs of memory, but there is a lot of competition for memory. This is our main server and there are a lot of jobs running in the background.

    The reason why I'm not able to use CLR functions is because I get the following error when I try to install the .dll file: Msg 6513, Level 16, State 27, Line 8 Failed to initialize the Common Language Runtime (CLR) v2.0.50727 due to memory pressure. Please restart SQL server in Address Windowing Extensions (AWE) mode to use CLR integration features.

    I've tried restarting the server with AWE mode enabled and have tried all the common solutions that the internet has suggested. We think that the problem could be a memory leak. When the problem first started happening, we allocated more memory to SQL server. It sucked it up, but the problems didn't go away, including the CLR error. I've tried installing the CLR function on other servers, but strangely, I get the same error even on servers that are hardly being used and have a lot of free memory.

    Its unclear if you are trying to match individual words (as in name matching) or entire phrases. The performance of any routine which has to match every row against every other row in a large table will always be a problem. The trick is to find a way of "rough matching" which can be indexed. In word checking, you might use the length of the words being matched, so that only those word close to each other in length are checked. Usually, users of Levenstein type routines will only accept matches with values less than (say) 4. In that case, the only words which can match must have lengths within 4 of each other. You can filter the data to eliminate the vast bulk of potential matches which can never match. There are lots of academic papers on improving the performance of matching routines, and they might be fruitful source of potential ideas for your matching problem. They often have long lists of references which can be consulted as well. Good luck.

    Thanks for the suggestion. Right now I'm comparing full names with each other. I'll try editing the function so that it ignores pairs that are too different in length.

    @ChrisM@home: I'll give it a try, thanks.

    Thanks again, everyone. I'm at a roadblock atm, so I appreciate your expertise.

  • Is /PAE present in BOOT.INI?

    Can you post the output of this batch:

    select physical_memory_in_bytes, virtual_memory_in_bytes,bpool_committed, bpool_commit_target, bpool_visible

    from sys.dm_os_sys_info

    EXEC sp_configure 'show advanced options', 1

    RECONFIGURE

    EXEC sp_configure 'awe enabled'

    EXEC sp_configure 'max server memory'

    [font="Times New Roman"]Erland Sommarskog, SQL Server MVP, www.sommarskog.se[/font]

  • The paper you really need to read is:

    A Guided Tour to Approximate String Matching

    GONZALO NAVARRO

    University of Chile

    ACM Computing Surveys, Vol. 33, No. 1, March 2001.

    It gives a concise statement of academic progress on the problem up to that date, and includes a terrific list of references. I guess you have three directions to go: improve your algorithm, or SQL Server performance, or hardware performance. Or all three.

  • Other possible "rough matching" filters to apply before the full matching routine:

    - the first characters must match

    - one of first two characters must match

    - 2 out of the first three characters must match

    Tests such as these can be hardcoded ar prefilters to allow the full matching routine to be invoked as rarely as possible. They are a compromise, but some of them may be acceptable compromises, if an impossible task becomes possible. Maybe you can plow through "the rest" in background as a long running low priority task, so you can evaluate the effect of the compromises. Or a sample of "the rest", at least.

  • Perhaps this article will help you: Fuzzy-String Search: Find misspelled information with T-SQL[/url]


    My mantra: No loops! No CURSORs! No RBAR! Hoo-uh![/I]

    My thought question: Have you ever been told that your query runs too fast?

    My advice:
    INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
    The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.

    Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
    Since random numbers are too important to be left to chance, let's generate some![/url]
    Learn to understand recursive CTEs by example.[/url]
    [url url=http://www.sqlservercentral.com/articles/St

  • Is /PAE present in BOOT.INI?

    No. Here's the file in case it might be helpful.

    [boot loader]

    timeout=30

    default=multi(0)disk(0)rdisk(0)partition(1)\WINDOWS

    [operating systems]

    multi(0)disk(0)rdisk(0)partition(1)\WINDOWS="Windows Server 2003, Enterprise" /noexecute=optout /fastdetect

    Can you post the output of this batch:

    select physical_memory_in_bytes, virtual_memory_in_bytes,bpool_committed, bpool_commit_target, bpool_visible

    from sys.dm_os_sys_info

    EXEC sp_configure 'show advanced options', 1

    RECONFIGURE

    EXEC sp_configure 'awe enabled'

    EXEC sp_configure 'max server memory'

    physical_memory_in_bytes: 17169162240

    virtual_memory_in_bytes: 2147352576

    bpool_committed: 1769472

    bpool_commit_target: 1769472

    bpool_visible: 181248

    name: awe enabled

    minimum: 0

    maximum: 1

    config_value: 1

    run_value: 1

    name: max server memory (MB)

    minimum: 16

    maximum: 2147483647

    config_value: 16000

    run_value: 16000

    The paper you really need to read is:

    A Guided Tour to Approximate String Matching

    GONZALO NAVARRO

    Perhaps this article will help you: Fuzzy-String Search: Find misspelled information with T-SQL

    Thanks for the references and suggestions. I'm comparing so many strings that I probably won't be able to simply use a Levenstein algorithm.

  • /PAE is needed to be able to use AWE. However, not always. Books Online says:

    In Windows Server 2003, PAE is automatically enabled only if the server is using hot-add memory devices. In this case, you do not have to use the /pae parameter on a system that is configured to use hot-add memory devices. In all other cases, you must use the /pae parameter in the Boot.ini file to take advantage of memory over 4 GB.

    But maybe your server is configured for that. Because all other output indicates that everything is OK. Just for a check, can you post the the first five minutes of the current ERRORLOG for this instance?

    This CLR dll you are having problem, from where did you get it? Is it something you can share? Also source code?

    [font="Times New Roman"]Erland Sommarskog, SQL Server MVP, www.sommarskog.se[/font]

  • Here's the first 5 mins of my current log:

    2013-07-29 07:59:04.53 Server Microsoft SQL Server 2005 - 9.00.1399.06 (Intel X86)

    Oct 14 2005 00:33:37

    Copyright (c) 1988-2005 Microsoft Corporation

    Express Edition on Windows NT 5.1 (Build 2600: Service Pack 3)

    2013-07-29 07:59:04.53 Server (c) 2005 Microsoft Corporation.

    2013-07-29 07:59:04.53 Server All rights reserved.

    2013-07-29 07:59:04.53 Server Server process ID is 400.

    2013-07-29 07:59:04.53 Server Logging SQL Server messages in file 'c:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\LOG\ERRORLOG'.

    2013-07-29 07:59:04.53 Server This instance of SQL Server last reported using a process ID of 240 at 7/26/2013 3:02:46 PM (local) 7/26/2013 7:02:46 PM (UTC). This is an informational message only; no user action is required.

    2013-07-29 07:59:04.53 Server Registry startup parameters:

    2013-07-29 07:59:04.53 Server -d c:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\DATA\master.mdf

    2013-07-29 07:59:04.53 Server -e c:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\LOG\ERRORLOG

    2013-07-29 07:59:04.53 Server -l c:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\DATA\mastlog.ldf

    2013-07-29 07:59:04.79 Server SQL Server is starting at normal priority base (=7). This is an informational message only. No user action is required.

    2013-07-29 07:59:04.79 Server Detected 2 CPUs. This is an informational message; no user action is required.

    2013-07-29 07:59:07.07 Server Using dynamic lock allocation. Initial allocation of 2500 Lock blocks and 5000 Lock Owner blocks per node. This is an informational message only. No user action is required.

    2013-07-29 07:59:07.21 Server Database Mirroring Transport is disabled in the endpoint configuration.

    2013-07-29 07:59:07.24 spid4s Starting up database 'master'.

    2013-07-29 07:59:07.42 spid4s Recovery is writing a checkpoint in database 'master' (1). This is an informational message only. No user action is required.

    2013-07-29 07:59:07.65 spid4s SQL Trace ID 1 was started by login "sa".

    2013-07-29 07:59:07.76 spid4s Starting up database 'mssqlsystemresource'.

    2013-07-29 07:59:08.67 spid8s Starting up database 'model'.

    2013-07-29 07:59:08.68 spid4s Server name is 'HNA-002297\SQLEXPRESS'. This is an informational message only. No user action is required.

    2013-07-29 07:59:08.70 spid4s Starting up database 'msdb'.

    2013-07-29 07:59:08.79 Server Error: 17190, Severity: 16, State: 1.

    2013-07-29 07:59:08.79 Server FallBack certificate initialization failed with error code: 1.

    2013-07-29 07:59:08.79 Server Warning:Encryption is not available, could not find a valid certificate to load.

    2013-07-29 07:59:08.79 Server Server local connection provider is ready to accept connection on [ \\.\pipe\SQLLocal\SQLEXPRESS ].

    2013-07-29 07:59:08.79 Server Server local connection provider is ready to accept connection on [ \\.\pipe\MSSQL$SQLEXPRESS\sql\query ].

    2013-07-29 07:59:08.79 Server Dedicated administrator connection support was not started because it is not available on this edition of SQL Server. This is an informational message only. No user action is required.

    2013-07-29 07:59:08.82 Server SQL Server is now ready for client connections. This is an informational message; no user action is required.

    2013-07-29 07:59:09.43 spid8s Clearing tempdb database.

    I don't think this specific dll is the problem. I say this because it worked for a few days until all of the CLR functions stopped working. I've tried implementing the same functionality, but I get the same error when I try to install the dll on the server. I can't even install a quick one-liner function. Here's the dll I'm looking to use: http://anastasiosyal.com/post/2009/01/11/Beyond-SoundEx-Functions-for-Fuzzy-Searching-in-MS-SQL-Server. It doesn't give the C# source, just the dll unfortunetly.

    Thanks again for your help.

  • The log you posted, disagrees with the output you posted from "SELECT @@version":

    Microsoft SQL Server 2005 - 9.00.5069.00 (Intel X86) Aug 22 2012 16:01:52 Copyright (c) 1988-2005 Microsoft Corporation Standard Edition on Windows NT 5.2 (Build 3790: Service Pack 2)

    The log file you posted is something else:

    2013-07-29 07:59:04.53 Server Microsoft SQL Server 2005 - 9.00.1399.06 (Intel X86)

    Oct 14 2005 00:33:37

    Copyright (c) 1988-2005 Microsoft Corporation

    Express Edition on Windows NT 5.1 (Build 2600: Service Pack 3)

    Maybe that's a local Express instance running on your local machine. (And to which you have not applied any service packs!).

    I was of course interesting in the errorlog for the SQL Server where you have problem with running the CLR.

    If this is happening to any CLR module, maybe your DBA should try restarting SQL Server?

    [font="Times New Roman"]Erland Sommarskog, SQL Server MVP, www.sommarskog.se[/font]

  • Erland Sommarskog (7/29/2013)


    The log you posted, disagrees with the output you posted from "SELECT @@version":

    Microsoft SQL Server 2005 - 9.00.5069.00 (Intel X86) Aug 22 2012 16:01:52 Copyright (c) 1988-2005 Microsoft Corporation Standard Edition on Windows NT 5.2 (Build 3790: Service Pack 2)

    The log file you posted is something else:

    2013-07-29 07:59:04.53 Server Microsoft SQL Server 2005 - 9.00.1399.06 (Intel X86)

    Oct 14 2005 00:33:37

    Copyright (c) 1988-2005 Microsoft Corporation

    Express Edition on Windows NT 5.1 (Build 2600: Service Pack 3)

    Maybe that's a local Express instance running on your local machine. (And to which you have not applied any service packs!).

    I was of course interesting in the errorlog for the SQL Server where you have problem with running the CLR.

    Oops, of course. I wasn't thinking and grabbed the log file off my laptop...

    Here's the first 5 mins of the current log file on my server:

    2013-07-14 05:50:55.54 Server Microsoft SQL Server 2005 - 9.00.5069.00 (Intel X86)

    Aug 22 2012 16:01:52

    Copyright (c) 1988-2005 Microsoft Corporation

    Standard Edition on Windows NT 5.2 (Build 3790: Service Pack 2)

    2013-07-14 05:50:55.62 Server (c) 2005 Microsoft Corporation.

    2013-07-14 05:50:55.65 Server All rights reserved.

    2013-07-14 05:50:55.67 Server Server process ID is 2936.

    2013-07-14 05:50:55.70 Server Authentication mode is MIXED.

    2013-07-14 05:50:55.71 Server Logging SQL Server messages in file 'C:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\LOG\ERRORLOG'.

    2013-07-14 05:50:55.78 Server This instance of SQL Server last reported using a process ID of 2940 at 7/14/2013 5:47:19 AM (local) 7/14/2013 9:47:19 AM (UTC). This is an informational message only; no user action is required.

    2013-07-14 05:50:55.82 Server Registry startup parameters:

    2013-07-14 05:50:55.84 Server -d C:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\DATA\master.mdf

    2013-07-14 05:50:55.85 Server -e C:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\LOG\ERRORLOG

    2013-07-14 05:50:55.85 Server -l C:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\DATA\mastlog.ldf

    2013-07-14 05:50:56.64 Server SQL Server is starting at normal priority base (=7). This is an informational message only. No user action is required.

    2013-07-14 05:50:56.65 Server Detected 16 CPUs. This is an informational message; no user action is required.

    2013-07-14 05:50:58.01 Server Address Windowing Extensions is enabled. This is an informational message only; no user action is required.

    If this is happening to any CLR module, maybe your DBA should try restarting SQL Server?

    Indeed. We've restarted the server, but it doesn't fix the problem. Everything looks good, besides the fact that we can't use or install CLR functions.

  • AWE is definitely enabled.

    Adding the startup option -g512 should not be a problem, and I recommend that you try that. This is a quite common thing to do.

    When this problem started to occur, had their been any changes to the server, like installing some new software or extended stored procedure?

    Another story is that your entire server is somewhat antiquated. It's running Windows 2003 which has gone out of support, and SQL 2005 which is in extended support. And you are running 32-bit SQL Server. Wity 64-bit SQL Server, SQL 2005 or a later version, it is unlikely that you would have this problem.

    [font="Times New Roman"]Erland Sommarskog, SQL Server MVP, www.sommarskog.se[/font]

Viewing 15 posts - 1 through 15 (of 15 total)

You must be logged in to reply to this topic. Login to reply