Click here to monitor SSC
SQLServerCentral is supported by Red Gate Software Ltd.
 
Log in  ::  Register  ::  Not logged in
 
 
 
        
Home       Members    Calendar    Who's On


Add to briefcase 1234»»»

Sound Matching and a Phonetic Toolkit Expand / Collapse
Author
Message
Posted Monday, July 11, 2005 12:53 PM


Ten Centuries

Ten CenturiesTen CenturiesTen CenturiesTen CenturiesTen CenturiesTen CenturiesTen CenturiesTen Centuries

Group: General Forum Members
Last Login: Monday, September 15, 2014 5:20 PM
Points: 1,276, Visits: 1,134
Comments posted to this topic are about the content posted at http://www.sqlservercentral.com/columnists/mcoles/soundmatchingandaphonetictoolkit.asp
Post #199530
Posted Friday, July 29, 2005 4:19 PM


Ten Centuries

Ten CenturiesTen CenturiesTen CenturiesTen CenturiesTen CenturiesTen CenturiesTen CenturiesTen Centuries

Group: General Forum Members
Last Login: Monday, September 15, 2014 5:20 PM
Points: 1,276, Visits: 1,134

Note that some Windows 2003 systems might require modification to the ADD_XP.SQL installation script.  The following is a script change that might need to be made for some Windows 2003 systems.

Change:

EXEC sp_addextendedproc xp_nysiis, 'xp_nysiis.dll'
GO
EXEC sp_addextendedproc xp_levenshtein, 'xp_levenshtein.dll'

To:

EXEC sp_addextendedproc xp_nysiis, 'C:\Program Files\Microsoft SQL Server\MSSQL\BINN\xp_nysiis.dll'
GO
EXEC sp_addextendedproc xp_levenshtein, 'C:\Program Files\Microsoft SQL Server\MSSQL\BINN\xp_levenshtein.dll'

The directory used above is the default directory for C: drive installations.  Replace the directory with your appropriate SQL Server directory in the lines above, as necessary.

Also, it may be necessary to download the two Microsoft redistributable DLL files and copy them to your MSSQL\BINN directory on some Windows 2003 systems.  These files are OPENDS60.DLL and MSVCR71.DLL.  They can be downloaded as part of the SQL Encryption Toolkit here.

Post #205811
Posted Wednesday, August 3, 2005 2:02 AM
SSC-Enthusiastic

SSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-Enthusiastic

Group: General Forum Members
Last Login: Friday, December 9, 2005 10:22 AM
Points: 102, Visits: 1

If you are interested in phonetic matching then try http://aspell.net/metaphone/

If memory serves me correctly Lawrence Philips went on to work for Verity, the search engine people

Post #206724
Posted Wednesday, August 3, 2005 2:57 AM
Forum Newbie

Forum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum Newbie

Group: General Forum Members
Last Login: Thursday, February 7, 2008 5:00 PM
Points: 4, Visits: 3

Good to see this being discussed. Sound and speech are only going to become more important as we move forward.

Need to tie the discussion in with English Query and the new variants of this too when possible.

Cheers

Russell




Russell K Darroch
Post #206744
Posted Wednesday, August 3, 2005 7:29 AM


Ten Centuries

Ten CenturiesTen CenturiesTen CenturiesTen CenturiesTen CenturiesTen CenturiesTen CenturiesTen Centuries

Group: General Forum Members
Last Login: Monday, September 15, 2014 5:20 PM
Points: 1,276, Visits: 1,134
There's a "compact" double-metaphone implementation over at The Code Project.
Post #206855
Posted Wednesday, August 3, 2005 8:09 AM


Ten Centuries

Ten CenturiesTen CenturiesTen CenturiesTen CenturiesTen CenturiesTen CenturiesTen CenturiesTen Centuries

Group: General Forum Members
Last Login: Monday, September 15, 2014 5:20 PM
Points: 1,276, Visits: 1,134

Thanks for the feedback!  There is, of course, always more that can be added to most discussions   I've used the English Query Automation API to develop interfaces that perform "English"(-like) queries, but haven't really used it to correct user typographical errors or to phonetically match surnames.

I'd love some more information on this, if you have some links.  Thanks!

Post #206879
Posted Wednesday, August 3, 2005 11:00 AM
Grasshopper

GrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopper

Group: General Forum Members
Last Login: Wednesday, August 30, 2006 11:06 AM
Points: 11, Visits: 1

Thanks for the great article and toolkit. Works like a charm.

I just wanted to share my experience, maybe someone else will make use of it .

I have an average sized table, about 40,000 rows, and faced a performance issue while trying to use udf_NYSIIS. The query that takes NO time with soundex took like 12 seconds with udf_NYSIIS.

So I tried a workaround, and added the soundex to my where clause in front of the udf_NYSIIS clause with an AND...

This way, the lazy SQL processor checked soundex first, and if that one passes, then checked the udf_NYSIIS.

I was getting the same result again in NO time.

Try for yourself:

declare @myname varchar(255)
declare @mydex varchar(4)
declare @mydex2 varchar(10)
set @myname = 'smit' -- your criteria goes here
set @mydex = soundex(@myname)
set @mydex2 =  master.dbo.udf_NYSIIS(@myname)

select * from customers where soundex(lastname)=@mydex

select * from customers where master.dbo.udf_NYSIIS(lastname)=@mydex2

select * from customers where soundex(lastname)=@mydex and master.dbo.udf_NYSIIS(lastname)=@mydex2

Thanks again.

Duray AKAR

Post #206967
Posted Wednesday, August 3, 2005 12:08 PM


Ten Centuries

Ten CenturiesTen CenturiesTen CenturiesTen CenturiesTen CenturiesTen CenturiesTen CenturiesTen Centuries

Group: General Forum Members
Last Login: Monday, September 15, 2014 5:20 PM
Points: 1,276, Visits: 1,134

Thanks for the feedback!  I'm glad you found it useful.  UDF's in the WHERE clause do tend to slow down your SELECT statements since they have to perform a row-by-row calculation/comparison to return your final result.  This is especially true in this instance, since SOUNDEX is a very simple and straightforward algorithm that involves encoding a single character at a time; the NYSIIS algorithm is more complex, as it requires several multi-character position-dependent encodings as well as single-character encodings.

One method that I've used on a table I work with (about 19,000,000 rows and growing) is to generate the SOUNDEX and NYSIIS values for the last name at row insertion time.  This gives the advantage of eliminating the redundant calculations/conversions at SELECT time, and allows me to take advantage of indexes on the lastname_SOUNDEX and lastname_NYSIIS columns.  The down side is that it slows down row INSERTs (not significantly on my current database however), and it takes up more storage space for the extra columns and indexes.

Additionally, if you find that you are running across a lot of people with the same last name, it might make sense to move the SOUNDEX and NYSIIS encoded values to a separate table using the lastname as the Primary Key, to eliminate a lot of duplicate entries.

Thanks again.

Post #207019
Posted Wednesday, August 3, 2005 1:05 PM
Grasshopper

GrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopper

Group: General Forum Members
Last Login: Wednesday, August 30, 2006 11:06 AM
Points: 11, Visits: 1

Another quick note:

The user defined function udf_levenshtein is not symmetric. In other words,
udf_levenshtein(a,b) is not always equal to udf_levenshtein(b,a)

You might wonder "so what, why does it matter?"

But if I use it to order my results, and if I have two fields to weight, such as bl_lastname_str and bl_company_str, SQL Server will not allow me to use it as below:

order by dbo.udf_levenshtein(@myname, isnull(bl_lastname_str,''))
 , bl_lastname_str
 , dbo.udf_levenshtein(@myname, isnull(bl_company_str,''))
 , bl_company_str

So I have to use :

order by dbo.udf_levenshtein(isnull(bl_lastname_str,''), @myname)
 , bl_lastname_str
 , dbo.udf_levenshtein(isnull(bl_company_str,''), @myname)
 , bl_company_str

which in fact, produces different results.

I know that this is a general problem with user defined functions in SQL server. SQL server tends to optimize the function results in the query with only the value of the FIRST parameter, so you have to change the value of your FIRST parameter... Otherwise, both function calls will be considered to return the same value... Quick example, lets say you have a udfBinAnd that takes two int parameters a and b, and calculates a & b (Binary and) ... If you use it in an SQL query,  it will consider
udfBinAnd(1,2) = udfBinAnd(1,3).
Pretty SCARY thing, so watch out...

That might even be considered an SQL Server "BUG", but I will take it as a "caveat of performance" And I won't go further with the storage optimization of consecutive bit fields and their negative effects on the group by queries, because it is totally out of context.

But again, I thought that might be some useful info for other people, or maybe even trigger some workarounds to make the function symmetric...

Thanks

Duray AKAR

Post #207066
Posted Wednesday, August 3, 2005 2:53 PM


Ten Centuries

Ten CenturiesTen CenturiesTen CenturiesTen CenturiesTen CenturiesTen CenturiesTen CenturiesTen Centuries

Group: General Forum Members
Last Login: Monday, September 15, 2014 5:20 PM
Points: 1,276, Visits: 1,134

Thanks for the feedback.  Do you have some examples of that issue you can share?  A couple of strings that are returning different results dependent on the order?  One workaround I can think of might be to modify the UDF to always submit the shorter string first and the longer string second to the XP.  If you can post a couple of examples I'll see what I can come up with.

You can also calculate a Levenshtein Edit Distance "score".  Basically you take the calculated Edit Distance and divide by the longer of the two strings (be sure to CAST one of the operands as a floating point type).  The result is a "score" that estimates how closely the two strings match.  You might then set a threshold of returning only strings that match with a score of 80% (0.80) or greater.

As a matter of fact, I just thought of something else - I've had issues using UDF's in an ORDER BY clause before.  Not sure if it's documented as an issue on the MS website, but I have encountered that very issue.  I'll look into that as well.

Thanks again!

Post #207108
« Prev Topic | Next Topic »

Add to briefcase 1234»»»

Permissions Expand / Collapse