get a list of similar strings

Question

get a list of similar strings

shani Glatt

Grasshopper

Points: 17
More actions
February 15, 2006 at 6:25 am

#93270

I have a list of organizations that where not defined as single absolute key.
So what happened is that different users enter the same organizations in various names .

For example
Burlington the American coat factory
Burlington-coats
Burlington.com

Gap online
Gap-online
Gap.com
The gap

The American colony hotel
American colony
American-colony hotel

I need to make a list of all the similar name so I can get rid of all the duplicates
In general the main part of the name is in the first 10 characters of the string

I thought of using
PATINDEX(left(Org_Name,10))

But I need to get all the groups of similar names
I think I need to use a courser so it can go over all the projects and compare them
I have no idea of how to group them up
Dos any body have any code to offer to solve this Problem
.
v

Viewing 6 posts - 1 through 6 (of 6 total)

You must be logged in to reply to this topic. Login to reply

Steve Jones - SSC Editor SSC Guru Points: 738347 More actions · Answer 1

You will need to experiment for a few things and go from there. SOUNDEX can help as well.

I would first look to remove common things instead of matching up. So remove "THE, A, AN, HOTEL, etc" from the fields, probably using some other table where you will the PK and this field. Then use a few passes of PATINDEX or SOUNDEX to gather some results, maybe store them in a temp table or table var.

JeffB Hall of Fame Points: 3541 More actions · Answer 2

SOUNDEX and DIFFERENCE will work with something like this:

declare @Customer Table (customerID int identity(1,1), customername varchar(50))

insert @Customer(customername) values ('Gap Online')

insert @Customer(customername) values ('Gap-Online')

insert @Customer(customername) values ('American Large Colony Hotl')

insert @Customer(customername) values ('American Colony Hotel')

select c1.CustomerName, c2.CustomerName, DIFFERENCE(c1.CustomerName, c2.CustomerName), SOUNDEX(c1.CustomerName), SOUNDEX(c2.CustomerName)

from @Customer c1

inner join @Customer c2

on DIFFERENCE(c1.CustomerName, c2.CustomerName) = 4 and c1.customerID != c2.customerID

Richard Ozenbaugh SSC Enthusiast Points: 174 More actions · Answer 3

First I would recomend cleaning out "extra" chars like ".,-:;" etc. Secondly I would remove any common parts of company names or at the very least make sure that they match for example: Inc. & Incorporated, Co & Company.

After these activities try direct matching (COL1 = COL1) of the remaining string. After the direct matches have been identified, try using Soundex-Difference function.

There is another really good algorithm for comparisons of string called "Levenshtein distance" http://en.wikipedia.org/wiki/Levenshtein_distance . I have used this for comparisons and matching before.

Richard Ozenbaugh (Xerisoft.com)

AMIT GUPTA-263376 SSCarpal Tunnel Points: 4244 More actions · Answer 4

Create Temp table of same structure

then compare the first 10 letters of the same field

in both the tables.

JeffB Hall of Fame Points: 3541 More actions · Answer 5

The matching of similar strings is more complex than just comparing the first 10 characters. By just comparing the first 10 characters, 'AAA Int'l' won't match 'AAA International' but they are the same company.

As has been previously posted, you should remove short words such as the, a, an, or and then look for matches. There is no exact rule and you will have to tweak whatever method you chose to increase the number of matches balanced against an increased number of mismatches.

get a list of similar strings

Cookies on SQLServerCentral