get a list of similar strings

  • I have a list of organizations that where not defined as single absolute key. 

    So what happened is that different users enter the same organizations in various names .

     

    For example

    Burlington the American coat factory

    Burlington-coats

    Burlington.com

     

    Gap online

    Gap-online

    Gap.com

    The gap

     

    The American colony hotel

    American colony

    American-colony hotel

     

    I need to make a list of all the similar name so I can get rid of all the duplicates 

    In general the main part of the name is in the first 10 characters of the string

     

    I thought of using 

    PATINDEX(left(Org_Name,10))

     

    But I need to get all the groups of similar names

    I think I need to use a courser so it can go over all the projects and compare them

    I have no idea of how to group them up

    Dos any body have any code to offer to solve this Problem

    .

     v

  • You will need to experiment for a few things and go from there. SOUNDEX can help as well.

    I would first look to remove common things instead of matching up. So remove "THE, A, AN, HOTEL, etc" from the fields, probably using some other table where you will the PK and this field. Then use a few passes of PATINDEX or SOUNDEX to gather some results, maybe store them in a temp table or table var.

  • SOUNDEX and DIFFERENCE will work with something like this:

    declare @Customer Table (customerID int identity(1,1), customername varchar(50))

    insert @Customer(customername) values ('Gap Online')

    insert @Customer(customername) values ('Gap-Online')

    insert @Customer(customername) values ('American Large Colony Hotl')

    insert @Customer(customername) values ('American Colony Hotel')

    select c1.CustomerName, c2.CustomerName, DIFFERENCE(c1.CustomerName, c2.CustomerName), SOUNDEX(c1.CustomerName), SOUNDEX(c2.CustomerName)

    from @Customer c1

    inner join @Customer c2

    on DIFFERENCE(c1.CustomerName, c2.CustomerName) = 4 and c1.customerID != c2.customerID

  • First I would recomend cleaning out "extra" chars like ".,-:;" etc.  Secondly I would remove any common parts of company names or at the very least make sure that they match for example: Inc. & Incorporated, Co & Company.

    After these activities try direct matching (COL1 = COL1) of the remaining string. After the direct matches have been identified, try using Soundex-Difference function.

    There is another really good algorithm for comparisons of string called "Levenshtein distance" http://en.wikipedia.org/wiki/Levenshtein_distance . I have used this for comparisons and matching before.

    Richard Ozenbaugh (Xerisoft.com)

  •  

    Create Temp table of same structure

    then compare the first 10 letters of the same field

    in both the tables.

     

     

  • The matching of similar strings is more complex than just comparing the first 10 characters.  By just comparing the first 10 characters, 'AAA Int'l' won't match 'AAA International' but they are the same company.

    As has been previously posted, you should remove short words such as the, a, an, or and then look for matches.  There is no exact rule and you will have to tweak whatever method you chose to increase the number of matches balanced against an increased number of mismatches.

Viewing 6 posts - 1 through 5 (of 5 total)

You must be logged in to reply to this topic. Login to reply