Finding similar values in a column

  • I'm trying to find values in our account table where the account names are similiar. I've already found the exact duplicates but now I'm trying to find like values i.e. Thompson Tools and Thompson Tools, Inc.

    I've tried several different ways and nothing seems to work. Here are 2 ways I've tried and no success:

    select a1.account, a2.account

    from sysdba.ACCOUNT a1

    INNER JOIN sysdba.ACCOUNT a2

    on a1.accountid = a2.accountid

    Where substring(a1.account,1,6) like substring(a2.account ,1,6)

    Group by a1.account, a2.account

    Having count(*) > 1

    This is bringing back only the exact matches and not the like values.

    I've also tried something like this:

    select a1.account, a2.account

    from sysdba.ACCOUNT a1

    INNER JOIN sysdba.ACCOUNT a2

    on a1.accountid = a2.accountid

    Where a1.account like substring(a2.account ,1,6)

    Group by a1.account, a2.account

    This brings back all the same records just showing the substring value from both fields.

    I've tried using a subselect but the same issue as above.

    If anyone has any idea what I'm missing, I would appreciate the help.

  • Maybe you can use SOUNDEX or DIFFERENCE to find similar rows. But I guess you'd manually need to verify the results, since there is no logic know of that can tell you whether Mayer's Inc. is the same company as Maier's Inc. or Meyer's. 😉

    I'm not sure how your table structure looks like but I hope you don't have only the name in there. Maybe you can use SOUNDEX together with the zip code...



    Lutz
    A pessimist is an optimist with experience.

    How to get fast answers to your question[/url]
    How to post performance related questions[/url]
    Links for Tally Table [/url] , Cross Tabs [/url] and Dynamic Cross Tabs [/url], Delimited Split Function[/url]

  • another thing I've done is use a Tally table to split the company name into single words....then try to match the single word against another copy of the table;

    that way, one of the searches would compare 'Thompson' against all columns, which might help me with duplicates.

    here's an example, you could use charindex instead of like as well to test for matches.

    select

    MyQuery.account,

    a1.account

    from sysdba.ACCOUNT a1

    INNER JOIN (select myAlias.item as account

    from sysdba.ACCOUNT

    CROSS APPLY [fn_split](account,' ') myAlias --space as a delimiter

    ) MyQuery

    on a1.accountid like '%' + MyQuery.account + '%'

    WHERE myAlias.item NOT IN ('CO.','CORP','CORP.','INC')

    GO

    create function [dbo].[fn_split](

    @STR varchar(8000),

    @spliter char(1)

    )

    returns @returnTable table (idx int primary key identity, item varchar(8000))

    as

    begin

    declare @spliterIndex int

    select @STR = @STR + @spliter

    SELECT @STR = @spliter + @STR + @spliter

    INSERT @returnTable

    SELECT SUBSTRING(@str,N+1,CHARINDEX(@spliter,@str,N+1)-N-1)

    FROM dbo.Tally

    WHERE N < LEN(@str)

    AND SUBSTRING(@str,N,1) = @spliter

    ORDER BY N

    return

    end

    Lowell


    --help us help you! If you post a question, make sure you include a CREATE TABLE... statement and INSERT INTO... statement into that table to give the volunteers here representative data. with your description of the problem, we can provide a tested, verifiable solution to your question! asking the question the right way gets you a tested answer the fastest way possible!

  • Thanks! I'll give both of those a try.

Viewing 4 posts - 1 through 3 (of 3 total)

You must be logged in to reply to this topic. Login to reply