Find The Similar Matching String

  • Dear,

    I have a customer table containing many duplicate names and custid is the unique key. But the customer names are not 100% similar. For example,

    CustIDCustName

    --------------------

    100ABC CO

    101ABC CO.

    102ABC CO&

    103ABC CO,NY

    Here all the custname refers to similar customer. Now I wanna delete the duplicate names. I need to search custnames which are 80% similar and delete those customers.

    Please help me to do this.

    Regards,

    Akbar

  • shohelr2003 (7/13/2013)


    Dear,

    I have a customer table containing many duplicate names and custid is the unique key. But the customer names are not 100% similar. For example,

    CustIDCustName

    --------------------

    100ABC CO

    101ABC CO.

    102ABC CO&

    103ABC CO,NY

    Here all the custname refers to similar customer. Now I wanna delete the duplicate names. I need to search custnames which are 80% similar and delete those customers.

    Please help me to do this.

    Regards,

    Akbar

    I have a suspicion that you have over simplified your data,,,,,,does the company name always come first as you have described?

    do you have any other customer related columns ( telephone/zipcode/address etc) that could help with mapping duplicates?

    what is the scale of the problem,,,,ie no of rows?

    how do you define 80% match?

    regards

    ________________________________________________________________
    you can lead a user to data....but you cannot make them think
    and remember....every day is a school day

  • This is a non-trivial problem, and a T-SQL only solution is impossible. If you are on SQL 2012, you should have a look at Data Quality Services. If you are on SQL 2008, the Data Profiling task in SSIS can help. (But then again, that is when you load the data, not when already have in the table).

    [font="Times New Roman"]Erland Sommarskog, SQL Server MVP, www.sommarskog.se[/font]

  • See article by Dwain Camps 2012/11/29

    http://www.sqlservercentral.com/articles/String+Manipulation/94365/[/url]

    His article on pattern matching may provide a solution. Based on your limited example, I came up with this

    script that works on that limited set (the pattern split function is posted below). I can't guarantee that

    it's a universal solution to the problem.

    Sample data

    IF OBJECT_ID('tempdb..#TempTable') IS NOT NULL

    DROP TABLE #TempTable

    CREATE TABLE #TempTable

    (

    ID INT IDENTITY(1,1) NOT NULL

    ,CustID INT NOT NULL

    ,CustName VARCHAR(1000) NULL

    ,PRIMARY KEY CLUSTERED (CustID)

    )

    ;WITH cteSampleData (CustID,CustName)

    AS

    (

    SELECT 100,'ABC CO' UNION ALL

    SELECT 101,'ABC CO.' UNION ALL

    SELECT 102,'ABC CO&' UNION ALL

    SELECT 103,'ABC CO,NY' UNION ALL

    SELECT 104,'XYZ Corp' UNION ALL

    SELECT 105,'XYZ Corporation' UNION ALL

    SELECT 106,'XYZ Corp CA'

    )

    INSERT INTO #TempTable

    (CustID,CustName)

    SELECT

    CustID,

    CustName

    FROM

    cteSampleData

    Script (requires function PatternSplitCM...see below)

    SELECT

    CustID

    ,BaseName

    ,Dupe

    FROM

    (

    SELECT

    ROW_NUMBER() OVER (PARTITION BY ps.Item ORDER BY t.ID) AS rowNum

    ,MAX(LEN(t.CustName)) OVER (PARTITION BY t.CustName) AS maxLen

    ,t.CustID

    ,ps.Item AS BaseName

    ,t.CustName as Dupe

    FROM

    #TempTable AS t

    CROSS APPLY

    dbo.PatternSplitCM(t.CustName,'[A-Za-z0-9]') AS ps

    WHERE

    ps.ItemNumber = 1

    ) r

    WHERE

    rowNum > 1

    AND LEN(Dupe) >= maxLen

    Output:

    CustIDBaseNameDupe

    102ABCABC CO&

    103ABCABC CO,NY

    101ABCABC CO.

    106XYZXYZ Corp CA

    105XYZXYZ Corporation

    /* See article by Dwain Camps 2012/11/29 */

    /* http://www.sqlservercentral.com/articles/String+Manipulation/94365/ */

    CREATE FUNCTION [dbo].[PatternSplitCM] (@List VARCHAR(8000) = NULL, @Pattern VARCHAR(50))

    RETURNS TABLE

    WITH SCHEMABINDING

    AS

    RETURN

    WITH numbers AS (

    SELECT TOP (ISNULL(DATALENGTH(@List), 0)) n = ROW_NUMBER() OVER (

    ORDER BY (

    SELECT NULL

    )

    )

    FROM (

    VALUES (0), (0), (0), (0), (0), (0), (0), (0), (0), (0)

    ) d(n), (

    VALUES (0), (0), (0), (0), (0), (0), (0), (0), (0), (0)

    ) e(n), (

    VALUES (0), (0), (0), (0), (0), (0), (0), (0), (0), (0)

    ) f(n), (

    VALUES (0), (0), (0), (0), (0), (0), (0), (0), (0), (0)

    ) g(n)

    )

    SELECT ItemNumber = ROW_NUMBER() OVER (

    ORDER BY MIN(n)

    ), Item = SUBSTRING(@List, MIN(n), 1 + MAX(n) - MIN(n)), [Matched]

    FROM (

    SELECT n, y.[Matched], Grouper = n - ROW_NUMBER() OVER (

    ORDER BY y.[Matched], n

    )

    FROM numbers

    CROSS APPLY (

    SELECT [Matched] = CASE

    WHEN SUBSTRING(@List, n, 1) LIKE @Pattern

    THEN 1

    ELSE 0

    END

    ) y

    ) d

    GROUP BY [Matched], Grouper

    GO

Viewing 4 posts - 1 through 3 (of 3 total)

You must be logged in to reply to this topic. Login to reply