Problem in finding duplicate records using CTE

  • Hi All,

    I have a situation to find the duplicate records in a table using the comparing columns. The comparing columns can have NULL values.

    Sctipt to create the table

    Create Table DupTest

    (

    OID bigint null,

    Test VARCHAR(46) null,

    TestColumn int null,

    TestString VARCHAR(50) null,

    IFFRN Bigint Identity(1,1)

    )

    Insert Script

    Insert DupTest values

    (1,'sdfsd',10,NULL),

    (2,NULL,11,'test2'),

    (1,'sdfsd',NULL,''),

    (1,'sdfsd',NULL,''),

    (1,'sdfsd',10,NULL),

    (1,'sdfsd',NULL,NULL)

    Select * from DupTest

    OIDTestTestColumnTestStringIFFRN

    [highlight="#FF99FF"]

    1sdfsd10 NULL 1[/highlight]

    2NULL11 test2 2[highlight="#FFFF00"]

    1sdfsdNULL 3

    1sdfsdNULL 4[/highlight][highlight="#FF99FF"]

    1sdfsd10 NULL 5[/highlight]

    1sdfsdNULL NULL 6

    The Rows highlighted are the duplicate rows.

    I have used the CTE with Row_Number() rank function to find the duplicate rows.

    ;WITH myCTE AS (

    SELECT

    oid,Test,TestColumn,TestString,IFFRN,

    ROW_NUMBER() OVER (PARTITION BY oid,Test,TestColumn,TestString ORDER BY IFFRN) RowID

    FROM DupTest

    )

    SELECT * FROM myCTE

    oidTestTestColumnTestStringIFFRNRowID

    1sdfsdNULL NULL 6 1[highlight="#FFFF00"]

    1sdfsdNULL 3 1

    1sdfsdNULL 4 2[/highlight][highlight="#FF99FF"]

    1sdfsd10 NULL 1 1

    1sdfsd10 NULL 5 2[/highlight]

    2NULL11 test2 2 1

    In the above result set I want only the highlighted set of Rows.

    I need the result as below.. I need only the duplicated set of rows.

    Please help me with the respective query.

    IFFRNRowID

    31

    42

    11

    52

    [font="Verdana"]Regards,
    Rals
    [/font].
  • try using "count(*) over (partition by...."



    Clear Sky SQL
    My Blog[/url]

  • Thanks dave... its working but in my real time scenario where i have a millon of records to compare count(*) over (paritition...

    is taking very long time to return the result compare to ROW_NUMBER() function..

    [font="Verdana"]Regards,
    Rals
    [/font].
  • With proper indexing , it shouldnt be to bad

    Alternatively comparing the min and max values can give you the same results

    Create table BigTable

    (

    id integer identity,

    RId integer not null

    )

    go

    create index idxBigTable on BigTable (Rid) include(id)

    go

    insert into bigtable(rid)

    Select ABS(checksum(newid())) %10000

    from sys.columns a cross join sys.columns b

    go

    with cteCount

    as(

    select id,COUNT(*) over (partition by rid) as counter

    from bigtable

    )

    select *

    from cteCount

    where counter> 1

    go

    go

    select rid

    from bigtable

    group by rid

    having MIN(id) <> MAX(id)



    Clear Sky SQL
    My Blog[/url]

  • Maybe this could work (with the right index)... though i doubt it. Should probably be worse then the count version. But your the one with the data to test 🙂

    ;WITH myCTE AS (

    SELECT

    oid,Test,TestColumn,TestString,IFFRN,

    RANK() OVER (PARTITION BY oid,Test,TestColumn,TestString ORDER BY IFFRN) RowID

    FROM DupTest

    )

    select dt.*

    from DupTest dt

    join (select * from myCTE where RowId > 1) c

    on c.oid = dt.oid

    and (c.Test = dt.Test or (c.Test is null and dt.Test is null))

    and (c.TestColumn = dt.TestColumn or (c.TestColumn is null and dt.TestColumn is null))

    and (c.TestString = dt.TestString or (c.TestString is null and dt.TestString is null))

    /T

  • Hi,

    May be you can try this.

    WITH myCTE AS (

    SELECT

    oid,Test,TestColumn,TestString,IFFRN,

    ROW_NUMBER() OVER (PARTITION BY oid,ISNULL(Test,'X'),ISNULL(TestColumn,-12),ISNULL(TestString,'X') ORDER BY IFFRN) RowID

    FROM DupTest

    )

    SELECT * FROM myCTE

  • One alternative is the old school way: Do a summary query.

    GROUP BY all the columns used to determine a duplicate . Store all results HAVING COUNT(*) > 1 in a table variable or #temp table whose clustered index consists of the "duplicate" columns. Make a second pass through your detail table joining it to the summary results table on all the "duplicate" columns.

    What column(s) are used for your clustered index on the production table?

    __________________________________________________

    Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
    Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

  • Another one to try:-)

    ;WITH cte AS

    (

    SELECT ROW_NUMBER() OVER (PARTITION BY oid, Test, TestColumn, TestString ORDER BY IFFRN) AS RowAsc,

    ROW_NUMBER() OVER (PARTITION BY oid, Test, TestColumn, TestString ORDER BY IFFRN DESC) AS RowDesc,

    oid, Test, TestColumn, TestString, IFFRN

    FROM DupTest

    )

    SELECT oid, Test, TestColumn, TestString, IFFRN

    FROM cte

    WHERE NOT (RowAsc = RowDesc AND RowAsc = 1 AND RowDesc = 1)

  • Thanks for all your ideas and Suggestions.

    Hi Dixie,

    In the production the query is running on a view which has again joined using many other views. It has 4 views and 12 tables.

    We are currently reviewing the design approach and possiblities of tuning the joins used in the views.

    Thanks

  • If some of the views reference the same tables you may want to consider rewriting the query entirely to run without views.

    __________________________________________________

    Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
    Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

Viewing 10 posts - 1 through 9 (of 9 total)

You must be logged in to reply to this topic. Login to reply