Removing only contiguous/adjacent duplicate records from a rowset

  • In SQL 2008, is there any way to remove only adjacent/contiguous duplicate records? Only if a record repeats immediately, then it should be deleted.

    Ex:

    id value name

    -- ------ -------

    1 10 test

    2 5 prod

    3 10 test

    4 4 test

    5 4 test

    6 10 test

    Only records with id 4,5 should be deleted.

    Records with id 1,3,6 should be preserved because value, name combination ( 10, test ) is duplicate but not contiguous.

    Is there any way to achieve this result using TSQL?

  • This appears to do what you want:

    use tempdb

    go

    declare @testtable table (id int, value int, name char(4))

    insert into @testtable values

    (1, 10, 'test'), (2, 5, 'prod'), (3, 10, 'test'), (4, 4, 'test'), (5, 4, 'test'), (6, 10, 'test')

    select * from @testtable

    delete from @testtable

    WHERE id in (

    select t2.id from @testtable t1 INNER JOIN @testtable t2 ON t1.id = t2.id - 1 INNER JOIN @testtable t3 ON t2.id = t3.id-1

    WHERE t1.name = t2.name and t2.name = t3.name)

    select * from @testtable

    Thomas Rushton
    blog: https://thelonedba.wordpress.com

  • ThomasRushton (10/24/2012)


    This appears to do what you want:

    use tempdb

    go

    declare @testtable table (id int, value int, name char(4))

    insert into @testtable values

    (1, 10, 'test'), (2, 5, 'prod'), (3, 10, 'test'), (4, 4, 'test'), (5, 4, 'test'), (6, 10, 'test')

    select * from @testtable

    delete from @testtable

    WHERE id in (

    select t2.id from @testtable t1 INNER JOIN @testtable t2 ON t1.id = t2.id - 1 INNER JOIN @testtable t3 ON t2.id = t3.id-1

    WHERE t1.name = t2.name and t2.name = t3.name)

    select * from @testtable

    That's a lot of joins for a simple operation.

    Maybe try something like this instead?

    DELETE a

    FROM @testtable a

    WHERE EXISTS (SELECT 1

    FROM @testtable b

    WHERE (a.id = b.id+1 OR a.id = b.id-1)

    AND a.NAME = b.NAME AND a.value = b.value);


    Forever trying to learn
    My blog - http://www.cadavre.co.uk/
    For better, quicker answers on T-SQL questions, click on the following...http://www.sqlservercentral.com/articles/Best+Practices/61537/
    For better, quicker answers on SQL Server performance related questions, click on the following...http://www.sqlservercentral.com/articles/SQLServerCentral/66909/

  • select from the table using the ROW_NUMBER() function ordering by id and partitioning by value.

    Wrap in brackets giving you a derived table.

    Run a delete command against the table and joining to the derived table on the id and where rownum > 1

    That should sort you out.

    Eamon

  • EamonSQL (10/24/2012)


    select from the table using the ROW_NUMBER() function ordering by id and partitioning by value.

    Wrap in brackets giving you a derived table.

    Run a delete command against the table and joining to the derived table on the id and where rownum > 1

    That should sort you out.

    Eamon

    Don't you mean a COUNT(*), and deleting where COUNT(*) > 1?

    β€œWrite the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • EamonSQL (10/24/2012)


    select from the table using the ROW_NUMBER() function ordering by id and partitioning by value.

    Wrap in brackets giving you a derived table.

    Run a delete command against the table and joining to the derived table on the id and where rownum > 1

    That should sort you out.

    Eamon

    The OP said: -

    praveen_vejandla (10/24/2012)


    id value name

    -- ------ -------

    1 10 test

    2 5 prod

    3 10 test

    4 4 test

    5 4 test

    6 10 test

    Only records with id 4,5 should be deleted.

    Here's some sample data to play with: -

    DECLARE @testtable TABLE (id INT, value INT, NAME CHAR(4));

    INSERT INTO @testtable

    VALUES (1, 10, 'test'), (2, 5, 'prod'), (3, 10, 'test'), (4, 4, 'test'), (5, 4, 'test'), (6, 10, 'test');

    So the expected result after you have deleted the bad rows is: -

    id value NAME

    ----------- ----------- ----

    1 10 test

    2 5 prod

    3 10 test

    6 10 test

    So, let's code up what you just described: -

    DELETE a

    FROM @testtable a

    INNER JOIN (SELECT id, value, NAME,

    ROW_NUMBER() OVER(PARTITION BY value ORDER BY id) AS rownum

    FROM @testtable) b ON a.id = b.id

    WHERE b.rownum > 1;

    OK, now we'll run it against the sample data and the result is: -

    id value NAME

    ----------- ----------- ----

    1 10 test

    2 5 prod

    4 4 test

    Ah, it seems you've deleted a few extra rows and kept a row that we wanted to delete.

    OK, let's take a look at ThomasRushton's answer: -

    DELETE

    FROM @testtable

    WHERE id IN (SELECT t2.id

    FROM @testtable t1

    INNER JOIN @testtable t2 ON t1.id = t2.id - 1

    INNER JOIN @testtable t3 ON t2.id = t3.id - 1

    WHERE t1.NAME = t2.NAME AND t2.NAME = t3.NAME

    );

    For me, there are two issues here. The first is that we're touching the table 4 times, which is unnecessary. The second is that we're only looking at "name" to see if a record is a duplicate. (If the OP agrees with this as the requirement, then we can scratch my second issue).

    What about the result? Does it give the correct answer from the sample data?

    id value NAME

    ----------- ----------- ----

    1 10 test

    2 5 prod

    3 10 test

    6 10 test

    Yes is does! Excellent.

    Now, let's look at my answer.

    DELETE a

    FROM @testtable a

    WHERE EXISTS (SELECT 1

    FROM @testtable b

    WHERE (a.id = b.id+1 OR a.id = b.id-1)

    AND a.NAME = b.NAME AND a.value = b.value);

    This time we're only touching the table twice and we're looking at both name and value to determine a duplicate. But does it produce the correct result based on the sample data?

    id value NAME

    ----------- ----------- ----

    1 10 test

    2 5 prod

    3 10 test

    6 10 test

    Again, yes it does.


    A question for the OP though - after the query has been run you'll notice that the result-set from the has a new "duplicate" (3 and 6). If you were to run either of the working solutions again, they wouldn't detect this as a duplicate. Would you expect them to? Or is this desired behaviour?


    Forever trying to learn
    My blog - http://www.cadavre.co.uk/
    For better, quicker answers on T-SQL questions, click on the following...http://www.sqlservercentral.com/articles/Best+Practices/61537/
    For better, quicker answers on SQL Server performance related questions, click on the following...http://www.sqlservercentral.com/articles/SQLServerCentral/66909/

  • Thanks every one for the prompt replies and sharing your thoughts.

    id field given in the sample table is contiguous but in real data that I am working at, id field has gaps in between.

    Also one issue with the answer that was provided by Thomas is:

    sample data

    id value name

    110test

    25prod

    310test

    44test

    54test

    610test

    result

    id value name

    110test

    25prod

    310test

    610test

    It is removing all the records with value 4. Actually what is required is it should preserve the 1st value with min(id). It should not remove all the records.

    Required result

    -------------

    id value name

    110test

    25prod

    310test

    44test

    610test

  • try this one

    DECLARE @testtable TABLE (id INT, value INT, NAME CHAR(4));

    INSERT INTO @testtable

    VALUES

    (1, 10, 'test'),

    (2, 5, 'prod'),

    (3, 10, 'test'),

    (4, 4, 'test'),

    (5, 4, 'test'),

    (6, 4, 'test'),

    (7, 10, 'test')

    delete from @testtable where id in (

    select t1.id

    from

    @testtable t

    inner join @testtable t1 on t.id+1=t1.id and t.value=t1.value

    )

    select * from @testtable

  • This works for me. I've made a few assumptions as to the ID, value and name.

    I've also modified the test data for assumption 1 and 2.

    1. ID value is not always incremented by 1 there can be gaps.

    2. Value does not always have the same name (i.e. 10 is not always "test")

    3. We're looking for duplicates in Value and Test (i.e. composite unique key)

    4. We keep the first instance of a duplicate and delete the subsequent ones.

    DECLARE @testtable TABLE (id INT, value INT, NAME CHAR(4));

    INSERT INTO @testtable

    VALUES

    (1, 10, 'test'), --Keep

    (2, 5, 'prod'), --Keep

    (4, 10, 'test'), --Keep

    (6, 4, 'test'), --Keep

    (7, 4, 'test'), --Drop

    (9, 4, 'Job'), --Keep (matching value but not name)

    (11, 10, 'test'), --Keep

    (13, 10, 'test') --Drop

    ;

    with t1 as (

    select ROW_NUMBER()over(order by id) RowID

    ,ID

    ,Name

    ,Value

    from @testtable

    )

    delete a

    from @testtable a

    where id in (

    select t1.id

    from t1

    inner join t1 t2

    on t1.RowID = t2.RowID + 1

    and t1.value = t2.value

    and t1.NAME = t2.NAME

    )

    select *

    from @testtable

    ---------------------------------------------------------------
    Mike Hahn - MCSomething someday:-)
    Right way to ask for help!!
    http://www.sqlservercentral.com/articles/Best+Practices/61537/
    I post so I can see my avatar :hehe:
    I want a personal webpage 😎
    I want to win the lotto πŸ˜€
    I want a gf like Tiffa :w00t: Oh wait I'm married!:-D

  • Thanks a lot for the replies. Finally, I could get the required result with your help.

  • Want a cool sig (10/25/2012)


    This works for me. I've made a few assumptions as to the ID, value and name.

    I've also modified the test data for assumption 1 and 2.

    1. ID value is not always incremented by 1 there can be gaps.

    2. Value does not always have the same name (i.e. 10 is not always "test")

    3. We're looking for duplicates in Value and Test (i.e. composite unique key)

    4. We keep the first instance of a duplicate and delete the subsequent ones.

    DECLARE @testtable TABLE (id INT, value INT, NAME CHAR(4));

    INSERT INTO @testtable

    VALUES

    (1, 10, 'test'), --Keep

    (2, 5, 'prod'), --Keep

    (4, 10, 'test'), --Keep

    (6, 4, 'test'), --Keep

    (7, 4, 'test'), --Drop

    (9, 4, 'Job'), --Keep (matching value but not name)

    (11, 10, 'test'), --Keep

    (13, 10, 'test') --Drop

    ;

    with t1 as (

    select ROW_NUMBER()over(order by id) RowID

    ,ID

    ,Name

    ,Value

    from @testtable

    )

    delete a

    from @testtable a

    where id in (

    select t1.id

    from t1

    inner join t1 t2

    on t1.RowID = t2.RowID + 1

    and t1.value = t2.value

    and t1.NAME = t2.NAME

    )

    select *

    from @testtable

    Careful now... if you look at the row sizes of the arrows in the actual execution plan, there's a full blown cartesian product in there. I'm not sure if that'll go away for larger number of rows.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • praveen_vejandla (10/24/2012)


    Thanks every one for the prompt replies and sharing your thoughts.

    id field given in the sample table is contiguous but in real data that I am working at, id field has gaps in between.

    Also one issue with the answer that was provided by Thomas is:

    sample data

    id value name

    110test

    25prod

    310test

    44test

    54test

    610test

    result

    id value name

    110test

    25prod

    310test

    610test

    It is removing all the records with value 4. Actually what is required is it should preserve the 1st value with min(id). It should not remove all the records.

    Required result

    -------------

    id value name

    110test

    25prod

    310test

    44test

    610test

    I guess I don't understand. In your first post, you said only rows 4 and 5 should be removed which is what both Thomas and Cadavre came up with. Now you're icluding row 4.

    What are the correct requirements for this problem?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • EamonSQL (10/24/2012)


    select from the table using the ROW_NUMBER() function ordering by id and partitioning by value.

    Wrap in brackets giving you a derived table.

    Run a delete command against the table and joining to the derived table on the id and where rownum > 1

    That should sort you out.

    Eamon

    Got any code for this?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • praveen_vejandla (10/25/2012)


    Thanks a lot for the replies. Finally, I could get the required result with your help.

    You can get a bit more speed out of it by limiting the number of scans to only 2 passes instead of 3 as Cadavre suggested. Here's a slightly different way of doing that along with a performace test between the current accepted solution and a delete that deletes directly from the CTE that makes the row numbers.

    --===== Conditionally drop the test tables to make reruns

    -- of this destructive test easier.

    IF OBJECT_ID('tempdb..#TestTable1','U') IS NOT NULL DROP TABLE #TestTable1;

    IF OBJECT_ID('tempdb..#TestTable2','U') IS NOT NULL DROP TABLE #TestTable2;

    ;

    --===== Create and populate the test table on-the-fly.

    SELECT TOP 100000

    ID = IDENTITY(INT,1,1),

    Value = ABS(CHECKSUM(NEWID()))%10+1,

    NAME = CASE ABS(CHECKSUM(NEWID()))%2

    WHEN 0 THEN 'Test'

    WHEN 1 THEN 'Prod'

    ELSE 'Job'

    END

    INTO #TestTable1

    FROM sys.all_columns ac1

    CROSS JOIN sys.all_columns ac2

    ;

    --===== Add what I believe is the primary key.

    ALTER TABLE #TestTable1

    ADD PRIMARY KEY CLUSTERED (ID)

    ;

    --===== Make an identical copy of the table

    SELECT *

    INTO #TestTable2

    FROM #TestTable1

    ;

    ALTER TABLE #TestTable2

    ADD PRIMARY KEY CLUSTERED (ID)

    ;

    ---------------------------------------------------------------------------------------------------

    PRINT '==============================================================================';

    PRINT '========== Want a cool sig''s solution ==========';

    PRINT '==============================================================================';

    SET STATISTICS TIME ON;

    with t1 as (

    select ROW_NUMBER()over(order by id) RowID

    ,ID

    ,Name

    ,Value

    from #testtable1

    )

    delete a

    from #testtable1 a

    where id in (

    select t1.id

    from t1

    inner join t1 t2

    on t1.RowID = t2.RowID + 1

    and t1.value = t2.value

    and t1.NAME = t2.NAME

    )

    SET STATISTICS TIME OFF;

    PRINT '==============================================================================';

    PRINT '========== Jeff''s solution ==========';

    PRINT '==============================================================================';

    SET STATISTICS TIME ON;

    WITH

    cteEnumerate AS

    (

    SELECT RowNumLo = ROW_NUMBER() OVER (ORDER BY ID)+1,

    RowNumHi = ROW_NUMBER() OVER (ORDER BY ID),

    ID, Value, Name

    FROM #TestTable2

    )

    DELETE hi

    FROM cteEnumerate lo

    JOIN cteEnumerate hi ON lo.RowNumLo = hi.RowNumHi

    AND lo.Value = hi.Value

    AND lo.Name = hi.Name

    SET STATISTICS TIME OFF;

    PRINT '==============================================================================';

    Here are the performance results on my older single CPU desktop.

    (100000 row(s) affected)

    (100000 row(s) affected)

    ==============================================================================

    ========== Want a cool sig's solution ==========

    ==============================================================================

    SQL Server Execution Times:

    CPU time = 1125 ms, elapsed time = 3440 ms.

    (3687 row(s) affected)

    ==============================================================================

    ========== Jeff's solution ==========

    ==============================================================================

    SQL Server Execution Times:

    CPU time = 610 ms, elapsed time = 615 ms.

    (3687 row(s) affected)

    ==============================================================================

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff: I see what you mean by the 3 scans vs 2 scans. I ran the script on a VM server here at work, since that's where I am now, and the results came back mixed. I ran it 5 times and 2 of the 5 times the execution time for my script was actually faster(1,3) and 4th run came out the same.

    I'm assuming it's the faster machine compensating for any additional overhead but I didn't expect it to execute faster. I do like your script better, it's easier to read. I didn't know you can delete with CTE, learn something new every day. πŸ™‚

    (100000 row(s) affected)

    (100000 row(s) affected)

    ==============================================================================

    ========== Want a cool sig's solution ==========

    ==============================================================================

    SQL Server parse and compile time:

    CPU time = 356 ms, elapsed time = 356 ms.

    SQL Server Execution Times:

    CPU time = 266 ms, elapsed time = 515 ms.

    (3702 row(s) affected)

    ==============================================================================

    ========== Jeff's solution ==========

    ==============================================================================

    SQL Server parse and compile time:

    CPU time = 386 ms, elapsed time = 386 ms.

    SQL Server Execution Times:

    CPU time = 282 ms, elapsed time = 521 ms.

    (3702 row(s) affected)

    ==============================================================================

    (100000 row(s) affected)

    (100000 row(s) affected)

    ==============================================================================

    ========== Want a cool sig's solution ==========

    ==============================================================================

    SQL Server parse and compile time:

    CPU time = 339 ms, elapsed time = 339 ms.

    SQL Server Execution Times:

    CPU time = 297 ms, elapsed time = 533 ms.

    (3670 row(s) affected)

    ==============================================================================

    ========== Jeff's solution ==========

    ==============================================================================

    SQL Server parse and compile time:

    CPU time = 331 ms, elapsed time = 331 ms.

    SQL Server Execution Times:

    CPU time = 266 ms, elapsed time = 553 ms.

    (3670 row(s) affected)

    ==============================================================================

    (100000 row(s) affected)

    (100000 row(s) affected)

    ==============================================================================

    ========== Want a cool sig's solution ==========

    ==============================================================================

    SQL Server parse and compile time:

    CPU time = 341 ms, elapsed time = 341 ms.

    SQL Server Execution Times:

    CPU time = 265 ms, elapsed time = 579 ms.

    (3765 row(s) affected)

    ==============================================================================

    ========== Jeff's solution ==========

    ==============================================================================

    SQL Server parse and compile time:

    CPU time = 331 ms, elapsed time = 331 ms.

    SQL Server Execution Times:

    CPU time = 265 ms, elapsed time = 556 ms.

    (3765 row(s) affected)

    ==============================================================================

    (100000 row(s) affected)

    (100000 row(s) affected)

    ==============================================================================

    ========== Want a cool sig's solution ==========

    ==============================================================================

    SQL Server parse and compile time:

    CPU time = 342 ms, elapsed time = 342 ms.

    SQL Server Execution Times:

    CPU time = 266 ms, elapsed time = 538 ms.

    (3700 row(s) affected)

    ==============================================================================

    ========== Jeff's solution ==========

    ==============================================================================

    SQL Server parse and compile time:

    CPU time = 340 ms, elapsed time = 340 ms.

    SQL Server Execution Times:

    CPU time = 297 ms, elapsed time = 529 ms.

    (3700 row(s) affected)

    ==============================================================================

    (100000 row(s) affected)

    (100000 row(s) affected)

    ==============================================================================

    ========== Want a cool sig's solution ==========

    ==============================================================================

    SQL Server parse and compile time:

    CPU time = 342 ms, elapsed time = 342 ms.

    SQL Server Execution Times:

    CPU time = 297 ms, elapsed time = 574 ms.

    (3724 row(s) affected)

    ==============================================================================

    ========== Jeff's solution ==========

    ==============================================================================

    SQL Server parse and compile time:

    CPU time = 331 ms, elapsed time = 331 ms.

    SQL Server Execution Times:

    CPU time = 250 ms, elapsed time = 495 ms.

    (3724 row(s) affected)

    ==============================================================================

    ---------------------------------------------------------------
    Mike Hahn - MCSomething someday:-)
    Right way to ask for help!!
    http://www.sqlservercentral.com/articles/Best+Practices/61537/
    I post so I can see my avatar :hehe:
    I want a personal webpage 😎
    I want to win the lotto πŸ˜€
    I want a gf like Tiffa :w00t: Oh wait I'm married!:-D

Viewing 15 posts - 1 through 15 (of 15 total)

You must be logged in to reply to this topic. Login to reply