eliminating duplicates

  • There is a way to do this using SET ROWCOUNT 1 and a cursor (which is technically an intermediate table but fools most people). Performance will totally suck. It would be much easier if you got the people giving you these ridiculous requirements to cut you some slack... they should not be allowed anywhere near a database. 😉

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • ...and, if you just happend to have posted in the wrong forum and you actually have SQL Server 2005, you need to let us know that because there's a very simple solution there... :hehe:

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (6/21/2008)


    There is a way to do this using SET ROWCOUNT 1 and a cursor (which is technically an intermediate table but fools most people). Performance will totally suck. It would be much easier if you got the people giving you these ridiculous requirements to cut you some slack... they should not be allowed anywhere near a database. 😉

    Tak, tsk Jeff. This can be done in SQL2000 without any loops or cursors. All you need is a Tally table and some convoluted compression/decompression tricks.

    [font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
    Proactive Performance Solutions, Inc.
    [/font]
    [font="Verdana"] "Performance is our middle name."[/font]

  • Tally table method in 2000 for this would require a triangular join... I believe. If you know a way to do this in 2k without using a loop but using a Tally table instead, I'd sure like to see it. And, remember, OP said could not use an intermediate table or add any columns. 🙂

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (6/21/2008)


    Tally table method in 2000 for this would require a triangular join... I believe. If you know a way to do this in 2k without using a loop but using a Tally table instead, I'd sure like to see it. And, remember, OP said could not use an intermediate table or add any columns. 🙂

    Yep. Technically Triangular, but only over the count of duplicates-1. Given the other (extreme) constraints, I think that's a fair compromise.

    [font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
    Proactive Performance Solutions, Inc.
    [/font]
    [font="Verdana"] "Performance is our middle name."[/font]

  • Vish:

    vish (6/19/2008)


    I use SQL server 2000.

    I have a table 'TAB' with 3 coulumns 'Name','Age','Sex'

    and there is no primary key.

    I have multiple duplicate data in my table. for eg..

    ABC24M

    ABC24M

    LMN27M

    LMN27M

    LMN27M

    PQRS25F

    XYZ24M

    XYZ25M

    Now i would wish to 'DELETE' the 1st original row ( Row no 1,3,6,7)

    and keep the other duplcates from above data. 😎

    Now, If there is a row without any dupicates present.. it will be deleted ( for eg Row no 6)

    The condition is i dont want to go for intermediate tables or have any

    additional identity column. :w00t:

    Yes, according the the Relational rules of tuple identity, this can absolutely be done in SQL2000. It's a little obtuse, but not nearly as obtuse as your conditions.

    Try this:

    --Compress duplicates and

    --encode them into archetype records:

    Insert into TAB

    Select Name

    , Age

    , CHAR( (2*Count(*)) + (CASE Sex When 'M' Then 0 Else 1 End) )

    From TAB

    Group By Name, Age, Sex

    --remove the old leftover records:

    Delete from TAB

    Where NOT Sex IN('F','M')

    --Now Uncompress and restore

    --the original records skipping the 1st:

    Insert into TAB

    Select Name

    , Age

    , CASE (ASCII(Sex) && 1) When 0 Then 'M' Else 'F' End

    From TAB

    Join Tally ON Tally.Number <= CASE Ascii(Sex)/2

    Where Tally.Number Between 2 and 127

    [font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
    Proactive Performance Solutions, Inc.
    [/font]
    [font="Verdana"] "Performance is our middle name."[/font]

  • Oops, sorry. Forgot one of the steps:

    --pre-compress the records:

    Update TAB

    Set Sex = CHAR(CASE Sex When 'M' Then 0 Else 1 End)

    --Compress duplicates together and

    --encode them into archetype records:

    Insert into TAB

    Select Name

    , Age

    , CHAR( (2*Count(*)) + Ascii(Sex) )

    From TAB

    Group By Name, Age, Sex

    --remove the old leftover records:

    -- and the archetypes with a count=1

    Delete from TAB

    Where Ascii(Sex) <= 3

    --Now Uncompress and restore

    --the original records skipping the 1st:

    Insert into TAB

    Select Name

    , Age

    , CASE (ASCII(Sex) & 1) When 0 Then 'M' Else 'F' End

    From TAB

    Join Tally ON Tally.Number <= CASE Ascii(Sex)/2

    Where Tally.Number Between 2 and 127

    [font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
    Proactive Performance Solutions, Inc.
    [/font]
    [font="Verdana"] "Performance is our middle name."[/font]

  • Oh, very clever... well done, Barry!

    Only corrections are to remove the CASE in the final insert and to delete the compressed rows...

    Other than that, very, very clever... 🙂

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Heh... still, the best thing to do would be to give the person requesting that this be done without adding an IDENTITY column and without using an intermediate table a really bad case of porkchop-itus. 🙂

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (6/21/2008)


    Oh, very clever... well done, Barry!

    Only corrections are to remove the CASE in the final insert and to delete the compressed rows...

    Other than that, very, very clever... 🙂

    Dang! You're right.

    Oh well, as we used to say in theoretical mathematics: "I leave that as an exercise for the reader." 🙂

    [font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
    Proactive Performance Solutions, Inc.
    [/font]
    [font="Verdana"] "Performance is our middle name."[/font]

  • Jeff Moden (6/21/2008)


    Heh... still, the best thing to do would be to give the person requesting that this be done without adding an IDENTITY column and without using an intermediate table a really bad case of porkchop-itus. 🙂

    No argument there.

    [font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
    Proactive Performance Solutions, Inc.
    [/font]
    [font="Verdana"] "Performance is our middle name."[/font]

  • Hi.

    One of solutions I apply in these cases is.

    1. alter table and create a new identity column..

    2. Make a select group by items, and get min(identity),

    3. delete all records in table not in subselect (2)

    4. Alter table and drop column.

    Here you have and example:

    Table Test

    (

    id int,

    name char

    )

    You have the values:

    1 AAA

    1 AAA

    2 BBB

    3 CCC

    3 CCC

    - First create a new identity column pos order

    ALTER TABLE test ADD [POS][int] IDENTITY(1,1)

    Now whe have on table this

    1 AAA 1

    1 AAA 2

    2 BBB 3

    3 CCC 4

    4 CCC 4

    - Now drop (duplicated 2 o more times) values, group by columns values (in this case id,name)

    DELETE FROM test WHERE POS NOT IN(SELECT min(POS) FROM test group by id,name)

    - Drop new column

    ALTER table test DROP COLUMN POS

    That's all folks!!!!

  • vish (6/19/2008)


    Hi All,

    Let me give my problem description in detail.

    I use SQL server 2000.

    I have a table 'TAB' with 3 coulumns 'Name','Age','Sex'

    and there is no primary key.

    I have multiple duplicate data in my table. for eg..

    ABC24M

    ABC24M

    LMN27M

    LMN27M

    LMN27M

    PQRS25F

    XYZ24M

    XYZ25M

    Now i would wish to 'DELETE' the 1st original row ( Row no 1,3,6,7)

    and keep the other duplcates from above data. 😎

    Now, If there is a row without any dupicates present.. it will be deleted ( for eg Row no 6)

    The condition is i dont want to go for intermediate tables or have any

    additional identity column. :w00t:

    Please help..!!

    Thanks folks

    Pretty sure I can write something to do this actually.

    Weird request though! Can I just check, you would actually delete rows 7 and 8 up there, because the age is diff thus the rows are unique and according to yuor request, unique rows are to be removed?

  • OK Assuming your rows are unique only when all three rows match (this includes gender), this should work..

    DECLARE @Testing TABLE

    (

    FirstColvarChar(5),

    SecondColINT,

    ThirdColChar(1)

    )

    INSERT INTO @Testing

    SELECT 'ABC',24,'M'

    INSERT INTO @Testing

    SELECT 'ABC',24,'M'

    INSERT INTO @Testing

    SELECT 'DEF',24,'M'

    INSERT INTO @Testing

    SELECT 'DEF',24,'F'

    INSERT INTO @Testing

    SELECT 'GHI',26,'F'

    INSERT INTO @Testing

    SELECT 'GHI',26,'F'

    INSERT INTO @Testing

    SELECT 'GHI',26,'F'

    INSERT INTO @Testing

    SELECT 'GHI',26,'F'

    INSERT INTO @Testing

    SELECT 'GHI',26,'F'

    INSERT INTO @Testing

    SELECT 'LMN',27,'M'

    INSERT INTO @Testing

    SELECT 'LMN',27,'M'

    INSERT INTO @Testing

    SELECT 'LMN',27,'M'

    INSERT INTO @Testing

    SELECT 'PQRS',25,'F'

    INSERT INTO @Testing

    SELECT 'XYZ',24,'M'

    INSERT INTO @Testing

    SELECT 'XYZ',25,'M'

    SELECT * FROM @Testing

    -- Delete unique rows

    DELETE T1 FROM @Testing T1

    INNER JOIN (SELECT FirstCol,SecondCol,ThirdCol FROM @Testing GROUP BY FirstCol,SecondCol,ThirdCol HAVING COUNT(*) = 1) T2

    ON T1.FirstCol = T2.FirstCol

    AND T1.SecondCol = T2.SecondCol

    AND T1.ThirdCol = T2.ThirdCol

    SELECT * FROM @Testing

    -- Group up and count columns

    DECLARE@TestCount TABLE

    (

    FirstColvarChar(5),

    SecondColINT,

    ThirdColChar(1),

    CountColINT

    )

    INSERT INTO@TestCount

    SELECTFirstCol,SecondCol,ThirdCol,COUNT(*) FROM @Testing

    GROUP BY FirstCol,SecondCol,ThirdCol

    -- Get max count

    DECLARE@MaxCountINT

    SELECT@MaxCount = MAX(CountCol) FROM @TestCount

    -- Final table to hold last result set

    DECLARE @TestingFinal TABLE

    (

    FirstColvarChar(5),

    SecondColINT,

    ThirdColChar(1)

    )

    -- If there are two entries insert 1, 3 insert 2, 4 insert 3 etc.

    -- Thus removign "first original row"

    DECLARE@CurCountINT

    SELECT@CurCount = MIN(CountCol) FROM @TestCount

    WHILE@CurCount <= @MaxCount

    BEGIN

    -- Insert

    INSERT INTO@TestingFinal

    SELECTFirstCol,SecondCol,ThirdCol

    FROM@TestCount

    WHERECountCol >= @CurCount

    -- Increment

    SET@CurCount = @CurCount + 1

    END

    -- Done

    SELECT * FROM @TestingFinal

    ORDER BY FirstCol,SecondCol,ThirdCol

    *hides from the RBAR police*

  • *hides from the RBAR police*

    Heh.. nah... "we" see all RBAR 😛

    The problem with even that solution is that the op said "no intermediate tables"... @TestCount qualifies as an intermediate table...

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 15 posts - 16 through 30 (of 137 total)

You must be logged in to reply to this topic. Login to reply