eliminating duplicates

  • As far as a framework, Jeff Moden posts this link in his signature (and I believe others do as well).

    It makes it very clear what should be posted to receive quicker, more accurate responses. As far as SQL2000 vs. SQL2005, sometimes it's critical as there are changes/additions to functionality depending on version, correct? Sometimes posts are posted to the wrong forum and if people are thinking it's 2005, that's what most - if not all, responses will be geared toward. I don't like wasting my time trying to do something one way and find out, due to version, it's not going to work. And yes, sometimes, shouting at posters is the only way to be HEARD. The important thing is that people are giving their time, and talent, freely here to try and help someone out. Sometimes the OP also has to help themselves....IMHO :rolleyes:

    http://www.sqlservercentral.com/articles/Best+Practices/61537/

    -- You can't be late until you show up.

  • To the lovely Shawn: d_sysuk: dude you just don't get it do you?

    --> Stop posting, we're fed up:..had enough of this yesterday.

    Advice: STOP READING POST !

    I like many others do not live on USA timezones, picked this question up my time today.

    Had a look at this only for a couple of hours, but this highlights an interesting issue - almost valid of a MS connect feedback scenario - "To allow row_number deletes to apply - not to delete ALL duplicate entrys" ..., where it would be achieve a single SQL 2005 statement version :

    delete from dbo.my_tab

    from (select row_number() over (order by z.acctproc_id, z.[name], z.acct_id) row,

    acctproc_id, name, acct_id

    from my_tab z) SOURCE

    JOIN

    ( select a.acctproc_id, a.[name], a.acct_id,

    -- a.row,

    c.count_row,

    max(a.row) row_to_delete

    from ( select z.acctproc_id, z.[name], z.acct_id,

    row_number() over (order by z.acctproc_id, z.[name], z.acct_id) row

    from dbo.my_tab z) a

    join

    (select b.acctproc_id, b.name, b.acct_id,count(*) count_row

    from dbo.my_tab b

    group by b.acctproc_id, b.name, b.acct_id

    ) c

    on c.acctproc_id = a.acctproc_id

    and c.name = a.name

    and (c.acct_id = a.acct_id or c.acct_id is null and a.acct_id is null)

    group by a.acctproc_id, a.name, a.acct_id, c.count_row) FILTEREDSET ON

    --SOURCE.acctproc_id = FILTEREDSET.acctproc_id and

    --SOURCE.[name] = FILTEREDSET.[name]

    --and (SOURCE.acct_id = FILTEREDSET.acct_id

    -- or Source.acct_id is null and filteredset.acct_id is null)

    --and

    SOURCE.row= FILTEREDSET.row_to_delete

    to get the results we expect.

    Where it is using the actual row_numer to perform the inline delete.

  • d_sysuk: I am not in the USA, I'm ahead of them which could mean 16 hours behind πŸ˜€ as the earth spins. that aside your SQL2005 solution looks quite minimal in size.

    implying 2005 > 2000 and not just numerically πŸ˜‰

    but original question was SQL2000 :Whistling:

    πŸ˜€ Shaun not Shawn πŸ˜€

    Hiding under a desk from SSIS Implemenation Work :crazy:

  • per rbarryyoung:

    Except that that is NOT what was wanted. What was wanted was:

    Eliminate One row from each distinct Group.

    With the Following restrictions:

    No intermediate tables!

    and

    No additional Identity columns

    SQL Server 2000 (not 2005)

    This will delete all non duplicated entries and the first row of a duplicated entry - leaving all other duplicated rows in the table.

    ;with numbered as(SELECT rowno=row_number() over

    (partition by [your column] order by [your column name] ), [your column name] from [your table] )

    delete from numbered where rowno =1

    If everything seems to be going well, you have obviously overlooked something.

    Ron

    Please help us, help you -before posting a question please read[/url]
    Before posting a performance problem please read[/url]

  • bitbucket (7/1/2008)


    per rbarryyoung:

    Except that that is NOT what was wanted. What was wanted was:

    Eliminate One row from each distinct Group.

    With the Following restrictions:

    No intermediate tables!

    and

    No additional Identity columns

    SQL Server 2000 (not 2005)

    This will delete all non duplicated entries and the first row of a duplicated entry - leaving all other duplicated rows in the table.

    ;with numbered as(SELECT rowno=row_number() over

    (partition by [your column] order by [your column name] ), [your column name] from [your table] )

    delete from numbered where rowno =1

    Heh... RTFS! That won't work in SQL Server 2000 as requested!

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • d_sysuk (7/1/2008)


    Well going to the first post, its clear the the post is most unclear in its requirements. Shouting at posters wont help under any circumstances or imposing random conditions , e.g. its now SQL 2000, not SQL 2005 etc...

    The whole problem is that you won't stop... you and a lot of other folks keep writing code that doesn't even come close to the specs and it's going to be confusing for other folks that may read this thread.

    We've tried to tell everyone to stop because they're not reading the problem but they keep on posting any way... yourself included. So, I don't blame anyone for yelling because you... won't... listen... :hehe:

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Chris Morris (7/1/2008)


    Oh all right then. No funny stuff either.

    DECLARE @ItemID INT

    DECLARE @FirstCol varChar(5), @SecondCol INT, @ThirdCol Char(1)

    SET @ItemID = 1

    UPDATE #Testing SET ThirdCol = CASE @ItemID WHEN 1 THEN 'D' ELSE ThirdCol END,

    @ItemID = CASE WHEN @FirstCol = FirstCol AND @SecondCol = SecondCol AND @ThirdCol = ThirdCol THEN @ItemID+1 ELSE 1 END,

    @FirstCol = FirstCol, @SecondCol = SecondCol, @ThirdCol = ThirdCol

    DELETE FROM #Testing WHERE ThirdCol = 'D'

    Very nice, chris. Though don't forget that you need a covering Clustered Index to make this work.

    [font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
    Proactive Performance Solutions, Inc.
    [/font]
    [font="Verdana"] "Performance is our middle name."[/font]

  • Jeff as usual you are CORRECT ... I must have gotten carried away reading all these posts ... tell me one thing what the heck is RTFS and if it can not be said in mixed company send a PM....

    If everything seems to be going well, you have obviously overlooked something.

    Ron

    Please help us, help you -before posting a question please read[/url]
    Before posting a performance problem please read[/url]

  • RTFS = Read The (some not nice word beginning with F, usually emphasized) Spec - if I'm not mistaken πŸ˜€

    I could be wrong

    Hiding under a desk from SSIS Implemenation Work :crazy:

  • Yes the solution I posted, like others would work on SQL 2005, and not SQL 2000.

    Sometimes, we'll look at things not in the exact way a poster wants, but then downgrade a script.

    Something that has been missed from this nice analysis:

    I was looking at this problem in more detail, and what some seemed to miss was positional updates is just not possible in SQL Server.

    Something that you CAN do in ORACLE !

    I.e. every row has a unique identifier, which means if we wanted a table full of duplicates, we could delete one copy of a duplicate within a delete statement.

    The problem being a delete x,y,z from will delete every instance of that record which matchs, and not the first one from a set with no way of positioning the update..(which would probly make the partition idea also fail for this case) but alas this fine point has been missed, and the opportunity to improve something in the future.

    In terms of delete the records from the table for a SQL 2000 solution, what I would have suggested would have been turn the problem on its head for this one..

    Just SELECT out the data required at a view layer... would have been something like this (creating a view, as rqd):

    -- this requires some hard-coding with count=N

    -- Get records which have 0 duplicates

    --select acctproc_id, name, acct_id ,'U' duplicate

    -- from dbo.my_tab

    -- group by acctproc_id, name, acct_id

    -- having count(*) = 1

    --union

    -- Get records which have 1 DUPLICATE !

    -- output 1 Row, remove 1 dup row

    (select acctproc_id, name, acct_id ,'D' duplicate

    from dbo.my_tab

    group by acctproc_id, name, acct_id

    having count(*) =2 )

    union

    -- GRab this set twice, as we have 2 dups..

    (select acctproc_id, name, acct_id ,'D' duplicate

    from dbo.my_tab

    group by acctproc_id, name, acct_id

    having count(*) = 3)

    union

    (select acctproc_id, name, acct_id ,'D' duplicate

    from dbo.my_tab

    group by acctproc_id, name, acct_id

    having count(*) =3)

    -- ETC Rqd..

    ... Yes, requries repeating syntax for max number of dup items expected.

    The view then act as a temporary interface layer for the application, and tidy the data up in the background as required.

  • rbarryyoung (7/1/2008)


    Though don't forget that you need a covering Clustered Index to make this work.

    Thanks for pointing that out Barry. This solution will only work if the data is ordered such that identical rows are sequential*, or there's a clustered covering index.

    * Probably - there are threads arguing that this may not always be the case.

    β€œWrite the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • Jeff,

    Heh... this all reminds me of another of my favorite 4 letter acronyms... RTFS!!!!!

    what do you mean by RTFS ?

    karthik

  • karthikeyan (7/2/2008)


    Jeff,

    Heh... this all reminds me of another of my favorite 4 letter acronyms... RTFS!!!!!

    what do you mean by RTFS ?

    Read the friggin' spec πŸ˜›

    β€œWrite the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • Read

    The

    Final

    Spec.

    or

    Read

    The

    Full

    Spec.

    are the polite suggestions I can immediately think of πŸ˜€

    Hiding under a desk from SSIS Implemenation Work :crazy:

  • Try this

    set rowcount 1

    select 1

    while @@rowcount > 0

    delete Emp1 where 1 < (select count(*) from Emp1 a2

    where Emp1.Eno = a2.Eno

    and Emp1.Ename = a2.Ename

    )

    set rowcount 0

Viewing 15 posts - 76 through 90 (of 137 total)

You must be logged in to reply to this topic. Login to reply