deduplication

  • Hi all

    Could any body help me with a query?

    I have a table with a title,forename,surname and postcode field, I need to ensure only 1 record is selected per postcode ensuring that in the case of a duplicate postcode I select the male as preference.

    se following code to see what I am faced with:

    create table ashley.dbo.sample(

    title nvarchar(10),

    forename nvarchar(50),

    surname nvarchar(50),

    postcode nvarchar(10),

    dps nvarchar(10),

    ref nvarchar(10))

    insert into ashley.dbo.sample

    values ('Mr','John','Smith','AA1 1AB','AB','1')

    insert into ashley.dbo.sample

    values ('Mrs','Jane','Smith','AA1 1AB','AB','2')

    insert into ashley.dbo.sample

    values ('','J','Smith','AA1 1AB','AB','3')

    insert into ashley.dbo.sample

    values ('Mr','Bob','Smith','AA1 1AB','AB','4')

    insert into ashley.dbo.sample

    values ('Miss','Vera','Duckworth','AA1 1AC','FC','5')

    insert into ashley.dbo.sample

    values ('Mr','Jack','Duckworth','AA1 1AC','FC','6')

    insert into ashley.dbo.sample

    values ('Mrs','J','Duckworth','AA1 1AC','FC','7')

    insert into ashley.dbo.sample

    values ('Mrs','Jane','Cond','AA1 1AD','AD','8')

    insert into ashley.dbo.sample

    values ('','J','Cond','AA1 1AD','AD','9')

    So titles of Mr are the priority then i no male available any other 1 single record per postcode will do.

    Could anybody offer any assistance to this?

    Thanks in advance 🙂

  • I imagine there are better ways to do this, but this is what I was able to come up with...

    select * from

    (

    select *,

    row_number() over(partition by postcode order by case when title = 'Mr' then 0 else 1 end, ref) as pref

    from sample

    ) z where pref = 1

    The magic is all in the row_number function, where it Identifies the rows with 'Mr' and pushes them to the top of the set, partitioned by postcode. You then just select the top item in the set.

  • ROW_NUMBER() is an EXCELLENT way of identifying dups and getting rid of all but one.

    Good job.

    __________________________________________________

    Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
    Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

  • Jason Norsworthy (12/19/2010)


    I imagine there are better ways to do this, but this is what I was able to come up with...

    select * from

    (

    select *,

    row_number() over(partition by postcode order by case when title = 'Mr' then 0 else 1 end, ref) as pref

    from sample

    ) z where pref = 1

    The magic is all in the row_number function, where it Identifies the rows with 'Mr' and pushes them to the top of the set, partitioned by postcode. You then just select the top item in the set.

    This works and I have been using it...However, I am inserting results into another table and sometimes I have to run the query a few time to get all the deduped results into the table. DOes anybody know why this is?

  • bicky1980 (1/10/2011)


    Jason Norsworthy (12/19/2010)


    I imagine there are better ways to do this, but this is what I was able to come up with...

    select * from

    (

    select *,

    row_number() over(partition by postcode order by case when title = 'Mr' then 0 else 1 end, ref) as pref

    from sample

    ) z where pref = 1

    The magic is all in the row_number function, where it Identifies the rows with 'Mr' and pushes them to the top of the set, partitioned by postcode. You then just select the top item in the set.

    This works and I have been using it...However, I am inserting results into another table and sometimes I have to run the query a few time to get all the deduped results into the table. DOes anybody know why this is?

    Post the code where it's happening...

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • If you are using the code shown to insert into another table, you are only going to get one "duplicate" at a time. To insert all but the first row, you would want to select where the column produced by row_number() is greater than 1. Like Jeff said, please post up the code you are actually running.

    __________________________________________________

    Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
    Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

  • I have no idea how to use OVER and PARTITION and I hardly use ROW_NUMBER (probably out of ignorance), but just for fun I wrote a script that I think would work as well and should work with just one run.

    SELECT title, forename, surname, postcode, dps, ref

    from ashley.dbo.sample a

    where

    (

    title = 'Mr.'

    or

    NOT EXISTS (

    SELECT 1

    FROM ashley.dbo.sample b

    where

    a.postcode = b.postcode

    and b.title = 'Mr.'

    )

    )

    AND

    NOT EXISTS (

    SELECT 1

    FROM ashley.dbo.sample c

    where

    c.ref < a.ref

    and a.postcode = c.postcode

    and

    (

    c.title = 'Mr.'

    or

    NOT EXISTS (

    SELECT 1 FROM ashley.dbo.sample d where c.postcode = d.postcode and d.title = 'Mr.')

    )

    )

    I didn't build a test table to run this against, but I think it should run. Thanks for the other script it gives me something to learn and blog about in a few days.

  • Just out of curiosity, how do you use ROW_NUMBER() at all without using OVER? I'll encourage you to read up on ROW_NUMBER(), RANK(), and the other windowed functions in SQL2008. They are excellent tools to have in your toolbox.

    __________________________________________________

    Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
    Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

  • I will have to study up on that. I should be more specific I suppose 🙂 I have maintained scripts that I have noticed that ROW_NUMBER was used (I probably didn't notice the other two.) I really love working in SQL and am always excited about learning new tricks; so please feel free to critique the mess out of any of my posts. I definitely learn from my mistakes and plan on making a lot.

Viewing 9 posts - 1 through 8 (of 8 total)

You must be logged in to reply to this topic. Login to reply