Help Eliminating Duplicate Data

  • I have a table in my database that has some product dimension information, however some of it is duplicated in a way and I want to eliminate the duplicates. The problem is that the duplication is not exact, so I've not been able to figure out a way to do this.

    As an example, I queried the following results:

    In this example, using only one ProductId, you can see Depth=9.00 for both ObjectId 17555 and 37575. However, that same Depth=9.00 is also included in the entry where Width=20.00 for the same ObjectId. I would like to figure out some sort of script that can check to see if the single field of the Depth is included already in another row with more dimensional information (Width=20 & Depth=9). If so, then delete the line that has only Depth=9.

    In other words, the last column in the example is a CheckNumber that shows if there is only 1 dimension in that row or not. If there is only 1 dimension in that ObjectId row & that same dimension is listed in a row with the same ObjectId and MORE, i.e., CheckNumber>1, kill the row with the single dimension.

    My SQL for this query so far is:

    select m.ProductId,m.ObjectId,Height,Width,Depth,Length,[Height (Min)],[Height (Max)],[Width (Min)],[Width (Max)],[Depth (Min)],[Depth (Max)],[Length (Min)],[Length (Max)], CASE WHEN height IS NULL THEN 0 ELSE 1 END+CASE WHEN width IS NULL THEN 0 ELSE 1 END+CASE WHEN depth IS NULL THEN 0 ELSE 1 END+CASE WHEN length IS NULL THEN 0 ELSE 1 END+CASE WHEN [height (min)] IS NULL THEN 0 ELSE 1 END

    +CASE WHEN [height (max)] IS NULL THEN 0 ELSE 1 END+CASE WHEN [width (min)] IS NULL THEN 0 ELSE 1 END+CASE WHEN [width (max)] IS NULL THEN 0 ELSE 1 END+CASE WHEN [depth (min)] IS NULL THEN 0 ELSE 1 END+CASE WHEN [depth (max)] IS NULL THEN 0 ELSE 1 END

    +CASE WHEN [length (min)] IS NULL THEN 0 ELSE 1 END+CASE WHEN [length (max)] IS NULL THEN 0 ELSE 1 END as CheckNumber

    from EuroLuxProductBE.dbo.pdt_multidimensions m

    inner join

    (select productid,objectid

    from EuroLuxProductBE.dbo.pdt_MultiDimensions

    group by productid, objectid

    having count(*)>1) multidups

    on m.productid=multidups.ProductId

    and m.ObjectId=multidups.ObjectId

    where m.ProductId=125319

    Any help would be appreciated. I just can't get my head around this one, and I have 1000s of records like this and don't want to have to spend tons of time deleting them manually.

    Thanks!

  • Try ROW_NUMBER(). Comment out the outer WHERE clause to see how all rows are numbered:

    SELECT *

    FROM (

    SELECT

    *,

    rn = ROW_NUMBER() OVER(PARTITION BY ProductId, ObjectId ORDER BY CheckNumber DESC)

    FROM EuroLuxProductBE.dbo.pdt_multidimensions m

    CROSS APPLY (

    SELECT CheckNumber =

    CASE WHEN height IS NULL THEN 0 ELSE 1 END +

    CASE WHEN width IS NULL THEN 0 ELSE 1 END +

    CASE WHEN depth IS NULL THEN 0 ELSE 1 END +

    CASE WHEN [length] IS NULL THEN 0 ELSE 1 END +

    CASE WHEN [height (min)] IS NULL THEN 0 ELSE 1 END +

    CASE WHEN [height (max)] IS NULL THEN 0 ELSE 1 END +

    CASE WHEN [width (min)] IS NULL THEN 0 ELSE 1 END +

    CASE WHEN [width (max)] IS NULL THEN 0 ELSE 1 END +

    CASE WHEN [depth (min)] IS NULL THEN 0 ELSE 1 END +

    CASE WHEN [depth (max)] IS NULL THEN 0 ELSE 1 END +

    CASE WHEN [length (min)] IS NULL THEN 0 ELSE 1 END +

    CASE WHEN [length (max)] IS NULL THEN 0 ELSE 1 END

    ) x

    WHERE m.ProductId = 125319

    ) d

    WHERE rn > 1

    “Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • Aggregate functions automatically exclude NULL values, so this can be simplified.

    SELECT *

    FROM (

    SELECT

    *,

    rn = ROW_NUMBER() OVER(PARTITION BY ProductId, ObjectId ORDER BY CheckNumber DESC)

    FROM EuroLuxProductBE.dbo.pdt_multidimensions m

    CROSS APPLY (

    SELECT COUNT(*) AS CheckNumber

    FROM (

    VALUES

    (height),

    (width),

    (depth),

    ([length]),

    ([height (min)]),

    ([height (max)]),

    ([width (min)]),

    ([width (max)]),

    ([depth (min)]),

    ([depth (max)]),

    ([length (min)]),

    ([length (max)]

    ) x

    WHERE m.ProductId = 125319

    ) d

    WHERE rn > 1

    Although I prefer using CTEs rather than derived tables.

    Drew

    Edit: forgot to include column alias.

    J. Drew Allen
    Business Intelligence Analyst
    Philadelphia, PA

  • Thanks to both of you! I'll check it out tomorrow. I had to run out of town unexpectedly for a family funeral, so sorry for the late reply.

  • drew.allen (12/19/2016)


    Aggregate functions automatically exclude NULL values, so this can be simplified.

    SELECT *

    FROM (

    SELECT

    *,

    rn = ROW_NUMBER() OVER(PARTITION BY ProductId, ObjectId ORDER BY CheckNumber DESC)

    FROM EuroLuxProductBE.dbo.pdt_multidimensions m

    CROSS APPLY (

    SELECT COUNT(*) AS CheckNumber

    FROM (

    VALUES

    (height),

    (width),

    (depth),

    ([length]),

    ([height (min)]),

    ([height (max)]),

    ([width (min)]),

    ([width (max)]),

    ([depth (min)]),

    ([depth (max)]),

    ([length (min)]),

    ([length (max)]

    ) x

    WHERE m.ProductId = 125319

    ) d

    WHERE rn > 1

    Although I prefer using CTEs rather than derived tables.

    Drew

    Edit: forgot to include column alias.

    As it stands Drew this won't work - from BOL:

    "COUNT(*) does not require an expression parameter because, by definition, it does not use information about any particular column. COUNT(*) returns the number of rows in a specified table without getting rid of duplicates. It counts each row separately. This includes rows that contain null values."

    “Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • Note also that

    CASE WHEN height IS NULL THEN 0 ELSE 1 END

    may be shortened to

    IIF(Height IS NULL, 0, 1)

    If you haven't even tried to resolve your issue, please don't expect the hard-working volunteers here to waste their time providing links to answers which you could easily have found yourself.

  • Phil Parkin (12/28/2016)


    Note also that

    CASE WHEN height IS NULL THEN 0 ELSE 1 END

    may be shortened to

    IIF(Height IS NULL, 0, 1)

    It's shortened...but is it fastened too? :-))

    “Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • ChrisM@Work (12/28/2016)


    drew.allen (12/19/2016)


    Aggregate functions automatically exclude NULL values, so this can be simplified.

    SELECT *

    FROM (

    SELECT

    *,

    rn = ROW_NUMBER() OVER(PARTITION BY ProductId, ObjectId ORDER BY CheckNumber DESC)

    FROM EuroLuxProductBE.dbo.pdt_multidimensions m

    CROSS APPLY (

    SELECT COUNT(*) AS CheckNumber

    FROM (

    VALUES

    (height),

    (width),

    (depth),

    ([length]),

    ([height (min)]),

    ([height (max)]),

    ([width (min)]),

    ([width (max)]),

    ([depth (min)]),

    ([depth (max)]),

    ([length (min)]),

    ([length (max)]

    ) x

    WHERE m.ProductId = 125319

    ) d

    WHERE rn > 1

    Although I prefer using CTEs rather than derived tables.

    Drew

    Edit: forgot to include column alias.

    As it stands Drew this won't work - from BOL:

    "COUNT(*) does not require an expression parameter because, by definition, it does not use information about any particular column. COUNT(*) returns the number of rows in a specified table without getting rid of duplicates. It counts each row separately. This includes rows that contain null values."

    Of course, you're right. If the OP had provided consumable data, I would have been able to test. COUNT will still work if you specify the column rather than using *.

    Drew

    J. Drew Allen
    Business Intelligence Analyst
    Philadelphia, PA

Viewing 8 posts - 1 through 7 (of 7 total)

You must be logged in to reply to this topic. Login to reply