Detect duplicate rows - with a twist

  • I provided the trivial data, but mine still works if they aren't in sequence. That was omitted from the original specs, so I didn't want to assume it away. 🙂

    __________________________________________________

    Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
    Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

  • Made a simple change to the data and broke my own code. This requires additional information regarding he requirements. If the data is inserted as specified by the OP, it works. If you change the order of the data such that you have two unbilled records entered followed by a third billed record for the same "rawdata", the first unbilled record is not flagged as a duplicate.

    Here is the modified test code:

    /****** Object: Table [dbo].[CallDetailRecords] Script Date: 04/27/2009 12:10:02 ******/

    SET ANSI_NULLS ON

    GO

    SET QUOTED_IDENTIFIER ON

    GO

    CREATE TABLE [dbo].[CallDetailRecords](

    [Id] [uniqueidentifier] NOT NULL,

    [RawData] [varchar](300) NOT NULL,

    [ErrorText] [varchar](200) NULL,

    [Rated] [bit] NOT NULL,

    [Billed] [bit] NULL

    ) ON [PRIMARY]

    GO

    CREATE CLUSTERED INDEX IX_CallDetailRecords1 ON [dbo].[CallDetailRecords](

    [RawData]

    ) ON [PRIMARY]

    ;

    GO

    -- Duplicate Unbilled Calls (one should be removed)

    INSERT INTO [CallDetailRecords] ( [Id],[RawData],[ErrorText],[Rated],[Billed] ) VALUES ( NEWID(),'1234567890',NULL,1,0 );

    INSERT INTO [CallDetailRecords] ( [Id],[RawData],[ErrorText],[Rated],[Billed] ) VALUES ( NEWID(),'1234567890',NULL,1,0 );

    -- Duplicate Billed Calls (neither one should be removed)

    INSERT INTO [CallDetailRecords] ( [Id],[RawData],[ErrorText],[Rated],[Billed] ) VALUES ( NEWID(),'9876543210',NULL,1,1 );

    INSERT INTO [CallDetailRecords] ( [Id],[RawData],[ErrorText],[Rated],[Billed] ) VALUES ( NEWID(),'9876543210',NULL,1,1 );

    -- Mixed duplicates. All unbilled calls should be removed.

    INSERT INTO [CallDetailRecords] ( [Id],[RawData],[ErrorText],[Rated],[Billed] ) VALUES ( NEWID(),'5555555555',NULL,1,0 );

    INSERT INTO [CallDetailRecords] ( [Id],[RawData],[ErrorText],[Rated],[Billed] ) VALUES ( NEWID(),'5555555555',NULL,1,0 );

    INSERT INTO [CallDetailRecords] ( [Id],[RawData],[ErrorText],[Rated],[Billed] ) VALUES ( NEWID(),'5555555555',NULL,1,1 );

    select

    *

    from

    dbo.CallDetailRecords

    ;

    declare @RawData varchar(300),

    @ErrorText varchar(200);

    set @RawData = '';

    update dbo.CallDetailRecords set

    @ErrorText = ErrorText = case when @RawData = cdr.RawData and cdr.Billed = 0 then 'Duplicate Call' else cdr.ErrorText end,

    @RawData = cdr.RawData

    from

    dbo.CallDetailRecords cdr with (Index = 1)

    ;

    select

    *

    from

    dbo.CallDetailRecords

    ;

    drop table dbo.CallDetailRecords

    ;

  • If you modify the clustered index to include the BILLED flag as DESC, the code will once again work.

    /****** Object: Table [dbo].[CallDetailRecords] Script Date: 04/27/2009 12:10:02 ******/

    SET ANSI_NULLS ON

    GO

    SET QUOTED_IDENTIFIER ON

    GO

    CREATE TABLE [dbo].[CallDetailRecords](

    [Id] [uniqueidentifier] NOT NULL,

    [RawData] [varchar](300) NOT NULL,

    [ErrorText] [varchar](200) NULL,

    [Rated] [bit] NOT NULL,

    [Billed] [bit] NULL

    ) ON [PRIMARY]

    GO

    CREATE CLUSTERED INDEX IX_CallDetailRecords1 ON [dbo].[CallDetailRecords](

    [RawData],

    [Billed] DESC

    ) ON [PRIMARY]

    ;

    GO

    -- Duplicate Unbilled Calls (one should be removed)

    INSERT INTO [CallDetailRecords] ( [Id],[RawData],[ErrorText],[Rated],[Billed] ) VALUES ( NEWID(),'1234567890',NULL,1,0 );

    INSERT INTO [CallDetailRecords] ( [Id],[RawData],[ErrorText],[Rated],[Billed] ) VALUES ( NEWID(),'1234567890',NULL,1,0 );

    -- Duplicate Billed Calls (neither one should be removed)

    INSERT INTO [CallDetailRecords] ( [Id],[RawData],[ErrorText],[Rated],[Billed] ) VALUES ( NEWID(),'9876543210',NULL,1,1 );

    INSERT INTO [CallDetailRecords] ( [Id],[RawData],[ErrorText],[Rated],[Billed] ) VALUES ( NEWID(),'9876543210',NULL,1,1 );

    -- Mixed duplicates. All unbilled calls should be removed.

    INSERT INTO [CallDetailRecords] ( [Id],[RawData],[ErrorText],[Rated],[Billed] ) VALUES ( NEWID(),'5555555555',NULL,1,0 );

    INSERT INTO [CallDetailRecords] ( [Id],[RawData],[ErrorText],[Rated],[Billed] ) VALUES ( NEWID(),'5555555555',NULL,1,0 );

    INSERT INTO [CallDetailRecords] ( [Id],[RawData],[ErrorText],[Rated],[Billed] ) VALUES ( NEWID(),'5555555555',NULL,1,1 );

    select

    *

    from

    dbo.CallDetailRecords

    ;

    declare @RawData varchar(300),

    @ErrorText varchar(200);

    set @RawData = '';

    update dbo.CallDetailRecords set

    @ErrorText = ErrorText = case when @RawData = cdr.RawData and cdr.Billed = 0 then 'Duplicate Call' else cdr.ErrorText end,

    @RawData = cdr.RawData

    from

    dbo.CallDetailRecords cdr with (Index = 1)

    ;

    select

    *

    from

    dbo.CallDetailRecords

    ;

    drop table dbo.CallDetailRecords

    ;

  • Bob Hovious (4/27/2009)


    I provided the trivial data, but mine still works if they aren't in sequence. That was omitted from the original specs, so I didn't want to assume it away. 🙂

    Actually the OP provided data in this post.

    I think you missed as you both posted at the same time.

  • Bob Hovious (4/27/2009)


    I provided the trivial data, but mine still works if they aren't in sequence. That was omitted from the original specs, so I didn't want to assume it away. 🙂

    This why I think this is a trivial example. It is possible, but I really think there is more to identifying the duplicates for this problem. we just don't have all the facts.

  • Thank you both very much for your solutions. I will definitely give both of them a try on larger data samples.

    As for row sequencing, is it not invalid to assume anything about the order of the rows in a table unless an index or "order by" is specified?

  • Scott Roberts (4/27/2009)


    Thank you both very much for your solutions. I will definitely give both of them a try on larger data samples.

    As for row sequencing, is it not invalid to assume anything about the order of the rows in a table unless an index or "order by" is specified?

    Yes, and even if an index is present. What I gave you ONLY works when doing an UPDATE and the CLUSTERED INDEX exists on the data in the order it is processed. Jeff moden would also tell you to include the TABLOCKX hint along woth the index hint for the clustered index (INDEX 1,TABLOCKX).

    As I don't have SQL Server 2008, I have not been able to test this on that platform as yet, but it works on SQL Server 2000/2005.

  • Lynn: Wait... are we disagreeing about something?

    If so, don't mind us, Scott. Sidebar discussions are the only payment a volunteer ever gets.

    __________________________________________________

    Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
    Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

  • Bob Hovious (4/27/2009)


    Lynn: Wait... are we disagreeing about something?

    I don't think so. I think we are just confusing each other.

    If you are talking about the trivial data, most definately NOT. The OP posted data at the same time you did. I used the data provided by the OP, not you. It is the OP's data I was also calling trivial.

    As stated in later posts, i personally think there is more to the DEDUPE requirements, and based on the what the OP provided, we both have provided a solution that works.

  • Bob Hovious (4/27/2009)


    Sidebar discussions are the only payment a volunteer ever gets.

    And hopefully an occassional "thank you". Although I'd be obliged to agree if you don't consider that much of a "payment". 🙂

  • Lynn Pettis (4/27/2009)


    As stated in later posts, i personally think there is more to the DEDUPE requirements, and based on the what the OP provided, we both have provided a solution that works.

    For the sake of completeness, what additional requirements do you suspect exist? I'd like to be as clear as possible in case others wish to offer additional ideas.

  • Scott Roberts (4/27/2009)


    Lynn Pettis (4/27/2009)


    As stated in later posts, i personally think there is more to the DEDUPE requirements, and based on the what the OP provided, we both have provided a solution that works.

    For the sake of completeness, what additional requirements do you suspect exist? I'd like to be as clear as possible in case others wish to offer additional ideas.

    Not sure, but the fact that you are apparently doing this for call records, I'd assume that there would also be date/time values, elapsed time values, from/to phone numbers, account numbers, calling periods (peak/non-peak), and who knows what else.

    This is the type of info I'd want to see on my invoices so I could dispute any calls, or (if a business) make sure I get reimbursed by employees for personal calls or allocate costs to appropriate departments.

  • Lynn Pettis (4/27/2009)


    Not sure, but the fact that you are apparently doing this for call records, I'd assume that there would also be date/time values, elapsed time values, from/to phone numbers, account numbers, calling periods (peak/non-peak), and who knows what else.

    Gotcha.

    Recall from the very first post that the "RawData" field contains the entire fixed-length record received from the 3rd party. This includes the call date/time, call length, from number, to number, etc. All of the information necessary for detecting "duplicate calls" is included in the "RawData" field.

    Thanks again for your help. I'll try to post some results of the two new solutions later.

  • "Thank you"s are indeed payment. I stand corrected.

    You are welcome, Scott. We'll look forward to hearing from you.

    __________________________________________________

    Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
    Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

  • Scott Roberts (4/27/2009)


    While I would completely agree that cursors are "significantly harmful", I'm not yet convinced that they are "no longer necessary".

    Skepticism is certainly understood, especially on the types of CDR's that get sent in by the "big man" (used to be ATT). Not only do you get dupes in the same file, but they delay the first CDR's for new customers for 90 to 120 days (120 days is too late to bill in some states) and then you get all their CDR's for for the last 3 to 4 months which really ticks the customer off.

    Worse yet, you may get duplicated CDR's for the same call but they're in separate files.

    Can you tell I've been there on this one? 😀

    1. Duplicate rows where all of them are "unbilled". All duplicates should be removed and one row should remain.

    2. Duplicate rows where one (or more) of them is "billed" and the others are "unbilled". All of the "unbilled" duplicate rows should be removed.

    3. Duplicate rows where all of them are "billed" (shouldn't happen, but what if?). No rows should be removed.

    Proverbial piece o' cake... first, let's try it with your sample data. The big mistake people sometimes make is trying to do it all in the same query...


    [font="Courier New"]--drop table [CallDetailRecords]

    go

    /****** Object:  Table [dbo].[CallDetailRecords]    Script Date: 04/27/2009 12:10:02 ******/

    SET ANSI_NULLS ON

    GO

    CREATE TABLE [dbo].[CallDetailRecords](

    [Id] [uniqueidentifier] NOT NULL,

    [RawData] [varchar](5000) NOT NULL,

    [ErrorText] [varchar](200) NULL,

    [Rated] [bit] NOT NULL,

    [Billed] [bit] NULL

    ON [PRIMARY]

    GO

    SET ANSI_PADDING OFF

    -- Duplicate Unbilled Calls (one should be removed) 

    INSERT INTO [CallDetailRecords] [Id],[RawData],[ErrorText],[Rated],[Billed] VALUES NEWID(),'1234567890',NULL,1,

    INSERT INTO [CallDetailRecords] [Id],[RawData],[ErrorText],[Rated],[Billed] VALUES NEWID(),'1234567890',NULL,1,

    -- Duplicate Billed Calls (neither one should be removed) 

    INSERT INTO [CallDetailRecords] [Id],[RawData],[ErrorText],[Rated],[Billed] VALUES NEWID(),'9876543210',NULL,1,

    INSERT INTO [CallDetailRecords] [Id],[RawData],[ErrorText],[Rated],[Billed] VALUES NEWID(),'9876543210',NULL,1,

    -- Mixed duplicates. All unbilled calls should be removed. 

    INSERT INTO [CallDetailRecords] [Id],[RawData],[ErrorText],[Rated],[Billed] VALUES NEWID(),'5555555555',NULL,1,

    INSERT INTO [CallDetailRecords] [Id],[RawData],[ErrorText],[Rated],[Billed] VALUES NEWID(),'5555555555',NULL,1,

    INSERT INTO [CallDetailRecords] [Id],[RawData],[ErrorText],[Rated],[Billed] VALUES NEWID(),'5555555555',NULL,1,)

    --===== Mark duplicate Unbilled no matter what

         -- All duplicated data will be contained in RawData

     UPDATE cdr1

        SET ErrorText 'Duplicated Call'

       FROM dbo.CallDetailRecords cdr1,

            dbo.CallDetailRecords cdr2

      WHERE cdr1.RawData cdr2.RawData

        AND cdr1.Rated cdr2.Rated

        AND cdr1.Billed 0

        AND cdr2.Billed 0

        AND cdr1.ID cdr2.ID

    --===== Mark Unbilled if billed exists

         -- All duplicated data will be contained in RawData

         -- Notice that we're not including rows already "marked"

     UPDATE cdr1

        SET ErrorText 'Duplicated Call'

       FROM dbo.CallDetailRecords cdr1,

            dbo.CallDetailRecords cdr2

      WHERE cdr1.RawData cdr2.RawData

        AND ISNULL(cdr1.ErrorText,0ISNULL(cdr2.ErrorText,0)

        AND cdr1.Rated   cdr2.Rated

        AND cdr1.Billed  0

        AND cdr2.Billed  1

    SELECT FROM dbo.CallDetailRecords

    [/font]


    Now, the problem is that no matter which method you use, this is going to be dreadfully slow because there's no chance of a decent index to drive the code. Having done this before, I'd say that you need to import the data to a staging table in the separate columns the data actually belongs in. Then, you can put a nice clustered index on things like FromNum, ToNum, Date, Time, and maybe CallType and Duration.

    If you still want to play with the idea of using the whole CDR as a comparison, here's a million row test table with some dupes for you to play with.


    [font="Courier New"]--===== Pick a nice, safe place to play

        USE TempDB

    --===== Conditionally drop the test table for reruns

         IF OBJECT_ID('dbo.CallDetailRecords'IS NOT NULL

            DROP TABLE dbo.CallDetailRecords

    --===== Create the table with the correct datatypes

         -- and fill it with a million rows on the fly

     SELECT TOP 1000000

            ID        NEWID(),

            RawData   REPLICATE(CAST(NEWID() AS CHAR(36)),10),

            ErrorText CAST(NULL AS VARCHAR(200)),

            Rated     CAST(AS BIT),

            Billed    CAST(ABS(CHECKSUM(NEWID()))%AS BIT)

       INTO dbo.CallDetailRecords

       FROM Master.dbo.SysComments      sc1

      CROSS JOIN Master.dbo.SysComments sc2

    --TRUST ME... DON'T ADD ANY INDEXES YET (if any, you can't index a VARCHAR(5000))

    --===== Add some randomized "duplicates"

         -- (TRUST ME... YOU DON'T WANT TO DO AN ORDER BY HERE)

     INSERT INTO dbo.CallDetailRecords

            (ID,RawData,ErrorText,Rated,Billed)

     SELECT TOP 10000

            ID=NEWID(),RawData,ErrorText,Rated,Billed=CAST(ABS(CHECKSUM(NEWID()))%AS BIT)

       FROM dbo.CallDetailRecords

     

    --===== Add some more randomized "duplicates", hopefully, from the same batch as above

         -- (TRUST ME... YOU DON'T WANT TO DO AN ORDER BY HERE)

     INSERT INTO dbo.CallDetailRecords

            (ID,RawData,ErrorText,Rated,Billed)

     SELECT TOP 5000

            ID=NEWID(),RawData,ErrorText,Rated,Billed=CAST(ABS(CHECKSUM(NEWID()))%AS BIT)

       FROM dbo.CallDetailRecords[/font]


    Just a parting thought. I recommended working on a staging table with separate columns and indexes on the couple of columns that actually determine a dupe CDR. The recommendation is not made lightly. Using the method I posted, I dupe check across 31 sets (1 for each day of the month) of 3 databases and a "recycle" database for a total of (upto but usually 1 or 2 less depending on the months) 94 databases with about 4 million rows in each database... in about 11 minutes.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 15 posts - 16 through 30 (of 41 total)

You must be logged in to reply to this topic. Login to reply