Detect duplicate rows - with a twist

Question

Detect duplicate rows - with a twist

Viewing 15 posts - 16 through 30 (of 41 total)

You must be logged in to reply to this topic. Login to reply

The Dixie Flatline SSC Guru Points: 53354 More actions · Answer 1

I provided the trivial data, but mine still works if they aren't in sequence. That was omitted from the original specs, so I didn't want to assume it away. 🙂

__________________________________________________

Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

Lynn Pettis SSC Guru Points: 442467 More actions · Answer 2

Made a simple change to the data and broke my own code. This requires additional information regarding he requirements. If the data is inserted as specified by the OP, it works. If you change the order of the data such that you have two unbilled records entered followed by a third billed record for the same "rawdata", the first unbilled record is not flagged as a duplicate.

Here is the modified test code:

/****** Object: Table [dbo].[CallDetailRecords] Script Date: 04/27/2009 12:10:02 ******/

SET ANSI_NULLS ON

GO

SET QUOTED_IDENTIFIER ON

GO

CREATE TABLE [dbo].[CallDetailRecords](

[Id] [uniqueidentifier] NOT NULL,

[RawData] [varchar](300) NOT NULL,

[ErrorText] [varchar](200) NULL,

[Rated] [bit] NOT NULL,

[Billed] [bit] NULL

) ON [PRIMARY]

GO

CREATE CLUSTERED INDEX IX_CallDetailRecords1 ON [dbo].[CallDetailRecords](

[RawData]

) ON [PRIMARY]

;

GO

-- Duplicate Unbilled Calls (one should be removed)

INSERT INTO [CallDetailRecords] ( [Id],[RawData],[ErrorText],[Rated],[Billed] ) VALUES ( NEWID(),'1234567890',NULL,1,0 );

-- Duplicate Billed Calls (neither one should be removed)

INSERT INTO [CallDetailRecords] ( [Id],[RawData],[ErrorText],[Rated],[Billed] ) VALUES ( NEWID(),'9876543210',NULL,1,1 );

-- Mixed duplicates. All unbilled calls should be removed.

INSERT INTO [CallDetailRecords] ( [Id],[RawData],[ErrorText],[Rated],[Billed] ) VALUES ( NEWID(),'5555555555',NULL,1,0 );

INSERT INTO [CallDetailRecords] ( [Id],[RawData],[ErrorText],[Rated],[Billed] ) VALUES ( NEWID(),'5555555555',NULL,1,1 );

select

*

from

dbo.CallDetailRecords

;

declare @RawData varchar(300),

@ErrorText varchar(200);

set @RawData = '';

update dbo.CallDetailRecords set

@ErrorText = ErrorText = case when @RawData = cdr.RawData and cdr.Billed = 0 then 'Duplicate Call' else cdr.ErrorText end,

@RawData = cdr.RawData

from

dbo.CallDetailRecords cdr with (Index = 1)

;

select

*

from

dbo.CallDetailRecords

;

drop table dbo.CallDetailRecords

;

Lynn Pettis SSC Guru Points: 442467 More actions · Answer 3

If you modify the clustered index to include the BILLED flag as DESC, the code will once again work.

/****** Object: Table [dbo].[CallDetailRecords] Script Date: 04/27/2009 12:10:02 ******/

SET ANSI_NULLS ON

GO

SET QUOTED_IDENTIFIER ON

GO

CREATE TABLE [dbo].[CallDetailRecords](

[Id] [uniqueidentifier] NOT NULL,

[RawData] [varchar](300) NOT NULL,

[ErrorText] [varchar](200) NULL,

[Rated] [bit] NOT NULL,

[Billed] [bit] NULL

) ON [PRIMARY]

GO

CREATE CLUSTERED INDEX IX_CallDetailRecords1 ON [dbo].[CallDetailRecords](

[RawData],

[Billed] DESC

) ON [PRIMARY]

;

GO

-- Duplicate Unbilled Calls (one should be removed)

INSERT INTO [CallDetailRecords] ( [Id],[RawData],[ErrorText],[Rated],[Billed] ) VALUES ( NEWID(),'1234567890',NULL,1,0 );

-- Duplicate Billed Calls (neither one should be removed)

INSERT INTO [CallDetailRecords] ( [Id],[RawData],[ErrorText],[Rated],[Billed] ) VALUES ( NEWID(),'9876543210',NULL,1,1 );

-- Mixed duplicates. All unbilled calls should be removed.

INSERT INTO [CallDetailRecords] ( [Id],[RawData],[ErrorText],[Rated],[Billed] ) VALUES ( NEWID(),'5555555555',NULL,1,0 );

INSERT INTO [CallDetailRecords] ( [Id],[RawData],[ErrorText],[Rated],[Billed] ) VALUES ( NEWID(),'5555555555',NULL,1,1 );

select

*

from

dbo.CallDetailRecords

;

declare @RawData varchar(300),

@ErrorText varchar(200);

set @RawData = '';

update dbo.CallDetailRecords set

@ErrorText = ErrorText = case when @RawData = cdr.RawData and cdr.Billed = 0 then 'Duplicate Call' else cdr.ErrorText end,

@RawData = cdr.RawData

from

dbo.CallDetailRecords cdr with (Index = 1)

;

select

*

from

dbo.CallDetailRecords

;

drop table dbo.CallDetailRecords

;

Lynn Pettis SSC Guru Points: 442467 More actions · Answer 4

Bob Hovious (4/27/2009)
I provided the trivial data, but mine still works if they aren't in sequence. That was omitted from the original specs, so I didn't want to assume it away. 🙂

Actually the OP provided data in this post.

I think you missed as you both posted at the same time.

Lynn Pettis SSC Guru Points: 442467 More actions · Answer 5

Bob Hovious (4/27/2009)
I provided the trivial data, but mine still works if they aren't in sequence. That was omitted from the original specs, so I didn't want to assume it away. 🙂

This why I think this is a trivial example. It is possible, but I really think there is more to identifying the duplicates for this problem. we just don't have all the facts.

Scott Roberts-430649 Mr or Mrs. 500 Points: 560 More actions · Answer 6

Thank you both very much for your solutions. I will definitely give both of them a try on larger data samples.

As for row sequencing, is it not invalid to assume anything about the order of the rows in a table unless an index or "order by" is specified?

Lynn Pettis SSC Guru Points: 442467 More actions · Answer 7

Scott Roberts (4/27/2009)
Thank you both very much for your solutions. I will definitely give both of them a try on larger data samples.
As for row sequencing, is it not invalid to assume anything about the order of the rows in a table unless an index or "order by" is specified?

Yes, and even if an index is present. What I gave you ONLY works when doing an UPDATE and the CLUSTERED INDEX exists on the data in the order it is processed. Jeff moden would also tell you to include the TABLOCKX hint along woth the index hint for the clustered index (INDEX 1,TABLOCKX).

As I don't have SQL Server 2008, I have not been able to test this on that platform as yet, but it works on SQL Server 2000/2005.

The Dixie Flatline SSC Guru Points: 53354 More actions · Answer 8

Lynn: Wait... are we disagreeing about something?

If so, don't mind us, Scott. Sidebar discussions are the only payment a volunteer ever gets.

__________________________________________________

Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

Lynn Pettis SSC Guru Points: 442467 More actions · Answer 9

Bob Hovious (4/27/2009)
Lynn: Wait... are we disagreeing about something?

I don't think so. I think we are just confusing each other.

If you are talking about the trivial data, most definately NOT. The OP posted data at the same time you did. I used the data provided by the OP, not you. It is the OP's data I was also calling trivial.

As stated in later posts, i personally think there is more to the DEDUPE requirements, and based on the what the OP provided, we both have provided a solution that works.

Scott Roberts-430649 Mr or Mrs. 500 Points: 560 More actions · Answer 10

Bob Hovious (4/27/2009)
Sidebar discussions are the only payment a volunteer ever gets.

And hopefully an occassional "thank you". Although I'd be obliged to agree if you don't consider that much of a "payment". 🙂

Scott Roberts-430649 Mr or Mrs. 500 Points: 560 More actions · Answer 11

Lynn Pettis (4/27/2009)
As stated in later posts, i personally think there is more to the DEDUPE requirements, and based on the what the OP provided, we both have provided a solution that works.

For the sake of completeness, what additional requirements do you suspect exist? I'd like to be as clear as possible in case others wish to offer additional ideas.

Lynn Pettis SSC Guru Points: 442467 More actions · Answer 12

Scott Roberts (4/27/2009)
Lynn Pettis (4/27/2009)
As stated in later posts, i personally think there is more to the DEDUPE requirements, and based on the what the OP provided, we both have provided a solution that works.
For the sake of completeness, what additional requirements do you suspect exist? I'd like to be as clear as possible in case others wish to offer additional ideas.

Not sure, but the fact that you are apparently doing this for call records, I'd assume that there would also be date/time values, elapsed time values, from/to phone numbers, account numbers, calling periods (peak/non-peak), and who knows what else.

This is the type of info I'd want to see on my invoices so I could dispute any calls, or (if a business) make sure I get reimbursed by employees for personal calls or allocate costs to appropriate departments.

Scott Roberts-430649 Mr or Mrs. 500 Points: 560 More actions · Answer 13

Lynn Pettis (4/27/2009)
Not sure, but the fact that you are apparently doing this for call records, I'd assume that there would also be date/time values, elapsed time values, from/to phone numbers, account numbers, calling periods (peak/non-peak), and who knows what else.

Gotcha.

Recall from the very first post that the "RawData" field contains the entire fixed-length record received from the 3rd party. This includes the call date/time, call length, from number, to number, etc. All of the information necessary for detecting "duplicate calls" is included in the "RawData" field.

Thanks again for your help. I'll try to post some results of the two new solutions later.

The Dixie Flatline SSC Guru Points: 53354 More actions · Answer 14

"Thank you"s are indeed payment. I stand corrected.

You are welcome, Scott. We'll look forward to hearing from you.

__________________________________________________

Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

Jeff Moden SSC Guru Points: 1004664 More actions · Answer 15

Scott Roberts (4/27/2009)
While I would completely agree that cursors are "significantly harmful", I'm not yet convinced that they are "no longer necessary".

Skepticism is certainly understood, especially on the types of CDR's that get sent in by the "big man" (used to be ATT). Not only do you get dupes in the same file, but they delay the first CDR's for new customers for 90 to 120 days (120 days is too late to bill in some states) and then you get all their CDR's for for the last 3 to 4 months which really ticks the customer off.

Worse yet, you may get duplicated CDR's for the same call but they're in separate files.

Can you tell I've been there on this one? 😀

1. Duplicate rows where all of them are "unbilled". All duplicates should be removed and one row should remain.
2. Duplicate rows where one (or more) of them is "billed" and the others are "unbilled". All of the "unbilled" duplicate rows should be removed.
3. Duplicate rows where all of them are "billed" (shouldn't happen, but what if?). No rows should be removed.

Proverbial piece o' cake... first, let's try it with your sample data. The big mistake people sometimes make is trying to do it all in the same query...

[font="Courier New"]--drop table [CallDetailRecords]

go

/****** Object: Table [dbo].[CallDetailRecords] Script Date: 04/27/2009 12:10:02 ******/

SET ANSI_NULLS ON

GO

CREATE TABLE [dbo].[CallDetailRecords](

[Id] [uniqueidentifier] NOT NULL,

[RawData] [varchar](5000) NOT NULL,

[ErrorText] [varchar](200) NULL,

[Rated] [bit] NOT NULL,

[Billed] [bit] NULL

) ON [PRIMARY]

GO

SET ANSI_PADDING OFF

-- Duplicate Unbilled Calls (one should be removed)

INSERT INTO [CallDetailRecords] ( [Id],[RawData],[ErrorText],[Rated],[Billed] ) VALUES ( NEWID(),'1234567890',NULL,1,0 )

-- Duplicate Billed Calls (neither one should be removed)

INSERT INTO [CallDetailRecords] ( [Id],[RawData],[ErrorText],[Rated],[Billed] ) VALUES ( NEWID(),'9876543210',NULL,1,1 )

-- Mixed duplicates. All unbilled calls should be removed.

INSERT INTO [CallDetailRecords] ( [Id],[RawData],[ErrorText],[Rated],[Billed] ) VALUES ( NEWID(),'5555555555',NULL,1,1 )

INSERT INTO [CallDetailRecords] ( [Id],[RawData],[ErrorText],[Rated],[Billed] ) VALUES ( NEWID(),'5555555555',NULL,1,0 )

--===== Mark duplicate Unbilled no matter what

-- All duplicated data will be contained in RawData

UPDATE cdr1

SET ErrorText = 'Duplicated Call'

FROM dbo.CallDetailRecords cdr1,

dbo.CallDetailRecords cdr2

WHERE cdr1.RawData = cdr2.RawData

AND cdr1.Rated = cdr2.Rated

AND cdr1.Billed = 0

AND cdr2.Billed = 0

AND cdr1.ID < cdr2.ID

--===== Mark Unbilled if billed exists

-- All duplicated data will be contained in RawData

-- Notice that we're not including rows already "marked"

UPDATE cdr1

SET ErrorText = 'Duplicated Call'

FROM dbo.CallDetailRecords cdr1,

dbo.CallDetailRecords cdr2

WHERE cdr1.RawData = cdr2.RawData

AND ISNULL(cdr1.ErrorText,0) = ISNULL(cdr2.ErrorText,0)

AND cdr1.Rated = cdr2.Rated

AND cdr1.Billed = 0

AND cdr2.Billed = 1

SELECT * FROM dbo.CallDetailRecords

[/font]

Now, the problem is that no matter which method you use, this is going to be dreadfully slow because there's no chance of a decent index to drive the code. Having done this before, I'd say that you need to import the data to a staging table in the separate columns the data actually belongs in. Then, you can put a nice clustered index on things like FromNum, ToNum, Date, Time, and maybe CallType and Duration.

If you still want to play with the idea of using the whole CDR as a comparison, here's a million row test table with some dupes for you to play with.

[font="Courier New"]--===== Pick a nice, safe place to play

USE TempDB

--===== Conditionally drop the test table for reruns

IF OBJECT_ID('dbo.CallDetailRecords') IS NOT NULL

DROP TABLE dbo.CallDetailRecords

--===== Create the table with the correct datatypes

-- and fill it with a million rows on the fly

SELECT TOP 1000000

ID = NEWID(),

RawData = REPLICATE(CAST(NEWID() AS CHAR(36)),10),

ErrorText = CAST(NULL AS VARCHAR(200)),

Rated = CAST(1 AS BIT),

Billed = CAST(ABS(CHECKSUM(NEWID()))%2 AS BIT)

INTO dbo.CallDetailRecords

FROM Master.dbo.SysComments sc1

CROSS JOIN Master.dbo.SysComments sc2

--TRUST ME... DON'T ADD ANY INDEXES YET (if any, you can't index a VARCHAR(5000))

--===== Add some randomized "duplicates"

-- (TRUST ME... YOU DON'T WANT TO DO AN ORDER BY HERE)

INSERT INTO dbo.CallDetailRecords

(ID,RawData,ErrorText,Rated,Billed)

SELECT TOP 10000

ID=NEWID(),RawData,ErrorText,Rated,Billed=CAST(ABS(CHECKSUM(NEWID()))%2 AS BIT)

FROM dbo.CallDetailRecords

--===== Add some more randomized "duplicates", hopefully, from the same batch as above

-- (TRUST ME... YOU DON'T WANT TO DO AN ORDER BY HERE)

INSERT INTO dbo.CallDetailRecords

(ID,RawData,ErrorText,Rated,Billed)

SELECT TOP 5000

ID=NEWID(),RawData,ErrorText,Rated,Billed=CAST(ABS(CHECKSUM(NEWID()))%2 AS BIT)

FROM dbo.CallDetailRecords[/font]

Just a parting thought. I recommended working on a staging table with separate columns and indexes on the couple of columns that actually determine a dupe CDR. The recommendation is not made lightly. Using the method I posted, I dupe check across 31 sets (1 for each day of the month) of 3 databases and a "recycle" database for a total of (upto but usually 1 or 2 less depending on the months) 94 databases with about 4 million rows in each database... in about 11 minutes.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)