April 27, 2009 at 1:43 pm
I provided the trivial data, but mine still works if they aren't in sequence. That was omitted from the original specs, so I didn't want to assume it away. 🙂
__________________________________________________
Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills
April 27, 2009 at 1:45 pm
Made a simple change to the data and broke my own code. This requires additional information regarding he requirements. If the data is inserted as specified by the OP, it works. If you change the order of the data such that you have two unbilled records entered followed by a third billed record for the same "rawdata", the first unbilled record is not flagged as a duplicate.
Here is the modified test code:
/****** Object: Table [dbo].[CallDetailRecords] Script Date: 04/27/2009 12:10:02 ******/
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
CREATE TABLE [dbo].[CallDetailRecords](
[Id] [uniqueidentifier] NOT NULL,
[RawData] [varchar](300) NOT NULL,
[ErrorText] [varchar](200) NULL,
[Rated] [bit] NOT NULL,
[Billed] [bit] NULL
) ON [PRIMARY]
GO
CREATE CLUSTERED INDEX IX_CallDetailRecords1 ON [dbo].[CallDetailRecords](
[RawData]
) ON [PRIMARY]
;
GO
-- Duplicate Unbilled Calls (one should be removed)
INSERT INTO [CallDetailRecords] ( [Id],[RawData],[ErrorText],[Rated],[Billed] ) VALUES ( NEWID(),'1234567890',NULL,1,0 );
INSERT INTO [CallDetailRecords] ( [Id],[RawData],[ErrorText],[Rated],[Billed] ) VALUES ( NEWID(),'1234567890',NULL,1,0 );
-- Duplicate Billed Calls (neither one should be removed)
INSERT INTO [CallDetailRecords] ( [Id],[RawData],[ErrorText],[Rated],[Billed] ) VALUES ( NEWID(),'9876543210',NULL,1,1 );
INSERT INTO [CallDetailRecords] ( [Id],[RawData],[ErrorText],[Rated],[Billed] ) VALUES ( NEWID(),'9876543210',NULL,1,1 );
-- Mixed duplicates. All unbilled calls should be removed.
INSERT INTO [CallDetailRecords] ( [Id],[RawData],[ErrorText],[Rated],[Billed] ) VALUES ( NEWID(),'5555555555',NULL,1,0 );
INSERT INTO [CallDetailRecords] ( [Id],[RawData],[ErrorText],[Rated],[Billed] ) VALUES ( NEWID(),'5555555555',NULL,1,0 );
INSERT INTO [CallDetailRecords] ( [Id],[RawData],[ErrorText],[Rated],[Billed] ) VALUES ( NEWID(),'5555555555',NULL,1,1 );
select
*
from
dbo.CallDetailRecords
;
declare @RawData varchar(300),
@ErrorText varchar(200);
set @RawData = '';
update dbo.CallDetailRecords set
@ErrorText = ErrorText = case when @RawData = cdr.RawData and cdr.Billed = 0 then 'Duplicate Call' else cdr.ErrorText end,
@RawData = cdr.RawData
from
dbo.CallDetailRecords cdr with (Index = 1)
;
select
*
from
dbo.CallDetailRecords
;
drop table dbo.CallDetailRecords
;
April 27, 2009 at 1:47 pm
If you modify the clustered index to include the BILLED flag as DESC, the code will once again work.
/****** Object: Table [dbo].[CallDetailRecords] Script Date: 04/27/2009 12:10:02 ******/
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
CREATE TABLE [dbo].[CallDetailRecords](
[Id] [uniqueidentifier] NOT NULL,
[RawData] [varchar](300) NOT NULL,
[ErrorText] [varchar](200) NULL,
[Rated] [bit] NOT NULL,
[Billed] [bit] NULL
) ON [PRIMARY]
GO
CREATE CLUSTERED INDEX IX_CallDetailRecords1 ON [dbo].[CallDetailRecords](
[RawData],
[Billed] DESC
) ON [PRIMARY]
;
GO
-- Duplicate Unbilled Calls (one should be removed)
INSERT INTO [CallDetailRecords] ( [Id],[RawData],[ErrorText],[Rated],[Billed] ) VALUES ( NEWID(),'1234567890',NULL,1,0 );
INSERT INTO [CallDetailRecords] ( [Id],[RawData],[ErrorText],[Rated],[Billed] ) VALUES ( NEWID(),'1234567890',NULL,1,0 );
-- Duplicate Billed Calls (neither one should be removed)
INSERT INTO [CallDetailRecords] ( [Id],[RawData],[ErrorText],[Rated],[Billed] ) VALUES ( NEWID(),'9876543210',NULL,1,1 );
INSERT INTO [CallDetailRecords] ( [Id],[RawData],[ErrorText],[Rated],[Billed] ) VALUES ( NEWID(),'9876543210',NULL,1,1 );
-- Mixed duplicates. All unbilled calls should be removed.
INSERT INTO [CallDetailRecords] ( [Id],[RawData],[ErrorText],[Rated],[Billed] ) VALUES ( NEWID(),'5555555555',NULL,1,0 );
INSERT INTO [CallDetailRecords] ( [Id],[RawData],[ErrorText],[Rated],[Billed] ) VALUES ( NEWID(),'5555555555',NULL,1,0 );
INSERT INTO [CallDetailRecords] ( [Id],[RawData],[ErrorText],[Rated],[Billed] ) VALUES ( NEWID(),'5555555555',NULL,1,1 );
select
*
from
dbo.CallDetailRecords
;
declare @RawData varchar(300),
@ErrorText varchar(200);
set @RawData = '';
update dbo.CallDetailRecords set
@ErrorText = ErrorText = case when @RawData = cdr.RawData and cdr.Billed = 0 then 'Duplicate Call' else cdr.ErrorText end,
@RawData = cdr.RawData
from
dbo.CallDetailRecords cdr with (Index = 1)
;
select
*
from
dbo.CallDetailRecords
;
drop table dbo.CallDetailRecords
;
April 27, 2009 at 1:49 pm
Bob Hovious (4/27/2009)
I provided the trivial data, but mine still works if they aren't in sequence. That was omitted from the original specs, so I didn't want to assume it away. 🙂
Actually the OP provided data in this post.
I think you missed as you both posted at the same time.
April 27, 2009 at 1:58 pm
Bob Hovious (4/27/2009)
I provided the trivial data, but mine still works if they aren't in sequence. That was omitted from the original specs, so I didn't want to assume it away. 🙂
This why I think this is a trivial example. It is possible, but I really think there is more to identifying the duplicates for this problem. we just don't have all the facts.
April 27, 2009 at 2:07 pm
Thank you both very much for your solutions. I will definitely give both of them a try on larger data samples.
As for row sequencing, is it not invalid to assume anything about the order of the rows in a table unless an index or "order by" is specified?
April 27, 2009 at 2:14 pm
Scott Roberts (4/27/2009)
Thank you both very much for your solutions. I will definitely give both of them a try on larger data samples.As for row sequencing, is it not invalid to assume anything about the order of the rows in a table unless an index or "order by" is specified?
Yes, and even if an index is present. What I gave you ONLY works when doing an UPDATE and the CLUSTERED INDEX exists on the data in the order it is processed. Jeff moden would also tell you to include the TABLOCKX hint along woth the index hint for the clustered index (INDEX 1,TABLOCKX).
As I don't have SQL Server 2008, I have not been able to test this on that platform as yet, but it works on SQL Server 2000/2005.
April 27, 2009 at 2:14 pm
Lynn: Wait... are we disagreeing about something?
If so, don't mind us, Scott. Sidebar discussions are the only payment a volunteer ever gets.
__________________________________________________
Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills
April 27, 2009 at 2:19 pm
Bob Hovious (4/27/2009)
Lynn: Wait... are we disagreeing about something?
I don't think so. I think we are just confusing each other.
If you are talking about the trivial data, most definately NOT. The OP posted data at the same time you did. I used the data provided by the OP, not you. It is the OP's data I was also calling trivial.
As stated in later posts, i personally think there is more to the DEDUPE requirements, and based on the what the OP provided, we both have provided a solution that works.
April 27, 2009 at 2:46 pm
Bob Hovious (4/27/2009)
Sidebar discussions are the only payment a volunteer ever gets.
And hopefully an occassional "thank you". Although I'd be obliged to agree if you don't consider that much of a "payment". 🙂
April 27, 2009 at 2:49 pm
Lynn Pettis (4/27/2009)
As stated in later posts, i personally think there is more to the DEDUPE requirements, and based on the what the OP provided, we both have provided a solution that works.
For the sake of completeness, what additional requirements do you suspect exist? I'd like to be as clear as possible in case others wish to offer additional ideas.
April 27, 2009 at 2:55 pm
Scott Roberts (4/27/2009)
Lynn Pettis (4/27/2009)
As stated in later posts, i personally think there is more to the DEDUPE requirements, and based on the what the OP provided, we both have provided a solution that works.For the sake of completeness, what additional requirements do you suspect exist? I'd like to be as clear as possible in case others wish to offer additional ideas.
Not sure, but the fact that you are apparently doing this for call records, I'd assume that there would also be date/time values, elapsed time values, from/to phone numbers, account numbers, calling periods (peak/non-peak), and who knows what else.
This is the type of info I'd want to see on my invoices so I could dispute any calls, or (if a business) make sure I get reimbursed by employees for personal calls or allocate costs to appropriate departments.
April 27, 2009 at 3:28 pm
Lynn Pettis (4/27/2009)
Not sure, but the fact that you are apparently doing this for call records, I'd assume that there would also be date/time values, elapsed time values, from/to phone numbers, account numbers, calling periods (peak/non-peak), and who knows what else.
Gotcha.
Recall from the very first post that the "RawData" field contains the entire fixed-length record received from the 3rd party. This includes the call date/time, call length, from number, to number, etc. All of the information necessary for detecting "duplicate calls" is included in the "RawData" field.
Thanks again for your help. I'll try to post some results of the two new solutions later.
April 27, 2009 at 3:37 pm
"Thank you"s are indeed payment. I stand corrected.
You are welcome, Scott. We'll look forward to hearing from you.
__________________________________________________
Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills
April 27, 2009 at 9:55 pm
Scott Roberts (4/27/2009)
While I would completely agree that cursors are "significantly harmful", I'm not yet convinced that they are "no longer necessary".
Skepticism is certainly understood, especially on the types of CDR's that get sent in by the "big man" (used to be ATT). Not only do you get dupes in the same file, but they delay the first CDR's for new customers for 90 to 120 days (120 days is too late to bill in some states) and then you get all their CDR's for for the last 3 to 4 months which really ticks the customer off.
Worse yet, you may get duplicated CDR's for the same call but they're in separate files.
Can you tell I've been there on this one? 😀
1. Duplicate rows where all of them are "unbilled". All duplicates should be removed and one row should remain.
2. Duplicate rows where one (or more) of them is "billed" and the others are "unbilled". All of the "unbilled" duplicate rows should be removed.
3. Duplicate rows where all of them are "billed" (shouldn't happen, but what if?). No rows should be removed.
Proverbial piece o' cake... first, let's try it with your sample data. The big mistake people sometimes make is trying to do it all in the same query...
)
ON [PRIMARY]GO
SET ANSI_PADDING OFF-- Duplicate Unbilled Calls (one should be removed) INSERT INTO [CallDetailRecords] ( [Id],[RawData],[ErrorText],[Rated],[Billed] ) VALUES ( NEWID(),'1234567890',NULL,1,0 ) INSERT INTO [CallDetailRecords] ( [Id],[RawData],[ErrorText],[Rated],[Billed] ) VALUES ( NEWID(),'1234567890',NULL,1,0 ) -- Duplicate Billed Calls (neither one should be removed) INSERT INTO [CallDetailRecords] ( [Id],[RawData],[ErrorText],[Rated],[Billed] ) VALUES ( NEWID(),'9876543210',NULL,1,1 ) INSERT INTO [CallDetailRecords] ( [Id],[RawData],[ErrorText],[Rated],[Billed] ) VALUES ( NEWID(),'9876543210',NULL,1,1 ) -- Mixed duplicates. All unbilled calls should be removed. INSERT INTO [CallDetailRecords] ( [Id],[RawData],[ErrorText],[Rated],[Billed] ) VALUES ( NEWID(),'5555555555',NULL,1,1 ) INSERT INTO [CallDetailRecords] ( [Id],[RawData],[ErrorText],[Rated],[Billed] ) VALUES ( NEWID(),'5555555555',NULL,1,0 ) INSERT INTO [CallDetailRecords] ( [Id],[RawData],[ErrorText],[Rated],[Billed] ) VALUES ( NEWID(),'5555555555',NULL,1,0 )--===== Mark duplicate Unbilled no matter what-- All duplicated data will be contained in RawData
UPDATE cdr1
SET ErrorText = 'Duplicated Call'
FROM dbo.CallDetailRecords cdr1,
dbo.CallDetailRecords cdr2
WHERE cdr1.RawData = cdr2.RawData
AND cdr1.Rated = cdr2.Rated
AND cdr1.Billed = 0
AND cdr2.Billed = 0
AND cdr1.ID < cdr2.ID--===== Mark Unbilled if billed exists
-- All duplicated data will be contained in RawData
-- Notice that we're not including rows already "marked"
UPDATE cdr1
SET ErrorText = 'Duplicated Call'
FROM dbo.CallDetailRecords cdr1,
dbo.CallDetailRecords cdr2
WHERE cdr1.RawData = cdr2.RawData
AND ISNULL(cdr1.ErrorText,0) = ISNULL(cdr2.ErrorText,0)
AND
cdr1.Rated = cdr2.RatedAND cdr1.Billed = 0
AND cdr2.Billed = 1SELECT * FROM dbo.CallDetailRecords[/font]
Now, the problem is that no matter which method you use, this is going to be dreadfully slow because there's no chance of a decent index to drive the code. Having done this before, I'd say that you need to import the data to a staging table in the separate columns the data actually belongs in. Then, you can put a nice clustered index on things like FromNum, ToNum, Date, Time, and maybe CallType and Duration.
If you still want to play with the idea of using the whole CDR as a comparison, here's a million row test table with some dupes for you to play with.
USE TempDB--===== Conditionally drop the test table for reruns
IF OBJECT_ID('dbo.CallDetailRecords') IS NOT NULL
DROP TABLE dbo.CallDetailRecords--===== Create the table with the correct datatypes
-- and fill it with a million rows on the fly
SELECT TOP 1000000
ID
= NEWID(),RawData = REPLICATE(CAST(NEWID() AS CHAR(36)),10),
ErrorText = CAST(NULL AS VARCHAR(200)),
Rated = CAST(1 AS BIT),
Billed = CAST(ABS(CHECKSUM(NEWID()))%2 AS BIT)
INTO dbo.CallDetailRecords
FROM Master.dbo.SysComments sc1
CROSS JOIN Master.dbo.SysComments sc2--TRUST ME... DON'T ADD ANY INDEXES YET (if any, you can't index a VARCHAR(5000))
--===== Add some randomized "duplicates"
-- (TRUST ME... YOU DON'T WANT TO DO AN ORDER BY HERE)
INSERT INTO dbo.CallDetailRecords
(ID,RawData,ErrorText,Rated,Billed)
SELECT TOP 10000
ID
=NEWID(),RawData,ErrorText,Rated,Billed=CAST(ABS(CHECKSUM(NEWID()))%2 AS BIT)FROM dbo.CallDetailRecords
--===== Add some more randomized "duplicates", hopefully, from the same batch as above
-- (TRUST ME... YOU DON'T WANT TO DO AN ORDER BY HERE)
INSERT INTO dbo.CallDetailRecords
(ID,RawData,ErrorText,Rated,Billed)
SELECT TOP 5000
ID
=NEWID(),RawData,ErrorText,Rated,Billed=CAST(ABS(CHECKSUM(NEWID()))%2 AS BIT)FROM dbo.CallDetailRecords[/font]
Just a parting thought. I recommended working on a staging table with separate columns and indexes on the couple of columns that actually determine a dupe CDR. The recommendation is not made lightly. Using the method I posted, I dupe check across 31 sets (1 for each day of the month) of 3 databases and a "recycle" database for a total of (upto but usually 1 or 2 less depending on the months) 94 databases with about 4 million rows in each database... in about 11 minutes.
--Jeff Moden
Change is inevitable... Change for the better is not.
Viewing 15 posts - 16 through 30 (of 41 total)
You must be logged in to reply to this topic. Login to reply