How to loop through dataset to identify and flag duplicates

  • Hi,

    I have a job on to identify duplicate leads in a single dataset based on whether or not the creationdatetime of a record is within 30 seconds of the previous record.

    I have attempted to split out into grouped sub sets but can't get it work. I think I am left with either a cursor or building a loop based ssis package. The criteria I have to follow is simply:

    If the next ID's media code and phone number match the previous and the creationdatetime is less than 30 seconds, then flag as duplicate.

    Any help greatly appreciated....

  • Hi, welcome to SSC, if you would take a moment to read the second link in my signature block on posting questions it would help you get an answer quicker, but as this seems like a simple data problem it isn't to hard to recreate some play data.

    Based on that I believe the problem is, the below should satisfy your problem.

    DECLARE @Table TABLE (MediaCode INT, PhoneNumber CHAR(11), CreationDateTime DATETIME)

    INSERT INTO @Table VALUES

    (1,'01234567890','2012-01-01 00:00:00'),

    (1,'01234567890','2012-01-01 00:00:30'),--Flag as duplicate

    (1,'01234567890','2012-01-01 00:10:00'),

    (1,'01234567890','2012-01-01 00:10:30'),--Flag as duplicate

    (1,'01234567890','2012-01-01 00:20:00'),

    (1,'01234567890','2012-01-01 00:20:30')--Flag as duplicate

    select distinct

    t2.*,

    CASE WHEN DATEDIFF(SECOND,t1.CreationDateTime,t2.CreationDateTime) <= 30 THEN 'Duplicate' ELSE 'NotDuplicate' END AS Duplicate

    from

    @Table t1

    inner join

    @Table t2

    on

    t1.MediaCode = t2.MediaCode

    and

    t1.PhoneNumber = t2.PhoneNumber

    and

    t1.CreationDateTime < t2.CreationDateTime

    where

    DATEDIFF(SECOND,t1.CreationDateTime,t2.CreationDateTime) <= 30

  • Thanks for the quick response!

    Point taken on getting quick answers through provision of useful data, below is a script to provide an actual example of the issue I am having I have also revised my original criteria:

    If the next ID's media code and phone number and createddate match the previous and the CREATIONDATETIME is less than 30 seconds, then flag as duplicate.

    CREATE TABLE #Deduplicate(

    ID int,

    CreatedDateTime datetime,

    CreatedDate datetime,

    Media varchar(25),

    ContactPhone varchar(25)

    )

    Insert into #Deduplicate

    (ID,CreatedDateTime,CreatedDate,Media,ContactPhone)

    Select 1,'Jan 01 2012 07:00:05AM','Jan 01 2012 00:00AM','TV','Anonymous' Union ALL

    Select 2,'Jan 01 2012 07:00:10AM','Jan 01 2012 00:00AM','TV','Anonymous' Union ALL

    Select 3,'Jan 01 2012 07:00:15AM','Jan 01 2012 00:00AM','TV','Anonymous' Union ALL

    Select 4,'Jan 01 2012 07:00:20AM','Jan 01 2012 00:00AM','TV','Anonymous' Union ALL

    Select 5,'Jan 01 2012 07:00:25AM','Jan 01 2012 00:00AM','TV','Anonymous' Union ALL

    I think your solution still works using this data as a guide? Can you confirm?

    Thanks again!

  • yes my original solution will still work, my solution will return ID's 2,3,4,5 as duplicates.

  • That's great,

    Just tested and it works exactly as you explained.

    Many thanks!

Viewing 5 posts - 1 through 5 (of 5 total)

You must be logged in to reply to this topic. Login to reply