Detect and update 'duplicates' based on time difference

  • Hello,

    I have written the following script which is designed to update what is deemed as a 'duplicate' based on the time difference between the 'duplicate' and other qualifying data. A duplicate is classed a row that has the same UserKey, ProductKey and DateKey as another row and where the TimeKey is within 300 (seconds) of the other row's TimeKey. I should probably give an example:

    ExtranetLinkKey UserKey ProductKey DateKey TimeKey

    669270 77255 1219 3629 56749

    669289 77255 1219 3629 56900

    669323 77255 1219 3629 57122

    Given the above 3 rows, what should happen is that Row 2 would be classed as a 'duplicate' and would have it's LinkCount set to 0. Row 3 should not be classed as a 'duplicate' because I want it to be compared against the 1st row, not the 2nd row and the difference in TimeKey between the 1st and 3rd rows is > 300.

    What is actually happening is that rows 2 and 3 are being treated as 'duplicates'.

    I have a feeling that the solution involves the use of ROW_NUMBER - I just need a push in the right direction!

    Anyway, here's the script...

    /*

    Script detects duplicate links and updates the LinkCount to 0 where duplicates are found

    Duplicates defined as links that have been submitted for the same user and product within 5 minutes of each other

    Script counts the first links and then sets subdequent links to have an LinkCount of 0

    */

    DECLARE @Msg VARCHAR(255)

    DECLARE @Severity INT

    DECLARE @Subj VARCHAR(255)

    DECLARE @SecondsDifference SMALLINT-- Time difference searched to detect duplicates

    DECLARE @MaxExtranetLinkKey INT

    SET @MaxExtranetLinkKey = (SELECT ExtranetLinkKey

    FROM tmp_ExtranetLinkKey)

    -- Holds details of duplicated links

    DECLARE @Working TABLE

    (ExtranetLinkKey INT NOT NULL PRIMARY KEY,

    DateKey INT NOT NULL,

    UserKey INT NOT NULL,

    ProductKey INT NOT NULL,

    Seconds CHAR(8) NOT NULL,

    TimeKey INT NOT NULL)

    DECLARE @RemoveDups TABLE (ExtranetLinkKey INT) -- Holds ExtranetLinkKeys to be updated.

    SET @SecondsDifference = 300;

    BEGIN TRY

    BEGIN TRAN

    -- Get duplicate links based on user and product

    INSERT INTO @Working (ExtranetLinkKey, DateKey, UserKey, ProductKey, Seconds, TimeKey)

    SELECT el.ExtranetLinkKey, el.DateKey, el.UserKey, el.ProductKey, Seconds, el.TimeKey

    FROM dbo.Fact_ExtranetLinks el

    INNER JOIN Dim_Time t ON el.TimeKey = t.TimeKey

    INNER JOIN Dim_Date d ON el.DateKey = d.DateKey

    INNER JOIN

    (SELECT DateKey, UserKey, ProductKey

    FROM Fact_ExtranetLinks el

    GROUP BY DateKey, UserKey, ProductKey

    HAVING COUNT(*)<>1) AS A ON el.UserKey = A.UserKey AND el.DateKey = A.DateKey AND el.ProductKey = A.ProductKey

    WHERE el.ExtranetLinkKey > @MaxExtranetLinkKey

    -- Join duplicate links to themselves on user and product where the extranetkey is different and the link was submitted within the specified time period i.e. links for the same user and product within the time specified in @SecondsDifference

    INSERT INTO @RemoveDups

    SELECT DISTINCT CAST(ds1.ExtranetLinkKey AS VARCHAR(10))

    FROM @Working ds

    INNER JOIN @Working ds1 ON ds.UserKey=ds1.UserKey AND ds.DateKey=ds1.DateKey AND ds.ProductKey=ds1.ProductKey AND ds.ExtranetLinkKey<>ds1.ExtranetLinkKey

    --AND ds.TimeKey = ds1.TimeKey

    WHERE DATEDIFF(ss, ds.Seconds, ds1.Seconds) BETWEEN 1 AND @SecondsDifference

    UPDATE dbo.Fact_ExtranetLinks

    SET LinkCount = 0

    WHERE ExtranetLinkKey IN

    (SELECT ExtranetLinkKey FROM @RemoveDups)

    AND LinkCount = 1

    COMMIT

    END TRY

    BEGIN CATCH

    IF @@TRANCOUNT<>0

    ROLLBACK TRAN

    SET @Msg = REPLACE(ERROR_MESSAGE(), '.', '') + ' (error no: ' + CAST(ERROR_NUMBER() AS VARCHAR(10)) + ') on line ' + CAST(ERROR_LINE() AS VARCHAR(10)) + CASE WHEN ERROR_PROCEDURE() IS NOT NULL THEN ' in ' + ERROR_PROCEDURE() ELSE '' END

    SET @Severity = ERROR_SEVERITY()

    SET @Subj = 'Failed to update LinkCount for ReAP_MI_ExtranetExcursions_History '

    RAISERROR(@Msg, @Severity , 1)

    EXEC master..sp_sendmail

    @recipients= <recipients list>,

    @subject=@Subj,

    @message=@Msg

    END CATCH

    RETURN

    Thanks

    Lempster

  • I think the way I'd do it would be to join rows based on UserKey, ProductKey and DateKey, with equality tests on those columns, and a join based on the range for the TimeKey, where there's a row that's within 300 seconds before.

    Then add a Row_Number() partitioned by UserKey, ProductKey and DateKey, ordered by TimeKey.

    Delete the ones where the Row_Number() = 1, then repeat as long as @@Rowcount > 0.

    I'd have to see actual table definitions and some data to make sure that would do what's needed, but it's probably the easiest way to accomplish what you need.

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • Thanks GSquared,

    Nice to know I was in the right area. 😉

    I must admit I am not very good at using some of the (now not so) new T-SQL functions, so this will force me into it!

    Lempster

  • GSquared (11/11/2009)


    I think the way I'd do it would be to join rows based on UserKey, ProductKey and DateKey, with equality tests on those columns, and a join based on the range for the TimeKey, where there's a row that's within 300 seconds before.

    Hi GSquared,

    Is that not what I've done with my two table variables? The equality-based join in the @Working table variable and the TimeKey range join (ok I've used the Seconds column simply because I find it easier to verify the results by looking at the actual time rather an integer) in the @RemoveDups table variable.

    Then add a Row_Number() partitioned by UserKey, ProductKey and DateKey, ordered by TimeKey.

    Ok, I've modified the script like so:

    -- Join duplicate links to themselves on user and product where the extranetkey is different and the link was submitted within the specified time period i.e. links for the same user and product within the time specified in @SecondsDifference

    INSERT INTO @RemoveDups

    SELECT CAST(ds1.ExtranetLinkKey AS VARCHAR(10)),

    ROW_NUMBER() OVER(PARTITION BY ds.UserKey, ds.ProductKey, ds.DateKey

    ORDER BY ds.TimeKey)

    FROM @Working ds

    INNER JOIN @Working ds1 ON ds.UserKey=ds1.UserKey AND ds.DateKey=ds1.DateKey AND ds.ProductKey=ds1.ProductKey AND ds.ExtranetLinkKey<>ds1.ExtranetLinkKey

    --AND ds.TimeKey = ds1.TimeKey

    WHERE DATEDIFF(ss, ds.Seconds, ds1.Seconds) BETWEEN 1 AND @SecondsDifference

    ORDER BY ds1.ExtranetLinkKey

    Delete the ones where the Row_Number() = 1, then repeat as long as @@Rowcount > 0.

    I'm not sure what you mean by the second part of this statement - the 'repeat' bit?

    If you take this set of set test data as being the entire contents of the dbo.Fact_ExtranetLinks table, you can omit the WHERE el.ExtranetLinkKey > @MaxExtranetLinkKey condition. (Sorry about the formatting; I don't know how to get the coulmn values to line up)

    ExtranetLinkKeyDateKeyUserKeyProductKeySecondsTimeKey

    6702843971184610117310:41:0138461

    6702863971178855117310:58:2139501

    6702873971184610117310:59:1739557

    6702883971184610117310:59:4439584

    6702893971184610117310:59:4839588

    6702903971184610117310:59:5139591

    6702913971178855117311:10:5740257

    6702933971183948117315:08:4954529

    6702953971186733117314:19:4151581

    6702963971186733117314:45:2453124

    6702973971186733117315:04:4054280

    6702983971186733117314:55:2453724

    6702993971186733117314:41:4652906

    6703003971186733117314:02:0250522

    6703013971186733117314:02:5050570

    6703023971186733117315:57:4757467

    6703033971186733117315:31:1955879

    6703043971186733117315:10:1754617

    6703053971186733117315:28:3355713

    6703063971178678117316:22:0958929

    6703073971178678117316:19:0658746

    6703083971178678117316:19:1258752

    6703093971178678117316:19:1758757

    6703103971179182117317:31:1863078

    6703113971185714119412:07:5443674

    Out of the above I would want the following ExtranetLinkKeys to be classed as duplicates and subsequently deleted:

    670288

    670289

    670290

    670296

    670301

    670303

    670306

    670308

    670309

    Regards

    Lempster

  • In re-reading your post, I realized you're planning on an update instead of a deletion, so change that part of what I wrote.

    The "repeat till @@rowcount=0" is because you could, theoretically, have two rows that are within 300 seconds of a prior row. You need to make sure you get both of them.

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • Got it...I needed to include the TimeKey range in the join clause in the @Working table variable as well as the @RemoveDups table variable like so:

    INSERT INTO @Working (ExtranetLinkKey, DateKey, UserKey, ProductKey, Seconds, TimeKey)

    SELECT DISTINCT el.ExtranetLinkKey, el.DateKey, el.UserKey, el.ProductKey, Seconds, el.TimeKey

    FROM dbo.Fact_ExtranetLinks el

    INNER JOIN Dim_Time t ON el.TimeKey = t.TimeKey

    INNER JOIN Dim_Date d ON el.DateKey = d.DateKey

    INNER JOIN

    (SELECT ExtranetLinkKey, DateKey, UserKey, ProductKey, TimeKey

    FROM Fact_ExtranetLinks el) AS A

    ON el.UserKey = A.UserKey AND el.DateKey = A.DateKey AND el.ProductKey = A.ProductKey

    AND el.TimeKey - A.TimeKey < 300 AND el.ExtranetLinkKey <> A.ExtranetLinkKey

    WHERE el.ExtranetLinkKey > @MaxExtranetLinkKey

    Thanks for your help 🙂

  • You're welcome.

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

Viewing 7 posts - 1 through 6 (of 6 total)

You must be logged in to reply to this topic. Login to reply