Finding records clustered by date

  • Hi! I'm processing patient data on thousands of patients and I'm trying to determine a way of identifying a set of records from a single table of events that meet certain criteria and are within 7 days of each other. Identifying the records is not an issue, but my problem is determining which records are within 7 days of each other. For example, a patient may have the following set of events:

    Pat123 Code3 03/05/2010

    Pat123 Code1 05/19/2010

    Pat123 Code3 05/20/2010

    Pat123 Code2 05/21/2010

    Pat123 Code4 05/22/2010

    Pat123 Code1 06/20/2010

    Pat123 Code3 06/30/2010

    Of course, this table contains other patient records as well. I'm looking for a solution that doesn't involve loops or cursors. The answer from the above would be:

    Pat123 Code1 05/19/2010

    Pat123 Code3 05/20/2010

    Pat123 Code2 05/21/2010

    Pat123 Code4 05/22/2010

    Any thoughts?

  • So you have a specific code, then you want to see which items fall between a 7 day period for that code from the same table?

    I'm not 100% on what you are doing, but you could do that with an inner join. Your first select would pull the code and date and then the inner join could be used to pull the records within the 7 day span.

  • Try this:

    DECLARE @PatientsInfo TABLE

    (

    PatientID VARCHAR(15),

    PatientCode VARCHAR(5),

    Activity DATETIME

    )

    INSERT INTO @PatientsInfo

    SELECT 'Pat123', 'Code3', '03/05/2010'

    UNION ALL SELECT 'Pat123', 'Code1', '05/19/2010'

    UNION ALL SELECT 'Pat123', 'Code3', '05/20/2010'

    UNION ALL SELECT 'Pat123', 'Code2', '05/21/2010'

    UNION ALL SELECT 'Pat123', 'Code4', '05/22/2010'

    UNION ALL SELECT 'Pat123', 'Code1', '06/20/2010'

    UNION ALL SELECT 'Pat123', 'Code3', '06/30/2010'

    ; WITH GiveARowNumber AS

    (

    SELECT

    RowNum = ROW_NUMBER() OVER (ORDER BY (SELECT NULL))

    ,PatientID

    ,PatientCode

    ,Activity

    FROM

    @PatientsInfo

    ),

    GroupedSet AS

    (

    SELECT

    RN = ROW_NUMBER() OVER (PARTITION BY PatientID ORDER BY RowNum)

    ,RowNum

    ,PatientID

    ,PatientCode

    ,Activity

    FROM

    GiveARowNumber

    )

    SELECT *

    FROM GroupedSet GS1

    INNER JOIN GroupedSet GS2

    ON GS1.RN +1 = GS2.RN

    AND GS1.PatientID = GS2.PatientID

    WHERE

    DATEDIFF(dd,GS2.Activity,GS1.Activity) <= 7

    Please take a peek at how i am producing my table and how i am setting up sample data. You might want to do the same to attract more people to view your thread; more than attracting, you will get top-brainers to work on your request..

  • I think Monsieur ColdCoffee's code can be simplified - I don't think that the first CTE is necessary if in the second one, we order by the Activity in the ROW_NUMBER() function call.

    ; WITH GroupedSet AS

    (

    SELECT

    RN = ROW_NUMBER() OVER (PARTITION BY PatientID ORDER BY Activity)

    ,PatientID

    ,PatientCode

    ,Activity

    FROM

    @PatientsInfo

    )

    SELECT *

    FROM GroupedSet GS1

    INNER JOIN GroupedSet GS2

    ON GS1.RN +1 = GS2.RN

    AND GS1.PatientID = GS2.PatientID

    WHERE

    DATEDIFF(dd,GS2.Activity,GS1.Activity) <= 7

    Wayne
    Microsoft Certified Master: SQL Server 2008
    Author - SQL Server T-SQL Recipes


    If you can't explain to another person how the code that you're copying from the internet works, then DON'T USE IT on a production system! After all, you will be the one supporting it!
    Links:
    For better assistance in answering your questions
    Performance Problems
    Common date/time routines
    Understanding and Using APPLY Part 1 & Part 2

  • WayneS (11/3/2010)


    I think Monsieur ColdCoffee's code can be simplified - I don't think that the first CTE is necessary if in the second one, we order by the Activity in the ROW_NUMBER() function call.

    Wayne, specifically, did 2 ROW_NUMBERs just for the fact tht, if the PatientIDs dont follow an order (meaning a mx of PatientIDs in each row) , then that extra ROW_NUMBER will be effective..

    Consider this sample data

    DECLARE @PatientsInfo TABLE

    (

    PatientID VARCHAR(15),

    PatientCode VARCHAR(5),

    Activity DATETIME

    )

    INSERT INTO @PatientsInfo

    SELECT 'Pat123', 'Code3', '03/05/2010'

    UNION ALL SELECT 'Pat123', 'Code1', '05/19/2010'

    UNION ALL SELECT 'Pat123', 'Code3', '05/20/2010'

    UNION ALL SELECT 'Pat123', 'Code2', '05/21/2010'

    UNION ALL SELECT 'Pat123', 'Code4', '05/22/2010'

    UNION ALL SELECT 'Pat123', 'Code1', '06/20/2010'

    UNION ALL SELECT 'Pat123', 'Code3', '06/30/2010'

    UNION ALL SELECT 'Pat456', 'Code1', '05/19/2010'

    UNION ALL SELECT 'Pat456', 'Code3', '05/20/2010'

    UNION ALL SELECT 'Pat456', 'Code2', '05/21/2010'

    UNION ALL SELECT 'Pat456', 'Code4', '05/22/2010'

    UNION ALL SELECT 'Pat456', 'Code1', '06/20/2010'

    UNION ALL SELECT 'Pat456', 'Code3', '06/30/2010'

    UNION ALL SELECT 'Pat123', 'Code1', '05/19/2010'

    UNION ALL SELECT 'Pat123', 'Code3', '05/20/2010'

    UNION ALL SELECT 'Pat123', 'Code2', '05/21/2010'

    UNION ALL SELECT 'Pat123', 'Code4', '05/22/2010'

    UNION ALL SELECT 'Pat456', 'Code1', '06/20/2010'

    UNION ALL SELECT 'Pat456', 'Code3', '06/30/2010'

  • WayneS (11/3/2010)


    I think Monsieur ColdCoffee's code can be simplified - I don't think that the first CTE is necessary if in the second one, we order by the Activity in the ROW_NUMBER() function call.

    ; WITH GroupedSet AS

    (

    SELECT

    RN = ROW_NUMBER() OVER (PARTITION BY PatientID ORDER BY Activity)

    ,PatientID

    ,PatientCode

    ,Activity

    FROM

    @PatientsInfo

    )

    SELECT *

    FROM GroupedSet GS1

    INNER JOIN GroupedSet GS2

    ON GS1.RN +1 = GS2.RN

    AND GS1.PatientID = GS2.PatientID

    WHERE

    DATEDIFF(dd,GS2.Activity,GS1.Activity) <= 7

    YEs, looking closely at the sample data provided in my first post and presuming ActivityDate column follows the order as u said, then your simplified version of the code wins 🙂

  • ColdCoffee (11/3/2010)You might want to do the same to attract more people to view your thread; more than attracting, you will get top-brainers to work on your request..

    Thanks for your help. I fully intended to work up a better example using some code, but I was late for a meeting that I had to get to.

    I'll try out your suggestions and see what develops. Again, thanks a lot!!

  • ColdCoffee (11/3/2010)


    WayneS (11/3/2010)


    I think Monsieur ColdCoffee's code can be simplified - I don't think that the first CTE is necessary if in the second one, we order by the Activity in the ROW_NUMBER() function call.

    Wayne, specifically, did 2 ROW_NUMBERs just for the fact tht, if the PatientIDs dont follow an order (meaning a mx of PatientIDs in each row) , then that extra ROW_NUMBER will be effective..

    I could potentially see the need for that... we'll see what the OP says about it.

    Wayne
    Microsoft Certified Master: SQL Server 2008
    Author - SQL Server T-SQL Recipes


    If you can't explain to another person how the code that you're copying from the internet works, then DON'T USE IT on a production system! After all, you will be the one supporting it!
    Links:
    For better assistance in answering your questions
    Performance Problems
    Common date/time routines
    Understanding and Using APPLY Part 1 & Part 2

  • Aaron N. Cutshall (11/3/2010)


    ColdCoffee (11/3/2010)You might want to do the same to attract more people to view your thread; more than attracting, you will get top-brainers to work on your request..

    Thanks for your help. I fully intended to work up a better example using some code, but I was late for a meeting that I had to get to.

    I'll try out your suggestions and see what develops. Again, thanks a lot!!

    Good. Let us know how it works out... you've got two curious folks (plus, I'm sure, some lurking).

    Wayne
    Microsoft Certified Master: SQL Server 2008
    Author - SQL Server T-SQL Recipes


    If you can't explain to another person how the code that you're copying from the internet works, then DON'T USE IT on a production system! After all, you will be the one supporting it!
    Links:
    For better assistance in answering your questions
    Performance Problems
    Common date/time routines
    Understanding and Using APPLY Part 1 & Part 2

  • Thanks, Wayne. It will be a few days until I get to try it again. I'm off to a wee bit of a vacation to my brother's wedding. I'll be back next Wednesday and I'll give it a shot then.

  • OK, I finally had a chance to review the suggested code. To get it to work, I had to make a few modifications. I'm not sure that it's the best way, but it does work. Take a look at it and let me know what you think:

    DECLARE @PatientsInfo TABLE (

    PatientID VARCHAR(15),

    PatientCode VARCHAR(5),

    Activity DATETIME

    );

    INSERT INTO @PatientsInfo

    VALUES ('Pat123', 'Code3', '03/05/2010'),

    ('Pat123', 'Code1', '05/19/2010'),

    ('Pat123', 'Code3', '05/20/2010'),

    ('Pat123', 'Code2', '05/21/2010'),

    ('Pat123', 'Code4', '05/22/2010'),

    ('Pat123', 'Code1', '06/20/2010'),

    ('Pat123', 'Code3', '06/30/2010'),

    ('Pat456', 'Code1', '05/19/2010'),

    ('Pat456', 'Code3', '05/20/2010'),

    ('Pat456', 'Code2', '05/21/2010'),

    ('Pat456', 'Code4', '05/22/2010'),

    ('Pat456', 'Code1', '06/20/2010'),

    ('Pat456', 'Code3', '06/30/2010');

    WITH cteGroupedSet(PatientID, PatientCode, Activity, RowNbr) AS (

    SELECT PatientID, PatientCode, Activity,

    ROW_NUMBER() OVER (PARTITION BY PatientID ORDER BY Activity)

    FROM @PatientsInfo

    ),

    cteMatches(PatientID, PatientCode1, Activity1, PatientCode2, Activity2) AS (

    SELECT GS1.PatientID, GS1.PatientCode, GS1.Activity, GS2.PatientCode, GS2.Activity

    FROM cteGroupedSet GS1

    INNER JOIN cteGroupedSet GS2 ON GS2.PatientID = GS1.PatientID

    AND (GS2.RowNbr = GS1.RowNbr-1)

    WHERE DATEDIFF(dd,GS2.Activity,GS1.Activity) <= 7

    )

    SELECT DISTINCT P.PatientID, P.PatientCode, P.Activity

    FROM cteMatches M

    INNER JOIN @PatientsInfo P ON P.PatientID = M.PatientID

    AND (P.PatientCode = M.PatientCode1 OR P.PatientCode = M.PatientCode2)

    AND (P.Activity = M.Activity1 OR P.Activity = M.Activity2)

    ORDER BY P.PatientID, P.Activity;

    I really don't care for using the DISTINCT clause on the last query, but I couldn't come up with a better solution. Any ideas?

  • I've had an opportunity to work some more on this and the situation is a bit more complex than I originally thought. Therefore, the following code has been revised:

    DECLARE @PatientsInfo TABLE (

    PatientID VARCHAR(15),

    EventCode VARCHAR(5),

    EventDate DATETIME

    );

    INSERT INTO @PatientsInfo

    VALUES('Pat123', 'Code3', '03/05/2010'),

    ('Pat123', 'Code3', '03/06/2010'),

    ('Pat123', 'Code1', '05/19/2010'),

    ('Pat123', 'Code3', '05/20/2010'),

    ('Pat123', 'Code2', '05/21/2010'),

    ('Pat123', 'Code4', '05/22/2010'),

    ('Pat123', 'Code1', '06/20/2010'),

    ('Pat123', 'Code3', '06/30/2010'),

    ('Pat456', 'Code3', '05/17/2010'),

    ('Pat456', 'Code1', '05/19/2010'),

    ('Pat456', 'Code3', '05/20/2010'),

    ('Pat456', 'Code2', '05/21/2010'),

    ('Pat456', 'Code4', '05/24/2010'),

    ('Pat456', 'Code4', '05/28/2010'),

    ('Pat456', 'Code1', '06/20/2010'),

    ('Pat456', 'Code3', '06/30/2010'),

    ('Pat123', 'Code1', '07/01/2010'),

    ('Pat123', 'Code2', '06/29/2010'),

    ('Pat123', 'Code4', '07/01/2010'),

    ('Pat123', 'Code2', '06/19/2010');

    WITH cteGroupedSet(PatientID, EventCode, EventDate, RowNbr) AS (

    SELECT PatientID, EventCode, EventDate,

    ROW_NUMBER() OVER (PARTITION BY PatientID ORDER BY EventDate)

    FROM @PatientsInfo

    ),

    cteMatches(PatientID, EventCode, EventDate, RowNbr, GroupNbr) AS (

    SELECT PatientID, EventCode, EventDate, RowNbr, RowNbr

    FROM cteGroupedSet

    WHERE RowNbr = 1

    UNION ALL

    SELECT G.PatientID, G.EventCode, G.EventDate, G.RowNbr,

    CASE WHEN (ABS(DATEDIFF(dd,M.EventDate,G.EventDate))<=7) THEN M.GroupNbr ELSE M.GroupNbr+1 END

    FROM cteMatches M

    INNER JOIN cteGroupedSet G ON G.PatientID = M.PatientID AND G.RowNbr-1 = M.RowNbr

    WHERE G.RowNbr > 1

    ),

    cteUnique(PatientID, GroupNbr, EventCode, EventDate) AS (

    SELECT PatientID, GroupNbr, EventCode, MAX(EventDate)

    FROM cteMatches

    GROUP BY PatientID, GroupNbr, EventCode

    ),

    cteCompleteGroups(PatientID, GroupNbr, AvgEventDate) AS (

    SELECT PatientID, GroupNbr, CAST(AVG(CAST(EventDate as int)) as datetime)

    FROM cteUnique

    GROUP BY PatientID, GroupNbr

    HAVING COUNT(*) = 4

    )

    SELECT CG.PatientID, P.EventCode, CG.GroupNbr, P.EventDate

    FROM cteCompleteGroups CG

    INNER JOIN @PatientsInfo P ON P.PatientID = CG.PatientID

    WHERE ABS(DATEDIFF(dd,P.EventDate,CAST(AvgEventDate as datetime))) <= 3

    ORDER BY CG.PatientID, P.EventDate;

    This now accommodates situations wherein we want to ignore partial clusters, multiple codes clustered together; and when the entire cluster goes beyond the 7 day range.

    I think that I can be happy with this arrangement! 🙂

  • Aaron,

    It does look a bit complicated, but it looks efficient also. Thanks for the feedback!

    Wayne
    Microsoft Certified Master: SQL Server 2008
    Author - SQL Server T-SQL Recipes


    If you can't explain to another person how the code that you're copying from the internet works, then DON'T USE IT on a production system! After all, you will be the one supporting it!
    Links:
    For better assistance in answering your questions
    Performance Problems
    Common date/time routines
    Understanding and Using APPLY Part 1 & Part 2

Viewing 13 posts - 1 through 12 (of 12 total)

You must be logged in to reply to this topic. Login to reply