Selecting the record with the “nearest” date

  • Dear experts

    I have a rather difficult record selection problem in a SQL 2012 database which results from sometimes-missing data.

    First, the way it is supposed to work…

    I have a table with registration details of patients in a population. There are about half a million patients, and each time their details change, a new record is generated. Each record has a patient ID and a From Date and To Date showing the date range in which the details stored on the record were applicable.

    In principle therefore it should be straightforward, given a date, to pull out the relevant record for every patient in the population, using something like SELECT field1, …WHERE (DateAsAt >= IndexStartDate ) AND (DateAsAt<=IndexEndDate).

    Very often, the From date in the “first” record for a given patient is null, representing some date in the dim and distant past. OK then, lets modify our naïve selection criteria to WHERE ((IndexStartDate IS NULL) OR( IndexStartDate<=DateAsAt)) AND (DateAsAt<=IndexEndDate).

    This approach is however spannered by the fact that cases exist where the FromDate may be null multiple time for a given patient. (The To Date is always present, thankfully!) For instance:

    PatientIDFromDateToDate

    PatientANULL2015-01-19value11,value12, …

    PatientANULL2015-06-30Value21,Value22, …

    PatientANULL2500-12-31value31,value32, …

    I am assuming that if the data were complete, it would read like this:

    PatientIDFromDateToDate

    PatientANULL2015-01-19value11,value12, …

    PatientA2015-01-202015-06-30Value21,Value22, …

    PatientA2015-07-012500-12-31value31,value32, …

    Suppose I want to pull out the record that was current as at 2015-01-01. All three records fulfil my second WHERE condition, but I am most interested in the one whose ToDate is nearest to the AsAtDate, while still being greater than or equal to it.

    So, what is a good way of framing the criteria to get the record I want, while still (if possible) handling the non-deficient cases correctly? I suspect I am into windowing functions and/or subqueries, but need some inspiration.

    Yours hopefully

    Mark Dalley

  • Can you post some DDL and sample data? It would be helpful to understand, too, where the dates in your second result set are coming from.

    "I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

    -- Itzik Ben-Gan 2001

  • Something like this?

    SELECT * FROM Patients

    WHERE @AsAtDate BETWEEN >= ToDate

    AND @AsAtDate < LEAD(ToDate,1) OVER (PARTITION BY PatientID ORDER BY ToDate)

    As Alan said, please provide full table DDL and sample data for a fully tested solution.

    John

  • The simplest solution by far would seem to be fix the data. 🙂 Update all NULL start-date values to the minimum for the data type (1/1/1753 for full datetime for example), and make a DEFAULT for that and also change the column to be NOT NULL. Wouldn't that then allow all properly-crafted code work like a charm without resorting to machinations to deal with that NULL-value scenario?

    Best,
    Kevin G. Boles
    SQL Server Consultant
    SQL MVP 2007-2012
    TheSQLGuru on googles mail service

  • Hi guys,

    Thanks for the replies so far.

    I obviously need to brush up my approach to asking questions! Here therefore is some DDL for a test table (thanks to Jeff Moden for your really good article about this).

    CREATE TABLE [dbo].[MDTest2](

    [PatientID] [varchar](11) NOT NULL,

    [IndexStartDate] [datetime] NULL,

    [IndexEndDate] [datetime] NULL,

    [RecSource] [varchar](8) NOT NULL

    To fill it conveniently, use this DML... Since is is "real" data I have changed the IDs to ensure that they cannot be associated with real people:

    INSERT INTO dbo.MDTest2 (PatientID, IndexStartDate, IndexEndDate, RecSource)

    SELECT 'P5711063012',NULL,'Mar 24 2015 12:00AM','Historic' UNION ALL

    SELECT 'P5711063012',NULL,'Dec 31 2500 12:00AM','Current' UNION ALL

    SELECT 'P1063766833',NULL,'Sep 21 2015 12:00AM','Historic' UNION ALL

    SELECT 'P1063766833',NULL,'Dec 31 2500 12:00AM','Current' UNION ALL

    SELECT 'P0519084455',NULL,'Sep 21 2015 12:00AM','Historic' UNION ALL

    SELECT 'P0519084455',NULL,'Dec 31 2500 12:00AM','Current' UNION ALL

    SELECT 'P8362712077',NULL,'Sep 29 2015 12:00AM','Historic' UNION ALL

    SELECT 'P8362712077',NULL,'Dec 31 2500 12:00AM','Current' UNION ALL

    SELECT 'P6421082878',NULL,'Feb 22 2015 12:00AM','Historic' UNION ALL

    SELECT 'P6421082878',NULL,'May 22 2015 12:00AM','Deducted' UNION ALL

    SELECT 'P9814006870',NULL,'Jan 14 2015 12:00AM','Historic' UNION ALL

    SELECT 'P9814006870',NULL,'Jan 19 2015 12:00AM','Historic' UNION ALL

    SELECT 'P8293058689',NULL,'Sep 29 2015 12:00AM','Historic' UNION ALL

    SELECT 'P8293058689',NULL,'Dec 31 2500 12:00AM','Current' UNION ALL

    SELECT 'P8026056091',NULL,'Jan 19 2015 12:00AM','Historic' UNION ALL

    SELECT 'P8026056091',NULL,'Feb 18 2015 12:00AM','Historic' UNION ALL

    SELECT 'P8026056091',NULL,'Dec 31 2500 12:00AM','Current' UNION ALL

    SELECT 'P9037628490',NULL,'Oct 7 2015 12:00AM','Historic' UNION ALL

    SELECT 'P9037628490',NULL,'Dec 31 2500 12:00AM','Current' UNION ALL

    SELECT 'P1276762217',NULL,'Aug 25 2015 12:00AM','Historic' UNION ALL

    SELECT 'P1276762217',NULL,'Dec 31 2500 12:00AM','Current' UNION ALL

    SELECT 'P5608863213',NULL,'Jul 13 2015 12:00AM','Historic' UNION ALL

    SELECT 'P5608863213',NULL,'Jul 22 2015 12:00AM','Historic' UNION ALL

    SELECT 'P5608863213',NULL,'Dec 31 2500 12:00AM','Current' UNION ALL

    SELECT 'P4004258144',NULL,'Jun 29 2015 12:00AM','Historic' UNION ALL

    SELECT 'P4004258144',NULL,'Dec 31 2500 12:00AM','Current' UNION ALL

    SELECT 'P2695719501',NULL,'May 18 2015 12:00AM','Historic' UNION ALL

    SELECT 'P2695719501',NULL,'Dec 31 2500 12:00AM','Current' UNION ALL

    SELECT 'P9460813602',NULL,'Mar 4 2015 12:00AM','Historic' UNION ALL

    SELECT 'P9460813602',NULL,'Dec 31 2500 12:00AM','Current' ;

    Hope this helps

    Mark Dalley

  • Hi experts

    To answer Kevin's question, the table I am getting the data from is a read-only view maintained by another team and originating from a non-relational database of doubtful integrity! So, no, much as I would like to, I cannot clean up the source, despite the obvious added value for future generations.;-)

    I can of course have a go at filling in the null start dates in my MDtest2 table. That throws me back to windowing functions, I think, though I'm a newbie with these. I will play with the ideas suggested in John Mitchell's answer.

    Mark Dalley

  • Dear experts

    Here are the records I would expect a working query to return, highlighted in bold. I am taking an As At date of the 1st of June 2015 (2015-06-01):

    PatientID IndexStartDate IndexEndDate RecSource

    P5711063012NULL2015-03-24 00:00:00.000Historic

    P5711063012NULL2500-12-31 00:00:00.000Current

    P1063766833NULL2015-09-21 00:00:00.000Historic

    P1063766833NULL2500-12-31 00:00:00.000Current

    P0519084455NULL2015-09-21 00:00:00.000Historic

    P0519084455NULL2500-12-31 00:00:00.000Current

    P8362712077NULL2015-09-29 00:00:00.000Historic

    P8362712077NULL2500-12-31 00:00:00.000Current

    P6421082878NULL2015-02-22 00:00:00.000Historic

    P6421082878NULL2015-05-22 00:00:00.000Deducted

    P9814006870NULL2015-01-14 00:00:00.000Historic

    P9814006870NULL2015-01-19 00:00:00.000Historic

    P8293058689NULL2015-09-29 00:00:00.000Historic

    P8293058689NULL2500-12-31 00:00:00.000Current

    P8026056091NULL2015-01-19 00:00:00.000Historic

    P8026056091NULL2015-02-18 00:00:00.000Historic

    P8026056091NULL2500-12-31 00:00:00.000Current

    P9037628490NULL2015-10-07 00:00:00.000Historic

    P9037628490NULL2500-12-31 00:00:00.000Current

    P1276762217NULL2015-08-25 00:00:00.000Historic

    P1276762217NULL2500-12-31 00:00:00.000Current

    P5608863213NULL2015-07-13 00:00:00.000Historic

    P5608863213NULL2015-07-22 00:00:00.000Historic

    P5608863213NULL2500-12-31 00:00:00.000Current

    P4004258144NULL2015-06-29 00:00:00.000Historic

    P4004258144NULL2500-12-31 00:00:00.000Current

    P2695719501NULL2015-05-18 00:00:00.000Historic

    P2695719501NULL2500-12-31 00:00:00.000Current

    P9460813602NULL2015-03-04 00:00:00.000Historic

    P9460813602NULL2500-12-31 00:00:00.000Current

    In each case, the record I am after is the the record with the smallest index end date which comes after the as at date.

    MarkD

  • TheSQLGuru (12/4/2015)


    The simplest solution by far would seem to be fix the data. 🙂 Update all NULL start-date values to the minimum for the data type (1/1/1753 for full datetime for example), and make a DEFAULT for that and also change the column to be NOT NULL. Wouldn't that then allow all properly-crafted code work like a charm without resorting to machinations to deal with that NULL-value scenario?

    But you're corrupting the data rather than fixing it. The specific first date is unknown, but it is definitely not 1753! Later NULL values could be updated once/if it's confirmed that the change is assumed to be immediately after the previous end date. Or, it could be that there was a gap between the two, and the specific start date is still unknown, but you decide to just use the next date to insure faster lookups.

    SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.

  • Does your table have a primary key constraint, please? I can't see a way of writing a robust query without one.

    John

  • Try this, should do the trick.

    Excuse the sloppy formatting 🙂

    declare @AsAtDate Datetime = '2015-01-01'

    ;with

    MinPat as

    (

    select patientID, indexenddate, M.MinDiff

    ,ROW_NUMBER() over (partition by patientID order by M.MinDiff) as RowNum

    from [dbo].[MDTest2]

    cross apply (values(datediff(DAY, @AsAtDate, IndexEndDate))) as M(MinDiff)

    where IndexEndDate > @AsAtDate

    )

    select *

    from MinPat

    where RowNum = 1

    order by patientid

  • Provided explicitly defined intervals do not overlap try this to infer unknown start dates

    declare @AsAtDate Datetime = '2015-06-01';

    with cte as (

    select *,

    strt = isnull([IndexStartDate],lag([IndexEndDate],1,cast('19710101' as date))

    over(partition by [PatientID] order by [IndexStartDate])+1)

    from [dbo].[MDTest2]

    )

    select *

    from cte

    where @AsAtDate between strt and [IndexEndDate]

    I also advise against updating nulls directly in table. This may lead to overlapping intervals when some more data are added later.

  • In response to John Mitchell,

    There are multiple records for each PatientID, each covering a separate date range. (We know that the date ranges can never overlap, though they will often be contiguous.) Since, as we have seen, the IndexFromDate can be null, but the IndexToDate never is. a possible definition would be :

    CREATE TABLE [dbo].[MDTest2](

    [PatientID] [varchar](11) NOT NULL,

    [IndexStartDate] [datetime] NULL,

    [IndexEndDate] [datetime] NOT NULL,

    [RecSource] [varchar](8) NOT NULL

    CONSTRAINT PrimaryKey PRIMARY KEY (PatientID,IndexEndDate)

    )

    I note that this also touches on the approach suggested by Kevin Boles. Basically, Kevin was suggesting that we use a specific non-null date (1753-01-01) to represent an unknown date in the distant past. This simplifies determination of the relevant interval, but it would have to be understood as being a convention, and adhered to. Looking at the test data, I see that whoever created the data is already doing something similar with the IndexEndDate - 2025-12-31 is used to represent an undefined and as-yet-totally-unknown date in the future - a sort of high-valued null. And in this case not having a null works well - among other things, it makes it possible to have a somewhat sensible primary key!

    Now to try and get my head around Dave Morrisons suggested answer...

    MarkD

  • In that case, this should also work. It has a very similar execution plan to Dave's. That's only for a very small table, though - your mileage may vary when you start using it on production-size data.

    WITH MinDates AS (

    SELECT

    PatientID

    ,MIN(IndexEndDate) IndexEndDate

    FROM dbo.MDTest2

    WHERE IndexEndDate > '2015-06-01'

    GROUP BY PatientID

    )

    SELECT

    t.PatientID

    ,t.IndexStartDate

    ,t.IndexEndDate

    ,t.RecSource

    FROM MinDates m

    JOIN MDTest2 t

    ON m.PatientID = t.PatientID AND m.IndexEndDate = t.IndexEndDate;

    John

  • In similar situations, I found the following to work pretty well:

    SELECT PatientID, ...

    from dbo.MDTest tbl1

    where tbl1.IndexEndDate > @dParamDate

    and not exists (select 1

    from dbo.MDTest tbl2

    where tbl2.PatientID = tbl1.PatientID

    and tbl2.IndexEndDate > @dParamDate

    and tbl2.IndexEndDate < tbl1.IndexEndDate);

    This will return the latest record where the end date is greater than the specified date, but where there are no other records for the patient where the other record's end date is also greater than the specified date, but earlier than the "candidate" record's end date.

    This avoids all functions, data type conversions, etc. This would work especially well with a clustered index on (PatientID, IndexEndDate).

  • Kim Crosser (12/8/2015)


    In similar situations, I found the following to work pretty well:

    SELECT PatientID, ...

    from dbo.MDTest tbl1

    where tbl1.IndexEndDate > @dParamDate

    and not exists (select 1

    from dbo.MDTest tbl2

    where tbl2.PatientID = tbl1.PatientID

    and tbl2.IndexEndDate > @dParamDate

    and tbl2.IndexEndDate < tbl1.IndexEndDate);

    This will return the latest record where the end date is greater than the specified date, but where there are no other records for the patient where the other record's end date is also greater than the specified date, but earlier than the "candidate" record's end date.

    This avoids all functions, data type conversions, etc. This would work especially well with a clustered index on (PatientID, IndexEndDate).

    Yes, fine on small data sets. But check out the execution plan - it does two scans of the table and therefore as the table gets larger and larger you are likely to see performance deteriorate.

    John

Viewing 15 posts - 1 through 15 (of 19 total)

You must be logged in to reply to this topic. Login to reply