Selecting the record with the “nearest” date

Question

Selecting the record with the “nearest” date

Mark Dalley

SSCrazy

Points: 2974
More actions
December 4, 2015 at 8:28 am

#322730

Dear experts
I have a rather difficult record selection problem in a SQL 2012 database which results from sometimes-missing data.
First, the way it is supposed to work…
I have a table with registration details of patients in a population. There are about half a million patients, and each time their details change, a new record is generated. Each record has a patient ID and a From Date and To Date showing the date range in which the details stored on the record were applicable.
In principle therefore it should be straightforward, given a date, to pull out the relevant record for every patient in the population, using something like SELECT field1, …WHERE (DateAsAt >= IndexStartDate ) AND (DateAsAt<=IndexEndDate).
Very often, the From date in the “first” record for a given patient is null, representing some date in the dim and distant past. OK then, lets modify our naïve selection criteria to WHERE ((IndexStartDate IS NULL) OR( IndexStartDate<=DateAsAt)) AND (DateAsAt<=IndexEndDate).
This approach is however spannered by the fact that cases exist where the FromDate may be null multiple time for a given patient. (The To Date is always present, thankfully!) For instance:
PatientIDFromDateToDate
PatientANULL2015-01-19value11,value12, …
PatientANULL2015-06-30Value21,Value22, …
PatientANULL2500-12-31value31,value32, …
I am assuming that if the data were complete, it would read like this:
PatientIDFromDateToDate
PatientANULL2015-01-19value11,value12, …
PatientA2015-01-202015-06-30Value21,Value22, …
PatientA2015-07-012500-12-31value31,value32, …
Suppose I want to pull out the record that was current as at 2015-01-01. All three records fulfil my second WHERE condition, but I am most interested in the one whose ToDate is nearest to the AsAtDate, while still being greater than or equal to it.
So, what is a good way of framing the criteria to get the record I want, while still (if possible) handling the non-deficient cases correctly? I suspect I am into windowing functions and/or subqueries, but need some inspiration.
Yours hopefully
Mark Dalley

Viewing 15 posts - 1 through 15 (of 19 total)

You must be logged in to reply to this topic. Login to reply

Alan Burstein SSC Guru Points: 61136 More actions · Answer 1

Can you post some DDL and sample data? It would be helpful to understand, too, where the dates in your second result set are coming from.

"I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

-- Itzik Ben-Gan 2001

John Mitchell-245523 SSC Guru Points: 148809 More actions · Answer 2

Something like this?

SELECT * FROM Patients

WHERE @AsAtDate BETWEEN >= ToDate

AND @AsAtDate < LEAD(ToDate,1) OVER (PARTITION BY PatientID ORDER BY ToDate)

As Alan said, please provide full table DDL and sample data for a fully tested solution.

John

TheSQLGuru SSC Guru Points: 134017 More actions · Answer 3

The simplest solution by far would seem to be fix the data. 🙂 Update all NULL start-date values to the minimum for the data type (1/1/1753 for full datetime for example), and make a DEFAULT for that and also change the column to be NOT NULL. Wouldn't that then allow all properly-crafted code work like a charm without resorting to machinations to deal with that NULL-value scenario?

Best,
Kevin G. Boles
SQL Server Consultant
SQL MVP 2007-2012
TheSQLGuru on googles mail service

Mark Dalley SSCrazy Points: 2974 More actions · Answer 4

Hi guys,

Thanks for the replies so far.

I obviously need to brush up my approach to asking questions! Here therefore is some DDL for a test table (thanks to Jeff Moden for your really good article about this).

CREATE TABLE [dbo].[MDTest2](

[PatientID] [varchar](11) NOT NULL,

[IndexStartDate] [datetime] NULL,

[IndexEndDate] [datetime] NULL,

[RecSource] [varchar](8) NOT NULL

To fill it conveniently, use this DML... Since is is "real" data I have changed the IDs to ensure that they cannot be associated with real people:

INSERT INTO dbo.MDTest2 (PatientID, IndexStartDate, IndexEndDate, RecSource)

SELECT 'P5711063012',NULL,'Mar 24 2015 12:00AM','Historic' UNION ALL

SELECT 'P5711063012',NULL,'Dec 31 2500 12:00AM','Current' UNION ALL

SELECT 'P1063766833',NULL,'Sep 21 2015 12:00AM','Historic' UNION ALL

SELECT 'P1063766833',NULL,'Dec 31 2500 12:00AM','Current' UNION ALL

SELECT 'P0519084455',NULL,'Sep 21 2015 12:00AM','Historic' UNION ALL

SELECT 'P0519084455',NULL,'Dec 31 2500 12:00AM','Current' UNION ALL

SELECT 'P8362712077',NULL,'Sep 29 2015 12:00AM','Historic' UNION ALL

SELECT 'P8362712077',NULL,'Dec 31 2500 12:00AM','Current' UNION ALL

SELECT 'P6421082878',NULL,'Feb 22 2015 12:00AM','Historic' UNION ALL

SELECT 'P6421082878',NULL,'May 22 2015 12:00AM','Deducted' UNION ALL

SELECT 'P9814006870',NULL,'Jan 14 2015 12:00AM','Historic' UNION ALL

SELECT 'P9814006870',NULL,'Jan 19 2015 12:00AM','Historic' UNION ALL

SELECT 'P8293058689',NULL,'Sep 29 2015 12:00AM','Historic' UNION ALL

SELECT 'P8293058689',NULL,'Dec 31 2500 12:00AM','Current' UNION ALL

SELECT 'P8026056091',NULL,'Jan 19 2015 12:00AM','Historic' UNION ALL

SELECT 'P8026056091',NULL,'Feb 18 2015 12:00AM','Historic' UNION ALL

SELECT 'P8026056091',NULL,'Dec 31 2500 12:00AM','Current' UNION ALL

SELECT 'P9037628490',NULL,'Oct 7 2015 12:00AM','Historic' UNION ALL

SELECT 'P9037628490',NULL,'Dec 31 2500 12:00AM','Current' UNION ALL

SELECT 'P1276762217',NULL,'Aug 25 2015 12:00AM','Historic' UNION ALL

SELECT 'P1276762217',NULL,'Dec 31 2500 12:00AM','Current' UNION ALL

SELECT 'P5608863213',NULL,'Jul 13 2015 12:00AM','Historic' UNION ALL

SELECT 'P5608863213',NULL,'Jul 22 2015 12:00AM','Historic' UNION ALL

SELECT 'P5608863213',NULL,'Dec 31 2500 12:00AM','Current' UNION ALL

SELECT 'P4004258144',NULL,'Jun 29 2015 12:00AM','Historic' UNION ALL

SELECT 'P4004258144',NULL,'Dec 31 2500 12:00AM','Current' UNION ALL

SELECT 'P2695719501',NULL,'May 18 2015 12:00AM','Historic' UNION ALL

SELECT 'P2695719501',NULL,'Dec 31 2500 12:00AM','Current' UNION ALL

SELECT 'P9460813602',NULL,'Mar 4 2015 12:00AM','Historic' UNION ALL

SELECT 'P9460813602',NULL,'Dec 31 2500 12:00AM','Current' ;

Hope this helps

Mark Dalley

Mark Dalley SSCrazy Points: 2974 More actions · Answer 5

Hi experts

To answer Kevin's question, the table I am getting the data from is a read-only view maintained by another team and originating from a non-relational database of doubtful integrity! So, no, much as I would like to, I cannot clean up the source, despite the obvious added value for future generations.;-)

I can of course have a go at filling in the null start dates in my MDtest2 table. That throws me back to windowing functions, I think, though I'm a newbie with these. I will play with the ideas suggested in John Mitchell's answer.

Mark Dalley

Mark Dalley SSCrazy Points: 2974 More actions · Answer 6

Dear experts

Here are the records I would expect a working query to return, highlighted in bold. I am taking an As At date of the 1st of June 2015 (2015-06-01):

PatientID IndexStartDate IndexEndDate RecSource

P5711063012NULL2015-03-24 00:00:00.000Historic

P5711063012NULL2500-12-31 00:00:00.000Current

P1063766833NULL2015-09-21 00:00:00.000Historic

P1063766833NULL2500-12-31 00:00:00.000Current

P0519084455NULL2015-09-21 00:00:00.000Historic

P0519084455NULL2500-12-31 00:00:00.000Current

P8362712077NULL2015-09-29 00:00:00.000Historic

P8362712077NULL2500-12-31 00:00:00.000Current

P6421082878NULL2015-02-22 00:00:00.000Historic

P6421082878NULL2015-05-22 00:00:00.000Deducted

P9814006870NULL2015-01-14 00:00:00.000Historic

P9814006870NULL2015-01-19 00:00:00.000Historic

P8293058689NULL2015-09-29 00:00:00.000Historic

P8293058689NULL2500-12-31 00:00:00.000Current

P8026056091NULL2015-01-19 00:00:00.000Historic

P8026056091NULL2015-02-18 00:00:00.000Historic

P8026056091NULL2500-12-31 00:00:00.000Current

P9037628490NULL2015-10-07 00:00:00.000Historic

P9037628490NULL2500-12-31 00:00:00.000Current

P1276762217NULL2015-08-25 00:00:00.000Historic

P1276762217NULL2500-12-31 00:00:00.000Current

P5608863213NULL2015-07-13 00:00:00.000Historic

P5608863213NULL2015-07-22 00:00:00.000Historic

P5608863213NULL2500-12-31 00:00:00.000Current

P4004258144NULL2015-06-29 00:00:00.000Historic

P4004258144NULL2500-12-31 00:00:00.000Current

P2695719501NULL2015-05-18 00:00:00.000Historic

P2695719501NULL2500-12-31 00:00:00.000Current

P9460813602NULL2015-03-04 00:00:00.000Historic

P9460813602NULL2500-12-31 00:00:00.000Current

In each case, the record I am after is the the record with the smallest index end date which comes after the as at date.

MarkD

ScottPletcher SSC Guru Points: 100949 More actions · Answer 7

TheSQLGuru (12/4/2015)
The simplest solution by far would seem to be fix the data. 🙂 Update all NULL start-date values to the minimum for the data type (1/1/1753 for full datetime for example), and make a DEFAULT for that and also change the column to be NOT NULL. Wouldn't that then allow all properly-crafted code work like a charm without resorting to machinations to deal with that NULL-value scenario?

But you're corrupting the data rather than fixing it. The specific first date is unknown, but it is definitely not 1753! Later NULL values could be updated once/if it's confirmed that the change is assumed to be immediately after the previous end date. Or, it could be that there was a gap between the two, and the specific start date is still unknown, but you decide to just use the next date to insure faster lookups.

SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.

John Mitchell-245523 SSC Guru Points: 148809 More actions · Answer 8

Does your table have a primary key constraint, please? I can't see a way of writing a robust query without one.

John

Dave Morrison SSCrazy Points: 2017 More actions · Answer 9

Try this, should do the trick.

Excuse the sloppy formatting 🙂

declare @AsAtDate Datetime = '2015-01-01'

;with

MinPat as

(

select patientID, indexenddate, M.MinDiff

,ROW_NUMBER() over (partition by patientID order by M.MinDiff) as RowNum

from [dbo].[MDTest2]

cross apply (values(datediff(DAY, @AsAtDate, IndexEndDate))) as M(MinDiff)

where IndexEndDate > @AsAtDate

)

select *

from MinPat

where RowNum = 1

order by patientid

serg-52 SSCrazy Eights Points: 9913 More actions · Answer 10

Provided explicitly defined intervals do not overlap try this to infer unknown start dates

declare @AsAtDate Datetime = '2015-06-01';

with cte as (

select *,

strt = isnull([IndexStartDate],lag([IndexEndDate],1,cast('19710101' as date))

over(partition by [PatientID] order by [IndexStartDate])+1)

from [dbo].[MDTest2]

)

select *

from cte

where @AsAtDate between strt and [IndexEndDate]

I also advise against updating nulls directly in table. This may lead to overlapping intervals when some more data are added later.

Mark Dalley SSCrazy Points: 2974 More actions · Answer 11

In response to John Mitchell,

There are multiple records for each PatientID, each covering a separate date range. (We know that the date ranges can never overlap, though they will often be contiguous.) Since, as we have seen, the IndexFromDate can be null, but the IndexToDate never is. a possible definition would be :

CREATE TABLE [dbo].[MDTest2](

[PatientID] [varchar](11) NOT NULL,

[IndexStartDate] [datetime] NULL,

[IndexEndDate] [datetime] NOT NULL,

[RecSource] [varchar](8) NOT NULL

CONSTRAINT PrimaryKey PRIMARY KEY (PatientID,IndexEndDate)

)

I note that this also touches on the approach suggested by Kevin Boles. Basically, Kevin was suggesting that we use a specific non-null date (1753-01-01) to represent an unknown date in the distant past. This simplifies determination of the relevant interval, but it would have to be understood as being a convention, and adhered to. Looking at the test data, I see that whoever created the data is already doing something similar with the IndexEndDate - 2025-12-31 is used to represent an undefined and as-yet-totally-unknown date in the future - a sort of high-valued null. And in this case not having a null works well - among other things, it makes it possible to have a somewhat sensible primary key!

Now to try and get my head around Dave Morrisons suggested answer...

MarkD

John Mitchell-245523 SSC Guru Points: 148809 More actions · Answer 12

In that case, this should also work. It has a very similar execution plan to Dave's. That's only for a very small table, though - your mileage may vary when you start using it on production-size data.

WITH MinDates AS (

SELECT

PatientID

,MIN(IndexEndDate) IndexEndDate

FROM dbo.MDTest2

WHERE IndexEndDate > '2015-06-01'

GROUP BY PatientID

)

SELECT

t.PatientID

,t.IndexStartDate

,t.IndexEndDate

,t.RecSource

FROM MinDates m

JOIN MDTest2 t

ON m.PatientID = t.PatientID AND m.IndexEndDate = t.IndexEndDate;

John

Kim Crosser SSCommitted Points: 1763 More actions · Answer 13

In similar situations, I found the following to work pretty well:

SELECT PatientID, ...

from dbo.MDTest tbl1

where tbl1.IndexEndDate > @dParamDate

and not exists (select 1

from dbo.MDTest tbl2

where tbl2.PatientID = tbl1.PatientID

and tbl2.IndexEndDate > @dParamDate

and tbl2.IndexEndDate < tbl1.IndexEndDate);

This will return the latest record where the end date is greater than the specified date, but where there are no other records for the patient where the other record's end date is also greater than the specified date, but earlier than the "candidate" record's end date.

This avoids all functions, data type conversions, etc. This would work especially well with a clustered index on (PatientID, IndexEndDate).

John Mitchell-245523 SSC Guru Points: 148809 More actions · Answer 14

Kim Crosser (12/8/2015)
In similar situations, I found the following to work pretty well:
SELECT PatientID, ...
from dbo.MDTest tbl1
where tbl1.IndexEndDate > @dParamDate
and not exists (select 1
from dbo.MDTest tbl2
where tbl2.PatientID = tbl1.PatientID
and tbl2.IndexEndDate > @dParamDate
and tbl2.IndexEndDate < tbl1.IndexEndDate);
This will return the latest record where the end date is greater than the specified date, but where there are no other records for the patient where the other record's end date is also greater than the specified date, but earlier than the "candidate" record's end date.
This avoids all functions, data type conversions, etc. This would work especially well with a clustered index on (PatientID, IndexEndDate).

Yes, fine on small data sets. But check out the execution plan - it does two scans of the table and therefore as the table gets larger and larger you are likely to see performance deteriorate.

John