Identifying Start Dates not Aligned with End Dates on a Prior Row

,

In a prior article published by SQL Server Central, Creating a Date Range from Multiple Rows Based on a Single Date, I discussed the various ways you can pair up rows when all you have is an effective date on each row, with the effective end date being implied by the effective date on the preceding row.

While in my opinion that is the preferred structuring of such data, I realize that sometimes a designer comes along and decides that each row should have both an effective start date and an effective end date.  Without proper constraints in place, this can lead to anomalies in the data, such as an end date overlapping a start date on the following row.

There are a few ways of structuring your data model to include constraints that enforce data integrity for the case where start and end dates appear on each row:

Avoiding such anomalies can also be done in the front-end application, however that seems like a complexity that can be avoided by a constraints-based approach.  From seeing some of the posts in the SSC forums (such as this one), I get the distinct feeling that either many folks don’t realize it can be done with constraints or they’re letting their front-end applications handle the chore but the approach they’ve chosen is at least partially ineffective.

So for all of you folks out there who have this data structure and now have to deal with the consequences of it, I’d like to offer you some alternatives to identify your persistent data anomalies so (presumably) you can make progress to correct them.

Sample Data Structure and Data

Drawing upon my previous SSC article, we’ll construct a table with rows containing effective start and end dates, and then populate that table with some sample data containing anomalies such as I’ve described.

CREATE TABLE dbo.ProductPrices (
      ProductID           INT
     ,EffectiveStartDT   DATETIME
     ,EffectiveEndDT     DATETIME
     ,ProductPrice       MONEY
     ,CONSTRAINT md1 PRIMARY KEY (ProductID, EffectiveStartDT)
 );
-- Insert some initial sample data 
INSERT INTO dbo.ProductPrices (
     ProductID, EffectiveStartDT, EffectiveEndDT, ProductPrice )
 VALUES (1, '2013-06-04', '2013-06-17', 15.25)
     ,(2, '2013-05-01', '2013-08-09', 42.13)
     ,(1, '2013-06-17', '2013-06-25', 16.33)
     ,(1, '2013-06-25', '2014-02-23', 16.45)
     ,(2, '2013-08-09', '2013-08-10', 45.88)
     ,(2, '2013-08-10', '2014-03-01', 45.88)
     ,(2, '2014-03-01', NULL, 45.85)
     ,(1, '2014-02-23', NULL, 16.65)
     ,(3, '2013-01-01', '2013-12-31', 17.77)
     ,(3, '2013-12-31', NULL, 22.75);
DECLARE @GetPriceOn DATETIME = '2013-12-31';
SELECT ProductID, EffectiveStartDT, EffectiveEndDT, ProductPrice
 FROM dbo.ProductPrices
 WHERE @GetPriceOn >= EffectiveStartDT
 AND    @GetPriceOn < ISNULL(EffectiveEndDT, '2099-12-31');
GO 
-- DROP TABLE dbo.ProductPrices;

When the final SELECT query is run, the result produces the “effective” prices on 31 Dec 2013.

ProductID  EffectiveStartDT          EffectiveEndDT            ProductPrice
1          2013-06-25 00:00:00.000   2014-02-23 00:00:00.000   16.45
2          2013-08-10 00:00:00.000   2014-03-01 00:00:00.000   45.88
3          2013-12-31 00:00:00.000   NULL                      22.75

Let’s now introduce two anomalies

  • ProductID=4 has a gap occurring on precisely the date for which we want the price (31 Dec 2013). 
  • ProductID=5 has an overlap occurring on that date.
-- Insert some more sample data with anomalies
INSERT INTO dbo.ProductPrices
 (
     ProductID, EffectiveStartDT, EffectiveEndDT, ProductPrice )
 VALUES (4, '2013-06-04', '2013-12-31', 18.25)
     ,(4, '2014-01-01', '2014-08-09', 22.13)
     ,(4, '2014-08-09', NULL, 25.13)
     ,(5, '2013-06-04', '2014-01-01', 18.25)
     ,(5, '2013-12-30', '2014-08-09', 22.13)
     ,(5, '2014-08-09', NULL, 25.13);
SELECT ProductID, EffectiveStartDT, EffectiveEndDT, ProductPrice
 FROM dbo.ProductPrices
 WHERE @GetPriceOn >= EffectiveStartDT
 AND   @GetPriceOn < ISNULL(EffectiveEndDT, '2099-12-31');

Note that you cannot create an anomaly where the same start date occurs more than once for a product, due to the construction of the PRIMARY KEY.  When we run our SELECT query again, this is the new result set.

ProductID  EffectiveStartDT          EffectiveEndDT            ProductPrice
1          2013-06-25 00:00:00.000   2014-02-23 00:00:00.000   16.45
2          2013-08-10 00:00:00.000   2014-03-01 00:00:00.000   45.88
3          2013-12-31 00:00:00.000   NULL                      22.75
5          2013-06-04 00:00:00.000   2014-01-01 00:00:00.000   18.25
5          2013-12-30 00:00:00.000   2014-08-09 00:00:00.000   22.13

The anomalies present show no price on 31 Dec 2013 for ProductID=4 (which may be the intent if you never expect to sell that product on that date) or two effective prices for ProductID=5 on that date (very unlikely to be something you want).

Our business rule thus suggests that for every non-NULL end date on a row, there should be a corresponding row that follows with that end date as the new start date.

Our goal, should we decide to accept it, is to identify any row after the first where its effective start date does not match the end date of the prior row.  This means that our target result set for this set of sample data is the following:

ProductID   EffectiveStartDT          EffectiveEndDT            ProductPrice
4           2014-01-01 00:00:00.000   2014-08-09 00:00:00.000   22.13
5           2013-12-30 00:00:00.000   2014-08-09 00:00:00.000   22.13

Solution 1: Using the SQL 2012 LAG Function

No doubt there will be those pundits that will exclaim how incredibly easy this is if you are using SQL 2012 and are familiar with the capabilities of the LAG analytic function available in that (but not prior versions).  Indeed, it would seem that the LAG function was built specifically with this sort of an issue in mind.  So without any further introduction, here is how to use LAG to produce the target result set.

SELECT ProductID, EffectiveStartDT, EffectiveEndDT, ProductPrice
FROM
(
    SELECT ProductID, EffectiveStartDT, EffectiveEndDT, ProductPrice
        ,d=LAG(EffectiveEndDT, 1) OVER
            (
                PARTITION BY ProductID ORDER BY EffectiveStartDT
            )
    FROM dbo.ProductPrices
) a
WHERE d IS NOT NULL AND d <> EffectiveStartDT;

Cases where d is NULL indicate the first row (there is no prior row), so those are eliminated.  We use LAG on EffectiveEndDT for each ProductID (the PARTITION column) and look back to the prior occurring row using EffectiveStartDT in the ORDER BY clause.  When the value retrieved by LAG is different than the EffectiveStartDT, we’ve encountered an anomaly.

We’ll leave it to the interested reader to verify that we’ve achieved our target result set.  

Solution 2: An Alternative that will Work Prior to SQL 2012

It is my experience that in university, budding developers studying the SQL language will be taught how to do a self-JOIN.  This capability has been around for as long as the language has had the ability to do JOINs, which is presumably forever.

Indeed, even many a seasoned developer will suggest this method to solve this particular problem, which it most certainly can.  So let’s take a look at how we would do that now, using some constructs that are available to us in SQL 2005, namely ROW_NUMBER() and Common Table Expressions or CTEs.

WITH Prices AS
(
    SELECT ProductID, EffectiveStartDT, EffectiveEndDT, ProductPrice
        ,rn=ROW_NUMBER() OVER (PARTITION BY ProductID ORDER BY EffectiveStartDT)
    FROM dbo.ProductPrices
)
SELECT b.ProductID, b.EffectiveStartDT, b.EffectiveEndDT, b.ProductPrice
FROM Prices a
JOIN Prices b ON a.ProductID = b.ProductID AND a.rn + 1 = b.rn
WHERE a.EffectiveEndDT <> b.EffectiveStartDT;

As it turns out, that wasn’t too difficult at all, and it gets exactly the target result set that we want.  However I should note that I personally despise self-JOINs because when your tables get really large they can become performance hogs.

Solution 3: A SQL 2005 Solution Using CROSS APPLY

Another solution that has gained a lot of popularity since it was introduced in SQL 2005 is CROSS APPLY, and indeed it too can be used to solve this problem.  For an excellent introduction to both CROSS and OUTER APPLY, I suggest the following articles by the renowned SQL MVP Paul White.

With a CROSS APPLY, we can dispense with the ROW_NUMBER() and don’t need to use a CTE either, which wasn’t really required in solution two for that matter (it was just for convenience).  Here is that solution.

SELECT a.ProductID, a.EffectiveStartDT, a.EffectiveEndDT, a.ProductPrice
FROM dbo.ProductPrices a
CROSS APPLY
(
    SELECT TOP 1 b.ProductID, b.EffectiveStartDT, b.EffectiveEndDT, b.ProductPrice
    FROM dbo.ProductPrices b
    WHERE a.ProductID = b.ProductID AND b.EffectiveStartDT < a.EffectiveStartDT
    ORDER BY b.EffectiveStartDT DESC
) b
WHERE a.EffectiveStartDT <> b.EffectiveEndDT;

Once again this produces the target row set.

Solution 4: Another Solution that will Work in SQL Versions Prior to 2012

I’m not sure why it is, but for some reason I’m always forgetting to think about solutions that use EXISTS in the WHERE clause.  Since I have that tendency, I need to force myself to do so.  So let’s think about this problem and see if we can devise a solution using EXISTS.  Let’s try this:

SELECT a.ProductID, a.EffectiveStartDT, a.EffectiveEndDT, a.ProductPrice
FROM dbo.ProductPrices a
WHERE NOT EXISTS
(
    SELECT 1
    FROM ProductPrices b
    WHERE a.ProductID = b.ProductID AND b.EffectiveEndDT = a.EffectiveStartDT
);

This produces the following results set, which certainly does not produce the number of rows in our target set.

ProductID  EffectiveStartDT         EffectiveEndDT           ProductPrice
1          2013-06-04 00:00:00.000  2013-06-17 00:00:00.000  15.25
2          2013-05-01 00:00:00.000  2013-08-09 00:00:00.000  42.13
3          2013-01-01 00:00:00.000  2013-12-31 00:00:00.000  17.77
4          2013-06-04 00:00:00.000  2013-12-31 00:00:00.000  18.25
4          2014-01-01 00:00:00.000  2014-08-09 00:00:00.000  22.13
5          2013-06-04 00:00:00.000  2014-01-01 00:00:00.000  18.25
5          2013-12-30 00:00:00.000  2014-08-09 00:00:00.000  22.13

However upon close inspection, we see that the two rows we need are included (bold) and the other rows happen to be the first (earliest) effective start date for each ProductID.  Perhaps we can eliminate those using a ROW_NUMBER(), thusly:

SELECT a.ProductID, a.EffectiveStartDT, a.EffectiveEndDT, a.ProductPrice
FROM
(
    SELECT a.ProductID, a.EffectiveStartDT, a.EffectiveEndDT, a.ProductPrice
        ,rn=ROW_NUMBER() OVER (PARTITION BY a.ProductID ORDER BY EffectiveStartDT)
    FROM dbo.ProductPrices a
    WHERE NOT EXISTS
    (
        SELECT 1
        FROM ProductPrices b
        WHERE a.ProductID = b.ProductID AND b.EffectiveEndDT = a.EffectiveStartDT
    )
) a
WHERE rn <> 1;

Indeed, this query now produces precisely the target results set that we seek.

The One Million Row Test Harness

To find out if any query is worth its salt, you should always try it in the one million row test harness.  Here’s some reading on that:

This test harness works best when you have more than one query that you want to compare, and in our case we have four so that may be good enough.  To generate the test data:

TRUNCATE TABLE dbo.ProductPrices;
WITH Tally (n) AS (
     SELECT TOP (10000) ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
     FROM sys.all_columns a CROSS JOIN sys.all_columns b )
 INSERT INTO dbo.ProductPrices (
     ProductID, EffectiveStartDT, EffectiveEndDT, ProductPrice )
 SELECT ProductID=n
     -- Adjust some start dates back to get overlaps
     ,StartDate  = CASE m%13 WHEN 0 THEN StartDate-2 ELSE StartDate END
     -- Adjust some end dates back to get gaps
     ,EndDate    = CASE
                     -- Last row end date is NULL
                     WHEN m = 100 THEN NULL
                     WHEN m%16 = 0 THEN EndDate-4
                     ELSE EndDate
                     END
     ,Price      = 10+(ABS(CHECKSUM(NEWID()))%1000)/100.
 FROM (
     SELECT TOP 100 n, m
         ,StartDate          -- Align the end dates with the start date on the next row
         ,EndDate    = LEAD(StartDate, 1) OVER (PARTITION BY n ORDER BY StartDate)
     FROM
     (
         SELECT a.n, m=b.n
             ,StartDate=DATEADD(day, 10*b.n + CHECKSUM(NEWID())%3, '2010-01-01')
         FROM Tally a
         CROSS APPLY
         (
             SELECT n
             FROM Tally b
             WHERE n <= 100
         ) b
     ) a
 ) a;
SELECT TOP 20 ProductID, EffectiveStartDT, EffectiveEndDT, ProductPrice 
FROM dbo.ProductPrices 
ORDER BY ProductID, EffectiveStartDT;

This is a somewhat more advanced 1M row test harness than described in any of the articles I’ve linked to, because of the way the intervals need to be set-up and matched, then adjusted.  But we believe that with a little study you should be able to grasp what’s going on and use the same technique in any future effort you may find relevant.

Note that it will only run in SQL 2012 because it uses LEAD, so if you are using an earlier version of SQL Server you’ll need to do some gymnastics to generate matching end dates where needed.

The table gets populated with 10,000 products with start/end intervals that are close to 10 days plus or minus 3 days, which are then adjusted at rows divisible by 13 or 16 to produce overlaps and gaps.  This is shown by SELECTing the first 20 rows, where I have highlighted the anomalies below (in bold). 

ProductID  EffectiveStartDT         EffectiveEndDT           ProductPrice
1          2010-01-09 00:00:00.000  2010-01-19 00:00:00.000  11.35
1          2010-01-19 00:00:00.000  2010-01-31 00:00:00.000  19.82
1          2010-01-31 00:00:00.000  2010-02-10 00:00:00.000  17.66
1          2010-02-10 00:00:00.000  2010-02-20 00:00:00.000  15.28
1          2010-02-20 00:00:00.000  2010-03-04 00:00:00.000  15.61
1          2010-03-04 00:00:00.000  2010-03-12 00:00:00.000  11.38
1          2010-03-12 00:00:00.000  2010-03-20 00:00:00.000  11.75
1          2010-03-20 00:00:00.000  2010-04-01 00:00:00.000  17.27
1          2010-04-01 00:00:00.000  2010-04-10 00:00:00.000  17.00
1          2010-04-10 00:00:00.000  2010-04-20 00:00:00.000  13.52
1          2010-04-20 00:00:00.000  2010-04-30 00:00:00.000  14.73
1          2010-04-30 00:00:00.000  2010-05-11 00:00:00.000  11.63
1          2010-05-09 00:00:00.000  2010-05-22 00:00:00.000  12.79 <-- overlaps prior
1          2010-05-22 00:00:00.000  2010-05-31 00:00:00.000  10.97
1          2010-05-31 00:00:00.000  2010-06-11 00:00:00.000  12.84
1          2010-06-11 00:00:00.000  2010-06-17 00:00:00.000  10.45 <-- gap before next
1          2010-06-21 00:00:00.000  2010-07-02 00:00:00.000  15.78
1          2010-07-02 00:00:00.000  2010-07-10 00:00:00.000  17.12
1          2010-07-10 00:00:00.000  2010-07-22 00:00:00.000  14.36
1          2010-07-22 00:00:00.000  2010-07-28 00:00:00.000  15.55

You can verify (by selecting the TOP 100 from the table) that the last row for each ProductID is NULL as it should be.

Performance Testing Results

In the resources attached to this article, you’ll find a single SQL script which does the following:

  • Creates the ProductPrices table.
  • Populates the ProductPrices table with the 1M row test harness.  The original sample data is included also but commented out if you want to play around with it.
  • Runs the four queries with results rendered to the results pane.  I use this for checking that row counts are the same and otherwise spot checking the results.
  • Runs the four queries again, but this time shunting the results to local variables and displaying elapsed and CPU times in the Messages pane.

Since some queries are improved by good INDEXing, we also tried using this statement to create a covering INDEX (non-clustered) on the table.  This too is present in the resources script but commented out for the initial run.

CREATE INDEX md2
    ON dbo.ProductPrices (ProductID, EffectiveEndDT)
    INCLUDE (EffectiveStartDT, ProductPrice);

The easiest method for showing the relative timings is graphically, so here are those charts for comparison.

We see from these results that the fastest (elapsed time) of the query, regardless of whether the covering INDEX is present is our solution four – the one we almost forgot to try!  The version without the covering INDEX is faster than the one with the INDEX.  If CPU is the measure you need to tune to, and you don’t have a covering INDEX to draw upon, solution two may be the better option.

Performance testing was performed on a Dell laptop running Windows 7 64-bit on a Core-i5 2.6 GHz processor with 12GB RAM, using SQL 2012 64-bit version.

Query Plans

For the sake of completeness, we’ll show the graphical execution plans for each of our four solutions.  These are the plans generated when there are 1M rows of test data present (they could be different if all you are dealing with is the small initial sample).  The first plans are returned when there is no non-clustered covering INDEX.

Here are the plans with the non-clustered covering INDEX.

It is interesting to note that the “query cost relative to the batch” reported by SQL Server for solution one is lower than other query costs for both cases and significantly lower for the no covering INDEX case.  I’m not saying it is wrong – more likely it is simply calculating a weighted cost using various factors.  The lesson here is to not rely solely upon it when evaluating the expected elapsed time performance of queries you are comparing.

It is also interesting to note that the plans are basically the same except that with the covering INDEX, SQL Server chose to use the non-clustered index in solutions three and four, but it only improved the elapsed time performance for query four.  Negative values in the “Improvement with INDEX” columns in the table below indicate longer elapsed or more CPU time usage in the INDEXed query.

Without INDEX

With INDEX

Improvement With INDEX

CPU (ms)

Elapsed (ms)

CPU (ms)

Elapsed (ms)

CPU (ms)

Elapsed (ms)

Solution 1

2387

2532

1700

1760

29%

30%

Solution 2

1373

1382

1389

1424

-1%

-3%

Solution 3

4554

1471

4711

1696

-3%

-15%

Solution 4

2261

640

546

543

76%

15%

Note that it is not clear why solution one’s performance improved with the covering INDEX, as the query plan doesn’t show it being used, however it doesn’t really matter because it still trails the pack in elapsed time, and only comes in third in CPU time.

What We Learned

Due to design choices you will often find effective dates applied to rows in a table including both start and end dates.  While that is not my usual choice, there may be other reasons for it.  This can lead to data anomalies, which can be avoided by a constraint-based strategy, or simply identified and corrected after the fact.

We have chosen four queries (undoubtedly there are others), only one of which requires the use of SQL Server 2012, which can all be used pretty effectively to identify these data anomalies.  It was interesting to find that the new SQL 2012 version, wasn’t the fastest.

We’ve also shown how to construct a 1M row test harness to demonstrate which of our queries performs the best.

We’ve also shown that even the best query you can construct may be improved by adding a covering INDEX (no surprise there).

In all of my articles I have a tradition of omitting certain things and leaving them to the imagination and ingenuity of my valued readers.  So consider these:

  • When you read the goal stated for the query solutions, you may realize that instead of the start date anomaly, you may wish to identify rows with end dates for which there are no corresponding start dates.  I believe that all of the queries presented could be modified to accommodate this scenario.
  • Correcting the data, whether that is the anomalous start or end date, is also something that can be readily accomplished using the queries provided in conjunction with an UPDATE, however that is not described.

Until next time folks, happy querying!

Follow me on Twitter: @DwainCSQL

© Copyright Dwain Camps Mar 13, 2015 All Rights Reserved

Resources

Rate

4.46 (13)

Share

Share

Rate

4.46 (13)