Counting the number of unique days between ranges of dates

Question

Counting the number of unique days between ranges of dates

Viewing 6 posts - 16 through 20 (of 20 total)

You must be logged in to reply to this topic. Login to reply

nicod Old Hand Points: 330 More actions · Answer 1

Thanks all for your help with this one. Reading through all your suggestions it looks like using a tally is what'll work for us. The dataset is only about 2,800 rows so performance isn't a huge problem.

Also came across this solution which is very similar and seems to work on the test data:

https://bertwagner.com/posts/gaps-and-islands/

jcelko212 32090 SSCrazy Eights Points: 9295 More actions · Answer 2

>> Example (dates changed to integers to simplify the problem): <<

Don't do this. We were very proud of putting temporal data types into the language. Unfortunately, what you posted for DDL is not a table and can never be a table, by definition. A key not as an option, a table has to have a key. Keys cannot have NULLs, by definition. The first thing we need to do is try to fix your DDL and turn it into a real relational database instead of a deck of punch cards written in SQL.

Since a table models a set of entities, the name should be a collective noun or plural. Unless of course you really do have only one element in the set. The reason people use singular names as the due assembly early standards for models of unit record processing like IDEF. The reason they used singular names is that they did record at a time processing so that was the unit of work. SQL uses sets

Temporal data also comes in intervals. In SQL Server, we have to model it with start and stop timestamps. The ISO 11179 naming conventions use"<attribute name>_<attribute property>", instead of the old file system single name conventions.

An identifier ("_id") is on a nominal scale and cannot be integers again, by definition. Have you ever read any introduction to data modeling so that you will know what the scales and units of measurement are for a valid data model?

Your table should have looked like this in all probability:

CREATE TABLE UnitMemberships

(membership_id CHAR(5) NOT NULL,

unit_id CHAR(5) NOT NULL,

PRIMARY KEY (membership_id, unit_id, membership_start_date), --- required in RDBMS!!

membership_start_date DATE NOT NULL,

membership_end_date DATE NOT NULL,

CHECK (membership_start_date <= membership_end_date)

);

Please pay attention to how the primary key is declared. After 30 years of doing this. I have found that failure to do valid DDL is the cause of most failed RDBM projects. And it's also the reason I can charge $$$$$ as a consultant when I come in to fix them. :-).

There is no reason to use an old row at a time INSERT INTO syntax that was required decades ago, in the first releases of Sybase SQL Server. We also have date datatypes now in SQL Server (this was always part of the ANSI ISO standards but Microsoft was slow).

INSERT INTO UnitMemberships

VALUES

('00001', '00001', '2021-01-01', '2021-01-05'),

('00001', '00002', '2021-01-08', '2021-01-15'),

('00001', '00003', '2021-01-10', '2021-01-18');

The reason for writing it all in one values clause is that this lets the optimizer take the table constructor as a whole, instead of having to do it row by row by row as if you were still using punch cards.

You'll probably want to download a copy of "Temporal Queries in SQL" by Rick Snodgrass. It's a free PDF file from the University of Arizona. It is a classic and it should be on your bookshelf. One of the standard programming idioms in SQL is a calendar table. It's just what you think it is but may include other information.

To get the set of all the dates that are included in your membership table, you can use:

SELECT COUNT (DISTINCT C.cal_date) AS unique_membership_date_count

FROM Calendar AS C, UnitMemberships AS U

WHERE C.cal_date BETWEEN U.membership_start_date AND U.membership_end_date;

Generally speaking, you want to avoid using function calls and computations in SQL. The language was designed for set-oriented operations and data manipulation. This is not FORTRAN.

Please post DDL and follow ANSI/ISO standards when asking for help.

Jeff Moden SSC Guru Points: 1003875 More actions · Answer 3

nicod wrote:

Thanks all for your help with this one. Reading through all your suggestions it looks like using a tally is what'll work for us. The dataset is only about 2,800 rows so performance isn't a huge problem.
Also came across this solution which is very similar and seems to work on the test data:
https://bertwagner.com/posts/gaps-and-islands/

Bert's solution doesn't appear to handle multiple distinct sets. Could you post the actual code you used?

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Johan Bijnens SSC Guru Points: 135265 More actions · Answer 4

nicod wrote:

...Example (dates changed to integers to simplify the problem) ...

Just wondering how the current solution performs with date/datetime data types on a larger set of logging data.

Johan

Learn to play, play to learn !

Dont drive faster than your guardian angel can fly ...
but keeping both feet on the ground wont get you anywhere :w00t:

- How to post Performance Problems
- How to post data/code to get the best help[/url]

- How to prevent a sore throat after hours of presenting ppt

press F1 for solution, press shift+F1 for urgent solution 😀

Need a bit of Powershell? How about this

Who am I ? Sometimes this is me but most of the time this is me

DesNorton SSC-Insane Points: 24283 More actions · Answer 5

So, we have 2 problems to solve

Overlapping ranges
Islands

Bert's solution can be tweaked to 1st consolidate the overlapping sets, then calculate the islands, and finally do our range calculations.

DECLARE @UnitMembership table (
    IdentityID int NOT NULL
  , UnitID     int NOT NULL
  , StartDate  int NOT NULL
  , EndDate    int NOT NULL
);

INSERT INTO @UnitMembership ( IdentityID, UnitID, StartDate, EndDate )
VALUES-- Jeffs data set -------------------------------------------
        (6, 1, 1, 10)
      , (6, 2, 2, 9)
      , (6, 3, 3, 8)
      , (6, 4, 4, 7)
      , (6, 5, 5, 6)
      , (6, 6, 5, 5)
      -- Overlapping data set --------------------------------------
      , (9, 1, 1, 3)
      , (9, 2, 5, 7)
      , (9, 3, 6, 10)
      , (9, 4, 7, 8)
      , (9, 5, 9, 11)
      , (9, 5, 16, 18)
      , (9, 6, 17, 19);

WITH cteOverlaps AS (
      -- We need to know the previous end date, so that we can determine whether the current record is the start of a new island
      SELECT  um.IdentityID
            , um.StartDate
            , um.EndDate
            , PrevEndDate = LAG(um.EndDate, 1, 0) OVER (PARTITION BY um.IdentityID ORDER BY um.StartDate, um.EndDate)
      FROM    @UnitMembership AS um
)
, cteIslands AS (
      -- Now we calculate the start of each island
      SELECT o.IdentityID, o.StartDate, o.EndDate
           , isStartNewIsland = CASE
                                  WHEN o.StartDate > MAX(o.PrevEndDate) OVER (PARTITION BY o.IdentityID ORDER BY o.StartDate, o.EndDate
                                                                              ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
                                      THEN 1
                                  ELSE 0
                                END
      FROM cteOverlaps AS o
)
, cteGroups AS (
      -- A windowed running total will define the group
      SELECT  i.IdentityID, i.StartDate, i.EndDate, i.isStartNewIsland
            , GroupID = SUM(i.isStartNewIsland) OVER (PARTITION BY i.IdentityID ORDER BY i.StartDate, i.EndDate
                                                      ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
      FROM cteIslands AS i
)
-- Finally we do our calculations per IdentityID and GroupID
SELECT g.IdentityID
     , StartDate = MIN(g.StartDate)
     , EndDate   = MAX(g.EndDate)
     , Duration  = MAX(g.EndDate) - MIN(g.StartDate) +1
FROM cteGroups AS g
GROUP BY g.IdentityID, g.GroupID
ORDER BY g.IdentityID, g.GroupID

How to post data/code on a forum to get the best help.

The Dixie Flatline SSC Guru Points: 53354 More actions · Answer 6

The solution seems much simpler, once you look at the data sorted by startdate and enddate within each IdentityID. A gap only occurs when the current start date is at least two days greater than the preceding maximum end date. Could someone (Jeff?) try to break the following code?

DECLARE @UnitMembership TABLE
        (
        IdentityID int NOT NULL,
        UnitID int not NULL,
        StartDate INT not NULL,
        EndDate INT not NULL,
primary key(identityid,startdate,enddate,unitid)
        );

INSERT  @UnitMembership
VALUES  (1, 1, 1, 5),
        (1, 2, 8, 15),
        (1, 3, 14, 18),
(6, 4,1, 10),
(6, 5,2, 9), 
(6, 6,3, 8), 
(6, 7,4, 7), 
(6, 8,5, 6), 
(6, 9,5, 5),
(9,10,1,30),
(9,11,2,10),
(9,12,15,25); 




set statistics time, io on;

with cte as (select IdentityID, StartDate, EndDate
, MX = max(EndDate) over(partition by IdentityID order by StartDate, EndDate ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING)
from @UnitMembership e
)


select IdentityID, TotalDays = 1 + Max(EndDate) - Min(StartDate)
, MissedDays = sum(case when StartDate > MX + 1 then StartDate - MX - 1 else 0 end ) 
from cte
group by IdentityID
order by IdentityID

set statistics time, io off;

select * from @UnitMembership

This reply was modified 3 years, 2 months ago by The Dixie Flatline. Reason: Had the wrong version of code. TotalDays wasn't calculated correctly

__________________________________________________

Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills