Grouping dates and IDs together

Question

Grouping dates and IDs together

Brandie Tarvin

SSC Guru

Points: 172991
More actions
February 9, 2018 at 12:35 pm

#343592

I have what at first (and second and fifth) glance appears to be an impossible task. I have to take a table with multiple rows per ID and group it across different dates. I've tried CTEs, cursors, WHILE loops and MERGE Upserts all in an effort to get around this problem and at the end of the day, something is always going wrong with the data. The worst part is I can't use MAX() and MIN() effectively. (Note: this problem will look familiar to those who helped me with a similar but slightly different issue a couple of weeks ago).
Here's what I've got (data obfuscated, so if something seems off, please ask):
CREATE TABLE #MyTemp (Office VARCHAR(20), POID INT, DeliverySite INT, AmtDollars MONEY, StartDate DATE, EndDate DATE, DelSite_Order TINYINT, POID_Order TINYINT)
INSERT INTO #MyTemp (Office, POID, DeliverySite, AmtDollars, StartDate, EndDate, DelSite_Order, POID_Order)
VALUES ('ABC3566', 1071581, 1821, 50.00, '08/18/2010', '02/23/2011', 1, 1),
('ABC3566', 1071578, 1821, 50.00, '09/03/2010', '05/01/2013', 1, 2),
('ABC3566', 1092732, 1821, 50.00, '07/01/2011', '02/11/2014', 1, 3),
('ABC3566', 1195618, 1821, 75.00, '02/12/2014', '06/05/2014', 1, 4),
('ABC3566', 1195621, 1821, 75.00, '02/12/2014', '06/03/2014', 1, 5),
('ABC3566', 1200240, 1821, 125.00, '06/06/2014', '07/31/2014', 1, 6),
('ABC3566', 1204971, 1821, 125.00, '08/01/2014', '10/31/2014', 1, 7),
('ABC3566', 1204980, 1821, 125.00, '08/01/2014', '10/31/2014', 1, 8),
('ABC3566', 1214303, 1821, 50.00, '11/01/2014', '01/12/2015', 1, 9),
('ABC3566', 1214303, 3882, 50.00, '11/01/2014', '01/12/2015', 2, 9),
('ABC3566', 1216923, 1821, 50.00, '01/13/2015', '09/30/2015', 1, 10),
('ABC3566', 1216923, 3882, 50.00, '01/13/2015', '09/30/2015', 2, 10),
('ABC3566', 1216926, 1821, 50.00, '01/13/2015', '09/30/2015', 1, 11),
('ABC3566', 1216926, 3882, 50.00, '01/13/2015', '09/30/2015', 2, 11),
('ABC3566', 1228334, 1821, 50.00, '10/01/2015', '06/30/2016', 1, 12),
('ABC3566', 1228334, 3882, 50.00, '10/01/2015', '06/30/2016', 2, 12),
('EFJ9833', 1072026, 2607, 25.00, '11/01/2010', '05/01/2013', 1, 1),
('EFJ9833', 1072026, 2608, 25.00, '11/01/2010', '05/01/2013', 2, 1),
('EFJ9833', 1099563, 2607, 25.00, '07/01/2011', '06/04/2013', 1, 2),
('EFJ9833', 1099563, 2608, 25.00, '07/01/2011', '06/04/2013', 2, 2),
('EFJ9833', 1189651, 2607, 25.00, '06/05/2013', '05/14/2014', 1, 3),
('EFJ9833', 1189651, 2608, 25.00, '06/05/2013', '05/14/2014', 2, 3),
('EFJ9833', 1189651, 3504, 25.00, '06/05/2013', '05/14/2014', 3, 3),
('EFJ9833', 1199690, 2607, 25.00, '05/15/2014', '06/30/2014', 1, 4),
('EFJ9833', 1199690, 3784, 25.00, '05/15/2014', '06/30/2014', 2, 4),
('EFJ9833', 1199690, 3785, 25.00, '05/15/2014', '06/30/2014', 3, 4),
('EFJ9833', 1209430, 2607, 25.00, '07/01/2014', '07/31/2014', 1, 5),
('EFJ9833', 1209430, 3784, 25.00, '07/01/2014', '07/31/2014', 2, 5),
('EFJ9833', 1209430, 3828, 25.00, '07/01/2014', '07/31/2014', 3, 5),
('EFJ9833', 1209513, 2607, 25.00, '08/01/2014', '09/30/2015', 1, 6),
('EFJ9833', 1209513, 3784, 25.00, '08/01/2014', '09/30/2015', 2, 6),
('EFJ9833', 1209513, 3828, 25.00, '08/01/2014', '09/30/2015', 3, 6);
What I've got to do is combine data sets based on the start / end date, the POID, and the amount. So, for example, office ABC3566 has 3 rows initially from 08/18/2010 to the "max" end date of 2/11/2014 for $50.00 heading to delivery site 1821. Since the amount then changes to $75.00 (but for the same site), I should get a new record and the same when it increases to $125.00. When the delivery sites break into 2 different sites, there should be one new record each for each site. Therefore the records should look like this:
ABC3566   1821 $50.00 08/18/2010 02/11/2014
ABC3566   1821 $75.00 02/12/2014 06/05/2014 <- Note the "05", not the "03" day even though the order puts "03" last.
ABC3566   1821 $125.00 06/06/2014 10/31/2014
ABC3566   1821 $50.00 11/01/2014 06/30/2016
ABC3566   3882 $50.00 11/01/2014 06/30/2016
My biggest problem here is that because 1821 has two different time frames in which the amount is $50.00, the MAX EndDate ends up being 6/30/2016 for the first entry when it shouldn't be. No matter how I group things, I can't seem to fix this problem.
With the next one, EFJ9833, the results should look like this:
EFJ9833 2607 $25.00 11/01/2010 06/04/2013
EFJ9833 2608 $25.00 11/01/2010 06/04/2013
EFJ9833 2607 $25.00 06/05/2013 05/14/2014
EFJ9833 2608 $25.00 06/05/2013 05/14/2014
EFJ9833 3504 $25.00 06/05/2013 05/14/2014
EFJ9833 2607 $25.00 05/15/2014 06/30/2014
EFJ9833 3784 $25.00 05/15/2014 06/30/2014
EFJ9833  3785 $25.00 05/15/2014 06/30/2014
EFJ9833 2607 $25.00 07/01/2014 09/30/2015
EFJ9833 3784 $25.00 07/01/2014 09/30/2015
EFJ9833  3828 $25.00 07/01/2014 09/30/2015
And then, just when the dates seem to be making sense, i.e., a StartDate occurring a day after a EndDate, there's data like this:
INSERT INTO #MyTemp (Office, POID, DeliverySite, AmtDollars, StartDate, EndDate, DelSite_Order, POID_Order) VALUES ('XKA9523', 1043624, 1987, 25.00, '11/01/2010', '05/01/2013', 1, 1), ('XKA9523', 1100518, 2005, 25.00, '12/01/2011', '05/01/2013', 1, 1), ('XKA9523', 1096836, 1592, 40.00, '09/01/2015', '06/04/2016', 1, 2);
Which needs to be:
XKA9523 1987 $25.00 11/01/2010 05/01/2013
XKA9523 2005 $25.00 12/01/2011 05/01/2013 <- Notice this StartDate starts before the last ENDDAte
XKA9523 1592 $40.00 09/01/2015 06/04/2016 <- Notice this StartDate has skipped several months between the last
So if a StartDate happens before or after the last EndDate but is for the same Office and the same DeliverySite with the same dollar amount, it gets grouped in with the previous record. If it happens before or after the last EndDate for a different DeliverySite *or* a different amount, it gets its own record. The "order" columns were done by me using RANK() and DENSE_RANK() in an effort to group DeliverySites by Office and in order of site #. Because I had no other way of effectively grouping this information it was the best I could do.
I've been banging my head against this particular wall for 2 weeks now and still am no closer to a solution. Each time I solve the problem for one office, I break it for another.
Any thoughts on where I can start?
EDIT: <headdesk> My apologies. I thought I posted this in the T-SQL forum but my mouse must have skipped up when I hit the forum header. I'm not going to double-post, so please answer here.
Brandie Tarvin, MCITP Database AdministratorLiveJournal Blog: http://brandietarvin.livejournal.com/[/url]On LinkedIn!, Google+, and Twitter.Freelance Writer: ShadowrunLatchkeys: Nevermore, Latchkeys: The Bootleg War, and Latchkeys: Roscoes in the Night are now available on Nook and Kindle.

Viewing 14 posts - 1 through 13 (of 13 total)

You must be logged in to reply to this topic. Login to reply

drew.allen SSC Guru Points: 76943 More actions · Answer 1

This gives the results that you are looking for. I had to group based on the delivery sites per POID in order to get the "splits" correct.

; WITH DeliverySites AS ( SELECT *, STUFF( ( SELECT ',', DeliverySite AS [text()] FROM #MyTemp sub WHERE main.POID = sub.POID ORDER BY DeliverySite FOR XML PATH(''), TYPE ).value('.', 'VARCHAR(MAX)') , 1, 1, '') AS DeliverySites FROM #MyTemp main ) , Starts AS ( SELECT *, CASE WHEN AmtDollars = LAG(AmtDollars) OVER(PARTITION BY Office, DeliverySites, DeliverySite ORDER BY StartDate) THEN 0 ELSE 1 END IsStart FROM DeliverySites ) , Groups AS ( SELECT Office, DeliverySite, AmtDollars, StartDate, EndDate, DeliverySites, SUM(IsStart) OVER(PARTITION BY Office, DeliverySites, DeliverySite ORDER BY StartDate ROWS UNBOUNDED PRECEDING) AS Grp FROM Starts ) SELECT Office, DeliverySite, AmtDollars, MIN(StartDate), MAX(EndDate) FROM Groups GROUP BY Office, DeliverySites, DeliverySite, AmtDollars, Grp ORDER BY Office, MIN(StartDate), DeliverySite

Drew

J. Drew Allen
Business Intelligence Analyst
Philadelphia, PA

Lynn Pettis SSC Guru Points: 442467 More actions · Answer 2

I am curious why in some cases an adjacent date range is merged but in others they aren't. What I mean is the enddate for one range is the day before the start of the next (range a enddate 2014-07-31 and range b start date 2014-08-01 for the same Office, DeliverySite, and AmtDollars.

drew.allen SSC Guru Points: 76943 More actions · Answer 3

If you look at POID 1099563, it delivers to sites 2607 and 2608, but POID 1189651 delivers to sites 2607, 2608, and 3504. I treated those as a break even though there is no break if you just look at sites 2607 and 2608, because that was the only way that I could get my results to match the desired results

Drew

J. Drew Allen
Business Intelligence Analyst
Philadelphia, PA

Lynn Pettis SSC Guru Points: 442467 More actions · Answer 4

drew.allen - Friday, February 9, 2018 3:29 PM
If you look at POID 1099563, it delivers to sites 2607 and 2608, but POID 1189651 delivers to sites 2607, 2608, and 3504. I treated those as a break even though there is no break if you just look at sites 2607 and 2608, because that was the only way that I could get my results to match the desired results
Drew

That really isn't very intuitive. You must have spent some time on this or something pattern-wise just struck you as I didn't catch that.

Brandie Tarvin SSC Guru Points: 172991 More actions · Answer 5

drew.allen - Friday, February 9, 2018 3:29 PM
If you look at POID 1099563, it delivers to sites 2607 and 2608, but POID 1189651 delivers to sites 2607, 2608, and 3504. I treated those as a break even though there is no break if you just look at sites 2607 and 2608, because that was the only way that I could get my results to match the desired results
Drew

That's exactly what I need to do. And yes, Lynn, it's not intuitive. In fact, I told our project manager that our data was incompatible with the requirements. But the requirements are the way it needs to be done because Reasons.

Thanks, Drew. I'll check your code out. I toyed with the idea of using STUFF() for about 5 minutes but couldn't see how to apply it, so went on to try nested cursors and nested while loops, etc. Maybe I should have actually tried STUFF() instead of just looked at it. @=)

Brandie Tarvin, MCITP Database AdministratorLiveJournal Blog: http://brandietarvin.livejournal.com/[/url]On LinkedIn!, Google+, and Twitter.Freelance Writer: ShadowrunLatchkeys: Nevermore, Latchkeys: The Bootleg War, and Latchkeys: Roscoes in the Night are now available on Nook and Kindle.

Brandie Tarvin SSC Guru Points: 172991 More actions · Answer 6

Drew, thank you so very much. I'd never heard of LAG() or the ROWS UNBOUNDED PRECEDING stuff until now. I love this job because I learn new things every day.

The kicker to this specific project is that I have to check delivery sites for overlapping dates / amounts and error them out. Which would be my next hurdle if I hadn't gotten your help on the "Tying MAX() value" thread for the last project a couple of weeks ago. You just sorted out a whole bunch of problems with two posts. Thank you so very much.

Brandie Tarvin, MCITP Database AdministratorLiveJournal Blog: http://brandietarvin.livejournal.com/[/url]On LinkedIn!, Google+, and Twitter.Freelance Writer: ShadowrunLatchkeys: Nevermore, Latchkeys: The Bootleg War, and Latchkeys: Roscoes in the Night are now available on Nook and Kindle.

Brandie Tarvin SSC Guru Points: 172991 More actions · Answer 7

BAH. I thought I was through with this, but I just found an outlier.

When testing with the this data:

INSERT INTO #MyTemp (Office, POID, DeliverySite, AmtDollars, StartDate, EndDate, DelSite_Order, POID_Order) VALUES ('KTY4568', 1071581, 2056, 50.00, '01/30/2012', '07/31/2013', 1, 1), ('KTY4568', 1071578, 5943, 50.00, '08/01/2013', '09/02/2013', 1, 2), ('KTY4568', 1092732, 2056, 50.00, '09/03/2013', '12/12/2013', 1, 3)

I get

KTY4568 2056 $50.00 01/30/2012 12/12/2013
KTY4568 5943 $50.00 08/01/2013 09/02/2013

When I should be getting:

KTY4568 2056 $50.00 01/30/2012 07/31/2013
KTY4568 5943 $50.00 08/01/2013 09/02/2013
KTY4568 2056 $50.00 09/03/2013 12/12/2013

Thoughts?

EDIT: I think I may have to do these ones manually. Fortunately, there only appears to be about 9 offices affected by this.

Brandie Tarvin, MCITP Database AdministratorLiveJournal Blog: http://brandietarvin.livejournal.com/[/url]On LinkedIn!, Google+, and Twitter.Freelance Writer: ShadowrunLatchkeys: Nevermore, Latchkeys: The Bootleg War, and Latchkeys: Roscoes in the Night are now available on Nook and Kindle.

Lynn Pettis SSC Guru Points: 442467 More actions · Answer 8

Brandie Tarvin - Monday, February 12, 2018 11:46 AM
BAH. I thought I was through with this, but I just found an outlier.
When testing with the this data:
INSERT INTO #MyTemp (Office, POID, DeliverySite, AmtDollars, StartDate, EndDate, DelSite_Order, POID_Order) VALUES ('KTY4568', 1071581, 2056, 50.00, '01/30/2012', '07/31/2013', 1, 1), ('KTY4568', 1071578, 5943, 50.00, '08/01/2013', '09/02/2013', 1, 2), ('KTY4568', 1092732, 2056, 50.00, '09/03/2013', '12/12/2013', 1, 3)
I get
KTY4568 2056 $50.00 01/30/2012 12/12/2013
KTY4568 5943 $50.00 08/01/2013 09/02/2013
When I should be getting:
KTY4568 2056 $50.00 01/30/2012 07/31/2013
KTY4568 5943 $50.00 08/01/2013 09/02/2013
KTY4568 2056 $50.00 09/03/2013 12/12/2013

Thoughts?

So this goes back to sometimes adjacent time spans aren't joined together. Is there something that tells you that or is it institutional knowledge that you have to code for in the query?

drew.allen SSC Guru Points: 76943 More actions · Answer 9

Brandie Tarvin - Monday, February 12, 2018 11:46 AM
BAH. I thought I was through with this, but I just found an outlier.
When testing with the this data:
INSERT INTO #MyTemp (Office, POID, DeliverySite, AmtDollars, StartDate, EndDate, DelSite_Order, POID_Order) VALUES ('KTY4568', 1071581, 2056, 50.00, '01/30/2012', '07/31/2013', 1, 1), ('KTY4568', 1071578, 5943, 50.00, '08/01/2013', '09/02/2013', 1, 2), ('KTY4568', 1092732, 2056, 50.00, '09/03/2013', '12/12/2013', 1, 3)
I get
KTY4568 2056 $50.00 01/30/2012 12/12/2013
KTY4568 5943 $50.00 08/01/2013 09/02/2013
When I should be getting:
KTY4568 2056 $50.00 01/30/2012 07/31/2013
KTY4568 5943 $50.00 08/01/2013 09/02/2013
KTY4568 2056 $50.00 09/03/2013 12/12/2013

Thoughts?

EDIT: I think I may have to do these ones manually. Fortunately, there only appears to be about 9 offices affected by this.

You just need to go back and adjust the CASE expression to make sure that the IsStart matches your expectations. You will probably need to add additional WHEN clauses. As Lynn says, it's still not clear what the exact criteria for packing intervals are.
Drew

J. Drew Allen
Business Intelligence Analyst
Philadelphia, PA

Brandie Tarvin SSC Guru Points: 172991 More actions · Answer 10

Lynn Pettis - Monday, February 12, 2018 11:50 AM
Brandie Tarvin - Monday, February 12, 2018 11:46 AM
BAH. I thought I was through with this, but I just found an outlier.
When testing with the this data:
INSERT INTO #MyTemp (Office, POID, DeliverySite, AmtDollars, StartDate, EndDate, DelSite_Order, POID_Order) VALUES ('KTY4568', 1071581, 2056, 50.00, '01/30/2012', '07/31/2013', 1, 1), ('KTY4568', 1071578, 5943, 50.00, '08/01/2013', '09/02/2013', 1, 2), ('KTY4568', 1092732, 2056, 50.00, '09/03/2013', '12/12/2013', 1, 3)
I get
KTY4568 2056 $50.00 01/30/2012 12/12/2013
KTY4568 5943 $50.00 08/01/2013 09/02/2013
When I should be getting:
KTY4568 2056 $50.00 01/30/2012 07/31/2013
KTY4568 5943 $50.00 08/01/2013 09/02/2013
KTY4568 2056 $50.00 09/03/2013 12/12/2013
Thoughts?

So this goes back to sometimes adjacent time spans aren't joined together. Is there something that tells you that or is it institutional knowledge that you have to code for in the query?
This data is coming from a SQL Server database table. It was inputed through a client interface. Things changed then changed back. I have no way of telling (except either through code or visual inspection) when this happens. And I'm trying to figure out a way to tell through code without having to update these 9 offices manually. Especially since my dev data might not be equal to production data by the time I need this file in Production.

Brandie Tarvin, MCITP Database AdministratorLiveJournal Blog: http://brandietarvin.livejournal.com/[/url]On LinkedIn!, Google+, and Twitter.Freelance Writer: ShadowrunLatchkeys: Nevermore, Latchkeys: The Bootleg War, and Latchkeys: Roscoes in the Night are now available on Nook and Kindle.

Brandie Tarvin SSC Guru Points: 172991 More actions · Answer 11

drew.allen - Monday, February 12, 2018 12:12 PM
Brandie Tarvin - Monday, February 12, 2018 11:46 AM
BAH. I thought I was through with this, but I just found an outlier.
When testing with the this data:
INSERT INTO #MyTemp (Office, POID, DeliverySite, AmtDollars, StartDate, EndDate, DelSite_Order, POID_Order) VALUES ('KTY4568', 1071581, 2056, 50.00, '01/30/2012', '07/31/2013', 1, 1), ('KTY4568', 1071578, 5943, 50.00, '08/01/2013', '09/02/2013', 1, 2), ('KTY4568', 1092732, 2056, 50.00, '09/03/2013', '12/12/2013', 1, 3)
I get
KTY4568 2056 $50.00 01/30/2012 12/12/2013
KTY4568 5943 $50.00 08/01/2013 09/02/2013
When I should be getting:
KTY4568 2056 $50.00 01/30/2012 07/31/2013
KTY4568 5943 $50.00 08/01/2013 09/02/2013
KTY4568 2056 $50.00 09/03/2013 12/12/2013
Thoughts?
EDIT: I think I may have to do these ones manually. Fortunately, there only appears to be about 9 offices affected by this.

You just need to go back and adjust the CASE expression to make sure that the IsStart matches your expectations. You will probably need to add additional WHEN clauses. As Lynn says, it's still not clear what the exact criteria for packing intervals are.

Drew
The criteria for packing intervals are based on 4 things. Office, Delivery Site aggregation (how many Delivery Sites are used for the same time interval for the site), Amount, and the dates. If a date completely cuts off a site (EndDate), then the packing needs to pick up the next set of sites for the office with those dates and those amounts. But if the sites have begin / end dates concurrent / crossing with the same amounts over the same time frames, they need to be packed.

Site 1821 could have 3 different date ranges right after each other for the same amount, so it needs to be packed. The second line could also have a start date before the first line's cancel date which, so long as the amount was the same, would still fall into the packing. But if Site 2015 (for the same office) runs across these dates, even if it has the same amount, this is a change that needs to be delineated by a separate line with its own dates. And when a new set of dates with multiple sites (1821,3882) pops up for the office, then that needs to be separated out from the first grouping but can have its own set of continuous / contiguous dates.

And if a site is split, like in this last example, it needs to be treated as two different lines with the splitting site in between.

I really wish I could share more information about this. I'm not sure my explanation is cutting it.

Brandie Tarvin, MCITP Database AdministratorLiveJournal Blog: http://brandietarvin.livejournal.com/[/url]On LinkedIn!, Google+, and Twitter.Freelance Writer: ShadowrunLatchkeys: Nevermore, Latchkeys: The Bootleg War, and Latchkeys: Roscoes in the Night are now available on Nook and Kindle.

drew.allen SSC Guru Points: 76943 More actions · Answer 12

I think this captures everything.

; WITH DeliverySites AS ( SELECT *, STUFF( ( SELECT ',', DeliverySite AS [text()] FROM #MyTemp sub WHERE main.POID = sub.POID ORDER BY DeliverySite FOR XML PATH(''), TYPE ).value('.', 'VARCHAR(MAX)') , 1, 1, '') AS DeliverySites FROM #MyTemp main ) , Starts AS ( SELECT *, CASE WHEN AmtDollars <> LAG(AmtDollars, 1, 0) OVER(PARTITION BY Office, DeliverySites, DeliverySite ORDER BY StartDate) THEN 1 WHEN DeliverySites = LAG(DeliverySites) OVER(PARTITION BY Office ORDER BY StartDate) THEN 0 WHEN StartDate > LAG(EndDate, 1, '1900-01-01') OVER(PARTITION BY Office ORDER BY StartDate) THEN 1 ELSE 0 END IsStart FROM DeliverySites ) , Groups AS ( SELECT Office, DeliverySite, AmtDollars, StartDate, EndDate, DeliverySites, SUM(IsStart) OVER(PARTITION BY Office, DeliverySites, DeliverySite ORDER BY StartDate ROWS UNBOUNDED PRECEDING) AS Grp FROM Starts ) SELECT Office, DeliverySite, AmtDollars, MIN(StartDate), MAX(EndDate) FROM Groups GROUP BY Office, DeliverySites, DeliverySite, AmtDollars, Grp ORDER BY Office, MIN(StartDate), DeliverySite

All I did was change the CASE expression.

Drew

J. Drew Allen
Business Intelligence Analyst
Philadelphia, PA

Brandie Tarvin SSC Guru Points: 172991 More actions · Answer 13

Thanks, Drew. I was working on that, but changing different items in the CASE. I'll take a look at that and see how it works.

Brandie Tarvin, MCITP Database AdministratorLiveJournal Blog: http://brandietarvin.livejournal.com/[/url]On LinkedIn!, Google+, and Twitter.Freelance Writer: ShadowrunLatchkeys: Nevermore, Latchkeys: The Bootleg War, and Latchkeys: Roscoes in the Night are now available on Nook and Kindle.