Grouping dates and IDs together

  • I have what at first (and second and fifth) glance appears to be an impossible task. I have to take a table with multiple rows per ID and group it across different dates. I've tried CTEs, cursors, WHILE loops and MERGE Upserts all in an effort to get around this problem and at the end of the day, something is always going wrong with the data. The worst part is I can't use MAX() and MIN() effectively. (Note: this problem will look familiar to those who helped me with a similar but slightly different issue a couple of weeks ago).

    Here's what I've got (data obfuscated, so if something seems off, please ask):


    CREATE TABLE #MyTemp (Office VARCHAR(20), POID INT, DeliverySite INT, AmtDollars MONEY, StartDate DATE, EndDate DATE, DelSite_Order TINYINT, POID_Order TINYINT)

    INSERT INTO #MyTemp (Office, POID, DeliverySite, AmtDollars, StartDate, EndDate, DelSite_Order, POID_Order)
    VALUES ('ABC3566', 1071581, 1821, 50.00, '08/18/2010', '02/23/2011', 1, 1),
    ('ABC3566', 1071578, 1821, 50.00, '09/03/2010', '05/01/2013', 1, 2),
    ('ABC3566', 1092732, 1821, 50.00, '07/01/2011', '02/11/2014', 1, 3),
    ('ABC3566', 1195618, 1821, 75.00, '02/12/2014', '06/05/2014', 1, 4),
    ('ABC3566', 1195621, 1821, 75.00, '02/12/2014', '06/03/2014', 1, 5),
    ('ABC3566', 1200240, 1821, 125.00, '06/06/2014', '07/31/2014', 1, 6),
    ('ABC3566', 1204971, 1821, 125.00, '08/01/2014', '10/31/2014', 1, 7),
    ('ABC3566', 1204980, 1821, 125.00, '08/01/2014', '10/31/2014', 1, 8),
    ('ABC3566', 1214303, 1821, 50.00, '11/01/2014', '01/12/2015', 1, 9),
    ('ABC3566', 1214303, 3882, 50.00, '11/01/2014', '01/12/2015', 2, 9),
    ('ABC3566', 1216923, 1821, 50.00, '01/13/2015', '09/30/2015', 1, 10),
    ('ABC3566', 1216923, 3882, 50.00, '01/13/2015', '09/30/2015', 2, 10),
    ('ABC3566', 1216926, 1821, 50.00, '01/13/2015', '09/30/2015', 1, 11),
    ('ABC3566', 1216926, 3882, 50.00, '01/13/2015', '09/30/2015', 2, 11),
    ('ABC3566', 1228334, 1821, 50.00, '10/01/2015', '06/30/2016', 1, 12),
    ('ABC3566', 1228334, 3882, 50.00, '10/01/2015', '06/30/2016', 2, 12),
    ('EFJ9833', 1072026, 2607, 25.00, '11/01/2010', '05/01/2013', 1, 1),
    ('EFJ9833', 1072026, 2608, 25.00, '11/01/2010', '05/01/2013', 2, 1),
    ('EFJ9833', 1099563, 2607, 25.00, '07/01/2011', '06/04/2013', 1, 2),
    ('EFJ9833', 1099563, 2608, 25.00, '07/01/2011', '06/04/2013', 2, 2),
    ('EFJ9833', 1189651, 2607, 25.00, '06/05/2013', '05/14/2014', 1, 3),
    ('EFJ9833', 1189651, 2608, 25.00, '06/05/2013', '05/14/2014', 2, 3),
    ('EFJ9833', 1189651, 3504, 25.00, '06/05/2013', '05/14/2014', 3, 3),
    ('EFJ9833', 1199690, 2607, 25.00, '05/15/2014', '06/30/2014', 1, 4),
    ('EFJ9833', 1199690, 3784, 25.00, '05/15/2014', '06/30/2014', 2, 4),
    ('EFJ9833', 1199690, 3785, 25.00, '05/15/2014', '06/30/2014', 3, 4),
    ('EFJ9833', 1209430, 2607, 25.00, '07/01/2014', '07/31/2014', 1, 5),
    ('EFJ9833', 1209430, 3784, 25.00, '07/01/2014', '07/31/2014', 2, 5),
    ('EFJ9833', 1209430, 3828, 25.00, '07/01/2014', '07/31/2014', 3, 5),
    ('EFJ9833', 1209513, 2607, 25.00, '08/01/2014', '09/30/2015', 1, 6),
    ('EFJ9833', 1209513, 3784, 25.00, '08/01/2014', '09/30/2015', 2, 6),
    ('EFJ9833', 1209513, 3828, 25.00, '08/01/2014', '09/30/2015', 3, 6);

    What I've got to do is combine data sets based on the start / end date, the POID, and the amount. So, for example, office ABC3566 has 3 rows initially from 08/18/2010 to the "max" end date of 2/11/2014 for $50.00 heading to delivery site 1821. Since the amount then changes to $75.00 (but for the same site), I should get a new record and the same when it increases to $125.00. When the delivery sites break into 2 different sites, there should be one new record each for each site. Therefore the records should look like this:

    ABC3566   1821  $50.00  08/18/2010  02/11/2014
    ABC3566   1821  $75.00  02/12/2014  06/05/2014 <- Note the "05", not the "03" day even though the order puts "03" last.
    ABC3566   1821  $125.00  06/06/2014  10/31/2014
    ABC3566   1821  $50.00  11/01/2014  06/30/2016
    ABC3566   3882  $50.00  11/01/2014  06/30/2016

    My biggest problem here is that because 1821 has two different time frames in which the amount is $50.00, the MAX EndDate ends up being 6/30/2016 for the first entry when it shouldn't be. No matter how I group things, I can't seem to fix this problem.

    With the next one, EFJ9833, the results should look like this:

    EFJ9833  2607  $25.00  11/01/2010  06/04/2013
    EFJ9833 2608  $25.00  11/01/2010  06/04/2013
    EFJ9833  2607  $25.00  06/05/2013  05/14/2014
    EFJ9833  2608  $25.00  06/05/2013  05/14/2014
    EFJ9833  3504  $25.00  06/05/2013  05/14/2014
    EFJ9833  2607  $25.00  05/15/2014  06/30/2014
    EFJ9833  3784  $25.00  05/15/2014  06/30/2014
    EFJ9833  3785  $25.00  05/15/2014  06/30/2014
    EFJ9833  2607  $25.00  07/01/2014  09/30/2015
    EFJ9833  3784  $25.00  07/01/2014  09/30/2015
    EFJ9833  3828  $25.00  07/01/2014  09/30/2015

    And then, just when the dates seem to be making sense, i.e., a StartDate occurring a day after a EndDate, there's data like this:


    INSERT INTO #MyTemp (Office, POID, DeliverySite, AmtDollars, StartDate, EndDate, DelSite_Order, POID_Order)
    VALUES ('XKA9523', 1043624, 1987, 25.00, '11/01/2010', '05/01/2013', 1, 1),
    ('XKA9523', 1100518, 2005, 25.00, '12/01/2011', '05/01/2013', 1, 1),
    ('XKA9523', 1096836, 1592, 40.00, '09/01/2015', '06/04/2016', 1, 2);

    Which needs to be:

    XKA9523  1987  $25.00  11/01/2010  05/01/2013
    XKA9523  2005  $25.00  12/01/2011  05/01/2013 <- Notice this StartDate starts before the last ENDDAte
    XKA9523  1592  $40.00  09/01/2015  06/04/2016  <- Notice this StartDate has skipped several months between the last

    So if a StartDate happens before or after the last EndDate but is for the same Office and the same DeliverySite with the same dollar amount, it gets grouped in with the previous record. If it happens before or after the last EndDate for a different DeliverySite *or* a different amount, it gets its own record. The "order" columns were done by me using RANK() and DENSE_RANK() in an effort to group DeliverySites by Office and in order of site #. Because I had no other way of effectively grouping this information it was the best I could do.

    I've been banging my head against this particular wall for 2 weeks now and still am no closer to a solution. Each time I solve the problem for one office, I break it for another.

    Any thoughts on where I can start?

    EDIT: <headdesk> My apologies. I thought I posted this in the T-SQL forum but my mouse must have skipped up when I hit the forum header. I'm not going to double-post, so please answer here.

    Brandie Tarvin, MCITP Database AdministratorLiveJournal Blog: http://brandietarvin.livejournal.com/[/url]On LinkedIn!, Google+, and Twitter.Freelance Writer: ShadowrunLatchkeys: Nevermore, Latchkeys: The Bootleg War, and Latchkeys: Roscoes in the Night are now available on Nook and Kindle.

  • This gives the results that you are looking for.  I had to group based on the delivery sites per POID in order to get the "splits" correct.

    ;
    WITH DeliverySites AS
    (
        SELECT *,
            STUFF(
                (
                    SELECT ',', DeliverySite AS [text()]
                    FROM #MyTemp sub
                    WHERE main.POID = sub.POID
                    ORDER BY DeliverySite
                    FOR XML PATH(''), TYPE
                ).value('.', 'VARCHAR(MAX)')
            , 1, 1, '') AS DeliverySites
        FROM #MyTemp main
    )
    , Starts AS
    (
        SELECT *, CASE WHEN AmtDollars = LAG(AmtDollars) OVER(PARTITION BY Office, DeliverySites, DeliverySite ORDER BY StartDate) THEN 0 ELSE 1 END IsStart
        FROM DeliverySites
    )
    , Groups AS
    (
        SELECT Office, DeliverySite, AmtDollars, StartDate, EndDate, DeliverySites, SUM(IsStart) OVER(PARTITION BY Office, DeliverySites, DeliverySite ORDER BY StartDate ROWS UNBOUNDED PRECEDING) AS Grp
        FROM Starts
    )
    SELECT Office, DeliverySite, AmtDollars, MIN(StartDate), MAX(EndDate)
    FROM Groups
    GROUP BY Office, DeliverySites, DeliverySite, AmtDollars, Grp
    ORDER BY Office, MIN(StartDate), DeliverySite

    Drew

    J. Drew Allen
    Business Intelligence Analyst
    Philadelphia, PA

  • I am curious why in some cases an adjacent date range is merged but in others they aren't.  What I mean is the enddate for one range is the day before the start of the next (range a enddate 2014-07-31 and range b start date 2014-08-01 for the same Office, DeliverySite, and AmtDollars.

  • If you look at POID 1099563, it delivers to sites 2607 and 2608, but POID 1189651 delivers to sites 2607, 2608, and 3504.  I treated those as a break even though there is no break if you just look at sites 2607 and 2608, because that was the only way that I could get my results to match the desired results

    Drew

    J. Drew Allen
    Business Intelligence Analyst
    Philadelphia, PA

  • drew.allen - Friday, February 9, 2018 3:29 PM

    If you look at POID 1099563, it delivers to sites 2607 and 2608, but POID 1189651 delivers to sites 2607, 2608, and 3504.  I treated those as a break even though there is no break if you just look at sites 2607 and 2608, because that was the only way that I could get my results to match the desired results

    Drew

    That really isn't very intuitive.  You must have spent some time on this or something pattern-wise just struck you as I didn't catch that.

  • drew.allen - Friday, February 9, 2018 3:29 PM

    If you look at POID 1099563, it delivers to sites 2607 and 2608, but POID 1189651 delivers to sites 2607, 2608, and 3504.  I treated those as a break even though there is no break if you just look at sites 2607 and 2608, because that was the only way that I could get my results to match the desired results

    Drew

    That's exactly what I need to do. And yes, Lynn, it's not intuitive. In fact, I told our project manager that our data was incompatible with the requirements. But the requirements are the way it needs to be done because Reasons.

    Thanks, Drew. I'll check your code out. I toyed with the idea of using STUFF() for about 5 minutes but couldn't see how to apply it, so went on to try nested cursors and nested while loops, etc. Maybe I should have actually tried STUFF() instead of just looked at it. @=)

    Brandie Tarvin, MCITP Database AdministratorLiveJournal Blog: http://brandietarvin.livejournal.com/[/url]On LinkedIn!, Google+, and Twitter.Freelance Writer: ShadowrunLatchkeys: Nevermore, Latchkeys: The Bootleg War, and Latchkeys: Roscoes in the Night are now available on Nook and Kindle.

  • Drew, thank you so very much. I'd never heard of LAG() or the ROWS UNBOUNDED PRECEDING stuff until now. I love this job because I learn new things every day.

    The kicker to this specific project is that I have to check delivery sites for overlapping dates / amounts and error them out. Which would be my next hurdle if I hadn't gotten your help on the "Tying MAX() value" thread for the last project a couple of weeks ago. You just sorted out a whole bunch of problems with two posts. Thank you so very much.

    Brandie Tarvin, MCITP Database AdministratorLiveJournal Blog: http://brandietarvin.livejournal.com/[/url]On LinkedIn!, Google+, and Twitter.Freelance Writer: ShadowrunLatchkeys: Nevermore, Latchkeys: The Bootleg War, and Latchkeys: Roscoes in the Night are now available on Nook and Kindle.

  • BAH. I thought I was through with this, but I just found an outlier.

    When testing with the this data:

    INSERT INTO #MyTemp (Office, POID, DeliverySite, AmtDollars, StartDate, EndDate, DelSite_Order, POID_Order)
    VALUES ('KTY4568', 1071581, 2056, 50.00, '01/30/2012', '07/31/2013', 1, 1),
    ('KTY4568', 1071578, 5943, 50.00, '08/01/2013', '09/02/2013', 1, 2),
    ('KTY4568', 1092732, 2056, 50.00, '09/03/2013', '12/12/2013', 1, 3)

    I get

    KTY4568  2056  $50.00  01/30/2012  12/12/2013
    KTY4568  5943  $50.00  08/01/2013  09/02/2013

    When I should be getting:

    KTY4568  2056  $50.00  01/30/2012  07/31/2013
    KTY4568  5943  $50.00  08/01/2013  09/02/2013
    KTY4568  2056  $50.00  09/03/2013  12/12/2013

    Thoughts?

    EDIT: I think I may have to do these ones manually. Fortunately, there only appears to be about 9 offices affected by this.

    Brandie Tarvin, MCITP Database AdministratorLiveJournal Blog: http://brandietarvin.livejournal.com/[/url]On LinkedIn!, Google+, and Twitter.Freelance Writer: ShadowrunLatchkeys: Nevermore, Latchkeys: The Bootleg War, and Latchkeys: Roscoes in the Night are now available on Nook and Kindle.

  • Brandie Tarvin - Monday, February 12, 2018 11:46 AM

    BAH. I thought I was through with this, but I just found an outlier.

    When testing with the this data:

    INSERT INTO #MyTemp (Office, POID, DeliverySite, AmtDollars, StartDate, EndDate, DelSite_Order, POID_Order)
    VALUES ('KTY4568', 1071581, 2056, 50.00, '01/30/2012', '07/31/2013', 1, 1),
    ('KTY4568', 1071578, 5943, 50.00, '08/01/2013', '09/02/2013', 1, 2),
    ('KTY4568', 1092732, 2056, 50.00, '09/03/2013', '12/12/2013', 1, 3)

    I get

    KTY4568  2056  $50.00  01/30/2012  12/12/2013
    KTY4568  5943  $50.00  08/01/2013  09/02/2013

    When I should be getting:

    KTY4568  2056  $50.00  01/30/2012  07/31/2013
    KTY4568  5943  $50.00  08/01/2013  09/02/2013
    KTY4568  2056  $50.00  09/03/2013  12/12/2013

    Thoughts?

    So this goes back to sometimes adjacent time spans aren't joined together.   Is there something that tells you that or is it institutional knowledge that you have to code for in the query?

  • Brandie Tarvin - Monday, February 12, 2018 11:46 AM

    BAH. I thought I was through with this, but I just found an outlier.

    When testing with the this data:

    INSERT INTO #MyTemp (Office, POID, DeliverySite, AmtDollars, StartDate, EndDate, DelSite_Order, POID_Order)
    VALUES ('KTY4568', 1071581, 2056, 50.00, '01/30/2012', '07/31/2013', 1, 1),
    ('KTY4568', 1071578, 5943, 50.00, '08/01/2013', '09/02/2013', 1, 2),
    ('KTY4568', 1092732, 2056, 50.00, '09/03/2013', '12/12/2013', 1, 3)

    I get

    KTY4568  2056  $50.00  01/30/2012  12/12/2013
    KTY4568  5943  $50.00  08/01/2013  09/02/2013

    When I should be getting:

    KTY4568  2056  $50.00  01/30/2012  07/31/2013
    KTY4568  5943  $50.00  08/01/2013  09/02/2013
    KTY4568  2056  $50.00  09/03/2013  12/12/2013

    Thoughts?

    EDIT: I think I may have to do these ones manually. Fortunately, there only appears to be about 9 offices affected by this.

    You just need to go back and adjust the CASE expression to make sure that the IsStart matches your expectations.  You will probably need to add additional WHEN clauses.  As Lynn says, it's still not clear what the exact criteria for packing intervals are.
    Drew

    J. Drew Allen
    Business Intelligence Analyst
    Philadelphia, PA

  • Lynn Pettis - Monday, February 12, 2018 11:50 AM

    Brandie Tarvin - Monday, February 12, 2018 11:46 AM

    BAH. I thought I was through with this, but I just found an outlier.

    When testing with the this data:

    INSERT INTO #MyTemp (Office, POID, DeliverySite, AmtDollars, StartDate, EndDate, DelSite_Order, POID_Order)
    VALUES ('KTY4568', 1071581, 2056, 50.00, '01/30/2012', '07/31/2013', 1, 1),
    ('KTY4568', 1071578, 5943, 50.00, '08/01/2013', '09/02/2013', 1, 2),
    ('KTY4568', 1092732, 2056, 50.00, '09/03/2013', '12/12/2013', 1, 3)

    I get

    KTY4568  2056  $50.00  01/30/2012  12/12/2013
    KTY4568  5943  $50.00  08/01/2013  09/02/2013

    When I should be getting:

    KTY4568  2056  $50.00  01/30/2012  07/31/2013
    KTY4568  5943  $50.00  08/01/2013  09/02/2013
    KTY4568  2056  $50.00  09/03/2013  12/12/2013

    Thoughts?

    So this goes back to sometimes adjacent time spans aren't joined together.   Is there something that tells you that or is it institutional knowledge that you have to code for in the query?
    This data is coming from a SQL Server database table. It was inputed through a client interface. Things changed then changed back. I have no way of telling (except either through code or visual inspection) when this happens. And I'm trying to figure out a way to tell through code without having to update these 9 offices manually. Especially since my dev data might not be equal to production data by the time I need this file in Production.

    Brandie Tarvin, MCITP Database AdministratorLiveJournal Blog: http://brandietarvin.livejournal.com/[/url]On LinkedIn!, Google+, and Twitter.Freelance Writer: ShadowrunLatchkeys: Nevermore, Latchkeys: The Bootleg War, and Latchkeys: Roscoes in the Night are now available on Nook and Kindle.

  • drew.allen - Monday, February 12, 2018 12:12 PM

    Brandie Tarvin - Monday, February 12, 2018 11:46 AM

    BAH. I thought I was through with this, but I just found an outlier.

    When testing with the this data:

    INSERT INTO #MyTemp (Office, POID, DeliverySite, AmtDollars, StartDate, EndDate, DelSite_Order, POID_Order)
    VALUES ('KTY4568', 1071581, 2056, 50.00, '01/30/2012', '07/31/2013', 1, 1),
    ('KTY4568', 1071578, 5943, 50.00, '08/01/2013', '09/02/2013', 1, 2),
    ('KTY4568', 1092732, 2056, 50.00, '09/03/2013', '12/12/2013', 1, 3)

    I get

    KTY4568  2056  $50.00  01/30/2012  12/12/2013
    KTY4568  5943  $50.00  08/01/2013  09/02/2013

    When I should be getting:

    KTY4568  2056  $50.00  01/30/2012  07/31/2013
    KTY4568  5943  $50.00  08/01/2013  09/02/2013
    KTY4568  2056  $50.00  09/03/2013  12/12/2013

    Thoughts?

    EDIT: I think I may have to do these ones manually. Fortunately, there only appears to be about 9 offices affected by this.

    You just need to go back and adjust the CASE expression to make sure that the IsStart matches your expectations.  You will probably need to add additional WHEN clauses.  As Lynn says, it's still not clear what the exact criteria for packing intervals are.

    Drew
    The criteria for packing intervals are based on 4 things. Office, Delivery Site aggregation (how many Delivery Sites are used for the same time interval for the site), Amount, and the dates. If a date completely cuts off a site (EndDate), then the packing needs to pick up the next set of sites for the office with those dates and those amounts. But if the sites have begin / end dates concurrent / crossing with the same amounts over the same time frames, they need to be packed.

    Site 1821 could have 3 different date ranges right after each other for the same amount, so it needs to be packed. The second line could also have a start date before the first line's cancel date which, so long as the amount was the same, would still fall into the packing. But if Site 2015 (for the same office) runs across these dates, even if it has the same amount, this is a change that needs to be delineated by a separate line with its own dates. And when a new set of dates with multiple sites (1821,3882) pops up for the office, then that needs to be separated out from the first grouping but can have its own set of continuous / contiguous dates.

    And if a site is split, like in this last example, it needs to be treated as two different lines with the splitting site in between.

    I really wish I could share more information about this. I'm not sure my explanation is cutting it.

    Brandie Tarvin, MCITP Database AdministratorLiveJournal Blog: http://brandietarvin.livejournal.com/[/url]On LinkedIn!, Google+, and Twitter.Freelance Writer: ShadowrunLatchkeys: Nevermore, Latchkeys: The Bootleg War, and Latchkeys: Roscoes in the Night are now available on Nook and Kindle.

  • I think this captures everything.

    ;
    WITH DeliverySites AS
    (
        SELECT *,
            STUFF(
                (
                    SELECT ',', DeliverySite AS [text()]
                    FROM #MyTemp sub
                    WHERE main.POID = sub.POID
                    ORDER BY DeliverySite
                    FOR XML PATH(''), TYPE
                ).value('.', 'VARCHAR(MAX)')
            , 1, 1, '') AS DeliverySites
        FROM #MyTemp main
    )
    , Starts AS
    (
        SELECT *,
            CASE
                WHEN AmtDollars <> LAG(AmtDollars, 1, 0) OVER(PARTITION BY Office, DeliverySites, DeliverySite ORDER BY StartDate) THEN 1
                WHEN DeliverySites = LAG(DeliverySites) OVER(PARTITION BY Office ORDER BY StartDate) THEN 0
                WHEN StartDate > LAG(EndDate, 1, '1900-01-01') OVER(PARTITION BY Office ORDER BY StartDate) THEN 1
                ELSE 0
            END IsStart
        FROM DeliverySites
    )
    , Groups AS
    (
        SELECT Office, DeliverySite, AmtDollars, StartDate, EndDate, DeliverySites, SUM(IsStart) OVER(PARTITION BY Office, DeliverySites, DeliverySite ORDER BY StartDate ROWS UNBOUNDED PRECEDING) AS Grp
        FROM Starts
    )
    SELECT Office, DeliverySite, AmtDollars, MIN(StartDate), MAX(EndDate)
    FROM Groups
    GROUP BY Office, DeliverySites, DeliverySite, AmtDollars, Grp
    ORDER BY Office, MIN(StartDate), DeliverySite

    All I did was change the CASE expression.

    Drew

    J. Drew Allen
    Business Intelligence Analyst
    Philadelphia, PA

  • Thanks, Drew. I was working on that, but changing different items in the CASE. I'll take a look at that and see how it works.

    Brandie Tarvin, MCITP Database AdministratorLiveJournal Blog: http://brandietarvin.livejournal.com/[/url]On LinkedIn!, Google+, and Twitter.Freelance Writer: ShadowrunLatchkeys: Nevermore, Latchkeys: The Bootleg War, and Latchkeys: Roscoes in the Night are now available on Nook and Kindle.

Viewing 14 posts - 1 through 13 (of 13 total)

You must be logged in to reply to this topic. Login to reply