Max window function with order by

  • Hello Everyone

    Thought I knew how Window functions worked in SQL Server but that was  before I ran below script

    This script is part of a CTE which is then fed into other CTEs downstream - but I am having hard time understanding how the MAX function here is deriving maximum date for each row

    For example - ids 14  to 18 appears to be in one group with same max_date (2017-05-22 21:05:10.000) , ids 3 - 5 again have the same max date(2016-08-03 15:25:07.000)  - not sure how the grouping works and what actually 'order by' is doing here

    Script below

    DROP TABLE IF EXISTS #test 
    CREATE TABLE #test(
    [id] [int] IDENTITY(1,1) NOT NULL,
    [customer_id] [int] NULL,
    [start_Date] [datetime] NULL,
    [end_date] [datetime] NULL
    ) ON [PRIMARY]

    GO
    SET IDENTITY_INSERT #test ON

    INSERT #test ([id], [customer_id], [start_Date], [end_date]) VALUES (1, 111, CAST(N'2016-05-04T22:34:38.000' AS DateTime), CAST(N'2016-07-03T22:34:38.000' AS DateTime))
    INSERT #test ([id], [customer_id], [start_Date], [end_date]) VALUES (2, 111, CAST(N'2016-05-05T08:51:08.000' AS DateTime), CAST(N'2016-08-03T08:51:08.000' AS DateTime))
    INSERT #test ([id], [customer_id], [start_Date], [end_date]) VALUES (3, 111, CAST(N'2016-05-05T15:25:07.000' AS DateTime), CAST(N'2016-08-03T15:25:07.000' AS DateTime))
    INSERT #test ([id], [customer_id], [start_Date], [end_date]) VALUES (4, 111, CAST(N'2016-05-05T15:25:40.000' AS DateTime), CAST(N'2016-05-19T15:25:40.000' AS DateTime))
    INSERT #test ([id], [customer_id], [start_Date], [end_date]) VALUES (5, 111, CAST(N'2016-05-06T08:50:23.000' AS DateTime), CAST(N'2016-06-05T08:50:23.000' AS DateTime))
    INSERT #test ([id], [customer_id], [start_Date], [end_date]) VALUES (6, 111, CAST(N'2017-05-16T20:22:44.000' AS DateTime), CAST(N'2017-07-15T20:22:44.000' AS DateTime))
    INSERT #test ([id], [customer_id], [start_Date], [end_date]) VALUES (7, 111, CAST(N'2017-07-18T15:15:42.000' AS DateTime), CAST(N'2017-08-01T15:15:42.000' AS DateTime))
    INSERT #test ([id], [customer_id], [start_Date], [end_date]) VALUES (8, 111, CAST(N'2017-07-18T15:16:29.000' AS DateTime), CAST(N'2017-07-25T15:16:29.000' AS DateTime))
    INSERT #test ([id], [customer_id], [start_Date], [end_date]) VALUES (9, 111, CAST(N'2017-07-26T10:16:46.000' AS DateTime), CAST(N'2017-10-24T10:16:46.000' AS DateTime))
    INSERT #test ([id], [customer_id], [start_Date], [end_date]) VALUES (10, 111, CAST(N'2018-09-27T11:19:13.000' AS DateTime), CAST(N'2018-10-04T11:19:13.000' AS DateTime))
    INSERT #test ([id], [customer_id], [start_Date], [end_date]) VALUES (11, 111, CAST(N'2019-01-30T10:16:46.000' AS DateTime), CAST(N'2019-02-06T10:16:46.000' AS DateTime))
    INSERT #test ([id], [customer_id], [start_Date], [end_date]) VALUES (12, 222, CAST(N'2016-12-20T16:32:09.000' AS DateTime), CAST(N'2017-03-20T16:32:09.000' AS DateTime))
    INSERT #test ([id], [customer_id], [start_Date], [end_date]) VALUES (13, 222, CAST(N'2016-12-20T16:33:39.000' AS DateTime), CAST(N'2017-01-19T16:33:39.000' AS DateTime))
    INSERT #test ([id], [customer_id], [start_Date], [end_date]) VALUES (14, 222, CAST(N'2017-02-21T21:05:10.000' AS DateTime), CAST(N'2017-05-22T21:05:10.000' AS DateTime))
    INSERT #test ([id], [customer_id], [start_Date], [end_date]) VALUES (15, 222, CAST(N'2017-02-21T21:09:25.000' AS DateTime), CAST(N'2017-03-07T21:09:25.000' AS DateTime))
    INSERT #test ([id], [customer_id], [start_Date], [end_date]) VALUES (16, 222, CAST(N'2017-02-28T13:47:14.000' AS DateTime), CAST(N'2017-04-29T13:47:14.000' AS DateTime))
    INSERT #test ([id], [customer_id], [start_Date], [end_date]) VALUES (17, 222, CAST(N'2017-04-04T11:10:20.000' AS DateTime), CAST(N'2017-05-04T11:10:20.000' AS DateTime))
    INSERT #test ([id], [customer_id], [start_Date], [end_date]) VALUES (18, 222, CAST(N'2017-04-04T11:10:20.000' AS DateTime), CAST(N'2017-05-04T11:10:20.000' AS DateTime))
    INSERT #test ([id], [customer_id], [start_Date], [end_date]) VALUES (19, 222, CAST(N'2017-10-26T15:25:36.000' AS DateTime), CAST(N'2017-11-25T15:25:36.000' AS DateTime))
    INSERT #test ([id], [customer_id], [start_Date], [end_date]) VALUES (20, 222, CAST(N'2018-04-12T09:29:28.000' AS DateTime), CAST(N'2018-04-19T09:29:28.000' AS DateTime))
    INSERT #test ([id], [customer_id], [start_Date], [end_date]) VALUES (21, 222, CAST(N'2018-04-12T09:43:29.000' AS DateTime), CAST(N'2018-07-11T09:43:29.000' AS DateTime))
    INSERT #test ([id], [customer_id], [start_Date], [end_date]) VALUES (22, 222, CAST(N'2018-08-02T23:10:08.000' AS DateTime), CAST(N'2018-09-01T23:10:08.000' AS DateTime))
    INSERT #test ([id], [customer_id], [start_Date], [end_date]) VALUES (23, 222, CAST(N'2019-08-20T11:15:03.000' AS DateTime), CAST(N'2019-09-19T11:15:03.000' AS DateTime))
    INSERT #test ([id], [customer_id], [start_Date], [end_date]) VALUES (24, 222, CAST(N'2019-10-15T15:34:53.000' AS DateTime), CAST(N'2019-10-22T15:34:53.000' AS DateTime))
    INSERT #test ([id], [customer_id], [start_Date], [end_date]) VALUES (25, 222, CAST(N'2019-10-15T15:35:31.000' AS DateTime), CAST(N'2020-01-13T15:35:31.000' AS DateTime))
    SET IDENTITY_INSERT #test OFF

    SELECT * FROM #test AS t ORDER BY customer_id , t.start_Date , t.end_date


    SELECT * , MAX(t.end_date) OVER (PARTITION BY t.customer_id ORDER BY t.start_Date ) AS max_date
    FROM #test AS t
    ORDER BY customer_id , t.start_Date , t.end_date
  • if you compare the results with the ORDER BY t.start_Date and without, it might be clearer.  Just having PARTITION BY t.customer_id will get the MAX(end_date) for each customer_id and return it for every row of that customer_id.  The entire set that match the customer_id is the window regardless of which customer_id row you are looking at.

    Having an ORDER BY in the OVER clause of MAX(end_date) causes it to evaluate only records before or up-to the current record in the function, so the window is growing for each row within the customer_id.

    For your specific questions, since you ORDER BY start_date, which also seem to match order by id, then the following explanations hold true:

    for customer_id = 111, id = 1 only has itself to compare end_date to so max_date = 2016-07-03 22:34:38.000

    for customer_id = 111, id = 2, it has a later end_date than 1 so max_date = it's own end_date 2016-08-03 08:51:08.000

    for customer_id = 111, id = 3, it has a later end_date than 1 or 2 so max_date = it's own end_date 2016-08-03 15:25:07.000

    for customer_id = 111, id = 4, it has an earlier end_date than id = 3 so id = 3 is still the MAX(end_date) 2016-08-03 15:25:07.000

    for customer_id = 111, id = 5, it has an earlier end_date than id = 3 so id = 3 is still the MAX(end_date) 2016-08-03 15:25:07.000

    for customer_id = 111, id = 6, it has a later end_date than 1 or 2 or 3 or 4 so max_date - it's own end_date 2017-07-15 20:22:44.000

    Same situation for id 14,15,16,17,18.

    15,16,17,18 have later start_date than 14 and also have earlier end_date so the end_date of id = 14 is the MAX(end_date) for each of those rows.

    Here's an article that describes in detail how they work:

    https://www.red-gate.com/simple-talk/sql/t-sql-programming/introduction-to-t-sql-window-functions/

     

  • Certain windowed functions require a frame when there is an ORDER BY clause.  MAX() is one of those functions.  So MAX(t.end_date) OVER (PARTITION BY t.customer_id ORDER BY t.start_Date ) AS max_date is short for MAX(t.end_date) OVER (PARTITION BY t.customer_id ORDER BY t.start_Date RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS max_date.

    Some things to note here:

    1. RANGE is deterministic, but ROWS is not necessarily.  This is why RANGE is the default.
    2. RANGE will often write to disk, but ROWS will never write to disk.  This is why ROWS is generally preferred.
    3. RANGE will consider all tied records, ROWS will "arbitrarily" break the tie.  (See 1.)

      1. This is why it's common practice to include a unique key in the sort order.  This ensures that there will not be ties.

    Chris' statement below is inaccurate in that it ignores ties.

    Having an ORDER BY in the OVER clause of MAX(end_date) causes it to evaluate only records before or up-to the current record in the function, so the window is growing for each row within the customer_id.

    The differences between RANGE and ROWS all come down to the difference in how they process ties.  Since ROWS arbitrarily breaks ties, it only ever needs to consider two values: the value from the current row and the value of the aggregate from the previous row.  Since it only ever needs to consider two values, it never writes to disk. Since RANGE considers all records within the tied group, the number of records processed to calculate the current aggregate value can be arbitrarily large, because you can have an arbitrary number of tied records.  RANGE will write to disk when the number of values being considered reaches a certain threshold.  (I believe it's 10,000 values).

    Drew

    J. Drew Allen
    Business Intelligence Analyst
    Philadelphia, PA

  • Thanks Chris and Drew for your detailed explanation - it does make sense.

    I guess I was just so hung up with the idea of some strange grouping - I did not think of it being something like running max

    Thanks

     

Viewing 4 posts - 1 through 3 (of 3 total)

You must be logged in to reply to this topic. Login to reply