SQLServerCentral Article

Window Function Basics - PARTITION BY

,

One of the things that I've done over the last year to improve my T-SQL skills is learn to work with the various window functions that exist in the language. These were added in 2012 and there have been a number of improvements across the last decade. Since I struggled to understand some of the concepts early on, I decided to write a few articles on some basic concepts of how these functions work.

This article will look at the partitioning clause inside a window function with a few examples that I hope will help you learn how to better work with SUM, COUNT, and other aggregate functions. Before I start, I'll give a general description of what a window function is, since there isn't a set of "window functions" documented by Microsoft.

One thing to note is that often the window functions work better (perform better) than non-window function queries because the data is only processed once as the window is examined. I shall try to show some performance improvements in a later article.

A Basic Look at OVER()

The OVER() clause is where the window is applied to an aggregate function. This is actually listed in the documentation as the first sentence says: "Determines the partitioning and ordering of a rowset before the associated window function is applied." - from SELECT-OVER Clause.

You should think of a window as a set of rows that are gathered together and then a function is run on these. The OVER() clause is where the window is defined and applied to data.

Note: This can change in SQL Server 2022 and later as there is actually a Window clause now.

The structure of the OVER() clause is that it comes after an aggregate function. For example, if I have code like this:

SELECT
  customerid
, COUNT (*) OVER (PARTITION BY soh.CustomerID)
FROM Sales.SalesOrderHeader AS soh;

I have written a query that gets a count of rows, but it breaks the data into windows defined by the OVER() clause. In this case, I've partitioned the data by CustomerID. We'll discuss what this means later, but think of this as a way of breaking the data into groups and then applying the aggregate to each group. In this case COUNT() is applied to each CustomerID.

One thing to note here is that I didn't need to specify a GROUP BY as I normally would with a COUNT function. This doesn't mean that OVER() replaces GROUP BY, but instead GROUP BY isn't required here. This means I can list more columns without needing to add them all to another clause. This is a more complex, and valid, query:

SELECT
  customerid, soh.AccountNumber, soh.DueDate
, COUNT (*) OVER (PARTITION BY soh.CustomerID)
, SUM(soh.SubTotal) OVER (PARTITION BY soh.CustomerID)
FROM Sales.SalesOrderHeader AS soh;

Setup

To show some data, I'll work with AdventureWorks, which is a very common sample database from Microsoft. You can download this from the links in this article (https://learn.microsoft.com/en-us/sql/samples/adventureworks-install-configure?view=sql-server-ver16&tabs=ssms)

Where possible, I will try to limit the amount of data so that we can examine the windows in more detail and understand what is happening. For example, I would likely try to explain windows by running this query before the one above.

SELECT
  customerid
FROM Sales.SalesOrderHeader AS soh
WHERE soh.CustomerID IN (11014, 11015, 11017)

Partitions

A partition is a set of data that is contained together. We have various types of partitioning in SQL Server, but in this article we are speaking of the partition section within an OVER() clause. In this case, we relate to a group of data in our query. We define the grouping with this parameter.

Our PARTITION BY refers to a set of data returned by the FROM clause in a SELECT. We can decide how we want to break this data into sections, with our aggregate applied to each section. This might make more sense with a few examples, so let's look at a few sets of data.

In AdventureWorks, I'll take my second query above and return a bit more data. Look at this query:

SELECT
  customerid
, soh.AccountNumber
, soh.SalesPersonID
FROM Sales.SalesOrderHeader AS soh
WHERE
  soh.CustomerID IN ( 29718, 30052, 29890 );

When I run this, I get 20 rows with the following data:

customerid  AccountNumber   SalesPersonID
----------- --------------- -------------
29718       10-4020-000161  275
29718       10-4020-000161  275
29718       10-4020-000161  275
29718       10-4020-000161  275
29718       10-4020-000161  275
29718       10-4020-000161  275
29718       10-4020-000161  277
29718       10-4020-000161  277
29890       10-4020-000200  283
29890       10-4020-000200  283
29890       10-4020-000200  283
29890       10-4020-000200  283
29890       10-4020-000200  283
29890       10-4020-000200  283
29890       10-4020-000200  283
29890       10-4020-000200  283
30052       10-4020-000511  276
30052       10-4020-000511  276
30052       10-4020-000511  276
30052       10-4020-000511  276

If I look at this data, I see there are 3 CustomerIDs, with 4 SalesPersonIDs. The breakdown (by CustomerID) is:

  • 29718 - 8 rows, 2 SalesPersonIDs
  • 29890 - 8 rows, 1 SalesPersonID
  • 30052 - 4 rows, 1 SalesPersonID

If I add this WHERE clause to the query above, my paritions for data by CustomerID should match the counts above. In fact, if I run this query:

SELECT
  customerid
, soh.SalesPersonID
, COUNT (*) OVER (PARTITION BY soh.CustomerID) AS countByCustomer
FROM Sales.SalesOrderHeader AS soh
WHERE
  soh.CustomerID IN ( 29718, 30052, 29890 );

That's what I see. Here are the results.

customerid  SalesPersonID countByCustomer
----------- ------------- ---------------
29718       275           8
29718       275           8
29718       275           8
29718       275           8
29718       275           8
29718       275           8
29718       277           8
29718       277           8
29890       283           8
29890       283           8
29890       283           8
29890       283           8
29890       283           8
29890       283           8
29890       283           8
29890       283           8
30052       276           4
30052       276           4
30052       276           4
30052       276           4

Well, those are the raw results. Note that I still get 20 rows, but my COUNT aggregate is totaled for each "parition". In this case, customer 29718 returns 8 for all their rows, as does customer 29890. Customer 30052 returns 4. Note the SalesPersonID doesn't affect the grouping, which is just by customer.

If I were to count the rows by SalesPersonID as the partition, I use this query:

SELECT
  customerid
, soh.SalesPersonID
, COUNT (*) OVER (PARTITION BY soh.SalesPersonID) AS countBySalesPersonID
FROM Sales.SalesOrderHeader AS soh
WHERE
  soh.CustomerID IN ( 29718, 30052, 29890 );

Now I get results that look at the partitions of different SalesPersonIDs. In this case we see there are 6 orders with 275, all from one customer (29718). Then I have the 4 orders from customer 30052 with SalesPerson 276. Next we have 2 orders from customer 29718 with SalesPerson 277. Then the last 8 from customer 29890.

customerid  SalesPersonID countBySalesPersonID
----------- ------------- --------------------
29718       275           6
29718       275           6
29718       275           6
29718       275           6
29718       275           6
29718       275           6
30052       276           4
30052       276           4
30052       276           4
30052       276           4
29718       277           2
29718       277           2
29890       283           8
29890       283           8
29890       283           8
29890       283           8
29890       283           8
29890       283           8
29890       283           8
29890       283           8

That seems strange, but remember there is no real ordering of data without a specifiction, which is part of the issue here. We'll look at ordering in the next article.

I can also combine columns to make more partitions. For example, I can do this:

SELECT
  customerid
, soh.SalesPersonID
, COUNT (*) OVER (PARTITION BY soh.CustomerID, soh.SalesPersonID) AS countByCustomerID_SalesPersonID
FROM Sales.SalesOrderHeader AS soh
WHERE
  soh.CustomerID IN ( 29718, 30052, 29890 );

This will actually give me the same results as the previous query, but in a different order.

customerid  SalesPersonID countByCustomerID_SalesPersonID
----------- ------------- -------------------------------
29718       275           6
29718       275           6
29718       275           6
29718       275           6
29718       275           6
29718       275           6
29718       277           2
29718       277           2
29890       283           8
29890       283           8
29890       283           8
29890       283           8
29890       283           8
29890       283           8
29890       283           8
29890       283           8
30052       276           4
30052       276           4
30052       276           4
30052       276           4

What we've done is we break up the data like this, looking at the original data. Here are the partitions for CustomerID.

Partition 1
29718       10-4020-000161  275
29718       10-4020-000161  275
29718       10-4020-000161  275
29718       10-4020-000161  275
29718       10-4020-000161  275
29718       10-4020-000161  275
29718       10-4020-000161  277
29718       10-4020-000161  277
Partition 2
29890       10-4020-000200  283
29890       10-4020-000200  283
29890       10-4020-000200  283
29890       10-4020-000200  283
29890       10-4020-000200  283
29890       10-4020-000200  283
29890       10-4020-000200  283
29890       10-4020-000200  283
Partition 3
30052       10-4020-000511  276
30052       10-4020-000511  276
30052       10-4020-000511  276
30052       10-4020-000511  276

Now we add new partitions based on the SalesPersonID in these groups. That gives me:

Partition 1
29718       10-4020-000161  275
29718       10-4020-000161  275
29718       10-4020-000161  275
29718       10-4020-000161  275
29718       10-4020-000161  275
29718       10-4020-000161  275
Partition 2
29718       10-4020-000161  277
29718       10-4020-000161  277
Partition 3
29890       10-4020-000200  283
29890       10-4020-000200  283
29890       10-4020-000200  283
29890       10-4020-000200  283
29890       10-4020-000200  283
29890       10-4020-000200  283
29890       10-4020-000200  283
29890       10-4020-000200  283
Partition 4
30052       10-4020-000511  276
30052       10-4020-000511  276
30052       10-4020-000511  276
30052       10-4020-000511  276

In this case, I'm assuming the data is always in this order. If we had other data or ordering, we might get different partitions. There are default ordering and partitions. The default ordering is ascending and with all rows.

If I do not specify a partition, the default is the entire set of data returned by the FROM clause (with the WHERE applied).

Let's use a bit more data to explain things. I will add another set of data to my query and I'll remove AccountNumber, which isnt' very interesting as it tends to just match CustomerID

SELECT
  customerid
, soh.SalesPersonID
, soh.DueDate
FROM Sales.SalesOrderHeader AS soh
WHERE
  soh.CustomerID IN ( 29718, 30052, 29890 );

I get these results, again, without ordering, but with a spread of data.

customerid  SalesPersonID DueDate
----------- ------------- -----------------------
29718       275           2011-06-12 00:00:00.000
29718       275           2011-09-12 00:00:00.000
29718       275           2011-12-13 00:00:00.000
29718       275           2012-03-12 00:00:00.000
29718       275           2012-06-11 00:00:00.000
29718       275           2012-09-11 00:00:00.000
29718       277           2012-12-12 00:00:00.000
29718       277           2013-03-12 00:00:00.000
29890       283           2011-06-12 00:00:00.000
29890       283           2011-12-13 00:00:00.000
29890       283           2012-03-12 00:00:00.000
29890       283           2012-06-11 00:00:00.000
29890       283           2012-09-11 00:00:00.000
29890       283           2012-12-12 00:00:00.000
29890       283           2013-03-12 00:00:00.000
29890       283           2013-12-12 00:00:00.000
30052       276           2011-06-12 00:00:00.000
30052       276           2011-09-12 00:00:00.000
30052       276           2011-12-13 00:00:00.000
30052       276           2012-03-12 00:00:00.000

What happens if I partition by customerID? I should get the same results.

SELECT
  customerid
, soh.SalesPersonID
, soh.DueDate
, COUNT (*) OVER (PARTITION BY soh.CustomerID) AS countByCustomer
FROM Sales.SalesOrderHeader AS soh
WHERE
  soh.CustomerID IN ( 29718, 30052, 29890 );

My results are the same, with dates instead of the account number.

customerid  SalesPersonID DueDate                 countByCustomer
----------- ------------- ----------------------- ---------------
29718       275           2011-06-12 00:00:00.000 8
29718       275           2011-09-12 00:00:00.000 8
29718       275           2011-12-13 00:00:00.000 8
29718       275           2012-03-12 00:00:00.000 8
29718       275           2012-06-11 00:00:00.000 8
29718       275           2012-09-11 00:00:00.000 8
29718       277           2012-12-12 00:00:00.000 8
29718       277           2013-03-12 00:00:00.000 8
29890       283           2011-06-12 00:00:00.000 8
29890       283           2011-12-13 00:00:00.000 8
29890       283           2012-03-12 00:00:00.000 8
29890       283           2012-06-11 00:00:00.000 8
29890       283           2012-09-11 00:00:00.000 8
29890       283           2012-12-12 00:00:00.000 8
29890       283           2013-03-12 00:00:00.000 8
29890       283           2013-12-12 00:00:00.000 8
30052       276           2011-06-12 00:00:00.000 4
30052       276           2011-09-12 00:00:00.000 4
30052       276           2011-12-13 00:00:00.000 4
30052       276           2012-03-12 00:00:00.000 4

What if I partition by DueDate? I'll use this query:

SELECT
  customerid
, soh.SalesPersonID
, soh.DueDate
, COUNT (*) OVER (PARTITION BY soh.DueDate) AS countByDueDate
FROM Sales.SalesOrderHeader AS soh
WHERE
  soh.CustomerID IN ( 29718, 30052, 29890 );

What would you expect? Take a minute and think about it before you look at the results below. How is the data broken up?

Here are the results, which should make sense.

customerid  SalesPersonID DueDate                 countByDueDate
----------- ------------- ----------------------- --------------
29718       275           2011-06-12 00:00:00.000 3
29890       283           2011-06-12 00:00:00.000 3
30052       276           2011-06-12 00:00:00.000 3
30052       276           2011-09-12 00:00:00.000 2
29718       275           2011-09-12 00:00:00.000 2
29718       275           2011-12-13 00:00:00.000 3
29890       283           2011-12-13 00:00:00.000 3
30052       276           2011-12-13 00:00:00.000 3
30052       276           2012-03-12 00:00:00.000 3
29890       283           2012-03-12 00:00:00.000 3
29718       275           2012-03-12 00:00:00.000 3
29718       275           2012-06-11 00:00:00.000 2
29890       283           2012-06-11 00:00:00.000 2
29890       283           2012-09-11 00:00:00.000 2
29718       275           2012-09-11 00:00:00.000 2
29718       277           2012-12-12 00:00:00.000 2
29890       283           2012-12-12 00:00:00.000 2
29890       283           2013-03-12 00:00:00.000 2
29718       277           2013-03-12 00:00:00.000 2
29890       283           2013-12-12 00:00:00.000 1

If I look at the data, what I really see is this (partitions listed):

customerid  SalesPersonID DueDate                 countByDueDate
----------- ------------- ----------------------- --------------
partition 1
29718       275           2011-06-12 00:00:00.000 3
29890       283           2011-06-12 00:00:00.000 3
30052       276           2011-06-12 00:00:00.000 3
partition 2
30052       276           2011-09-12 00:00:00.000 2
29718       275           2011-09-12 00:00:00.000 2
partition 3
29718       275           2011-12-13 00:00:00.000 3
29890       283           2011-12-13 00:00:00.000 3
30052       276           2011-12-13 00:00:00.000 3
partition 4
30052       276           2012-03-12 00:00:00.000 3
29890       283           2012-03-12 00:00:00.000 3
29718       275           2012-03-12 00:00:00.000 3
partition 5
29718       275           2012-06-11 00:00:00.000 2
29890       283           2012-06-11 00:00:00.000 2
partition 6
29890       283           2012-09-11 00:00:00.000 2
29718       275           2012-09-11 00:00:00.000 2
partition 7
29718       277           2012-12-12 00:00:00.000 2
29890       283           2012-12-12 00:00:00.000 2
partition 8
29890       283           2013-03-12 00:00:00.000 2
29718       277           2013-03-12 00:00:00.000 2
partition 9
29890       283           2013-12-12 00:00:00.000 1

Each time the DueDate changes, we have a new parition. We've "grouped" the data by DueDate and then we count how many rows are in each.

What about if we look at DueDate and the CustomerID? The query is here:

SELECT
  customerid
, soh.SalesPersonID
, soh.DueDate
, COUNT (*) OVER (PARTITION BY soh.DueDate, soh.CustomerID) AS countByDueDate
FROM Sales.SalesOrderHeader AS soh
WHERE
  soh.CustomerID IN ( 29718, 30052, 29890 );

The results show that we've broken the data by DueDate, but each time the customer is different it's a separate partition. In this case, every single partition has 1 row.

customerid  SalesPersonID DueDate                 countByDueDate
----------- ------------- ----------------------- --------------
29718       275           2011-06-12 00:00:00.000 1
29890       283           2011-06-12 00:00:00.000 1
30052       276           2011-06-12 00:00:00.000 1
29718       275           2011-09-12 00:00:00.000 1
30052       276           2011-09-12 00:00:00.000 1
29718       275           2011-12-13 00:00:00.000 1
29890       283           2011-12-13 00:00:00.000 1
30052       276           2011-12-13 00:00:00.000 1
29718       275           2012-03-12 00:00:00.000 1
29890       283           2012-03-12 00:00:00.000 1
30052       276           2012-03-12 00:00:00.000 1
29718       275           2012-06-11 00:00:00.000 1
29890       283           2012-06-11 00:00:00.000 1
29718       275           2012-09-11 00:00:00.000 1
29890       283           2012-09-11 00:00:00.000 1
29718       277           2012-12-12 00:00:00.000 1
29890       283           2012-12-12 00:00:00.000 1
29718       277           2013-03-12 00:00:00.000 1
29890       283           2013-03-12 00:00:00.000 1
29890       283           2013-12-12 00:00:00.000 1

Let's reverse that and compare the results. I'll parition by customer and then date.

SELECT
  customerid
, soh.SalesPersonID
, soh.DueDate
, COUNT (*) OVER (PARTITION BY
                    soh.CustomerID
                  , soh.DueDate) AS countByDueDate
FROM Sales.SalesOrderHeader AS soh
WHERE
  soh.CustomerID IN ( 29718, 30052, 29890 );

Now I see these results.

customerid  SalesPersonID DueDate                 countByDueDate
----------- ------------- ----------------------- --------------
29718       275           2011-06-12 00:00:00.000 1
29718       275           2011-09-12 00:00:00.000 1
29718       275           2011-12-13 00:00:00.000 1
29718       275           2012-03-12 00:00:00.000 1
29718       275           2012-06-11 00:00:00.000 1
29718       275           2012-09-11 00:00:00.000 1
29718       277           2012-12-12 00:00:00.000 1
29718       277           2013-03-12 00:00:00.000 1
29890       283           2011-06-12 00:00:00.000 1
29890       283           2011-12-13 00:00:00.000 1
29890       283           2012-03-12 00:00:00.000 1
29890       283           2012-06-11 00:00:00.000 1
29890       283           2012-09-11 00:00:00.000 1
29890       283           2012-12-12 00:00:00.000 1
29890       283           2013-03-12 00:00:00.000 1
29890       283           2013-12-12 00:00:00.000 1
30052       276           2011-06-12 00:00:00.000 1
30052       276           2011-09-12 00:00:00.000 1
30052       276           2011-12-13 00:00:00.000 1
30052       276           2012-03-12 00:00:00.000 1

Same results, but different partitions. Note here we see that for each customer, all of our dates are different. Let me insert 1 row with this code to show the difference:

INSERT Sales.SalesOrderHeader
   (RevisionNumber, OrderDate, DueDate, ShipDate, Status, OnlineOrderFlag, PurchaseOrderNumber, AccountNumber, CustomerID, SalesPersonID, TerritoryID, BillToAddressID, ShipToAddressID, ShipMethodID, CreditCardID, CreditCardApprovalCode, CurrencyRateID, SubTotal, TaxAmt, Freight, Comment, rowguid, ModifiedDate)
 VALUES
   (8, '2011-12-01', '2011-12-13', NULL, DEFAULT, DEFAULT, NULL, NULL, 30052, 276, 4, 1074, 1074, 5, NULL, NULL, NULL, DEFAULT, DEFAULT, DEFAULT, NULL, DEFAULT, DEFAULT)

Now my raw data has a second order by customer 30052. Let's run the two queries again. First the partition by duedate and then customer.

SELECT
  customerid
, soh.SalesPersonID
, soh.DueDate
, COUNT (*) OVER (PARTITION BY soh.DueDate, soh.CustomerID) AS countByDueDate
FROM Sales.SalesOrderHeader AS soh
WHERE
  soh.CustomerID IN ( 29718, 30052, 29890 );

This returns all 1s. Now this query again:

SELECT
  customerid
, soh.SalesPersonID
, soh.DueDate
, COUNT (*) OVER (PARTITION BY
                    soh.CustomerID
                  , soh.DueDate) AS countByDueDate
FROM Sales.SalesOrderHeader AS soh
WHERE
  soh.CustomerID IN ( 29718, 30052, 29890 );

I now see this data:

customerid  SalesPersonID DueDate                 countByDueDate
----------- ------------- ----------------------- --------------
29718       275           2011-06-12 00:00:00.000 1
29718       275           2011-09-12 00:00:00.000 1
29718       275           2011-12-13 00:00:00.000 1
29718       275           2012-03-12 00:00:00.000 1
29718       275           2012-06-11 00:00:00.000 1
29718       275           2012-09-11 00:00:00.000 1
29718       277           2012-12-12 00:00:00.000 1
29718       277           2013-03-12 00:00:00.000 1
29890       283           2011-06-12 00:00:00.000 1
29890       283           2011-12-13 00:00:00.000 1
29890       283           2012-03-12 00:00:00.000 1
29890       283           2012-06-11 00:00:00.000 1
29890       283           2012-09-11 00:00:00.000 1
29890       283           2012-12-12 00:00:00.000 1
29890       283           2013-03-12 00:00:00.000 1
29890       283           2013-12-12 00:00:00.000 1
30052       276           2011-06-12 00:00:00.000 1
30052       276           2011-09-12 00:00:00.000 1
30052       276           2011-12-13 00:00:00.000 2
30052       276           2011-12-13 00:00:00.000 2
30052       276           2012-03-12 00:00:00.000 1

One of our partitions has two rows. CustomerID 30052 has two rows with DueDate 2011-12-13. Hence the 2 is returned as the aggregate of this partition.

One other thing to keep in mind is that we can have separate partitions for different aggregates. As an example, I will set up two different aggregates and then partition them separately. We will get a MAX() with one partition and a MIN() with another. Her is our query:

SELECT
  customerid
, soh.SalesPersonID
, soh.DueDate
, MAX(duedate) OVER (PARTITION BY soh.CustomerID) AS MaxDueDatebyCustomer
, MIN(soh.CustomerID) OVER (PARTITION BY soh.DueDate) AS MinCustomer
FROM Sales.SalesOrderHeader AS soh
WHERE
  soh.CustomerID IN ( 29718, 30052, 29890 );

And the results:

customerid  SalesPersonID DueDate                 MaxDueDatebyCustomer    MinCustomer
----------- ------------- ----------------------- ----------------------- -----------
29718       275           2011-06-12 00:00:00.000 2013-03-12 00:00:00.000 29718
29890       283           2011-06-12 00:00:00.000 2013-12-12 00:00:00.000 29718
30052       276           2011-06-12 00:00:00.000 2012-03-12 00:00:00.000 29718
30052       276           2011-09-12 00:00:00.000 2012-03-12 00:00:00.000 29718
29718       275           2011-09-12 00:00:00.000 2013-03-12 00:00:00.000 29718
29718       275           2011-12-13 00:00:00.000 2013-03-12 00:00:00.000 29718
29890       283           2011-12-13 00:00:00.000 2013-12-12 00:00:00.000 29718
30052       276           2011-12-13 00:00:00.000 2012-03-12 00:00:00.000 29718
30052       276           2011-12-13 00:00:00.000 2012-03-12 00:00:00.000 29718
30052       276           2012-03-12 00:00:00.000 2012-03-12 00:00:00.000 29718
29890       283           2012-03-12 00:00:00.000 2013-12-12 00:00:00.000 29718
29718       275           2012-03-12 00:00:00.000 2013-03-12 00:00:00.000 29718
29718       275           2012-06-11 00:00:00.000 2013-03-12 00:00:00.000 29718
29890       283           2012-06-11 00:00:00.000 2013-12-12 00:00:00.000 29718
29890       283           2012-09-11 00:00:00.000 2013-12-12 00:00:00.000 29718
29718       275           2012-09-11 00:00:00.000 2013-03-12 00:00:00.000 29718
29718       277           2012-12-12 00:00:00.000 2013-03-12 00:00:00.000 29718
29890       283           2012-12-12 00:00:00.000 2013-12-12 00:00:00.000 29718
29890       283           2013-03-12 00:00:00.000 2013-12-12 00:00:00.000 29718
29718       277           2013-03-12 00:00:00.000 2013-03-12 00:00:00.000 29718
29890       283           2013-12-12 00:00:00.000 2013-12-12 00:00:00.000 29890

In the results, we have the DueDate values together. If I look at the duedate values first, notice that for each duedate, I get the lowest customer value, which is the min. For 2011-06-12, there are all three customers with data (first 3 rows), but 29718 is returned for all as this is the lowest value. Going through other due dates, we see the same thing. When we get to 2013-12-12, only customer 29890 has data, so this one is returned as the min.

For the MAX() aggregate, we are partitioning by customer, so for customer 29718, the max date is 2013-03-12, which is returned for all rows with this customer. That also happens to be the same date for customer 30052. However, for customer 29890, the max is 2013-12-12, which is returned for all these rows.

This seems complex, but imagine trying to get these two aggregates without an OVER and using GROUP BY. We would need a much more complex query, likely with two different CTEs to calculate these values and then bring them together in a third query. That can be hard to read and hard to maintain. Performance is something to test and evaluate, however, as you want to be sure that you choose the query that stresses your server less. Often this is the window function, but not always.

Using a Window Clause

Maintaining the separate OVER clauses for all aggregates can be hard to do, especially if we reuse the same paritions over and over. In SQL Server 2022, we have the addition of a WINDOW clause, which makes this easier. Let us see how this helps by examining a few queries. First, a complex query that has multiple window clauses that are repeated a few times. This is how the query would need to be written in SQL Server 2019 and previous versions.

SELECT
  customerid
, soh.SalesPersonID
, soh.DueDate
, COUNT(duedate) OVER (PARTITION BY soh.CustomerID) AS MaxDueDatebyCustomer
, MAX(duedate) OVER (PARTITION BY soh.CustomerID) AS MaxDueDatebyCustomer
, MIN(soh.CustomerID) OVER (PARTITION BY soh.DueDate) AS MinCustomer
, SUM(soh.CustomerID) OVER (PARTITION BY soh.DueDate) AS MinCustomer
FROM Sales.SalesOrderHeader AS soh
WHERE
  soh.CustomerID IN ( 29718, 30052, 29890 );

Now look at how this can be improved in SQL Server 2022. This is the same query:

SELECT
  customerid
, soh.SalesPersonID
, soh.DueDate
, COUNT (duedate) OVER pCustomerID AS MaxDueDatebyCustomer
, MAX (duedate) OVER pCustomerID AS MaxDueDatebyCustomer
, MIN (soh.CustomerID) OVER pDueDate AS MinCustomer
, SUM (soh.CustomerID) OVER pDueDate AS MinCustomer
FROM Sales.SalesOrderHeader AS soh
WHERE
  soh.CustomerID IN ( 29718, 30052, 29890 ) 
WINDOW pCustomerID AS(PARTITION BY soh.CustomerID),
       pDueDate AS(PARTITION BY soh.DueDate);

Notice how the definition of the window partition is moved to the WINDOW clause after the WHERE and the aggregates are simpler to read with the OVER clause naming the window.

One thing I have found is that it is good to try and have a naming standard for these windows. For now, we have used a pCol_oCol_RCol structure. Future articles will look at the order by and rows sections of the OVER clause.

I encourage you to use this new WINDOW clause when you can to make queries easier to read.

Summary

This article has looked at the way partitions work in an OVER clause. We examined how paritioning breaks the data into groups that can be processed by the query, often in a much more efficient way than previously done in T-SQL.

This is the start to using Window functions in your queries, but there is much more to learn. We will add the ORDER BY section to the OVER() clause in the next article and begin to have more control over how we analyze and aggregate our data.

Rate

5 (3)

You rated this post out of 5. Change rating

Share

Share

Rate

5 (3)

You rated this post out of 5. Change rating