A Window Function question...

  • Hi,

     I am working with Window Functions and am trying to understand their behaviour a bit better.  The database I am using is the TSQL2012 database from Itzik Ben-Gans book "Inside Microsoft SQL Server 2008:  T-SQL Querying".  I am running SQL Server 2017.

    By the way......the book is awesome! 🙂

    Sorry the code isn't copyable.  My browser keeps crashing when I try to copy and paste code from SSMS.

    This is my table:

    create table sales.orders (
        orderid  bigint,
        custid    int
    )

    The table is filled with data whereby each customer has several orderids against their unique custid.

     I want to find the highest order number for each custid.  This query produces the required results:

    The result set:

    So far so good.  Now I want to rewrite the query to take advantage of a Window Function:

    To me the query looks good but I get the following result set:

     I can't understand why I am getting a different result set from this query.

    Can someone explain what I am doing wrong?

    Regards,
    Kev

  • Kev

    The second query has no WHERE clause, so you'll get one row in your result set for every row in the table.  The first query filters out rows so that you are only left with the highest orderid per customer.

    John

  • The first query is simulating an aggregate function. It's not returning all rows, just one for each customer. The second query will return one row per order.
    You don't need group by when using window functions. But you would need to correct the group by and remove the OVER clause to get the correct result.

    Luis C.
    General Disclaimer:
    Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

    How to post data/code on a forum to get the best help: Option 1 / Option 2
  • I'm now sure why you want to use a Window Function for this task. Would a GROUP BY not achieve what you're looking for instead?
    SELECT MAX(o.orderid) AS orderid,
           o.custid
    FROM Sales.Orders o
    GROUP BY o.custid
    ORDER BY o.custid ASC;

    Using a Window Function, as you saw, will bring back lots of rows for a single customer
    because they have many orders; the Window function does not eliminate/group these, it simply let's you know the MAX order number for that customer on every row.

    Edit:"Windows" Function, heh. :hehe:

    Thom~

    Excuse my typos and sometimes awful grammar. My fingers work faster than my brain does.
    Larnu.uk

  • Quick thought, in the case of the aggregated results, there is no benefit of using Window functions.
    😎

  • Thom A - Thursday, September 28, 2017 9:07 AM

    I'm now sure why you want to use a Window Function for this task. Would a GROUP BY not achieve what you're looking for instead?
    SELECT MAX(o.orderid) AS orderid,
           o.custid
    FROM Sales.Orders o
    GROUP BY o.custid
    ORDER BY o.custid ASC;

    Using a Window Function, as you saw, will bring back lots of rows for a single customer
    because they have many orders; the Window function does not eliminate/group these, it simply let's you know the MAX order number for that customer on every row.

    Edit:"Windows" Function, heh. :hehe:

  • Much simpler query:
    SELECT MAX(o.orderid) OVER(PARTITION BY o.custid ORDER BY o.orderid) AS LastOrderId,
        o.custid
    FROM Sales.Orders AS o

    Steve (aka sgmunson) 🙂 🙂 🙂
    Rent Servers for Income (picks and shovels strategy)

  • So, this is how I solved it first of all....

    selectdistinct(max(o.orderid) over (partition by o.custid order by o.custid)) as LastOrderId
              ,o.custid

         from Sales.Orders o

         order by o.custid asc

    ;

    It works but when I look at the execution plan.....

    I know that there are many of you out there that cringed when they saw the distinct but I wanted to see the difference between the Window Function and a normal GROUP BY: 


    SELECT

    MAX(o.orderid) AS orderid,

    o.custid

    FROM Sales.Orders o

    GROUP BY o.custid

    ORDER BY o.custid ASC;

    Although it is still a scan and not a seek, the improvement is clear and this was what I was looking for.

    @Eirikur:  The majority of examples I have seen have used aggregate functions and that is something I wanted to understand a bit more.  I know that there are some pretty elegant Window Functions available (LEAD, LAG and so on) but if it can be used then I don't make the assumption that it should be used or that there isn't a better solution.  That is the reason why I wanted to get a better look at this particular problem.  But now my question:  If there is no benefit to generating aggregates using Window Functions then where do the real benefits lie?

    @sgmunson:  This was my first solution (although I put a GROUP BY in it) but it didn't produce the required results; the highest orderid that a single custid has (see my very first post).  The post resulted from my confusion about why it didn't produce the required results.

  • kevaburg - Friday, September 29, 2017 4:22 AM

    Although it is still a scan and not a seek, the improvement is clear and this was what I was looking for.

    That's to be expected. There's no filtering on your query, so SQL Server has to scan the entire index to get all the results. On the other hand, if you add a WHERE then it changes to a seek. For example:
    USE Sandbox;
    GO

    CREATE TABLE orders (orderid int PRIMARY KEY CLUSTERED,
           custid int);
    GO
    CREATE INDEX idx_custid ON orders (custid ASC);

    GO
    INSERT INTO orders
    VALUES (12,324),
       (17,324),
       (123,324),
       (435,324),
       (546,123),
       (153,123),
       (654,123),
       (14,22),
       (234,22),
       (54,65),
       (6573,6),
       (489,6),
       (694,6),
       (21,567),
       (23,567);
    GO
    SELECT *
    FROM orders;

    SELECT *
    FROM orders
    WHERE orderid = 23;

    SELECT MAX(orderid) AS orderid, custid
    FROM orders
    GROUP BY custid;

    SELECT MAX(orderid) AS orderid, custid
    FROM orders
    WHERE custid > 100
    GROUP BY custid;
    GO
    DROP TABLE orders;
    GO

    (BE CAREFUL, I have a drop clause in there).

    Notice, the first query (just the Select *) uses a Index Scan on the Clustered Index. The second, a Seek on the Clustered Index. The latter two use a Scan and Seek respectively as well on the NonClustered Index. This is due to the 2 queries with a Seek having a predicate on one of the columns. Without the predicate in both queries SQL Server has to get the result for every row; hence the Scan.

    Thom~

    Excuse my typos and sometimes awful grammar. My fingers work faster than my brain does.
    Larnu.uk

  • Thom A - Friday, September 29, 2017 4:42 AM

    kevaburg - Friday, September 29, 2017 4:22 AM

    Although it is still a scan and not a seek, the improvement is clear and this was what I was looking for.

    That's to be expected. There's no filtering on your query, so SQL Server has to scan the entire index to get all the results. On the other hand, if you add a WHERE then it changes to a seek. For example:
    USE Sandbox;
    GO

    CREATE TABLE orders (orderid int PRIMARY KEY CLUSTERED,
           custid int);
    GO
    CREATE INDEX idx_custid ON orders (custid ASC);

    GO
    INSERT INTO orders
    VALUES (12,324),
       (17,324),
       (123,324),
       (435,324),
       (546,123),
       (153,123),
       (654,123),
       (14,22),
       (234,22),
       (54,65),
       (6573,6),
       (489,6),
       (694,6),
       (21,567),
       (23,567);
    GO
    SELECT *
    FROM orders;

    SELECT *
    FROM orders
    WHERE orderid = 23;

    SELECT MAX(orderid) AS orderid, custid
    FROM orders
    GROUP BY custid;

    SELECT MAX(orderid) AS orderid, custid
    FROM orders
    WHERE custid > 100
    GROUP BY custid;
    GO
    DROP TABLE orders;
    GO

    (BE CAREFUL, I have a drop clause in there).

    Notice, the first query (just the Select *) uses a Index Scan on the Clustered Index. The second, a Seek on the Clustered Index. The latter two use a Scan and Seek respectively as well on the NonClustered Index. This is due to the 2 queries with a Seek having a predicate on one of the columns. Without the predicate in both queries SQL Server has to get the result for every row; hence the Scan.

    OK, I understand that.  With the production tables I have worked with it was rarely the case that we retrieved the entire table in a query so in producing a subset of data with a sensible index on it I would see a seek rather than a scan.  Certainly something to think about when considering how much data is actually required.

  • kevaburg - Thursday, September 28, 2017 8:38 AM

     I can't understand why I am getting a different result set from this query.

    Can someone explain what I am doing wrong?

    Regards,
    Kev

    It may help you to see what you are doing wrong by including o.orderid in your columns selected.  The query you tried essentialy says "for every unique combination of custid & orderid, get me the most recent orderid".  Each unique combination is looking into a "window" for the most recent orderid for the customer.

    Wes
    (A solid design is always preferable to a creative workaround)

Viewing 11 posts - 1 through 10 (of 10 total)

You must be logged in to reply to this topic. Login to reply