Critique My Understanding of Window Functions Please...

  • Good Evening Gang,

    My knowledge of Window Functions is a bit lacking, so decided this weekend to spend some time getting it nailed, as it were. As a result, I've been using the AdventureWorks database as my test DB and was hoping you guys could let me know if I'm doing something wrong with my query, for a given a made up scenario.

    My made up scenario is to identify average pay rates, broken down by Gender, Marital Status and All.

    I'll post my code below, and then explain why I used various functions:

    with mytable as

    (

    select

    e.[BusinessEntityID],

    row_number() over (partition by p.businessentityid order by p.[RateChangeDate] desc) as recency,

    [Gender],

    [MaritalStatus],

    p.rate,

    avg(p.rate) over (partition by gender) as avg_rate_for_gender,

    AVG(p.rate) over (partition by maritalstatus) as avg_rate_for_marital_status,

    avg(p.rate) over() as avg_rate

    from [HumanResources].[Employee] e

    inner join [HumanResources].[EmployeePayHistory] p on e.BusinessEntityID = p.BusinessEntityID

    )

    select * from mytable

    where recency =1

    order by [BusinessEntityID] asc

    So, explanations as follows:

    Why row_number?

    Employee pay rate is stored in [EmployeePayHistory] - this table also keeps historical changes, so I am using the row_number function to identify the most recent rate for a person (BusinessEntityID)

    Why avg over (partition by)?

    I want to find the average pay rate dervied from gender and marital status

    Why avg over()?

    I also want the average pay rate accross all users regardless of their gender or marital status.

    Thanks guys, I appreciate any comments/confirmation you can give on my understanding of these functions.

    Cheers

    Alex

  • Very quick note, you are definitely on the right track here, the notes are all correct and accurate so far.

    😎

  • Thanks Eirikur!

  • If you're getting into windowed functions in SQL 2012, I'd encourage to to also included "Window Frames" in your study.

    Windowed functions are an extremely powerful feature, and 2012's inclusion of widow frames only adds to that.

    Also, if you haven'rt already picked up a copy, I don't think any attempt to learn Windowed Functions is complete without this... Microsoft SQL Server 2012 High-Performance T-SQL Using Window Functions (Developer Reference)

  • Jason A. Long (5/25/2015)


    If you're getting into windowed functions in SQL 2012, I'd encourage to to also included "Window Frames" in your study.

    Windowed functions are an extremely powerful feature, and 2012's inclusion of widow frames only adds to that.

    Also, if you haven'rt already picked up a copy, I don't think any attempt to learn Windowed Functions is complete without this... Microsoft SQL Server 2012 High-Performance T-SQL Using Window Functions (Developer Reference)

    +1, Itzik's book is a very good reference!

    😎

  • Yes, I've got Kindle copies of High Performance Windows Functions and Expert TSQL Windows Functions.

    Am I correct in thinking that the Window Frame portion is the part of the call that uses 'unbounded preceding' ~ 'current row'? Or is there more to it?

    Cheers

  • alex.sqldba (5/25/2015)


    Am I correct in thinking that the Window Frame portion is the part of the call that uses 'unbounded preceding' ~ 'current row'? Or is there more to it?

    Cheers

    You are correct... On both counts... The window frame allows you a great deal of control in determining which rows from the set are included in the window.

  • It's the "FOLLOWNG" portion that's a bit of a head-trip for me, right now!

  • Morning

    I've just tried to add in the lag function to the above query (reposted below) because I want to add the previous pay rate per user to the report, so I've have the following query:

    with mytable as

    (

    select

    e.[BusinessEntityID],

    row_number() over (partition by p.businessentityid order by p.[RateChangeDate] desc) as recency,

    [Gender],

    [MaritalStatus],

    p.rate as current_rate,

    lag(p.rate) over (partition by p.businessentityid order by p.[RateChangeDate] desc) as previous_rate,

    avg(p.rate) over (partition by gender) as avg_rate_for_gender,

    AVG(p.rate) over (partition by maritalstatus) as avg_rate_for_marital_status,

    avg(p.rate) over() as avg_rate

    from [HumanResources].[Employee] e

    inner join [HumanResources].[EmployeePayHistory] p on e.BusinessEntityID = p.BusinessEntityID

    )

    select * from mytable

    where recency =1

    order by [BusinessEntityID] asc

    Which you can see the LAG/OVER() to I thought grab the previous row in that window.

    But when I sort that window DESC (the same sort as I use to get the most recent pay as current_pay) it returns all nulls. If I sort ASC instead I get some values back as expected.

    So, my question is why? Why sorting DESC do I not get the right value back?

    Cheers

    Alex

  • I think a slightly more simplistic sample offers a better illustration of what happens with LAG & LEAD in relation to the order by clause...

    -- Test Data --

    IF OBJECT_ID('tempdb..#temp') IS NOT NULL

    DROP TABLE #temp;

    CREATE TABLE #temp (

    ID INT,

    Value VARCHAR(15)

    );

    INSERT #temp (ID,Value) VALUES

    (1,'One'),(2,'Two'),(3,'Three'),(4,'Four'),(5,'Five'),

    (6,'Six'),(7,'Seven'),(8,'Eight'),(9,'Nine'),(10,'Ten')

    -- LAG & LEAD query --

    SELECT

    t.ID,

    t.Value,

    LAG(t.Value, 1) OVER (ORDER BY t.ID) Lag_Asc,

    LAG(t.Value, 1) OVER (ORDER BY t.ID DESC) Lag_Desc,

    LEAD(t.Value, 1) OVER (ORDER BY t.ID) Lead_Asc,

    LEAD(t.Value, 1) OVER (ORDER BY t.ID DESC) Lead_Desc

    FROM

    #temp t

    ORDER BY

    t.ID

    Results...

    ID Value Lag_Asc Lag_Desc Lead_Asc Lead_Desc

    ----------- --------------- --------------- --------------- --------------- ---------------

    1 One NULL Two Two NULL

    2 Two One Three Three One

    3 Three Two Four Four Two

    4 Four Three Five Five Three

    5 Five Four Six Six Four

    6 Six Five Seven Seven Five

    7 Seven Six Eight Eight Six

    8 Eight Seven Nine Nine Seven

    9 Nine Eight Ten Ten Eight

    10 Ten Nine NULL NULL Nine

    As you can see from the example, sorting in DESC order, causes the LAG to act like LEAD and vise versa...

  • I think you're on the right track, you'll notice in Jason's excellent example that he includes a number in his lag/lead functions (you did not not which is the problem) and I'll say that Microsoft SQL Server 2012 High-Performance T-SQL Using Window Functions (Ben-Gan) changed my career.

    A few other things to note about windows functions:

    Dwain Camps wrote an excellent article some time back titled, The Performance of the T-SQL Window Functions[/url]. A key takeaway from that article is that aggregate windows functions, though more convenient, are pretty darn slow. It's good to understand them but I would avoid them.

    One of the most important uses of Windows Ranking Functions is the good ol':

    SELECT ROW_NUMBER() OVER (ORDER BY (SELECT NULL)). I first learned about that here: The "Numbers" or "Tally" Table: What it is and how it replaces a loop [/url]. You see Itzik Ben-Gan use that in the aforementioned book and it's not uncommon you'll see that around these forums. You can also do the ol' OVER (ORDER BY (SELECT NULL)) with NTILE() but I NTILE(), unlike the other ranking functions, is rather slow in MS SQL Server as the MS implementation is rather poorly designed IMHO.

    Lastly, If you're going to start using Windows Functions then you should begin practicing how to index properly for Windows Functions. In Ben-Gan's Microsoft SQL Server 2012 High-Performance T-SQL Using Window Functions he discusses the POC index which stands for Partition, Order and Cover. Take this scaled down version of the query you posted (scaled down for brevity):

    select

    row_number() over (partition by p.businessentityid order by p.[RateChangeDate] desc) as recency,

    p.rate

    FROM [HumanResources].[EmployeePayHistory] p

    A good index to have handy if you ran the above query frequently would be:

    CREATE INDEX index_xxx ON HumanResources.EmployeePayHistory(businessentityid, RateChangeDate desc)

    INCLUDE (rate)

    It first handles the partition (P), then the Order (O) and includes rate (covering) which, in the case, that single index could handle the above query with a single non-clustered index.

    "I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

    -- Itzik Ben-Gan 2001

  • Awesome reply, Alan... Cheers!

    I need to re-read it a few times!

  • Brilliant, cheers Jason!

    So, for Lag and Lead to work as 'expected' - I should leave the default sort order in place, realistically?

  • useful reading here

    http://sqlmag.com/sql-server-2012/how-use-microsoft-sql-server-2012s-window-functions-part-1

    http://sqlmag.com/sql-server-2012/microsoft-sql-server-2012-how-write-t-sql-window-functions-part-2

    http://sqlmag.com/sql-server-2012/sql-server-2012-how-write-t-sql-window-functions-part-3

    ________________________________________________________________
    you can lead a user to data....but you cannot make them think
    and remember....every day is a school day

Viewing 14 posts - 1 through 13 (of 13 total)

You must be logged in to reply to this topic. Login to reply