Critique My Understanding of Window Functions Please...

Question

Post reply

Critique My Understanding of Window Functions Please...

alex.sqldba

SSChampion

Points: 10254
More actions
May 25, 2015 at 2:06 pm

#297238

Good Evening Gang,
My knowledge of Window Functions is a bit lacking, so decided this weekend to spend some time getting it nailed, as it were. As a result, I've been using the AdventureWorks database as my test DB and was hoping you guys could let me know if I'm doing something wrong with my query, for a given a made up scenario.
My made up scenario is to identify average pay rates, broken down by Gender, Marital Status and All.
I'll post my code below, and then explain why I used various functions:
with mytable as
(
select
e.[BusinessEntityID],
row_number() over (partition by p.businessentityid order by p.[RateChangeDate] desc) as recency,
[Gender],
[MaritalStatus],
p.rate,
avg(p.rate) over (partition by gender) as avg_rate_for_gender,
AVG(p.rate) over (partition by maritalstatus) as avg_rate_for_marital_status,
avg(p.rate) over() as avg_rate
from [HumanResources].[Employee] e
inner join [HumanResources].[EmployeePayHistory] p on e.BusinessEntityID = p.BusinessEntityID
)
select * from mytable
where recency =1
order by [BusinessEntityID] asc
So, explanations as follows:
Why row_number?
Employee pay rate is stored in [EmployeePayHistory] - this table also keeps historical changes, so I am using the row_number function to identify the most recent rate for a person (BusinessEntityID)
Why avg over (partition by)?
I want to find the average pay rate dervied from gender and marital status
Why avg over()?
I also want the average pay rate accross all users regardless of their gender or marital status.
Thanks guys, I appreciate any comments/confirmation you can give on my understanding of these functions.
Cheers
Alex

Viewing 14 posts - 1 through 13 (of 13 total)

You must be logged in to reply to this topic. Login to reply

Eirikur Eiriksson SSC Guru Points: 182882 More actions · Answer 1

Very quick note, you are definitely on the right track here, the notes are all correct and accurate so far.

😎

alex.sqldba SSChampion Points: 10254 More actions · Answer 2

alex.sqldba

SSChampion

Points: 10254

May 25, 2015 at 2:23 pm

#1800446

Thanks Eirikur!

Jason A. Long SSC-Insane Points: 23780 More actions · Answer 3

If you're getting into windowed functions in SQL 2012, I'd encourage to to also included "Window Frames" in your study.

Windowed functions are an extremely powerful feature, and 2012's inclusion of widow frames only adds to that.

Also, if you haven'rt already picked up a copy, I don't think any attempt to learn Windowed Functions is complete without this... Microsoft SQL Server 2012 High-Performance T-SQL Using Window Functions (Developer Reference)

Eirikur Eiriksson SSC Guru Points: 182882 More actions · Answer 4

Jason A. Long (5/25/2015)
If you're getting into windowed functions in SQL 2012, I'd encourage to to also included "Window Frames" in your study.
Windowed functions are an extremely powerful feature, and 2012's inclusion of widow frames only adds to that.
Also, if you haven'rt already picked up a copy, I don't think any attempt to learn Windowed Functions is complete without this... Microsoft SQL Server 2012 High-Performance T-SQL Using Window Functions (Developer Reference)

+1, Itzik's book is a very good reference!

😎

alex.sqldba SSChampion Points: 10254 More actions · Answer 5

Yes, I've got Kindle copies of High Performance Windows Functions and Expert TSQL Windows Functions.

Am I correct in thinking that the Window Frame portion is the part of the call that uses 'unbounded preceding' ~ 'current row'? Or is there more to it?

Cheers

Jason A. Long SSC-Insane Points: 23780 More actions · Answer 6

alex.sqldba (5/25/2015)
Am I correct in thinking that the Window Frame portion is the part of the call that uses 'unbounded preceding' ~ 'current row'? Or is there more to it?
Cheers

You are correct... On both counts... The window frame allows you a great deal of control in determining which rows from the set are included in the window.

alex.sqldba SSChampion Points: 10254 More actions · Answer 7

It's the "FOLLOWNG" portion that's a bit of a head-trip for me, right now!

alex.sqldba SSChampion Points: 10254 More actions · Answer 8

Morning

I've just tried to add in the lag function to the above query (reposted below) because I want to add the previous pay rate per user to the report, so I've have the following query:

with mytable as

(

select

e.[BusinessEntityID],

row_number() over (partition by p.businessentityid order by p.[RateChangeDate] desc) as recency,

[Gender],

[MaritalStatus],

p.rate as current_rate,

lag(p.rate) over (partition by p.businessentityid order by p.[RateChangeDate] desc) as previous_rate,

avg(p.rate) over (partition by gender) as avg_rate_for_gender,

AVG(p.rate) over (partition by maritalstatus) as avg_rate_for_marital_status,

avg(p.rate) over() as avg_rate

from [HumanResources].[Employee] e

inner join [HumanResources].[EmployeePayHistory] p on e.BusinessEntityID = p.BusinessEntityID

)

select * from mytable

where recency =1

order by [BusinessEntityID] asc

Which you can see the LAG/OVER() to I thought grab the previous row in that window.

But when I sort that window DESC (the same sort as I use to get the most recent pay as current_pay) it returns all nulls. If I sort ASC instead I get some values back as expected.

So, my question is why? Why sorting DESC do I not get the right value back?

Cheers

Alex

Jason A. Long SSC-Insane Points: 23780 More actions · Answer 9

I think a slightly more simplistic sample offers a better illustration of what happens with LAG & LEAD in relation to the order by clause...

-- Test Data --

IF OBJECT_ID('tempdb..#temp') IS NOT NULL

DROP TABLE #temp;

CREATE TABLE #temp (

ID INT,

Value VARCHAR(15)

);

INSERT #temp (ID,Value) VALUES

(1,'One'),(2,'Two'),(3,'Three'),(4,'Four'),(5,'Five'),

(6,'Six'),(7,'Seven'),(8,'Eight'),(9,'Nine'),(10,'Ten')

-- LAG & LEAD query --

SELECT

t.ID,

t.Value,

LAG(t.Value, 1) OVER (ORDER BY t.ID) Lag_Asc,

LAG(t.Value, 1) OVER (ORDER BY t.ID DESC) Lag_Desc,

LEAD(t.Value, 1) OVER (ORDER BY t.ID) Lead_Asc,

LEAD(t.Value, 1) OVER (ORDER BY t.ID DESC) Lead_Desc

FROM

#temp t

ORDER BY

t.ID

Results...

ID Value Lag_Asc Lag_Desc Lead_Asc Lead_Desc

----------- --------------- --------------- --------------- --------------- ---------------

1 One NULL Two Two NULL

2 Two One Three Three One

3 Three Two Four Four Two

4 Four Three Five Five Three

5 Five Four Six Six Four

6 Six Five Seven Seven Five

7 Seven Six Eight Eight Six

8 Eight Seven Nine Nine Seven

9 Nine Eight Ten Ten Eight

10 Ten Nine NULL NULL Nine

As you can see from the example, sorting in DESC order, causes the LAG to act like LEAD and vise versa...

Alan Burstein SSC Guru Points: 61136 More actions · Answer 10

I think you're on the right track, you'll notice in Jason's excellent example that he includes a number in his lag/lead functions (you did not not which is the problem) and I'll say that Microsoft SQL Server 2012 High-Performance T-SQL Using Window Functions (Ben-Gan) changed my career.

A few other things to note about windows functions:

Dwain Camps wrote an excellent article some time back titled, The Performance of the T-SQL Window Functions[/url]. A key takeaway from that article is that aggregate windows functions, though more convenient, are pretty darn slow. It's good to understand them but I would avoid them.

One of the most important uses of Windows Ranking Functions is the good ol':

SELECT ROW_NUMBER() OVER (ORDER BY (SELECT NULL)). I first learned about that here: The "Numbers" or "Tally" Table: What it is and how it replaces a loop [/url]. You see Itzik Ben-Gan use that in the aforementioned book and it's not uncommon you'll see that around these forums. You can also do the ol' OVER (ORDER BY (SELECT NULL)) with NTILE() but I NTILE(), unlike the other ranking functions, is rather slow in MS SQL Server as the MS implementation is rather poorly designed IMHO.

Lastly, If you're going to start using Windows Functions then you should begin practicing how to index properly for Windows Functions. In Ben-Gan's Microsoft SQL Server 2012 High-Performance T-SQL Using Window Functions he discusses the POC index which stands for Partition, Order and Cover. Take this scaled down version of the query you posted (scaled down for brevity):

select

row_number() over (partition by p.businessentityid order by p.[RateChangeDate] desc) as recency,

p.rate

FROM [HumanResources].[EmployeePayHistory] p

A good index to have handy if you ran the above query frequently would be:

CREATE INDEX index_xxx ON HumanResources.EmployeePayHistory(businessentityid, RateChangeDate desc)

INCLUDE (rate)

It first handles the partition (P), then the Order (O) and includes rate (covering) which, in the case, that single index could handle the above query with a single non-clustered index.

"I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

-- Itzik Ben-Gan 2001

alex.sqldba SSChampion Points: 10254 More actions · Answer 11

Awesome reply, Alan... Cheers!

I need to re-read it a few times!

alex.sqldba SSChampion Points: 10254 More actions · Answer 12

Brilliant, cheers Jason!

So, for Lag and Lead to work as 'expected' - I should leave the default sort order in place, realistically?

J Livingston SQL SSC Guru Points: 51550 More actions · Answer 13

useful reading here

http://sqlmag.com/sql-server-2012/how-use-microsoft-sql-server-2012s-window-functions-part-1

http://sqlmag.com/sql-server-2012/microsoft-sql-server-2012-how-write-t-sql-window-functions-part-2

http://sqlmag.com/sql-server-2012/sql-server-2012-how-write-t-sql-window-functions-part-3

________________________________________________________________
you can lead a user to data....but you cannot make them think
and remember....every day is a school day