Blog Post

Rows or Range, What’s the difference ?

,

With the release of Denali CTP3 came an extension of the over clause to allow for a sliding window of data.  This allows us to , amongst other things, to efficiently and neatly calculate rolling balances.  This is a very common requirement for a database system and one which crops up time and time again on forums etc.

The basic syntax is documented in BOL here and Wayne Sheffield(blog|twitter) has written a great introduction here.

To help demonstrate, I need to generate a temporary table using some data from AdventureWorks.

use AdventureWorks2008R2
go
drop table #orders
go

Select SalesPersonID,
min(OrderDate) as OrderMonth,
sum(TotalDue ) as TotalDue
into #Orders
from Sales.SalesOrderHeader SOH
where SOH.SalesPersonID is not null
group by SalesPersonID,datediff(mm,0,OrderDate);

go
create clustered index idxOrder on #Orders(SalesPersonId,OrderMonth);
go

So, in Denali CTP3 to perform a rolling balance of the TotalDue accumulating for each OrderMonth we can perform:

select *,
sum(TotalDue) over(partition by SalesPersonId
order by OrderMonth)
as RollingBalance
from #Orders
order by SalesPersonID,OrderMonth;
and we will see a result set like this
 
image

Neat, a rolling balance, and as we as partitioning by SalesPersonID it will be re-initialised for each SalesPerson.  Problem solved.

But what about the ‘BY RANGE’ and ‘BY ROWS’ option of the clause , what’s their significance ?

Let us duplicate the data in the table

insert into #Orders
select * from #Orders

and retry the RollingBalance query (  remember that the default is RANGE so this is exactly the same query as before )

select *,
sum(TotalDue) over(partition by SalesPersonId
order by OrderMonth
Range UNBOUNDED PRECEDING)
as RollingBalance
from #Orders
order by SalesPersonID,OrderMonth;

image

Notice how the RollingBalance is now summing together all the totaldue values for each SalesPersonId / ordermonth.

Where-as by using ROWS

select *,
sum(TotalDue) over(partition by SalesPersonId
order by OrderMonth
Rows UNBOUNDED PRECEDING)
as RollingBalance
from #Orders
order by SalesPersonID,OrderMonth;

image

We now have the value incremented over each row.

Other than the output , ( which i grant you is quite important Smile ) there is another fundamental difference to be looked at.  How much work has SQLServer done to service both of these queries ?

Lets create more data..

insert into #Orders
Select SalesPersonID,OrderMonth,TotalDue
from #Orders
go 8
and re-execute the queries
 
image
 
The columns are (CPU,Reads,Writes and TotalDuration)

Wow , quite a stark difference.  But, a very big but, if range is what you require then that is what you have to do.

Now consider a different set of data.

drop table #orders
go

Select SalesPersonID,
min(OrderDate) as OrderMonth,
sum(TotalDue ) as TotalDue
into #Orders
from Sales.SalesOrderHeader SOH
where SOH.SalesPersonID is not null
group by SalesPersonID,datediff(mm,0,OrderDate);

go
create unique clustered index idxOrder on #Orders(SalesPersonId,OrderMonth);
go


declare @MaxSalesPersonId integer
Select @MaxSalesPersonId = max(SalesPersonId)
from #Orders

insert into #Orders
select SalesPersonID+@MaxSalesPersonId,OrderMonth,TotalDue
from #Orders
go 8

The same amount of rows overall but because there are now many more combinations of SalesPersonID and OrderMonth the difference in timings is now even more stark.

image

Also,  the outputs are now exactly the same.  In-fact notice how I have a unique index on SalesPersonID and OrderMonth so by definition the outputs are guaranteed to be the same, every ‘range’ will only ever have one row. 

From experience I would say that ROWS will be used in 99% of scenario’s and TBH i think that

A) BY ROWS should of been the default

B) The optimizer could be enhanced to ensure that if unique , then BY ROWS is used. 

Rate

You rated this post out of 5. Change rating

Share

Share

Rate

You rated this post out of 5. Change rating