Click here to monitor SSC
SQLServerCentral is supported by Red Gate Software Ltd.
 
Log in  ::  Register  ::  Not logged in
 
 
 

PARTITION BY clause works incorrectly when in nested select over large volumes of data

I have been witnessing a very strange behaviour of ROW_NUMBER() PARTITION BY clause in complex join and nested select or CTE statements over large volumes of data with Max Degree of Parallelism <> 1

The ROW_NUMBER() or RANK() were incorrectly marking records within a partiton, and that behaviour was nondeterministic.

Let's say I have three tables girls, shoppingmall and times girls and shoppingmall have from and to dates, and times table has a row for every date. shoppingmall has shopid which can appear in different mallid.

ex: Now, I want to get first mallid where the girl was shopping for every day. I join girls table with times by from and to. Then I create a nested select from shoppingmall also joined to times by from and to, which would use ROW_NUMBER () PARTITION BY in order to give me first mallid for a given date and given shop.

 

select * into #resultset

from girls

left join times ON times.timeid >= girls.datefrom and times.timeid < girls.dateto

left join (select mallid, shopid as shopidx, times.timeid as timeidx, ROW_NUMBER() OVER (PARTITION BY shopid, times.timeid ORDER BY ordering ASC) Lenin

             FROM shoppingmall INNER JOIN times ON times.timeid between shoppingmall.datefrom and shoppingmall.dateto) AS trip

            ON girls.shopid = trip.shopidx AND times.timeid = trip.timeidx and Lenin = 1

 

Now, what I should get is only one row per date and per shopid from shoppingmall

However, if I run this query on a very large dataset ROW_NUMBER() occasionally marks as 1 two rows for same shopid and timeid!

As soon as I add a very restrictive WHERE condition which limits the dataset, ex: WHERE girls.girlid = "Kathy", the query starts behaving correctly (whereas without strict WHERE and on the same record it was marking two rows PARTITIONED BY shopid and timeid as 1)

Also, if I fist insert result of inner select in the temporary table, and then use the table in the join - results are correct.

I am running on SQL Server 2008 10.0.2531.0

I will rewrite all my inner selects to use temporary tables instead of nested selects and CTE, but still I do not like the unexpected and unexplainable behaviour of PARTITION BY.

I guess that this behaviour is due to parallelism because when I changed the Max Degree of Paralellism to 1 this behaviour disappeared and PARTITION BY started working correctly?

Comments

Posted by aedna on 3 November 2010

UPDATE: as data and server load have changed I am not able to reproduce this exact scenario at the moment. As I noted before, it was nondeterministic, i.e. same query produced correct and incorrect results at different times

Posted by Anonymous on 3 November 2010

Pingback from  Twitter Trackbacks for                 SQL Server Central, PARTITION BY clause works incorrectly when in nested select over large volumes of data - aedna@SQL         [sqlservercentral.com]        on Topsy.com

Posted by Anonymous on 4 November 2010

Pingback from  Dew Drop &ndash; November 4, 2010 | Alvin Ashcraft&#039;s Morning Dew

Leave a Comment

Please register or log in to leave a comment.