Blog Post

PARTITION BY clause works incorrectly when in nested select over large volumes of data


I have been witnessing a very strange behaviour of ROW_NUMBER() PARTITION BY clause in complex join and nested select or CTE statements over large volumes of data with Max Degree of Parallelism <> 1

The ROW_NUMBER() or RANK() were incorrectly marking records within a partiton, and that behaviour was nondeterministic.

Let's say I have three tables girls, shoppingmall and times girls and shoppingmall have from and to dates, and times table has a row for every date. shoppingmall has shopid which can appear in different mallid.

ex: Now, I want to get first mallid where the girl was shopping for every day. I join girls table with times by from and to. Then I create a nested select from shoppingmall also joined to times by from and to, which would use ROW_NUMBER () PARTITION BY in order to give me first mallid for a given date and given shop.


select * into #resultset

from girls

left join times ON times.timeid >= girls.datefrom and times.timeid < girls.dateto

left join (select mallid, shopid as shopidx, times.timeid as timeidx, ROW_NUMBER() OVER (PARTITION BY shopid, times.timeid ORDER BY ordering ASC) Lenin

             FROM shoppingmall INNER JOIN times ON times.timeid between shoppingmall.datefrom and shoppingmall.dateto) AS trip

            ON girls.shopid = trip.shopidx AND times.timeid = trip.timeidx and Lenin = 1


Now, what I should get is only one row per date and per shopid from shoppingmall

However, if I run this query on a very large dataset ROW_NUMBER() occasionally marks as 1 two rows for same shopid and timeid!

As soon as I add a very restrictive WHERE condition which limits the dataset, ex: WHERE girls.girlid = "Kathy", the query starts behaving correctly (whereas without strict WHERE and on the same record it was marking two rows PARTITIONED BY shopid and timeid as 1)

Also, if I fist insert result of inner select in the temporary table, and then use the table in the join - results are correct.

I am running on SQL Server 2008 10.0.2531.0

I will rewrite all my inner selects to use temporary tables instead of nested selects and CTE, but still I do not like the unexpected and unexplainable behaviour of PARTITION BY.

I guess that this behaviour is due to parallelism because when I changed the Max Degree of Paralellism to 1 this behaviour disappeared and PARTITION BY started working correctly?


You rated this post out of 5. Change rating




You rated this post out of 5. Change rating