Weird "Group By" Behavior from Subquery with Derived Table

  • I was asked to help a developer with a slow query.  While making changes, I generated a query with a GROUP BY that produces several records for each group.  In addition, the where clause no longer limits the records included in the resultset.  Can somebody explain why having a derived table in the sub query causes this behavior?  I'm using SQL Server 2000. 

    -- Create and load a test table. 

    CREATE TABLE #temp (

     [calendar_sid] [int] NOT NULL,

     [case_sid] [int] NOT NULL,

     [defendant_sid] [int] NULL,

     [calendar_time] [datetime] NULL ,

     CONSTRAINT [PK_case_calendar] PRIMARY KEY CLUSTERED

        (

            [calendar_sid]

        ) ON [PRIMARY] ,

    ) ON [PRIMARY]

    SET NOCOUNT ON

    INSERT INTO #temp

    SELECT calendar_sid, case_sid, defendant_sid, calendar_time

    FROM case_calendar

    SET NOCOUNT OFF

    -- This query works as expected - one record per c.defendant_sid

    SELECT c.defendant_sid,

        count(*) as [count],

        CAST(MIN(c.calendar_time) as char(20)) as [MIN(calendar_time)],

        (   SELECT CAST(MIN(t.calendar_time) as char(20))

            FROM #temp t

            WHERE t.defendant_sid = c.defendant_sid

            ) as [Min2]

    FROM #temp c

    WHERE c.defendant_sid < 10

    GROUP BY c.defendant_sid

    ORDER BY c.defendant_sid

    -- This query does not work as expected - many records

    -- per c.defendant_sid and c.defendant_sid < 10 not true

    SELECT c.defendant_sid,

        count(*) as [count],

        CAST(MIN(c.calendar_time) as char(20)) as [MIN(calendar_time)],

        (SELECT CAST(MIN(a.calendar_time) as char(20))

            FROM (

                SELECT t.calendar_time

             FROM #temp t

             WHERE t.defendant_sid = c.defendant_sid

                ) as a

            ) as [Min2]

    FROM #temp c

    WHERE c.defendant_sid < 10

    GROUP BY c.defendant_sid

    ORDER BY c.defendant_sid

    drop table #temp

    Results:

    defendant_sid count       MIN(calendar_time)   Min2                

    ------------- ----------- -------------------- --------------------

    3             1           May  1 2003 10:00AM  May  1 2003 10:00AM

    6             2           Mar  1 2003 10:00AM  Mar  1 2003 10:00AM

    9             2           Feb  1 2003  1:00PM  Feb  1 2003  1:00PM

     

    (3 row(s) affected)

     

    defendant_sid count       MIN(calendar_time)   Min2                

    ------------- ----------- -------------------- --------------------

    3             1           May  1 2003 10:00AM  May  1 2003 10:00AM

    3             2           Mar  1 2003 10:00AM  May  1 2003 10:00AM

    3             2           Feb  1 2003  1:00PM  May  1 2003 10:00AM

    6             1           May  1 2003 10:00AM  Mar  1 2003 10:00AM

    6             2           Mar  1 2003 10:00AM  Mar  1 2003 10:00AM

    6             2           Feb  1 2003  1:00PM  Mar  1 2003 10:00AM

    6             1           May  1 2003 10:00AM  Mar  1 2003 10:00AM

    6             2           Mar  1 2003 10:00AM  Mar  1 2003 10:00AM

    6             2           Feb  1 2003  1:00PM  Mar  1 2003 10:00AM

    9             1           May  1 2003 10:00AM  Feb  1 2003  1:00PM

    9             2           Mar  1 2003 10:00AM  Feb  1 2003  1:00PM

    9             2           Feb  1 2003  1:00PM  Feb  1 2003  1:00PM

    9             1           May  1 2003 10:00AM  Feb  1 2003  1:00PM

    9             2           Mar  1 2003 10:00AM  Feb  1 2003  1:00PM

    9             2           Feb  1 2003  1:00PM  Feb  1 2003  1:00PM

    13            1           May  1 2003 10:00AM  Mar 27 2003  9:00AM

    13            2           Mar  1 2003 10:00AM  Mar 27 2003  9:00AM

    13            2           Feb  1 2003  1:00PM  Mar 27 2003  9:00AM

    13            1           May  1 2003 10:00AM  Mar 27 2003  9:00AM

    13            2           Mar  1 2003 10:00AM  Mar 27 2003  9:00AM

    13            2           Feb  1 2003  1:00PM  Mar 27 2003  9:00AM

    14            1           May  1 2003 10:00AM  Apr 28 2003  1:00PM

    ...

    1149          2           Feb  1 2003  1:00PM  Jan 22 2004  1:30PM

    1150          1           May  1 2003 10:00AM  Jan 22 2004  1:00PM

    1150          2           Mar  1 2003 10:00AM  Jan 22 2004  1:00PM

    1150          2           Feb  1 2003  1:00PM  Jan 22 2004  1:00PM

    1150          1           May  1 2003 10:00AM  Jan 22 2004  1:00PM

    1150          2           Mar  1 2003 10:00AM  Jan 22 2004  1:00PM

    1150          2           Feb  1 2003  1:00PM  Jan 22 2004  1:00PM

     

    (1236 row(s) affected)

    PS  I probably will not use a subquery in the final result, but I'm curious about this. 

    RandyHelpdesk: Perhaps Im not the only one that does not know what you are doing. 😉

  • It looks like the sort of bug where the parser designers never tested a correlated derived table in a select list used with aggregates.  This should instead give you an error message, as derived tables consisting of correlated subqueries are generally not allowed.  I think it's performing the correlation before aggregating, essentially overriding the aggregation of the first MIN().  Your first formulation is better, anyway, and you could also (as you like derived tables ) try:

    SELECT c.defendant_sid,

        count(*) as [count],

        CAST(MIN(c.calendar_time) as char(20)) as [MIN(calendar_time)], 

        CAST(x.xMin) as char(20)) as [Min2]

    FROM #temp c JOIN

    (SELECT Defendent_SId, MIN(Calendar_Time) xMin

     FROM #temp

     GROUP BY Defendant_SId) x ON x.Defendant_SId = c.Defendant_SId

    WHERE c.defendant_sid < 10

    GROUP BY c.defendant_sid, x.xMin

    ORDER BY c.defendant_sid



    --Jonathan

  • I think I'll skip the derived table.  I can never tell what I'm going to get from the developers .  The original query is worse as this stuff was in the on clause of a join .  It does get interesting from time to time to see a query that works, but is not compatible with sanity.  BTW, the select from a correlated subquery in the derived table did work in the on clause of the join.   

     

    RandyHelpdesk: Perhaps Im not the only one that does not know what you are doing. 😉

  • Could you post the example of the correlated subquery used as a derived table?  I thought derived tables are not evaluated dependently...  The only time I've seen a correlation work in a derived table was when it was self-contained; i.e. no tables outside the derived table were referenced. 



    --Jonathan

  • I took a look at the execution plans for the example code included above.  They both use parallelism, but the second one (with the problem) is more complicated and has a "Nested Loops/Inner Join" step which the first query does not.  I'm no expert when it comes to plans - guess it's time to find a good reference, if there is one.  Recommendations?   

    *** I'll put the SHOWPLAN output in the next post because of it's ugly format. ***

    Below is the original query that led me to experiment with the above.  It does work - the correlated derived table is in the on clause, so I guess different behavior is expected.  I added an index on Calendar_time and pre-calculated datetimes (no cast or dateadd) which reduced the query time from 30 to 2 seconds. 

    In the final version, I got rid of the correlation and used nested derived tables to select one calendar_sid for each defendant.  I also moved two inner joins to the WHERE as EXISTS clauses.  The result now takes 10ms. 

    DECLARE @StartDate datetime

    SET @StartDate = '2004-01-23'

    -- Proc code follows

    DECLARE @StringStartDate varchar(10)

    SET @StringStartDate = CONVERT(varchar(10), @StartDate, 101)

    SELECT DISTINCT

        n.formatted_name,

        d.booking_number,

        d.booking_date,

        dbo.fn_ListReferredCharges(d.defendant_sid) as [Booking Charges],

        dbo.fn_ListFiledCharges(d.defendant_sid) as [Filed Charges],

        fc.calendar_time,

        fc.hearing_description,

        s.status_name,

        c.RA_case_numbers,

        d.defendant_sid

    FROM dbo.vw_defendant_unrestricted d

    LEFT OUTER JOIN dbo.vw_person_name_unrestricted n ON

        n.name_sid = d.name_sid

    LEFT OUTER JOIN dbo.vw_cases_unrestricted c ON

        c.case_sid = d.case_sid

    LEFT OUTER JOIN dbo.ref_case_status s ON

        s.status_id = d.defendant_status

    LEFT OUTER JOIN dbo.case_calendar fc ON

        fc.defendant_sid = d.defendant_sid AND

        fc.calendar_time = (

            SELECT MIN(calendar_time)

      FROM (

                SELECT calendar_time

       FROM case_calendar

       WHERE defendant_sid = d.defendant_sid AND

                    calendar_time > cast((@StringStartDate + ' 1:30 PM') as datetime)

      &nbsp as a

     &nbsp

    INNER JOIN dbo.case_calendar cc ON

        cc.defendant_sid = d.defendant_sid AND

        cc.hearing_type IN ('FAR', 'MAR') AND

        cc.court_department = 'M' AND

        DATEADD(mi, -810, cc.calendar_time) = @StringStartDate

    INNER JOIN dbo.case_activity ca ON

        ca.case_sid = c.case_sid AND

        ca.activity_id = 'INTK' AND

        ca.priority = 1

    WHERE d.custody_code = 'IC'

    RandyHelpdesk: Perhaps Im not the only one that does not know what you are doing. 😉

  • StmtText                                                                                                                                                                                                                                                                                                                                                                       

    -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

    SELECT c.defendant_sid,

        count(*) as [count],

        CAST(MIN(c.calendar_time) as char(20)) as [MIN(calendar_time)],

        (   SELECT CAST(MIN(t.calendar_time) as char(20))

            FROM #temp t

            WHERE t.defendant_sid = c.defendant_sid

            ) as [Min2]

    FROM #temp c

    WHERE c.defendant_sid < 10

    GROUP BY c.defendant_sid

    ORDER BY c.defendant_sid

    (1 row(s) affected)

    StmtText                                                                                                                                                                                                                                                                                               

    -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

      |--Compute Scalar(DEFINE[Expr1003]=Convert([Expr1002]), [Expr1007]=Convert([Expr1005])))

           |--Stream Aggregate(GROUP BY[c].[defendant_sid]) DEFINE[Expr1005]=MIN([t].[calendar_time]), [Expr1001]=ANY([Expr1001]), [Expr1002]=ANY([Expr1002])))

                |--Sort(ORDER BY[c].[defendant_sid] ASC))

                     |--Parallelism(Gather Streams)

                          |--Hash Match(Left Outer Join, HASH[c].[defendant_sid])=([t].[defendant_sid]), RESIDUAL[t].[defendant_sid]=[c].[defendant_sid]))

                               |--Bitmap(HASH[c].[defendant_sid]), DEFINE[Bitmap1012]))

                               |    |--Compute Scalar(DEFINE[Expr1001]=Convert([globalagg1009])))

                               |         |--Stream Aggregate(GROUP BY[c].[defendant_sid]) DEFINE[globalagg1009]=SUM([partialagg1008]), [Expr1002]=MIN([partialagg1010])))

                               |              |--Parallelism(Repartition Streams, PARTITION COLUMNS[c].[defendant_sid]), ORDER BY[c].[defendant_sid] ASC))

                               |                   |--Stream Aggregate(GROUP BY[c].[defendant_sid]) DEFINE[partialagg1008]=Count(*), [partialagg1010]=MIN([c].[calendar_time])))

                               |                        |--Sort(ORDER BY[c].[defendant_sid] ASC))

                               |                             |--Clustered Index Scan(OBJECT[tempdb].[dbo].[#temp_______________________________________________________________________________________________________________00000007A633].[PK_case_calendar] AS [c]), WHERE[c].[defendant_sid]<10))

                               |--Parallelism(Repartition Streams, PARTITION COLUMNS[t].[defendant_sid]), WHEREPROBE([Bitmap1012])=TRUE))

                                    |--Clustered Index Scan(OBJECT[tempdb].[dbo].[#temp_______________________________________________________________________________________________________________00000007A633].[PK_case_calendar] AS [t]))

    (14 row(s) affected)

     

    StmtText                                                                                                                                                                                                                                                                                                                                                                                                                                               

    -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

    SELECT c.defendant_sid,

        count(*) as [count],

        CAST(MIN(c.calendar_time) as char(20)) as [MIN(calendar_time)],

        (SELECT CAST(MIN(a.calendar_time) as char(20))

            FROM (

                SELECT t.calendar_time

             FROM #temp t

             WHERE t.defendant_sid = c.defendant_sid

                ) as a

            ) as [Min2]

    FROM #temp c

    WHERE c.defendant_sid < 10

    GROUP BY c.defendant_sid

    ORDER BY c.defendant_sid

    (1 row(s) affected)

    StmtText                                                                                                                                                                                                                                                                      

    ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

      |--Compute Scalar(DEFINE[Expr1004]=Convert([Expr1002]), [Expr1008]=Convert([Expr1006])))

           |--Nested Loops(Inner Join)

                |--Parallelism(Gather Streams, ORDER BY[c].[defendant_sid] ASC))

                |    |--Sort(ORDER BY[c].[defendant_sid] ASC))

                |         |--Hash Match(Right Outer Join, HASH[t].[defendant_sid])=([c].[defendant_sid]), RESIDUAL[t].[defendant_sid]=[c].[defendant_sid]))

                |              |--Hash Match(Aggregate, HASH[t].[defendant_sid]), RESIDUAL[t].[defendant_sid]=[t].[defendant_sid]) DEFINE[Expr1006]=MIN([t].[calendar_time])))

                |              |    |--Parallelism(Repartition Streams, PARTITION COLUMNS[t].[defendant_sid]))

                |              |         |--Clustered Index Scan(OBJECT[tempdb].[dbo].[#temp_______________________________________________________________________________________________________________00000007A633].[PK_case_calendar] AS [t]))

                |              |--Parallelism(Repartition Streams, PARTITION COLUMNS[c].[defendant_sid]))

                |                   |--Clustered Index Scan(OBJECT[tempdb].[dbo].[#temp_______________________________________________________________________________________________________________00000007A633].[PK_case_calendar] AS [c]))

                |--Table Spool

                     |--Compute Scalar(DEFINE[Expr1001]=Convert([Expr1015])))

                          |--Stream Aggregate(GROUP BY[c].[defendant_sid]) DEFINE[Expr1015]=Count(*), [Expr1002]=MIN([c].[calendar_time])))

                               |--Sort(ORDER BY[c].[defendant_sid] ASC))

                                    |--Clustered Index Scan(OBJECT[tempdb].[dbo].[#temp_______________________________________________________________________________________________________________00000007A633].[PK_case_calendar] AS [c]), WHERE[c].[defendant_sid]<10))

    (15 row(s) affected)

    PS  I used cut and paste from QA.  I do not know why the "Sad" and "Wink" faces are displayed in the preview - I did not add them.  A bug? 

    RandyHelpdesk: Perhaps Im not the only one that does not know what you are doing. 😉

  • Okay, I see why that "correlated derived table" works.  It doesn't need to be a derived table, and the query optimizer automatically reduces it to a simple correlated subquery. 



    --Jonathan

  • Simply turn on Query|Show Execution Plan in QA, run both queries in the same batch, and then compare the relative times in the Execution Plan window.  Try my variant with the subquery joined, too.



    --Jonathan

  • Here it is one more time - my post gets lost from time to time....

    The developer simplified the problem by removing the derived table.  I think this is the best form. 

        LEFT OUTER JOIN dbo.case_calendar fc ON

            fc.calendar_sid = (

                SELECT TOP 1 calendar_sid

                FROM case_calendar

                WHERE defendant_sid = d.defendant_sid AND

                    calendar_time > @StringStartDateTime

                ORDER BY calendar_time

                )  

     

    RandyHelpdesk: Perhaps Im not the only one that does not know what you are doing. 😉

  • -- Query 1: Query Cost (relative to batch): 48.89%

    SELECT c.defendant_sid,

        count(*) as [count],

        CAST(MIN(c.calendar_time) as char(20)) as [MIN(calendar_time)],

        (   SELECT CAST(MIN(t.calendar_time) as char(20))

            FROM #temp t

            WHERE t.defendant_sid = c.defendant_sid

            ) as [Min2]

    FROM #temp c

    WHERE c.defendant_sid < 10

    GROUP BY c.defendant_sid

    ORDER BY c.defendant_sid

    -- Query 2: Query Cost (relative to batch): 51.11%

    SELECT c.defendant_sid,

        count(*) as [count],

        CAST(MIN(c.calendar_time) as char(20)) as [MIN(calendar_time)],

        CAST(x.xMin as char(20)) as [Min2]

    FROM #temp c JOIN

    (SELECT defendant_sid, MIN(calendar_time) xMin

     FROM #temp

     GROUP BY Defendant_sid) x ON

        x.Defendant_sid = c.Defendant_sid

    WHERE c.defendant_sid < 10

    GROUP BY c.defendant_sid, x.xMin

    ORDER BY c.defendant_sid

    Not much different. 

    RandyHelpdesk: Perhaps Im not the only one that does not know what you are doing. 😉

  • ...and thanks for the tip.  These queries are fast enough that comparing relative times in one batch via the execution plan window is nice. 

    RandyHelpdesk: Perhaps Im not the only one that does not know what you are doing. 😉

  • Here is another batch that includes the buggy result.  In the last query, the output has the same number of records as in #temp (776937) rather than 1.  This is despite the where and group by clauses. 

     

    -- Query 1: Query Cost (relative to batch): 0.02%

    SELECT c.defendant_sid,

        count(*) as [count],

        CAST(MIN(c.calendar_time) as char(20)) as [MIN(calendar_time)],

        (   SELECT CAST(MIN(t.calendar_time) as char(20))

            FROM #temp t

            WHERE t.defendant_sid = c.defendant_sid

            ) as [Min2]

    FROM #temp c

    WHERE c.defendant_sid < 10

    GROUP BY c.defendant_sid

    ORDER BY c.defendant_sid

    -- Query 2: Query Cost (relative to batch): 0.02%

    SELECT c.defendant_sid,

        count(*) as [count],

        CAST(MIN(c.calendar_time) as char(20)) as [MIN(calendar_time)],

        CAST(x.xMin as char(20)) as [Min2]

    FROM #temp c JOIN

    (SELECT defendant_sid, MIN(calendar_time) xMin

     FROM #temp

     GROUP BY Defendant_sid) x ON

        x.Defendant_sid = c.Defendant_sid

    WHERE c.defendant_sid < 10

    GROUP BY c.defendant_sid, x.xMin

    ORDER BY c.defendant_sid

    -- Query 3: Query Cost (relative to batch): 0.13%

    SELECT c.defendant_sid,

        count(*) as [count],

        CAST(MIN(c.calendar_time) as char(20)) as [MIN(calendar_time)],

        (   SELECT TOP 1 t.calendar_time

            FROM #temp t

            WHERE t.defendant_sid = c.defendant_sid

            ORDER BY t.calendar_time

            ) as [Min2]

    FROM #temp c

    WHERE c.defendant_sid < 10

    GROUP BY c.defendant_sid

    ORDER BY c.defendant_sid

    -- Query 4: Query Cost (relative to batch): 99.82%

    --

    -- (776937 row(s) affected)!!! -- should be 1

    SELECT c.defendant_sid,

        count(*) as [count],

        CAST(MIN(c.calendar_time) as char(20)) as [MIN(calendar_time)],

        (SELECT CAST(MIN(a.calendar_time) as char(20))

            FROM (

                SELECT t.calendar_time

             FROM #temp t

             WHERE t.defendant_sid = c.defendant_sid

                ) as a

            ) as [Min2]

    FROM #temp c

    WHERE c.defendant_sid < 10

    GROUP BY c.defendant_sid

    ORDER BY c.defendant_sid

     

    RandyHelpdesk: Perhaps Im not the only one that does not know what you are doing. 😉

Viewing 12 posts - 1 through 11 (of 11 total)

You must be logged in to reply to this topic. Login to reply