Understanding and Using APPLY (Part 2)

  • Hmmm.... replacing Having with Top 1...order by causes it to return Nulls for the max() for some rows -- not what I want. Here are the execution plans (couldn't see how to attach a document -- did I miss it?):

    Join:

    |--Merge Join(Inner Join, MANY-TO-MANY MERGE:([nco].[STATUS_ID])=(.[STATUS_ID]), RESIDUAL:([OR_NCCORE].[dbo].[ORDER_DIM].[STATUS_ID] as [nco].[STATUS_ID]=[OR_NCCORE].[dbo].[STATUS_DIM].[STATUS_ID] as .[STATUS_ID]))

    |--Sort(ORDER BY:([nco].[STATUS_ID] ASC))

    | |--Hash Match(Inner Join, HASH:([OR_NCCORE].[dbo].[ORDER_ITEM_DIM].[ORDER_ID])=([nco].[ORDER_ID]), RESIDUAL:([OR_NCCORE].[dbo].[ORDER_DIM].[ORDER_ID] as [nco].[ORDER_ID]=[OR_NCCORE].[dbo].[ORDER_ITEM_DIM].[ORDER_ID]))

    | |--Filter(WHERE:([Expr1010]=(1)))

    | | |--Sequence Project(DEFINE:([Expr1010]=row_number))

    | | |--Segment

    | | |--Sort(ORDER BY:([OR_NCCORE].[dbo].[ORDER_ITEM_DIM].[ORDER_ID] ASC, [OR_NCCORE].[dbo].[ORDER_ITEM_DIM].[ORDER_ITEM_ID] ASC))

    | | |--Table Scan(OBJECT:([OR_NCCORE].[dbo].[ORDER_ITEM_DIM]))

    | |--Table Scan(OBJECT:([OR_NCCORE].[dbo].[ORDER_DIM] AS [nco]))

    |--Sort(ORDER BY:(.[STATUS_ID] ASC))

    |--Table Scan(OBJECT:([OR_NCCORE].[dbo].[STATUS_DIM] AS ))

    Cross Apply:

    |--Nested Loops(Inner Join, OUTER REFERENCES:([nco].[ORDER_ID]))

    |--Hash Match(Inner Join, HASH:(.[STATUS_ID])=([nco].[STATUS_ID]), RESIDUAL:([OR_NCCORE].[dbo].[STATUS_DIM].[STATUS_ID] as .[STATUS_ID]=[OR_NCCORE].[dbo].[ORDER_DIM].[STATUS_ID] as [nco].[STATUS_ID]))

    | |--Table Scan(OBJECT:([OR_NCCORE].[dbo].[STATUS_DIM] AS ))

    | |--Table Scan(OBJECT:([OR_NCCORE].[dbo].[ORDER_DIM] AS [nco]))

    |--Filter(WHERE:([Expr1010] IS NOT NULL))

    |--Stream Aggregate(DEFINE:([Expr1010]=MAX([OR_NCCORE].[dbo].[ORDER_ITEM_DIM].[ORDER_ITEM_ID])))

    |--Index Spool(SEEK:([OR_NCCORE].[dbo].[ORDER_ITEM_DIM].[ORDER_ID]=[OR_NCCORE].[dbo].[ORDER_DIM].[ORDER_ID] as [nco].[ORDER_ID]))

    |--Index Scan(OBJECT:([OR_NCCORE].[dbo].[ORDER_ITEM_DIM].[IX_ORDER_ITEM_ID_with_Includes]))

  • b5535084 (1/30/2012)


    Hmmm.... replacing Having with Top 1...order by causes it to return Nulls for the max() for some rows -- not what I want. Here are the execution plans (couldn't see how to attach a document -- did I miss it?)

    Yes, when you are adding a reply, there is an 'Edit Attachments' button in the Post Options section. If it is convenient, please also give the CREATE TABLE and CREATE INDEX statements for the tables involved. The more information you can provide, the more comprehensive an answer I'll be able to give you.

  • OK , I've attached the two execution plans. Recall that I do a similar query two ways: 1 with join and one with cross apply. The produce identical results. I'm trying to understand why the cross apply version is more expensive and if the latter can be rewritten to do better or if join is the best choice for this application.

    with join:

    select nco.ORDER_ID, nci.ORDER_ITEM_ID

    from OR_NCCore.dbo.Order_Dim nco

    INNER JOIN or_nccore.dbo.STATUS_DIM S

    ON nco.STATUS_ID = S.STATUS_ID

    inner join (

    select *, rn=row_number()

    over (partition by order_id order by order_item_id asc)

    from or_nccore.dbo.ORDER_ITEM_DIM

    ) nci

    on nco.order_id=nci.order_id and rn=1

    Stats:

    (70108 row(s) affected)

    Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Table 'ORDER_DIM'. Scan count 1, logical reads 4128, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Table 'ORDER_ITEM_DIM'. Scan count 1, logical reads 1417, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Table 'STATUS_DIM'. Scan count 1, logical reads 10, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    SQL Server Execution Times:

    CPU time = 391 ms, elapsed time = 1075 ms

    with cross apply:

    select nco.ORDER_ID, nci.ORDER_ITEM_ID

    From OR_NCCore.dbo.Order_Dim nco

    INNER JOIN or_nccore.dbo.STATUS_DIM S

    ON nco.STATUS_ID = S.STATUS_ID

    cross apply (

    select top 1 *, rn=row_number()

    over (partition by order_id order by order_item_id asc)

    from or_nccore.dbo.ORDER_ITEM_DIM nci

    where nco.order_id=nci.order_id

    ) nci

    stats:

    (70108 row(s) affected)

    Table 'Worktable'. Scan count 70147, logical reads 448162, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Table 'ORDER_ITEM_DIM'. Scan count 1, logical reads 1417, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Table 'ORDER_DIM'. Scan count 1, logical reads 4128, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Table 'STATUS_DIM'. Scan count 1, logical reads 10, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    SQL Server Execution Times:

    CPU time = 828 ms, elapsed time = 1080 ms.

  • If those two queries produce identical results, it is just luck. Your original queries had (rn = 1) ordered by order_item_id ascending for the JOIN option (which would give the row with the lowest order_item_id). The APPLY form used MAX(order_item_id) which will produce the highest order_item_id of course. Your new query just uses TOP (1) without an ORDER BY - so which row you get is down to chance. The ROW_NUMBER in the apply does nothing, by the way. To answer your implied question about the worktable reads - these are related to the on-the-fly index SQL Server is creating for you each time the query runs (the Eager Index Spool in the query plan). This is an indication that you are missing a real index that would be helpful to the APPLY query. Based on the information provided, this is how I would write the APPLY query, and includes the definition of the index you should create to get best performance from it:

    SELECT

    nco.ORDER_ID,

    nci.ORDER_ITEM_ID

    FROM OR_NCCore.dbo.Order_Dim AS nco

    JOIN OR_NCCore.dbo.Status_Dim AS s

    ON nco.STATUS_ID = S.STATUS_ID

    CROSS APPLY

    (

    SELECT TOP (1)

    nci.ORDER_ITEM_ID

    FROM OR_NCCore.dbo.ORDER_ITEM_DIM AS nci

    WHERE

    nci.ORDER_ID = nco.ORDER_ID

    ORDER BY

    nci.ORDER_ITEM_ID ASC

    ) AS nci;

    -- Optimal index for this query:

    CREATE INDEX x

    ON OR_NCCore.dbo.ORDER_ITEM_DIM

    (ORDER_ID, ORDER_ITEM_ID);

  • I took a look at the plans and what confused me is the apparent lack of clustered indexes (i noticed table scans instead of clustered index scans)!

    Could it be that these are heaps with nonclustered indexes applied to them?

    If so, that would create lots of perfomance issues and limit the optimizer.

    Each table, except for very few exceptions should have a primarky key, and it should be clustered.

  • peter-757102 (1/31/2012)


    I took a look at the plans and what confused me is the apparent lack of clustered indexes (i noticed table scans instead of clustered index scans)!

    The table names made me think these are fact and dimension tables in a data warehouse. It is not uncommon to use heaps in that scenario.

  • SQL Kiwi (1/31/2012)


    peter-757102 (1/31/2012)


    I took a look at the plans and what confused me is the apparent lack of clustered indexes (i noticed table scans instead of clustered index scans)!

    The table names made me think these are fact and dimension tables in a data warehouse. It is not uncommon to use heaps in that scenario.

    Can you explain (or have a link that does a good job at it), why one would want to go that route over clustered storage in that scenario?

    The one thing that comes to my mind is reduces storage for simple entities, but when you start to add indexes and query it like it is relational, that small 'advantage' evaporates real fast.

  • Thanks for the insights! I have a question about your comment about the rn doing nothing though. My hope was that it would force the windowing so that the "top 1" would be in the window. Or, must I reference rn directly to make that happen?

  • b5535084 (1/31/2012)


    Thanks for the insights! I have a question about your comment about the rn doing nothing though. My hope was that it would force the windowing so that the "top 1" would be in the window. Or, must I reference rn directly to make that happen?

    The optimizer removes it entirely because the result of the ROW_NUMBER isn't used for anything. Even if you did reference it, the ORDER BY in the ROW_NUMBER only defines how rows are ordered for numbering. In principle, the TOP could still return rows in some other order - the point is there is no guarantee, so you should not rely on any apparently-reproducible observed behaviour.

  • peter-757102 (1/31/2012)


    Can you explain (or have a link that does a good job at it), why one would want to go that route over clustered storage in that scenario?

    Not offhand, no. It's not something I have looked at for years - but I am aware that some DW methodologies advocate heaps. It may have something to do with requiring sorted order for no apparent benefit, but we're on the ragged edge of my recollection here.

  • b5535084 (1/31/2012)


    Thanks for the insights! I have a question about your comment about the rn doing nothing though. My hope was that it would force the windowing so that the "top 1" would be in the window. Or, must I reference rn directly to make that happen?

    I often wished for a ransomization (on/off) setting, that forces deliberately anti-ordered results and expression execution order to quickly identify hidden false assumptions in queries.

  • SQL Kiwi (1/31/2012)


    peter-757102 (1/31/2012)


    Can you explain (or have a link that does a good job at it), why one would want to go that route over clustered storage in that scenario?

    Not offhand, no. It's not something I have looked at for years - but I am aware that some DW methodologies advocate heaps. It may have something to do with requiring sorted order for no apparent benefit, but we're on the ragged edge of my recollection here.

    I did check some on the net myself (long live Google)....and this is what I get from it:

    1. Nonclustered indexes on heaps have direct "pointers" to their respective record data in the heap, avoiding a secondary scan trough a clustered index. On large tables that are heavily read and where records are never changed, this could be a big speed factor.

    2. It is easier to append new data and use many different search patterns using multiple indexes (a clustered index is only optimal on one, the rest becomes expensive).

    It might add flexibility in segregating inactive and active data too i think. Keeping the active set small, makes things fast ofcourse.

    Some testing on my side is needed to figure out when to use and when not too and what problems I encounter.

    Currenly all cubes I worked with are relativly easy managable in size, around 110 GB for the biggest, but with very limited access pattens, so clustering worked fine there. In other scenarios with much smaller datasets I also use clustereing, supported with one index in some cases and that worked fine too. But as usual, as the dataset grows, issues and wrong choices become more aparent.

  • OK -- I get that now. So I added:

    where rn=1 (so it would be included)

    and removed the top 1, so I now looks like:

    select nco.*, nci.ORDER_ITEM_ID o2

    From OR_NCCore.dbo.Order_Dim nco

    INNER JOIN or_nccore.dbo.STATUS_DIM S

    ON nco.STATUS_ID = S.STATUS_ID

    cross apply (

    select *,rn=row_number()

    over (partition by order_id order by order_item_id asc)

    from or_nccore.dbo.ORDER_ITEM_DIM nci

    where nco.order_id=nci.order_id

    ) nci

    where rn=1

    The inner join still beats it by a wide margin on I/O, even after I added the index you suggested (thanks for that!) It's also a pity that I can't put the where clause inside the subquery for the cross apply, but such is life!

  • b5535084 (1/31/2012)


    OK -- I get that now. So I added:

    where rn=1 (so it would be included)

    and removed the top 1, so I now looks like:

    select nco.*, nci.ORDER_ITEM_ID o2

    From OR_NCCore.dbo.Order_Dim nco

    INNER JOIN or_nccore.dbo.STATUS_DIM S

    ON nco.STATUS_ID = S.STATUS_ID

    cross apply (

    select *,rn=row_number()

    over (partition by order_id order by order_item_id asc)

    from or_nccore.dbo.ORDER_ITEM_DIM nci

    where nco.order_id=nci.order_id

    ) nci

    where rn=1

    The inner join still beats it by a wide margin on I/O, even after I added the index you suggested (thanks for that!) It's also a pity that I can't put the where clause inside the subquery for the cross apply, but such is life!

    I'm confused. Why can't you use:

    SELECT

    nco.*,

    nci.ORDER_ITEM_ID AS o2

    FROM OR_NCCore.dbo.Order_Dim AS nco

    JOIN OR_NCCore.dbo.Status_Dim AS s

    ON nco.STATUS_ID = S.STATUS_ID

    CROSS APPLY

    (

    SELECT TOP (1)

    nci.ORDER_ITEM_ID

    FROM OR_NCCore.dbo.ORDER_ITEM_DIM AS nci

    WHERE

    nci.ORDER_ID = nco.ORDER_ID

    ORDER BY

    nci.ORDER_ITEM_ID ASC

    ) AS nci;

    ...? I would expect this to perform well with the index. I'm struggling to understand the need for the ROW_NUMBER. Perhaps an execution plan and statistics results for my query above?

  • peter-757102 (1/31/2012)


    I did check some on the net myself (long live Google)....and this is what I get from it:

    I guess another consideration might be what to use as the clustering key, how wide that would be (which would affect nc indexes too), and whether having to sort new data in order of the cluster would be expensive or not.

Viewing 15 posts - 31 through 45 (of 51 total)

You must be logged in to reply to this topic. Login to reply