Efficiency of a "Most Recent for Each" join

  • I often perform reports which require joining the most recent of a many-to-one table joined back to the main. I've been using the below pattern for this type of query, but I was wondering if there is a better approach.

    Any feedback would be greatly appreciated.

    SELECT a.CustomerID, a.CustomerName, ..., b.CallDate, b.CallerID, b.CallDuration, ...

    FROM Customers a

    LEFT OUTER JOIN

    CustomerCalls b

    ON b.ID =

    (

    SELECT TOP 1 ID

    FROM CustomerCalls

    WHERE CustomerID = a.CustomerID

    ORDER BY CallDate DESC

    )

  • I would probably use a cte, something like this:

    ;

    with MostRecentCalls as (

    select CustomerID, max(ID) as callID

    from CustomerCalls

    group by CustomerID

    )

    select a.CustomerID, a.CustomerName, ..., C.CallDate, C.CallerID, C.CallDuration, ...

    FROM Customers a

    LEFT OUTER JOIN

    MostRecentCalls b on a.CustomerID = b.CustomerID

    JOIN

    CustomerCalls C on b.callID = C.ID

    The probability of survival is inversely proportional to the angle of arrival.

  • Craig has already posted some test scripts with different solutions and comparison in another topic.

    Check this out: http://www.sqlservercentral.com/Forums/FindPost1210550.aspx


    Alex Suprun

  • The CROSS APPLY solution is pretty interesting, I haven't messed with that type of query before.

    I did some timed tests using Jeff Modon's million row test table, which I modified to have a million row main table and a 50 million row sub table with pricing information.

    -- Build the test data

    IF OBJECT_ID('tempdb..#TempMain') IS NULL BEGIN

    SELECT TOP 1000000

    RowNum = IDENTITY(INT,1,1),

    SomeInt = ABS(CHECKSUM(NEWID()))%50000+1,

    SomeLetters2 = CHAR(ABS(CHECKSUM(NEWID()))%26+65)

    + CHAR(ABS(CHECKSUM(NEWID()))%26+65),

    SomeCSV = CAST('Part01,Part02,Part03,Part04,Part05,Part06,Part07,Part08,Part09,Part10' AS VARCHAR(80)),

    SomeMoney = CAST(ABS(CHECKSUM(NEWID()))%10000 /100.0 AS MONEY),

    SomeDate = CAST(RAND(CHECKSUM(NEWID()))*3653.0+36524.0 AS DATETIME),

    SomeHex12 = RIGHT(NEWID(),12)

    INTO #TempMain

    FROM Master.dbo.SysColumns t1,

    Master.dbo.SysColumns t2

    ALTER TABLE #TempMain

    ADD CONSTRAINT PK_TempMain_RowNum PRIMARY KEY CLUSTERED (RowNum)

    END

    IF OBJECT_ID('tempdb..#TempSub') IS NULL BEGIN

    SELECT TOP 50000000

    RowNum = IDENTITY(INT,1,1),

    Parent = ABS(CHECKSUM(NEWID()))%1000000+1,

    SomeMoney = CAST(ABS(CHECKSUM(NEWID()))%10000 /100.0 AS MONEY),

    SomeDate = CAST(RAND(CHECKSUM(NEWID()))*3653.0+36524.0 AS DATETIME)

    INTO #TempSub

    FROM Master.dbo.SysColumns t1,

    Master.dbo.SysColumns t2

    ALTER TABLE #TempSub

    ADD CONSTRAINT PK_TempSub_RowNum PRIMARY KEY CLUSTERED (RowNum)

    CREATE INDEX IX_TempSub_Parent ON #TempSub ( Parent, SomeDate ) INCLUDE ( RowNum )

    END

    I tried three different approaches:

    Method 1 is the pattern I've been using, doing a left join from the main table and using a subquery to get the id of the sub table.

    Method 2 uses a CTE to calculate the most recent date and then uses that join to both the main and sub table.

    Method 3 uses the OUTER APPLY approach (as opposed to a CROSS APPLY, to ensure we have a main record when nothing exists in the sub table.

    -- Throwaway variables.

    DECLARE @RowNum1 int, @SomeLetters2 char(2), @SomeCSV varchar(80), @RowNum2 int, @SomeMoney money, @SomeDate datetime

    -- Method 1

    SET STATISTICS TIME ON

    SELECT @RowNum1 = a.RowNum, @SomeLetters2 = a.SomeLetters2, @SomeCSV = a.SomeCSV, @RowNum2 = b.RowNum, @SomeMoney = b.SomeMoney, @SomeDate = b.SomeDate

    FROM #TempMain a

    LEFT OUTER JOIN

    #TempSub b

    ON b.RowNum =

    (

    SELECT TOP 1 RowNum

    FROM #TempSub

    WHERE Parent = a.RowNum

    ORDER BY SomeDate DESC

    )

    SET STATISTICS TIME OFF

    -- Method 2

    SET STATISTICS TIME ON

    ;WITH CTE AS (

    SELECT Parent, MAX(SomeDate) as MaxDate

    FROM #TempSub

    GROUP BY Parent

    )

    SELECT @RowNum1 = a.RowNum, @SomeLetters2 = a.SomeLetters2, @SomeCSV = a.SomeCSV, @RowNum2 = c.RowNum, @SomeMoney = c.SomeMoney, @SomeDate = c.SomeDate

    FROM #TempMain a

    LEFT OUTER JOIN

    CTE b ON a.RowNum = b.Parent

    LEFT OUTER JOIN

    #TempSub c ON a.RowNum = c.RowNum AND b.MaxDate = c.SomeDate

    SET STATISTICS TIME OFF

    -- Method 3

    SET STATISTICS TIME ON

    SELECT @RowNum1 = a.RowNum, @SomeLetters2 = a.SomeLetters2, @SomeCSV = a.SomeCSV, @RowNum2 = b.RowNum, @SomeMoney = b.SomeMoney, @SomeDate = b.SomeDate

    FROM #TempMain a

    CROSS APPLY

    (

    SELECT TOP 1 RowNum, SomeMoney, SomeDate

    FROM #TempSub

    WHERE Parent = a.RowNum

    ORDER BY SomeDate DESC

    ) b

    SET STATISTICS TIME OFF

    Results:

    SQL Server Execution Times:

    CPU time = 16005 ms, elapsed time = 1230 ms.

    SQL Server Execution Times:

    CPU time = 37159 ms, elapsed time = 3370 ms.

    SQL Server Execution Times:

    CPU time = 16285 ms, elapsed time = 1238 ms.

    Method 1 is pretty solid, though the OUTER APPLY in Method 3 is almost the same.

    Are there any other approaches worth giving a try? Or modification of the three queries I've tested above?

Viewing 4 posts - 1 through 3 (of 3 total)

You must be logged in to reply to this topic. Login to reply