Odd Results from an Indexed JOIN Column

  • Hello there! Starting off with some DDL...

    CREATE TABLE #AgentSession(

    [sessionId] [int] IDENTITY(1,1) NOT NULL,

    [dirSrvSessionGuid] [uniqueidentifier] NOT NULL,

    [serverId] [int] NOT NULL,

    [projectId] [int] NOT NULL,

    [loginDateTime] [datetime] NOT NULL,

    [logoutDateTime] [datetime] NULL,

    [userId] [int] NOT NULL,

    [testMode] [bit] NULL,

    CONSTRAINT [PK_AgentSession] PRIMARY KEY CLUSTERED

    (

    [sessionId] ASC

    )WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, FILLFACTOR = 95) ON [PRIMARY]

    ) ON [PRIMARY]

    CREATE NONCLUSTERED INDEX IX_LoginTime

    ON #AgentSession(loginDateTime,logoutDateTime)

    CREATE TABLE #AgentSessionConnectionTimes(

    [sessionId] [int] NOT NULL,

    [talkTime] [int] NULL,

    [waitingTime] [int] NULL,

    [pauseTime] [int] NULL,

    [reviewTime] [int] NULL,

    CONSTRAINT [PK_AgentSessionConnectionTimes] PRIMARY KEY CLUSTERED

    (

    [sessionId] ASC

    )WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, FILLFACTOR = 95) ON [PRIMARY]

    ) ON [PRIMARY]

    I've got a more general question on this pair of tables. I have a query that's like so:

    DECLARE @StartDate datetime = '4/17/2014 11:00:00'

    DECLARE @EndDate datetime = '4/18/2014 6:00:00'

    SELECT A.sessionid,SUM(Talktime)+SUM(PauseTime)+SUM(WaitingTime)

    FROM #AgentSession A

    INNER JOIN #AgentSessionConnectionTimes B

    ON A.sessionId = B.sessionId

    WHERE loginDateTime >= @StartDate

    AND logoutDateTime <= @EndDate

    GROUP BY A.sessionId

    I would have thought that the PK (with associated clustered index) on AgentSessionConnectionTime's sessionid column would produce an index seek for this query; however, it instead produces an index scan. This table has a bit more than 11 million rows, and the query takes about 8 seconds as a result.

    However, if I run this:

    SELECT A.sessionid,SUM(Talktime)+SUM(PauseTime)+SUM(WaitingTime)

    FROM #AgentSession A

    INNER JOIN #AgentSessionConnectionTimes B WITH (FORCESEEK)

    ON A.sessionId = B.sessionId

    WHERE loginDateTime >= @StartDate

    AND logoutDateTime <= @EndDate

    GROUP BY A.sessionId

    The query hint forces a seek (rather badly), but performance does improve; the query takes about half the time. On the other hand, I can cause a seek in a much better manner, like so:

    DECLARE @SessionMin int = (SELECT MIN(sessionid) FROM #AgentSession WHERE loginDateTime >= @StartDate AND logoutDateTime <= @EndDate)

    SELECT A.sessionid,SUM(Talktime)+SUM(PauseTime)+SUM(WaitingTime)

    FROM #AgentSession A

    INNER JOIN #AgentSessionConnectionTimes B

    ON A.sessionId = B.sessionId

    WHERE loginDateTime >= @StartDate

    AND logoutDateTime <= @EndDate

    AND B.sessionId >= @SessionMin

    GROUP BY A.sessionId

    Which basically does what I thought the index on AgentSessionConnectionTime, along with the INNER JOIN, would do; limit the range of sessionid values to only those present in the INNER JOIN. With this, the query runs in less than a second, and the entire procedure where this query is hosted falls in execution time from 30 seconds to 5 seconds. Much better!

    Is there some reason the index on AgentSessionConnectionTime isn't used unless I force it to be used in some way? This seems contrary to how I'd expect it to work, but there's probably something I'm missing here :-).

    Also, the statistics for each query, from SET STATISTICS TIME,IO ON:

    Base query:

    Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0,

    read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Table '#AgentSessionConnectionTimes'. Scan count 1, logical reads 4348, physical reads 0,

    read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Table '#AgentSession'. Scan count 1, logical reads 58, physical reads 0,

    read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    FORCESEEK query:

    '#AgentSessionConnectionTimes'. Scan count 0, logical reads 4051, physical reads 0,

    read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    '#AgentSession'. Scan count 1, logical reads 59, physical reads 0,

    read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Variable query:

    Table '#AgentSession'. Scan count 1, logical reads 8144, physical reads 0,

    read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Table '#AgentSessionConnectionTimes'. Scan count 1, logical reads 13, physical reads 0,

    read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Table '#AgentSession'. Scan count 1, logical reads 51, physical reads 0,

    read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    I'm also attaching the actual query plans for each query. There's an oddity in these, as well; in the variable query plan, the estimated number of rows is quite high. I did a statistics update WITH FULLSCAN on both tables, but that didn't seem to budge the estimates. Not sure why that's the case.

    Thanks in advance for any information on this issue! I'd post some sample data to go with the tables, but the slowdown issue isn't apparent until I hit about 5 million rows, and that would be quite a bit of data :-D. If needed, though, I could package the INSERT statements into a text file and attach it, as well.

    - 😀

  • It's difficult to analyze this completely without a representative data set, but from what I see of the tables, statistics, and execution plans, the following can be seen:

    1) Seek vs. Scan: The logical reads don't differ very much between these, but the subtree cost for your query is actually higher for the seek over the scan.

    2) Nested Loops vs. Hash Match: The hash match is more expensive than the nested loops when comparing the first 2 execution plans. The overall cost of the scan vs. seek decision is a function of the cost of each plus the join type.

    3) Despite having an index available for use, the inequalities on LoginDateTime and LogoutDateTime are still expensive. If the query optimizer sees these inequalities and determines (quickly) that the number of rows to be checked is large enough, then it will choose the scan as the difference in efficiency between seek and scan goes down with the percentage of rows that need to be checked or returned.

    My first thought is on statistics. Are your statistics up-to-date? In the first 2 queries, the estimated and actual row counts are very different. This cannot always be avoided, but it can be a sign that statistics need to be updated. By default, the statistics on a temp table will not be updated unless you run an UPDATE STATISTICS query on tempdb for the temp table in question. I was able to experiment locally with the schema you provided and could reproduce somewhat similar results by creating a large volume of dummy data, running your query, updating statistics, and then running again. If messing with tempdb is not possible in your environment, it might be beneficial to create an actual table in your database rather than a temp table, which would give you more control over statistics.

    If this is a very frequently executed query---ie one where a little savings per execution will net you immense improvements overall, consider experimenting with a covering index. The output list on the original query includes pauseTime, sessionId, talkTime, and waitingTime. In addition to the other decisions the optimizer needs to make, it will also weigh the output columns as well. If many columns are being returned, it may opt for a scan instead of the seek as the base data will be returned along with it.

    Again, I am making some inferences here about where this schema comes from and how it is used, but let me know if you have more specifics that could help, or if this helped answer your question.

  • Thank you for the response, Ed!

    Some of the answers to your questions are in my original post, but it's quite messy, so no fault of yours! Here's the bits that concerned me in summary:

    The statistics were freshly updated using WITH FULLSCAN just before these queries were run; if there's some sort of delay between the statistics being updated and the new statistics being put into use, that would certainly explain it. Otherwise, I'm stumped!

    The actual query doesn't use temp tables; this is probably a gaffe on my part. I wanted to make usable DDL for others to work with, but the tables are actually permanent ones.

    This query isn't executed too frequently, at least, not yet; it's part of a report that's currently in testing, and it will be run about two dozen times a day once it's been approved and moved from testing. That report usually takes about 45 seconds to a minute to run; however, after analyzing the code and noticing these scans, I did some testing, as there were complaints about the speed.

    After adding in the bit with the variable, the execution time on the whole procedure dropped immensely, and it ran in 5 seconds or less, depending on how much data was present for the day being reported on.

    I should add that I cleared the cache several times and re-ran the query uncached until it was cached again, and the execution times returned to my previous findings.

    The really weird thing is that the seek does seem to produce a much lower runtime for the report, even though the acquisition of the variable's value does produce more reads than when the query is run without it. I'd guess that pulling the 11 million rows in AgentSessionConnectionTime and discarding the ones that won't be used is more costly than finding the minimum sessionid value for the query, but I don't know the specifics on that level.

    - 😀

  • It's an interesting case where the optimizer seems to have 2 choices that seem very similar. In situations like this, minor changes in the query, plan cache, or data could push it either way.

    Were you able to try a covering index on the query, and if so, did it help with the speed and execution plan?

    Also, can you confirm that the execution plan and read stats are the same for the temp table version and the normal version of the schema?

  • Haha, actually, this may be the result of an entirely different problem altogether :-P.

    It seems that, for whatever reason, I can't get the procedure containing this query to have a reusable plan. Every time I change the @StartDate and @EndDate parameters, a new plan is generated.

    I've hacked away at the coding of the procedure until it's literally nothing more than the query contained here, and still no luck. I'm kinda confused, but it gives me something to test on to see if I can get it fixed.

    Originally, I was doing my runs of the query with a clean cache, and I must've run the query on one day without the sessionid limitations, then run it again on the same day with the limiter, and then done some more testing with and without it on other days, which confused the issue.

    My fault on this one! I really should've been looking at the plan cache more closely. Hopefully more testing will eliminate the issue in the end.

    - 😀

  • That sheds a ton of light on the odd behavior!

    It sounds like if you want plan reuse, try taking your SQL and either putting it into a stored procedure with the dates as parameters, or use sp_executesql to put it into a dynamic SQL block, with the dates as parameters. This should give you better control over parameter sniffing and plan reuse---especially if you have any other parameters you want to pass in.

    Introducing sq_executesql into your query and passing in @StartDate and @EndDate will allow plans to be reused consistantly (I hope!).

    Good luck with this!

  • Oh, yep, that's part of what's confusing me currently :-P. That query's part of a stored procedure, and it uses dynamic SQL; however, @StartDate and @EndDate are input parameters for the procedure, and the dynamic SQL is parameterized using the input parameters.

    That's why the cache problem didn't even occur to me... It seems like the whole thing should've plopped right into the cache and gotten reused the whole time, but something's stopping the reuse. Didn't have much time to experiment with it yesterday, but I should have a good bit of time to tinker with it today. I'll post any interesting findings for the benefit of the readers, should I stumble upon something.

    EDIT: Just in case it's part of the problem, the dynamic SQL is basically done like so:

    DECLARE @sql nvarchar(max) =

    'SELECT (columns)

    FROM ['+column+'].dbo.(table)

    WHERE (conditions including @StartDate and @EndDate)'

    +CHAR(10)

    FROM #DatabaseList

    EXEC sp_executesql @sql,N'@StartDateValue datetime,@EndDateValue datetime',@StartDateValue = @StartDate, @EndDateValue = @EndDate

    Basically, I have a table that holds the system names and front-end names of our databases, and I'm using it to run a query against multiple databases.

    - 😀

  • Whew! Alright, posting my findings back as I stated earlier, just a good while later :-D. I went to SQL Saturday Atlanta just this last weekend, and saw Grant's amazing session on statistics. It gave me quite a few ideas to work with. First, the main problematic table is like so:

    CREATE TABLE [dbo].[AgentSessionConnectionTimes](

    [sessionId] [int] NOT NULL,

    [talkTime] [int] NULL,

    [waitingTime] [int] NULL,

    [pauseTime] [int] NULL,

    [reviewTime] [int] NULL,

    CONSTRAINT [PK_AgentSessionConnectionTimes] PRIMARY KEY CLUSTERED

    (

    [sessionId] ASC

    )WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, FILLFACTOR = 95) ON [PRIMARY]

    ) ON [PRIMARY]

    Now, the problem here is that I have a ton of INT values, and the sessionid column is basically an IDENTITY column. Not an actual one, but every time a sessionid (which is an INT IDENTITY in a different table) is created, it's also inserted into this table with the associated Time values.

    Since all I have are INT values, though, there's absolutely no selectivity on the table. Using SHOW_STATISTICS on the table shows me that there's just two steps, with a RANGE_HI_KEY of 1 and... Whatever the max sessionid is at the moment. This is clearly not very good for querying!

    As such, I've instead created an indexed view out of the above table and the main table I'm using for my query criteria; by doing this, I've been able to get an actual, fleshed-out histogram for the indexed view.

    Granted, I'm still not able to get the queries to cache, for whatever reason; I'm still puzzled on that one. Despite this, execution times have increased very nicely; the uncached execution time is about 2-3 seconds, which is plenty acceptable (the procedure I'm running has to chug through about 80 databases and process a few hundred thousand to a million rows). The cached query plan has fallen from several kilobytes in size to maybe 30-40 bytes on average.

    Still not a perfect solution, but this definitely fits the current needs. Probably the trickiest thing I've had to do in SQL so far, even if it doesn't seem like much. I'm feeling pretty happy about the results :-).

    - 😀

  • Thanks for the summary of everything you did! It's definitely good to have a working solution, even if there is still an unsolved mystery in there.

    I've run into many cases in the past of the query optimizer not behaving as expected and scratching my head for a while before figuring out why, or at least putting together a solution that renders the problem irrelevant.

    No better place to go than a SQL Saturday for good ideas, too : )

Viewing 9 posts - 1 through 8 (of 8 total)

You must be logged in to reply to this topic. Login to reply