To those that don’t know me “Natural Born Killers (The Series)” sounds like a very strange choice for a SQL Server blog title and to be fair it is.
Those who do know me, will know that I have been presenting a session this year titled “Natural Born Killers, performance issues to avoid” around the UK. The “Series” element of the title is there because I have decided to serialise the session for the benefit of those who have been unable to see it thus far.
The plan is to release one topic from the presentation each day of SQL Relay 2013. Today is the seventh day of SQL Relay, this leg is being held in Reading. Its the closest event to the user group that I run in Maidenhead so I am hoping that this will help to encourage more people to attend and present at their local user group. I’m only going to be able to make the first half of this leg, so I guess it’s more of a knee rather than a leg for me as I have to take a flight to Madrid for TechEd where I will be representing Dell on their stand.
Today’s topic comes up on forums on almost a daily basis, so I thought it would make a great subject. The obvious answer to which you should use will always be “it depends”. What I want to do is to show you why it depends so you will have the information that you need in order to make the best decision. With that in mind we’ll start with the major consideration points for using Table Variables.
Table variables are limited to the batch or procedure from which they are called. This means that they cannot be referenced from other stored procedures in the same or other sessions unlike global temporary tables. If you are using SQL Server 2008 and above, you will have the ability to use Table-Valued Parameters to pass records into a stored procedure.
Many developers prefer to use table variables as it is perceived that they run much faster. One of the reasons for this is the myth that table variables reside in memory, I will provide more information on this later. Table variables also manage to shave time off by performing less logging than temporary tables and not storing information on statistics.
There is a fantastic blog post called Temporary Table Caching Explained by Paul White which goes into fantastic detail showing how both temporary tables and table variables are logged using the fn_dblog function. I highly recommend that you put some time aside to read through this!
You may also want to read a blog I wrote some time back that shows a consequence of the logging levels, namely that the contents of a table variable inside a transaction will not rollback inside a transaction – now that’s scary. http://sql.richarddouglas.co.uk/archive/2011/06/rollback-gotchas-part-2-2.html
For me, this is a huge problem and the reason why I tend to shy away from table variables. Statistics play a huge part in deciding how a query is executed, by not having the proper statistics for an object you are really hampering the optimization engine and you should not be surprised when it comes back with a bad plan. I’ll be providing some examples of this later on in the post.
This may not be a problem for everyone, but I remember working on a project where I dynamically needed to create and alter some tables. If this is a requirement for you then table variables are not an option (at least not at the time of writing this post).
Rather than reproducing work that others have done a great job on already, Gail Shaw wrote a blog on table variables covering this aspect which can be found here – http://sqlinthewild.co.za/index.php/2010/10/12/a-trio-of-table-variables/
I’ve heard many many times that it is not possible to create an index on a table variable. You can, but I understand why people think that you can’t, the reason is simple if you use the CREATE INDEX syntax you will receive an error. What you need to do is to create a unique constraint instead and as we all know a unique constraint is just a unique index behind the scenes.
At this point you might be thinking how does that help me? I cant make this attribute unique. Well, just create a surrogate primary key and build a composite key on that. This quick demo will show you how:
USE TEMPDB; GO DECLARE @TV TABLE (TV_ID INT IDENTITY(1,1) PRIMARY KEY CLUSTERED ,TV_COL1 CHAR(10) UNIQUE(TV_COL1, TV_ID) ,TV_COL2 CHAR(10) ) INSERT INTO @TV SELECT REPLICATE ('A',10), REPLICATE ('B',10) FROM SYS.columns SELECT TV_COL2 FROM @TV WHERE TV_COL2 = 'BBBBBBBBBB' SELECT TV_COL1 FROM @TV WHERE TV_COL1 = 'AAAAAAAAAA'
As you can see from the image of the actual execution plans below our query against the non indexed attribute TV_COL2 results in a clustered index scan. The next query against our unique constraint performs an index seek.
Its worth noting again that table variables are really fast with small data sets where the lack of statistics are not going to be a problem.
We’ve taken a quick look through some of the main points about table variables, now it’s time to turn our attention to temporary tables.
Woohoo! Temporary tables use statistics, that’s great. The only downside of that is that when we hit the statistics thresholds mentioned in Day 1 we hit a recompile event, this again goes to show that table variables are great for small data sets as you won’t have the overhead of these recompile events. On the flip side SQL Server now has a great mechanism for caching temporary tables so using them is not as expensive as it once was.
The benefit of statistics are that the optimizer can create better much plans. If you are looking for consistency this might be a much better option for you.
Temporary tables have a much wider scope than that of table variables. They can exist outside of a batch of code, they can even exist outside of a session with the use of a global temporary table which can be a very handy feature at times.
Ah, as a DBA I love consistency and temporary tables provide it for me on this point. If I have a transaction and there’s a data inside my temporary table, this data will be rolled back in my transaction should it fail. There are of course times where it’s useful to store transient data outside of a transaction and yes table variables are great for that. Just make sure everyone on your team is aware of this fundamental difference in behaviour.
This is probably quite a minor consideration for a lot of people but there are times especially when you are playing with metadata that you want to build and alter objects dynamically. I once recall building a report that dynamically created attributes built on year and month values inside another data set and having to manually pivot them. Really wish the client had been on 2005!
This is really all down to statistics and the optimization processes that can be used when the optimizer knows how much data is really inside the object.
So far this has probably come across as quite theoretical and dry, you all want to see some examples and I hate to disappoint, so here you go:
To play along at home (or your office) you’ll need a copy of the TableDesign_Bad database that we created earlier in the series.
There’s not much to do in order to set up the demonstration, we’re going to connect to the correct database and turn on some statistical information. You will need to turn on “Include Actual Execution Plan” however.
USE TableDesign_Bad; GO set statistics io on set statistics time on /* --Base Query - (1,666,667 row(s) affected) -- SELECT b.BetterDate, b.BetterID FROM dbo.BetterDataTypes b WHERE b.SurrogateID % 2 = 1 AND b.SurrogateID % 3 = 1 --AND b.SurrogateID <= 2000000 */
The commented section is our base query which will be performed against both a table variable and a temporary table. If you are following along this should create a resultset of over 1.6 million rows.
This next section is going to declare a table variable, insert data into it using our base data. I will then select everything out of it and finally perform an inner join to another table.
--Create a Table Variable DECLARE @TableVar TABLE (SurrogateID INT ,BetterID UNIQUEIDENTIFIER ,BetterDate DATE ) INSERT INTO @TableVar SELECT b.SurrogateID, b.BetterID, b.BetterDate FROM dbo.BetterDataTypes b WHERE b.SurrogateID % 2 = 1 AND b.SurrogateID % 3 = 1 --AND b.SurrogateID <= 2000000 SELECT SurrogateID, BetterID, BetterDate FROM @TableVar SELECT B.SurrogateID, B.BetterID, B.BetterDate FROM @TableVar T INNER JOIN dbo.BetterDataTypes B On B.SurrogateID = T.SurrogateID
Let’s break this down and look at what has happened for each part of this script, starting with the INSERT statement.
We can see that 66036 logical reads were performed against our table. This should be the same for both approaches, let’s take a look at the query plan to see if that throws up any surprises.
Other than seeing the statistics for this table are completely shot, there is one other thing that worries me about this plan. That is the lack of parallelism. My instance is running on a VM with multiple processors and the cost of this operation is significantly higher than the cost threshold for parallelism and yet this is running under one thread.
Moving on to the next piece of code we find that there were 6588 logical reads performed against our table variable
Note that the name of the table variable isn’t the same as the declared name, as far as I am aware they always seem to start with a # and are 9 characters in length.
The query plan for this statement is pretty straightforward, it’s worth reiterating again the huge disparity between the estimated number of rows and actual number of rows. This will cause us some resource utilisation headaches as you will see in the third query in this script.
So far I’ve always shown you the statistics followed by the plan. This time I’m going to switch things round and show the plan before the statistics.
Well that doesn’t look too bad does it? If you were to look at that on it’s own you might not be too concerned, it looks pretty straightforward doesn’t it. Fortunately we know better, we’ve already established the fact that the statistics are wrong and we know how many records there are in the table variable and we also know that this data originated from the table it is being joined against.
Because of the proper lack of statistics the optimizer thought that there would be one record and there would be one corresponding lookup. It calculated the most optimal join in this scenario would be the nested loop operator. Unfortunately there were actually over 1.6 million rows and therefore we will be doing over 1.6 million lookups!
Lets take a look at how many logical reads were performed now. Actually don’t I think you had better brace for impact, put down your coffee etc. etc. before scrolling down……
As this blog is open for anyone to view I’m not going to type any expletives. You will probably not be surprised that when I first saw just how many logical reads were performed against the BetterDataTypes table there were infact a very high number of expletives expressed towards the display.
That’s right folks, 66036 logical reads against the table to put it into the table variable in the first place and a whopping 5000001 reads when the same data was joined back to the table variable.
Let’s do the same test again, but this time using a temp table.
--Create Temp Table CREATE TABLE #TempTable (SurrogateID INT ,BetterID UNIQUEIDENTIFIER ,BetterDate DATE ); INSERT INTO #TempTable SELECT b.SurrogateID, b.BetterID, b.BetterDate FROM dbo.BetterDataTypes b WHERE b.SurrogateID % 2 = 1 AND b.SurrogateID % 3 = 1 --AND b.SurrogateID <= 2000000; SELECT SurrogateID, BetterID, BetterDate FROM #TempTable; SELECT B.SurrogateID, B.BetterID, B.BetterDate FROM #TempTable T INNER JOIN dbo.BetterDataTypes B On B.SurrogateID = T.SurrogateID; DROP TABLE #TempTable;
After running this we can see that there were 66774 logical reads performed when inserting into our temporary table.
What’s interesting to me in the plan is that you can clearly see that when the same data is being inserted into a temporary table the optimizer used parallelism.
The next two screenshots will be pretty similar to that of the table variable test.
As a side note if you are looking at this in your environment take a moment to look at the estimated cost of this step using the temporary table and that of the table variable. Why do you think there is such a difference?
The last step is where you can see a real difference between the two approaches for this data. Again we will start with the plan and the the stats.
The plan here is radically different we’re making use of all kinds of optimizer goodness here including parallelism and a much more efficient join operator for the amount of data that we are using.
What effect has this more complex plan had on the logical reads?
Wow! That’s a massive difference. In the first iteration the BetterDataTypes table had 5000001 logical reads performed against it, now just 66290. That’s a significant reduction.
Those of you who read my post Improving performance by using SET STATISTICS IO will have seen that when a hash join operator is used that no logical reads are reported from STATISTICS IO. When I ran a profiler session against this query it reported that 74443 reads were performed in total.
I hope that this post will help you to decide when each of these options is viable for your environment.