SQLServerCentral Article

How SQL Server Chooses the Type of Join

,

The problem

A stored procedure that is called via .NET application runs in as short a time as 1-3 seconds some times, but runs for as long as 1.5 minutes at other times. The same stored procedure ran in Query Analyzer always returns in the 1-3 second time frame. The stored procedure is called multiple times in a small time frame and collects data based on start dates and end dates. We can't reproduce it to the degree that we see on the client side, but we are able to see the above mentioned pattern. On the client site, they reach the .NET timeout of 3 Minutes on two of the eight times it is called. When the timeouts occur, these are also the calls made with the largest date ranges.

Finding the Real Problem

So the first thing that jumps out is the fact that this stored procedure fails to complete during the .NET timeout of 3 minutes on the two largest date ranges. However, upon running the stored procedure in Query Analyzer, it finishes with in 3 seconds.

Every time.

No hang ups, no slow downs, nothing. Even with bigger date ranges than that specified by our application and returning as much data as possible, we can't come close to a three minute time out. The procedure still runs in a matter of seconds.

Our thought process jumped to the client environment and possible differences at their site that might explain what it was they were seeing. Were there other things running at the same time? Was there something else blocking tables that we needed to get to? This was tested by again running the procedure in Query Analyzer, which did take longer than at our site. The indexes for the affected tables were rebuilt and now the Query Analyzer test finished in a few seconds. Ahh! Problem solved. Rebuild the indexes and we should be fine. The client ran the procedure via the application again that night, and again it failed. Back to the drawing board

At this point we needed to start over and gather facts. We decided to verify that the procedure ran slow from the application and fast from Query Analyzer. It did run faster. In fact, while running the stored procedure from the application and watching a trace of the server activity we were able to run the stored procedure from Query Analyzer while waiting for it to return from the application. At this point we began to suspect something in the .NET communications layer as that seemed to be the only difference. In order to verify this, we used ApeXSQL's SQL Edit to run the procedure from as it connects to the DB the same way our application does. That proved to be the jumping off point that would eventually lead to the detection of the problem.

When we tested the procedure from APEX, it ran slow, much like the application. We were now convinced it was something with the way .NET connected to the database server. We made a change to the stored procedure and recompiled it. Sure enough, it was much faster, 1-3 seconds, just like Query Analyzer. We then ran the stored procedure, which didn't seem to improve much over the previous time, so we ran the stored procedure from APEX again and it was slow again. Can you imagine our level of confusion? After running several other tests, we became convinced something was taking place within the application that was causing the query to run slowly during and after the application was run.

So at this point, the facts were as follows;

  • The procedure when run via the application was slow, over a minute in some cases.
  • The same stored procedure run from APEX was also slow once the application had been run.
  • The stored procedure would run fine from SQL Server Query Analyzer.
  • The stored procedure would run fine from APEX as long as it had been recompiled.

So what was different about the procedure when it was run from APEX fast versus APEX when it was slow? We decided at this point to start digging into the query plan. After analyzing the query plan, we could see the faster one was much shorter than the slow one, we just didn't know how or why. We discovered that the query plan for the procedure when it took a long time, used nested loop joins as opposed to when it was faster, in which case it used hash joins.

Why would it use nested loops some times and hash joins other times? And if the hash joins were faster, why didn't it use them all of the time. Why would it ever use the nested loops? According to the Microsoft documentation, the engine will chose to use nested loops for the joins in situations where there is not a lot of data being joined. And it will chose hash joins if there is a large amount of data being joined. In our case we had both, so why didn't it choose the right one for the right situation.

The answer lies in one of the other things we mentioned earlier. The procedure would run just fine in APEX if it was recompiled before it ran. But if it wasn't and the application ran first, then APEX would be slow. What was happening is that the procedure when run from APEX would chose the query plan from the procedure cache if there was one there. After a recompile there isn't one there, so it would choose the one for the larger data set as we were testing the procedure with the widest date sets and the largest data sets. When the application ran however, the first set of dates were small in range and the data set returned is relatively small. Based on this SQL Server would choose a query plan using nested loops, which in and of itself is fine. The problem came with the fact that the query plan was now in cache and would continue to tell the engine to use the nested loop join. It didn't know that the data set would change and grow. All the engine knows is that the application called a stored procedure for which it had a query plan stored in cache. It used that plan rather than compiling a new one and the problems described above were revealed.

The Solution

The fix for this was relatively simple. We simply inserted a sp_recompile {procedure_name} into the stored procedure so that it would recompile before it ran each time. This would allow a new query plan to be chosen based on the size of the current data set rather than simply using what was in cache. You might be thinking that compiling that stored procedure over and over again would add overhead and you would be right, which is why it is not a best practice to do this. However, in cases where you call the same stored procedure multiple times with different parameters that affect the size of the data set returned you might see a huge gain in performance by doing the recompile each time prior to the stored procedure being run.

Keep in mind this is true in any scenario in which SQL Server is going to chose a query plan based on a set of parameters that can have drastically different size data sets being called by the same stored procedure.

Rate

4.53 (15)

You rated this post out of 5. Change rating

Share

Share

Rate

4.53 (15)

You rated this post out of 5. Change rating