The team I work on does nothing but data conversions. We convert data for new customers from their legacy system format into our database format.
In general, we get their data into a SQL Server database, then use a series of stored procedures to convert this data into the new customer’s production database. A typical project can convert anywhere from tens of thousands of records to 20-30 million records. The SQL code related to the project can run anywhere from 5000-25,000 lines of code, depending on the number of modules being converted. Much of the code can be reused as we go from project to project. Customers may be using either SQL Server 2000 or SQL Server 2005. We run these processes on servers in-house for development and testing, then run them on the customer’s servers. So the processes run on a large variety of hardware configurations.
When we started using SQL Server 2005, we noticed that occasionally a SELECT query would “hang”. (This would generally be one of the more complex queries containing multiple INNER and OUTER joins.) That is, a query expected to run within a couple of minutes might run for two or three hours without returning any data. Reviewing the data, indexes, server load, etc. revealed no particular problem. Running the query in SQL Server 2000 revealed no problem. We found that often, adding OPTION (MERGE JOIN) to the query would result in the query running as expected, even though the execution plan would show this to be a slower process.
Later in the process, another query might hang. Adding the OPTION (MERGE JOIN) seemed to resolve the problem more often than not.
Subsequent runs on the same hardware with the same data might cause queries to hang that had run just fine in a previous test. Likewise, some queries that ran faster after OPTION (MERGE JOIN) had been added will later hang and run better after OPTION (MERGE JOIN) is removed.
Our team is fairly proficient in tuning queries, and understands how to use indexes and execution plans. We can see that, according to the execution plan, the merge join we are forcing is not as optimal as the hash join SQL Server seems to want to use. So we are at a loss to explain this behavior. We also have not discovered any way to predetermine which queries may have the problem.
Has anyone else run into this, or does anyone have a plausible explanation as to why this may be happening? Any thoughts would be appreciated.