RE: Difference between 'hash match/inner join' and 'nested loops/inner join' in execution query plan?

SSCoach

Points: 15681

December 17, 2010 at 9:09 am

TheSQLGuru (1/2/2008)
A nested loop query plan is usually optimal when there are small numbers of rows in one table (think 10s to perhaps 100s in most cases) that can be probed into another table that is either very small or has an index that allows a seek for each input. One thing to watch out for here is when the optimizer THINKS there will be few rows (check the estimated rows in the popup for the estimated query plan graphic) but in reality there are LOTS of rows (thousands or even millions). This is one of the worst things that can happen to a query plan and is usually the result of either out-of-date statistics, a cached query plan or data that is unevenly distributed. Each of these can be addressed to minimize the likelyhood of the problem.
A hash-match plan is optimal when there is a relatively few rows joined into a relatively large number of rows. Gail's post explains the basics of the mechanism. It can get REALLY slow on machines with insufficient buffer RAM to contain the hash tables in memory because they will have to be laid down to disk at a HUGE relative cost.

I know this is an old thread, but it seems to address the problem I'm facing now. I'll preface this description by saying that I deal mostly in BI, so you won't hurt my feelings if you point out a fundamental error in my performance tuning logic 😉

I've got a relational query run from SSRS that seems to be generating an inefficient execution plan. As part of the query, I'm joining an 18m row table with a 2k row table, and the resulting join shows up as a Nested Loop in my execution plan. Statistics on both tables are up to date, I've dumped the plan cache, and the columns on which the tables are joined are both indexed and are of the same data type (VARCHAR(7)).

My research tells me that a nested loop is best for small data sets only, and that a hash join might be a better performing method. I think the nested loop might be chosen because the expected number of rows (1) is wildly different from the actual number of rows (2.5 million).

So the question is, is there anything else I can do to help the query optimizer choose a different join structure? Further, are there any suggestions to improve the affinity between the expected and actual number of rows?

By the way, I'm checking with my client to make sure they have no concerns about uploading their execution plan to this thread.

Thanks,

Tim

Tim Mitchell, Microsoft Data Platform MVP
Data Warehouse and ETL Consultant
TimMitchell.net | @Tim_Mitchell | Tyleris.com
ETL Best Practices