RE: Join Operations – Hash Match

SSCertifiable

Points: 6355

May 18, 2016 at 8:19 am

simon.dobner 32307 (5/4/2014)
OK, Thanks, got it now.
'Input' is not necessarily the same as 'table in the from clause'
Regards
Simon

This is a great article, and I think this question in the comments adds a lot of value for people new to query optimisation. SQL is supposed to be a declarative language; you tell it what output you want, not how to execute. It *shouldn't* matter what order you put the tables in your join clause, and doesn't, until a certain point where you have lots of tables and other complexities to be considered which leads to lots of possible combination ways that the SQL query engine can re-arrange your query to determine the best possible plan.

The number of possible plans goes up exponentially with the number of tables and other complexities, so there is a point where the query engine gives up searching for the optimal plan and just goes with the best it could find in that limited time period. If it didn't do that the it's possible that the search for the best plan could take a lot longer than just running a *good enough* plan. It could takes years.

SQL query plan optimisation is NP hard.

So what do you do? Well if you see the query plan using a larger table as the hash table on top to hash join to a smaller table then perhaps rearranging the tables ordering in the join clause will help. Maybe, but there's so many factors that go into generating a plan that you can't take table join order in isolation. The query engine certainly starts somewhere, and if it were possible to shortcut the checking of plans so that it finds an optimal one sooner then certainly worth trying. Easier said than done.

Whatever you do don't go using these join hints to force the type of join. There's very few occasions where you can outfox the query engine like that. What might be optimal for a certain state of data and certain predicates might be disastrous for another set. Data (state) changes. Programmers dream about stateless databases (that's my favourite bad joke ever :p )

Hash matches are good for unsorted data where one table is large and the other table is small enough to make a practical hash table. If down the track the data changes, or indexes change and then both tables are now similarly sized and pre-sorted then a merge join is likely optimal and if you've told it specifically to hash join then you've just missed out on a massive performance gain happening for you. (with lots of *it depends* caveats like caching of plans, parallelism, etc etc)