RE: Hash Match – SQLServerCentral

SSC Guru

Points: 125522

May 1, 2014 at 1:15 pm

Given the options: first, larger, smaller, or second input, I still say that 'smaller input' is closest to the correct answer.

On his MSDN blog, Craig Freedman goes into some detail about the hash join, and he states that the choice of which input is used for building the hash table is cost based with a preferance for the smaller of the two tables. He was a member of the query execution team that developed the scan, seek, and join operators in SQL Server 2005, so he must know a good deal about the internals of how the algorithm works. It wouldn't be the first time that the official documentation for an application doesn't correspond exactly with the technical implementation, especially when it comes to something like this.

..Before a hash join begins execution, SQL Server tries to estimate how much memory it will need to build its hash table. We use the cardinality estimate for the size of the build input along with the expected average row size to estimate the memory requirement. To minimize the memory required by the hash join, we try to choose the smaller of the two tables as the build table..

http://blogs.msdn.com/b/craigfr/archive/2006/08/10/687630.aspx

I've experimented using the sample tables T1 (1,000 rows) and T2 (10,000 rows) provided in the blog post above, and regardless of the left or right position of each in the JOIN clause or ON expression, SQL Server choses the smaller T1 table for building the hash.

"Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho