Hash Match

Question

Hash Match

Steve Jones - SSC Editor

SSC Guru

Points: 741373
More actions
April 30, 2014 at 11:47 pm

#396686

Comments posted to this topic are about the item Hash Match

Viewing 15 posts - 1 through 15 (of 35 total)

You must be logged in to reply to this topic. Login to reply

Raghavendra Mudugal SSChampion Points: 10658 More actions · Answer 1

Interesting one, thank you for the post, Steve.

(Glad to be a part of the Steve's 12, who got it right. Now at least I take a hint that I am not smart and I am not dumb either... well lets go with the latter :-D)

ww; Raghu
--
The first and the hardest SQL statement I have wrote- "select * from customers" - and I was happy and felt smart.

gbritton1 SSCertifiable Points: 6522 More actions · Answer 2

Before I answered this question, I read

It reads, in part:

The hash join has two inputs: the build input and probe input. The query optimizer assigns these roles so that the smaller of the two inputs is the build input.

then later:

The hash join first scans or computes the entire build input and then builds a hash table in memory.

I'm having trouble reconciling the official description with the "correct" answer.

bkmsmith Ten Centuries Points: 1132 More actions · Answer 3

gbritton1 (5/1/2014)
Before I answered this question, I read
It reads, in part:
The hash join has two inputs: the build input and probe input. The query optimizer assigns these roles so that the smaller of the two inputs is the build input.
then later:
The hash join first scans or computes the entire build input and then builds a hash table in memory.
I'm having trouble reconciling the official description with the "correct" answer.

+1

I followed the same reference in my investigation on this QOTD. Any insight on this seemingly discrepant definition would be appreciated.

Brian

sestell1 SSChampion Points: 10230 More actions · Answer 4

bkmsmith (5/1/2014)
gbritton1 (5/1/2014)
Before I answered this question, I read
It reads, in part:
The hash join has two inputs: the build input and probe input. The query optimizer assigns these roles so that the smaller of the two inputs is the build input.
then later:
The hash join first scans or computes the entire build input and then builds a hash table in memory.
I'm having trouble reconciling the official description with the "correct" answer.
+1
I followed the same reference in my investigation on this QOTD. Any insight on this seemingly discrepant definition would be appreciated.
Brian

As did I, after much digging to see what SQL Server defined as the 'build' input.

PHYData DBA SSCertifiable Points: 7541 More actions · Answer 5

See lots of discussion about why the right answer is...

Here is a reference to the right answer from TechNet BOL.

Looks like this answer to this question has not changed for at least 15 years...

http://technet.microsoft.com/en-us/library/aa237090(v=SQL.80).aspx

For any joins, use the first (top) input to build the hash table and the second (bottom) input to probe the hash table. Output matches (or nonmatches) as dictated by the join type. If multiple joins use the same join column, these operations are grouped into a hash team.

😎

gbritton1 SSCertifiable Points: 6522 More actions · Answer 6

So, we have conflicting opinions! I suspect that the old Technet article is incorrect, or at least no longer correct. It only makes sense that SQL would build the hash table from the smaller input to minimize work.

jackson.fabros Say Hey Kid Points: 674 More actions · Answer 7

still unclear what the correct answer is. my textbook also mentions that "the hash table is built from the smaller input"

PHYData DBA SSCertifiable Points: 7541 More actions · Answer 8

gbritton1 (5/1/2014)
So, we have conflicting opinions! I suspect that the old Technet article is incorrect, or at least no longer correct. It only makes sense that SQL would build the hash table from the smaller input to minimize work.

Most likely this is convergence not conflict.

Here is a 2008 R2 article that says the same thing as the old one.

This and the last link I posted are actually listed as part of the SQL 2014 BOL but where not changed.

http://technet.microsoft.com/en-US/library/ms189582(v=SQL.105).aspx

For any joins, use the first (top) input to build the hash table and the second (bottom) input to probe the hash table. Output matches (or nonmatches) as dictated by the join type. If multiple joins use the same join column, these operations are grouped into a hash team.

There are some sources that say the QA will try to choose the smaller set as the build input, but here is some truth from SQL 2012 BOL.

http://technet.microsoft.com/en-us/library/ms173815.aspx

When using the HASH Join hint:

If a join hint is specified for any two tables, the query optimizer automatically enforces the join order for all joined tables in the query, based on the position of the ON keywords

So the first table in the ON statement is used unless REMOTE table exists or is specified.

IMHO: Considering the Join Hint documentation for HASH, the Query Optimizer would have to change the join order so that the smaller set is first (like any good Query writing utility would do) to do what the other article says it is doing.

Note: I could not find anything in the BOL articles unique to SQL 2014 so we are left to guess that SQL 2014 BOL just has pointers to the old articles because the information has not changed.

PHYData DBA SSCertifiable Points: 7541 More actions · Answer 9

jackson.fabros (5/1/2014)
still unclear what the correct answer is. my textbook also mentions that "the hash table is built from the smaller input"

Is this exactly what your textbook says?

Let's hope not since the build input should be made from the smaller set and the probe input from the remainder in a HASH JOIN.

A "hash table" would be something else but not sure how it would have different inputs. :hehe:

There are even HASH JOIN where the build and probe inputs are the same thing...

TomThomson SSC Guru Points: 104768 More actions · Answer 10

gbritton1 (5/1/2014)
Before I answered this question, I read
It reads, in part:
The hash join has two inputs: the build input and probe input. The query optimizer assigns these roles so that the smaller of the two inputs is the build input.
then later:
The hash join first scans or computes the entire build input and then builds a hash table in memory.
I'm having trouble reconciling the official description with the "correct" answer.

Without looking anything up, I answered smallest (ie build) input first. When I saw the answer, I constructed a test (using SQL Server 2012) to see what happened. Given one table with 65537 rows and another with 393222 rows (the sizes of teh test tables I built - admittedly not very big, but big enough to refute a silly "correct" answer) I found that whether I wrote the join so that the small table was the first or second referenced in the select clause and whether the small table was teh first or second listed in teh from clause, the actual (not estimated but actual) execution plan took the smaller table as first (ie build) table.

So I don't just have trouble reconciling the official correct answer with the documentation, I have clear and solid experimantal evidence that the officially correct answer is just plain wrong.

Tom

PHYData DBA SSCertifiable Points: 7541 More actions · Answer 11

TomThomson (5/1/2014)
gbritton1 (5/1/2014)
Before I answered this question, I read
It reads, in part:
The hash join has two inputs: the build input and probe input. The query optimizer assigns these roles so that the smaller of the two inputs is the build input.
then later:
The hash join first scans or computes the entire build input and then builds a hash table in memory.
I'm having trouble reconciling the official description with the "correct" answer.
Without looking anything up, I answered smallest (ie build) input first. When I saw the answer, I constructed a test (using SQL Server 2012) to see what happened. Given one table with 65537 rows and another with 393222 rows (the sizes of teh test tables I built - admittedly not very big, but big enough to refute a silly "correct" answer) I found that whether I wrote the join so that the small table was the first or second referenced in the select clause and whether the small table was teh first or second listed in teh from clause, the actual (not estimated but actual) execution plan took the smaller table as first (ie build) table.
So I don't just have trouble reconciling the official correct answer with the documentation, I have clear and solid experimantal evidence that the officially correct answer is just plain wrong.

Tom - See my earlier post... there is no need for the answer to be wrong ( run test using HASH join hint) to match with your tests so far. When writing a hash join using the Hint you put the smaller to the right to make it the build table. Try you experiment with the Hash Join hint and see what happens.

Your experiment proves what the Query Optimizer does, but not how it is done.

Of course I would not know about any of this if I had not learned all of this by having to work with an Execution Plan that for some reason had chosen the Larger (2 million records) set as the Build list years ago.

TomThomson SSC Guru Points: 104768 More actions · Answer 12

PHYData DBA (5/1/2014)
TomThomson (5/1/2014)
gbritton1 (5/1/2014)
Before I answered this question, I read
It reads, in part:
The hash join has two inputs: the build input and probe input. The query optimizer assigns these roles so that the smaller of the two inputs is the build input.
then later:
The hash join first scans or computes the entire build input and then builds a hash table in memory.
I'm having trouble reconciling the official description with the "correct" answer.
Without looking anything up, I answered smallest (ie build) input first. When I saw the answer, I constructed a test (using SQL Server 2012) to see what happened. Given one table with 65537 rows and another with 393222 rows (the sizes of teh test tables I built - admittedly not very big, but big enough to refute a silly "correct" answer) I found that whether I wrote the join so that the small table was the first or second referenced in the select clause and whether the small table was teh first or second listed in teh from clause, the actual (not estimated but actual) execution plan took the smaller table as first (ie build) table.
So I don't just have trouble reconciling the official correct answer with the documentation, I have clear and solid experimantal evidence that the officially correct answer is just plain wrong.
Tom - See my earlier post... there is no need for the answer to be wrong ( run test using HASH join hint) to match with your tests so far. When writing a hash join using the Hint you put the smaller to the right to make it the build table. Try you experiment with the Hash Join hint and see what happens.
Your experiment proves what the Query Optimizer does, but not how it is done.
Of course I would not know about any of this if I had not learned all of this by having to work with an Execution Plan that for some reason had chosen the Larger (2 million records) set as the Build list years ago.

I suppose it depends on what the answer means by "first". If it means the "the table which ir processed first in order to build the the hash table against which rows from teh other table will be tested" then " that's first" is a pointless tautology. It's not at all clear to me that any sane person can take it as meaning that. If on the other hand it means that the table used as build table can be any of the tables involved, regardless of size, which is the only sensible interpretation of the words given that (a) "is the one that is used first used first?" is not a sensible question and (b) the statement "Size does not matter for this operator" in the explanation it seems clear both that the BOL documentation quoted by gbritton1 indicates that the "correct" answer is wrong fpr SQL 2008R2 (not a terribly interesting argument, given how often BOL gets it wrong, but in fact - see below - BOL didn't go wrong here) and that the experiment I undertook after seeing this crazy answer proved conclusively that the "correct" answer is wrong for SQL 2012. It may of course be different for SQL 2014, but there are several releases currently in full support and we can see that it is wrong for the majority of them (I ran the experiment for 2008 R2 as well before composing this reply, and this is not one of the places where BOL got it wrong).

So we have to work out what Jason (author of the referenced document) meant by "first(top)" and "second(bottom)". Jason doesn't usually get things wrong. I think he was referring to the layout of the operators in the query plan in the standard query plan display provided by SSMS - certainly that's what I thought he meant when I read it a month or two ago. There's no imaginable way that today's question could be interpreted so that "first" mant that, so I think that Steve has misunderstood what Jason was saying and produced a question and answer based on that misunderstanding.

Tom

PHYData DBA SSCertifiable Points: 7541 More actions · Answer 13

TomThomson (5/1/2014)
PHYData DBA (5/1/2014)
TomThomson (5/1/2014)
gbritton1 (5/1/2014)
Before I answered this question, I read
It reads, in part:
The hash join has two inputs: the build input and probe input. The query optimizer assigns these roles so that the smaller of the two inputs is the build input.
then later:
The hash join first scans or computes the entire build input and then builds a hash table in memory.
I'm having trouble reconciling the official description with the "correct" answer.
Without looking anything up, I answered smallest (ie build) input first. When I saw the answer, I constructed a test (using SQL Server 2012) to see what happened. Given one table with 65537 rows and another with 393222 rows (the sizes of teh test tables I built - admittedly not very big, but big enough to refute a silly "correct" answer) I found that whether I wrote the join so that the small table was the first or second referenced in the select clause and whether the small table was teh first or second listed in teh from clause, the actual (not estimated but actual) execution plan took the smaller table as first (ie build) table.
So I don't just have trouble reconciling the official correct answer with the documentation, I have clear and solid experimantal evidence that the officially correct answer is just plain wrong.
Tom - See my earlier post... there is no need for the answer to be wrong ( run test using HASH join hint) to match with your tests so far. When writing a hash join using the Hint you put the smaller to the right to make it the build table. Try you experiment with the Hash Join hint and see what happens.
Your experiment proves what the Query Optimizer does, but not how it is done.
Of course I would not know about any of this if I had not learned all of this by having to work with an Execution Plan that for some reason had chosen the Larger (2 million records) set as the Build list years ago.
I suppose it depends on what the answer means by "first". If it means the "the table which ir processed first in order to build the the hash table against which rows from teh other table will be tested" then " that's first" is a pointless tautology. It's not at all clear to me that any sane person can take it as meaning that. If on the other hand it means that the table used as build table can be any of the tables involved, regardless of size, which is the only sensible interpretation of the words given that (a) "is the one that is used first used first?" is not a sensible question and (b) the statement "Size does not matter for this operator" in the explanation it seems clear both that the BOL documentation quoted by gbritton1 indicates that the "correct" answer is wrong fpr SQL 2008R2 (not a terribly interesting argument, given how often BOL gets it wrong, but in fact - see below - BOL didn't go wrong here) and that the experiment I undertook after seeing this crazy answer proved conclusively that the "correct" answer is wrong for SQL 2012. It may of course be different for SQL 2014, but there are several releases currently in full support and we can see that it is wrong for the majority of them (I ran the experiment for 2008 R2 as well before composing this reply, and this is not one of the places where BOL got it wrong).
So we have to work out what Jason (author of the referenced document) meant by "first(top)" and "second(bottom)". Jason doesn't usually get things wrong. I think he was referring to the layout of the operators in the query plan in the standard query plan display provided by SSMS - certainly that's what I thought he meant when I read it a month or two ago. There's no imaginable way that today's question could be interpreted so that "first" mant that, so I think that Steve has misunderstood what Jason was saying and produced a question and answer based on that misunderstanding.

It seems you are once again ranting and/or trolling for something you are not going to find here.

Have fun with that.

I have read what you wrote, re-read my post, and read the BOL again.

Your post reads the least sane of all of them and is full of assumptions without reference, along with an experiment that only prove the Query Optimizer did it's job and changed how your select was executed.

Had you used the HASH Join hint you might have seen something different.

Maybe if I feel motivated I ask Kim or Paul at SQLskills.com how this is done.

Hrmmm.... reading your post above has actually unmotivated me to do any of this...

<un-subscribing now for same reason as a thousand others>

Koen Verbeeck SSC Guru Points: 259207 More actions · Answer 14

Answered incorrectly because I found some blogs online which said something different than the official answer.

Mystery mystery...

Need an answer? No, you need a question
My blog at https://sqlkover.com.
MCSE Business Intelligence - Microsoft Data Platform MVP