March 12, 2008 at 5:08 pm
I'm not going to include my code, those of you who want to test this out can do it with your own systems and see what you get.
Here is what I did.
I cast each column of our student master table as varbinary(max) data types, using isnull I replaced any nulls with cast('' as varbinary(max)) and concatenated all this fields and used it as input to the HashBytes function using the 'SHA1' algorithm.
I did a select only on this HashBytes function, and it returned 122695 rows in 27 seconds.
I changed the select to a select distinct to determine if there were any duplicates and this select returned 122695 rows in 11 seconds.
Comments, suggestions, ideas? To me, this is counter intuitive as the server has to do more work to determine the distinct values prior to sending them to my client system.
:unsure:
March 12, 2008 at 5:46 pm
Can you post the query plans?
[font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
Proactive Performance Solutions, Inc. [/font][font="Verdana"] "Performance is our middle name."[/font]
March 12, 2008 at 8:00 pm
The server had to go to disk to get the data in your first query and the second query the data was all cached in memory. Thus your IO time was reduced so the faster query.
Jack Corbett
Consultant - Straight Path Solutions
Check out these links on how to get faster and more accurate answers:
Forum Etiquette: How to post data/code on a forum to get the best help
Need an Answer? Actually, No ... You Need a Question
March 12, 2008 at 9:24 pm
Don't think so. If that were the case, the first query when run again should have been quicker. It was consistantly 27 seconds.
Honest, all you wizards out there should try it and see if you get the same results.
It will have to wait until tomorrow to provide you with the query plans. I looked at them, and the select distinct obviously is doing more work, but that doesn't explain why it returns the same result set to the client (SSMS) faster than the straight select. Both have to transfer the same amount of data over the same size pipe.
😎
March 13, 2008 at 6:50 am
Here is the output from having STATITISTIC IO ON:
SELECT -
Table 'studemo'. Scan count 1, logical reads 5674, physical reads 2, read-ahead reads 5717, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SELECT DISTINCT -
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'studemo'. Scan count 9, logical reads 6289, physical reads 9, read-ahead reads 5693, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
I am working on getting the execution plans copied.
Edit: Uploaded the .sqlplan files in a .zip file.
😎
March 13, 2008 at 10:59 am
I think I have come up an explanation. Part of it comes down to parallelism. The straight select has a parallelism of 1, the select distinct has a parallelism of 8 (we are using a dual quad-core x64 system with 8 GB ram).
The select is computing the hashbytes of each row and sending the data immediately to the client (SSMS) as it runs. The select distinct, however, needs to compute the hashbytes of all the data and then determine if there are any duplicates then send the data to the client. Since each hash is between 18 and 20 bytes, it is able to do all of this in memory, and is spread over all 8 cores, it can be done very quickly, then send the results.
Anyone buy this explanation? By the way, a select * on the table returned all the data from the table to SSMS in 16 seconds, and that is when I came up with the theory above.
😎
March 13, 2008 at 12:18 pm
That does make sense and the Query plans confirm it also.
You could try setting maxdop to 1 and try them both again. If your theory is correct then the Select Distinct should take about 10-20% longer than the other one.
[font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
Proactive Performance Solutions, Inc. [/font][font="Verdana"] "Performance is our middle name."[/font]
March 13, 2008 at 12:50 pm
Set MAXDOP to 1 on the server, and reran the query and .........
the winner is .........
it took 30 seconds versus to 27 seconds for the straight select!!!!
This now goes back to those who say never use distinct in a subquery, if you have multiple processors, it may not have much effect either way. Still, the best thing to do is test, test, and test again! Also, it depends on how many rows are affected by the query. If you are returning MANY rows, then you probably would want to leave off the distinct.
By the way, I reset MAXDOP back to 0 on the server.
:cool::w00t::cool:
Viewing 8 posts - 1 through 8 (of 8 total)
You must be logged in to reply to this topic. Login to reply