While working on the Advent of Code problems in SQL, I ran across something interesting. Day 4 involves hashing, which is done with the HASHBYTES function in SQL Server. This is a computation and given the problem, there is no good way to do this without brute force. The problem says
- hash a specific string + an integer.
- If the leftmost digits are 0 (5 or 6 of them), stop
- increment the integer
Since a hash doesn’t lend itself to a pattern, you can’t start with 100,000 and determine if the integer you need is higher or lower. Instead you need to work through the integers.
I decided to try this with a tally table and hashing with TOP 1. BTW, TOP 1 makes a huge difference.
However, my structure was to query my tally table like this:
, HASHBYTES(‘MD5’, ‘iwrupvqb’ + CONVERT(VARCHAR(15), n))
This was in a second CTE, and in the main query I then use a WHERE clause to filter the list down to the entry with leading zeros. When I ran this, I noticed it was rather slow at first, at least, what I considered slow. I checked with a few other people that had solved the problem, and I found their times were faster than mine.
I wasn’t sure the brute force technique would benefit from a TOP clause, but I added a TOP 1 to the outer query. This made the entire process run much quicker, which is interesting. Apparently the filtering is collapsed across the tally table join with the hash computation and as soon as a valid match is found, this ends the calculations. My average went down by a factor of 10.
However, I wondered if moving the calculation to a join, with CROSS APPLY, would be quicker. I couldn’t imagine why, but I decided to try this. I moved the calcuation by changing the HASHBYTES calculation to a SELECT statement in a derived table for the CROSS APPLY and then taking the result of that as part of my column list. This changed my CTE to this:
CROSS APPLY (SELECT HASHBYTES(‘MD5’, ‘iwrupvqb’ + CONVERT(VARCHAR(15), n))) AS hb(hashvalue)
That resulted in a slightly faster query time. When I added a TOP to this, the times improved slightly from using HASHBYTES in the column list with a TOP. Intuitively this doens’t make sense, as it would seem the same number of function calls need to be completed, but the CROSS APPLY handles them a bit more efficiently. I’m sure someone has a much more in-depth understanding of the query optimizer here, and I won’t try to explain things myself. The times are close enough that I suspect some minor optimization from CROSS APPLY.
As a comparison, I also ran a brute force loop, with this code, that calculates the values sequentially until the result is determine. This should be equivalent to the results from TOP 1, and we find that they aren’t. The tally table solution with CROSS APPLY is much quicker.
DECLARE @t BIT = 1;
DECLARE @i INT = 0;
DECLARE @start DATETIME = GETDATE();
WHILE @t = 1
IF LEFT( CONVERT(VARCHAR(50), HASHBYTES(‘MD5’, ‘iwrupvqb’ + CAST(@i AS VARCHAR(10))), 2), 6) = ‘000000’
SELECT @t = 0
SELECT @i = @i + 1
–IF @i > 10000000
— SELECT @t = 0
SELECT starttime = @start
, seconds = DATEDIFF(SECOND, @start, GETDATE())
Here’s a summary of the code timings (averaged across 5 executions), for the second part of the puzzle, which looks for 6 leading zeros and has a result in the 9million range.
|Hashbytes in column list, no TOP||
|CROSS APPLY, no TOP||
|Hasbytes in columns list, TOP||
|CROSS APPLY with TOP||
|Brute Force, WHILE loop||
The conclusion I’d take here is that CROSS APPLY ought to be a tool you keep in the front of your toolbox and use when you must execute a function for each row of a set of tables. This is one of the T-SQL techniques that I never learned early in my career (it wasn’t available), and I haven’t used much outside of looking for execution plans, but it’s a join capability I will certainly look to use in the future.
However, if you are using UDFs instead of system functions, I’d certainly recommend you read Adam Machanic’s post on Scalar Functions and CROSS APPLY, and perhaps you can change to ITVFs and get some great performance gains.