Recently I was faced with an interesting challenge that involved random numbers - to randomly select 1 in 5 invoices to be audited. My initial approach was to write a query with a column that contained a random number from 1 to 5 for each row in that table, and then to select the rows with a value of 1 for this new column. This should be easy I thought, and so I started by writing the following query.
CAST(5 * RAND() + 1 AS INT) AS Rnd --Generate random # between 1 and 5
This produced the following output.
Every time I ran the query it showed the same behavior, although with a different number. Exactly what was going on here? To find out, I simplified the query.
RAND() AS Rnd --Generate random # between 0 and 1
It produced the following outputs the first time:
and the second time:
Then, just for giggles, I modified the query again as follows.
RAND() AS Rnd, --Generate random # between 0 and 1
RAND() AS Rnd2 --Generate a second random # between 0 and 1
This query produced even stranger results.
InvoiceID........................... Rnd..... Rnd2
1E92C9B1-7C7D-40EB-A386-6535A3AA557C 0.961508 0.358211
ECC458E2-CA56-4FD7-B508-5708C367D578 0.961508 0.358211
14EA7295-5338-4C92-AB4E-04FA88ADF8AB 0.961508 0.358211
AF414580-D271-4E46-96CA-AA5E2C4A05FD 0.961508 0.358211
F2126FAE-F22F-4B21-AB98-0C50EE233498 0.961508 0.358211
2A151BE0-5215-4EB7-A315-CD403E00A51D 0.961508 0.358211
D0BA9F75-8C4D-43F0-B929-1CA6F9F61A1A 0.961508 0.358211
5EC94F38-BF00-41DF-89E6-C175E4E886D5 0.961508 0.358211
7C32ADC4-6119-4D13-9B0E-D0207D15A208 0.961508 0.358211
0383E4E2-86C1-4FBD-9280-582132A401B9 0.961508 0.358211
A few hours of reading and some deductive reasoning later I had worked out what was happening. The RAND() function, by default, uses the current date and time as its seed value. The problem is that the RAND() function re-seeds itself once for every row in the result set, and the query runs so quickly that the seed value for each row ends up the same.
The RAND() function does take an optional parameter, a seed value. But what seed value should I use. The first thing I tried to use was the Row_Number() function.
RAND(Row_Number() OVER (Order by InvoiceID)) AS Rnd
This produced a different number for each row, but it suffered from two new problems. The first is that the random numbers generated are very similar. The second problem is that every time the query is run, it produced the exact same results.
This whole thing was turning into a real head-scratcher. The problem was that I needed a random number with which to seed the random number generator - a veritable catch 22. Then I noticed the primary key values - they certainly looked unique. Could I use a uniqueidentifier as the seed for the random number generator?
I first tried to simply pass the newid() function as a parameter to the rand() function, but this only led to an operand type clash. What I needed was an easy way to convert a uniqueidentifier to an integer. The solution was to use the built in checksum function, as shown in the next example.
RAND(Checksum(Newid())) AS Rnd --Generate random # between 0 and 1
This did the trick.
The last step was to modify the original query with my new supercharged RAND() function.
Cast(5 * RAND(Checksum(Newid())) + 1 AS INT) AS Rnd
This finally produced the desired results.
In my production environment, I modified the final query so the user could select the percentage of invoices that should be audited. To accomplish this, I changed the final query to produce a number in the range of 1 to 100 and saved it as a view. Then I created a table value function that took a parameter to specify the percentage of rows to be selected. These are shown below.
CREATE VIEW dbo.vwInvoiceAudit AS
Cast(100 * RAND(Checksum(Newid())) + 1 AS INT) AS Rnd
CREATE FUNCTION dbo.fnInvoiceAudit(@pctToSelect INT)
WHERE Rnd <= @pctToSelect
This approach is easy to implement, easy to understand and quite flexible because it is set based, rather than procedural. It can be applied to a number of other scenarios - for example randomly assigning cases to auditors, or even selecting door prize winners. The only other lesson to come from this case study is that the SQL Server's implementation of the RAND() function has some severe limitations.