Random problems

Recently I was faced with an interesting challenge that involved random numbers - to randomly select 1 in 5 invoices to be audited. My initial approach was to write a query with a column that contained a random number from 1 to 5 for each row in that table, and then to select the rows with a value of 1 for this new column. This should be easy I thought, and so I started by writing the following query.

SELECT InvoiceID,
CAST(5 * RAND() + 1 AS INT) AS Rnd --Generate random # between 1 and 5
FROM Invoice

This produced the following output.

InvoiceID........................... Rnd
1E92C9B1-7C7D-40EB-A386-6535A3AA557C 2
ECC458E2-CA56-4FD7-B508-5708C367D578 2
14EA7295-5338-4C92-AB4E-04FA88ADF8AB 2
AF414580-D271-4E46-96CA-AA5E2C4A05FD 2
F2126FAE-F22F-4B21-AB98-0C50EE233498 2
2A151BE0-5215-4EB7-A315-CD403E00A51D 2
D0BA9F75-8C4D-43F0-B929-1CA6F9F61A1A 2
5EC94F38-BF00-41DF-89E6-C175E4E886D5 2
7C32ADC4-6119-4D13-9B0E-D0207D15A208 2
0383E4E2-86C1-4FBD-9280-582132A401B9 2

Every time I ran the query it showed the same behavior, although with a different number. Exactly what was going on here? To find out, I simplified the query.

SELECT InvoiceID, 
RAND() AS Rnd --Generate random # between 0 and 1
FROM Invoice

It produced the following outputs the first time:

InvoiceID........................... Rnd
1E92C9B1-7C7D-40EB-A386-6535A3AA557C 0.735523213
ECC458E2-CA56-4FD7-B508-5708C367D578 0.735523213
14EA7295-5338-4C92-AB4E-04FA88ADF8AB 0.735523213
AF414580-D271-4E46-96CA-AA5E2C4A05FD 0.735523213
F2126FAE-F22F-4B21-AB98-0C50EE233498 0.735523213
2A151BE0-5215-4EB7-A315-CD403E00A51D 0.735523213
D0BA9F75-8C4D-43F0-B929-1CA6F9F61A1A 0.735523213
5EC94F38-BF00-41DF-89E6-C175E4E886D5 0.735523213
7C32ADC4-6119-4D13-9B0E-D0207D15A208 0.735523213
0383E4E2-86C1-4FBD-9280-582132A401B9 0.735523213

and the second time:

InvoiceID........................... Rnd
1E92C9B1-7C7D-40EB-A386-6535A3AA557C 0.187330049
ECC458E2-CA56-4FD7-B508-5708C367D578 0.187330049
14EA7295-5338-4C92-AB4E-04FA88ADF8AB 0.187330049
AF414580-D271-4E46-96CA-AA5E2C4A05FD 0.187330049
F2126FAE-F22F-4B21-AB98-0C50EE233498 0.187330049
2A151BE0-5215-4EB7-A315-CD403E00A51D 0.187330049
D0BA9F75-8C4D-43F0-B929-1CA6F9F61A1A 0.187330049
5EC94F38-BF00-41DF-89E6-C175E4E886D5 0.187330049
7C32ADC4-6119-4D13-9B0E-D0207D15A208 0.187330049
0383E4E2-86C1-4FBD-9280-582132A401B9 0.187330049

Then, just for giggles, I modified the query again as follows.

SELECT InvoiceID, 
RAND() AS Rnd, --Generate random # between 0 and 1
RAND() AS Rnd2 --Generate a second random # between 0 and 1
FROM Invoice

This query produced even stranger results.

InvoiceID........................... Rnd..... Rnd2
1E92C9B1-7C7D-40EB-A386-6535A3AA557C 0.961508 0.358211
ECC458E2-CA56-4FD7-B508-5708C367D578 0.961508 0.358211
14EA7295-5338-4C92-AB4E-04FA88ADF8AB 0.961508 0.358211
AF414580-D271-4E46-96CA-AA5E2C4A05FD 0.961508 0.358211
F2126FAE-F22F-4B21-AB98-0C50EE233498 0.961508 0.358211
2A151BE0-5215-4EB7-A315-CD403E00A51D 0.961508 0.358211
D0BA9F75-8C4D-43F0-B929-1CA6F9F61A1A 0.961508 0.358211
5EC94F38-BF00-41DF-89E6-C175E4E886D5 0.961508 0.358211
7C32ADC4-6119-4D13-9B0E-D0207D15A208 0.961508 0.358211
0383E4E2-86C1-4FBD-9280-582132A401B9 0.961508 0.358211

A few hours of reading and some deductive reasoning later I had worked out what was happening. The RAND() function, by default, uses the current date and time as its seed value. The problem is that the RAND() function re-seeds itself once for every row in the result set, and the query runs so quickly that the seed value for each row ends up the same.

The RAND() function does take an optional parameter, a seed value. But what seed value should I use. The first thing I tried to use was the Row_Number() function.

SELECT InvoiceID, 
RAND(Row_Number() OVER (Order by InvoiceID)) AS Rnd 
FROM Invoice

InvoiceID........................... Rnd
14EA7295-5338-4C92-AB4E-04FA88ADF8AB 0.713591993
F2126FAE-F22F-4B21-AB98-0C50EE233498 0.713610626
D0BA9F75-8C4D-43F0-B929-1CA6F9F61A1A 0.713629259
52DC8A47-4B1F-4258-BD40-23A0C657DEE7 0.713647892
481130B9-6237-4B0D-AA2F-25D8141656B5 0.713666525
CCB83FA5-4D42-4848-8EED-388C6AFDA9CB 0.713685158
2C39B111-0AEA-410B-BC1D-3DC95B6A707D 0.713703791
91F0469F-8DE1-48DC-9953-3E5B0D4C0E26 0.713722424
204C8A91-36FD-4D1B-8C40-3FF9C3DE6263 0.713741057
EBDEF2EE-7208-4893-BF5B-40057F0C3285 0.713759690

This produced a different number for each row, but it suffered from two new problems. The first is that the random numbers generated are very similar. The second problem is that every time the query is run, it produced the exact same results.

This whole thing was turning into a real head-scratcher. The problem was that I needed a random number with which to seed the random number generator - a veritable catch 22. Then I noticed the primary key values - they certainly looked unique. Could I use a uniqueidentifier as the seed for the random number generator?

I first tried to simply pass the newid() function as a parameter to the rand() function, but this only led to an operand type clash. What I needed was an easy way to convert a uniqueidentifier to an integer. The solution was to use the built in checksum function, as shown in the next example.

SELECT InvoiceID, 
RAND(Checksum(Newid())) AS Rnd --Generate random # between 0 and 1
FROM Invoice

This did the trick.

InvoiceID........................... Rnd
1E92C9B1-7C7D-40EB-A386-6535A3AA557C 0.518769865
ECC458E2-CA56-4FD7-B508-5708C367D578 0.246216954
14EA7295-5338-4C92-AB4E-04FA88ADF8AB 0.069079661
AF414580-D271-4E46-96CA-AA5E2C4A05FD 0.207718428
F2126FAE-F22F-4B21-AB98-0C50EE233498 0.562418735
2A151BE0-5215-4EB7-A315-CD403E00A51D 0.095575008
D0BA9F75-8C4D-43F0-B929-1CA6F9F61A1A 0.390121252
5EC94F38-BF00-41DF-89E6-C175E4E886D5 0.118476845
7C32ADC4-6119-4D13-9B0E-D0207D15A208 0.620123360
0383E4E2-86C1-4FBD-9280-582132A401B9 0.027715165

The last step was to modify the original query with my new supercharged RAND() function.

SELECT InvoiceID, 
Cast(5 * RAND(Checksum(Newid())) + 1 AS INT) AS Rnd 
FROM Invoice

This finally produced the desired results.

InvoiceID........................... Rnd
1E92C9B1-7C7D-40EB-A386-6535A3AA557C 4
ECC458E2-CA56-4FD7-B508-5708C367D578 5
14EA7295-5338-4C92-AB4E-04FA88ADF8AB 4
AF414580-D271-4E46-96CA-AA5E2C4A05FD 2
F2126FAE-F22F-4B21-AB98-0C50EE233498 5
2A151BE0-5215-4EB7-A315-CD403E00A51D 3
D0BA9F75-8C4D-43F0-B929-1CA6F9F61A1A 4
5EC94F38-BF00-41DF-89E6-C175E4E886D5 1
7C32ADC4-6119-4D13-9B0E-D0207D15A208 5
0383E4E2-86C1-4FBD-9280-582132A401B9 5

In my production environment, I modified the final query so the user could select the percentage of invoices that should be audited. To accomplish this, I changed the final query to produce a number in the range of 1 to 100 and saved it as a view. Then I created a table value function that took a parameter to specify the percentage of rows to be selected. These are shown below.

CREATE VIEW dbo.vwInvoiceAudit AS
SELECT InvoiceID,
 Cast(100 * RAND(Checksum(Newid())) + 1 AS INT) AS Rnd
FROM Invoice

CREATE FUNCTION dbo.fnInvoiceAudit(@pctToSelect INT)
RETURNS TABLE 
AS 
RETURN 
(
 SELECT *
 FROM vwInvoiceAudit
 WHERE Rnd <= @pctToSelect
)

This approach is easy to implement, easy to understand and quite flexible because it is set based, rather than procedural. It can be applied to a number of other scenarios - for example randomly assigning cases to auditors, or even selecting door prize winners. The only other lesson to come from this case study is that the SQL Server's implementation of the RAND() function has some severe limitations.

Rate

Share

Categories

Share

Rate

Random problems

Rate

Share

Categories

Share

Rate

Related content

random randomise randomize data query

Random DEFAULT string for a column using CASE and NEWID

Generating Test Data: Part 2 - Generating Sequential and Random Dates

Generating Test Data: Part 1 - Generating Random Integers and Floats

random integer number scalar function