Random problems

By Chris Nowicki, 2009/06/09

Recently I was faced with an interesting challenge that involved random numbers - to randomly select 1 in 5 invoices to be audited. My initial approach was to write a query with a column that contained a random number from 1 to 5 for each row in that table, and then to select the rows with a value of 1 for this new column. This should be easy I thought, and so I started by writing the following query.

```SELECT InvoiceID,CAST(5 * RAND() + 1 AS INT) AS Rnd --Generate random # between 1 and 5FROM Invoice
```

This produced the following output.

```InvoiceID........................... Rnd
1E92C9B1-7C7D-40EB-A386-6535A3AA557C 2
ECC458E2-CA56-4FD7-B508-5708C367D578 2
AF414580-D271-4E46-96CA-AA5E2C4A05FD 2
F2126FAE-F22F-4B21-AB98-0C50EE233498 2
2A151BE0-5215-4EB7-A315-CD403E00A51D 2
D0BA9F75-8C4D-43F0-B929-1CA6F9F61A1A 2
5EC94F38-BF00-41DF-89E6-C175E4E886D5 2
0383E4E2-86C1-4FBD-9280-582132A401B9 2
```

Every time I ran the query it showed the same behavior, although with a different number. Exactly what was going on here? To find out, I simplified the query.

```SELECT InvoiceID, RAND() AS Rnd --Generate random # between 0 and 1FROM Invoice
```

It produced the following outputs the first time:

```InvoiceID........................... Rnd
1E92C9B1-7C7D-40EB-A386-6535A3AA557C 0.735523213
ECC458E2-CA56-4FD7-B508-5708C367D578 0.735523213
AF414580-D271-4E46-96CA-AA5E2C4A05FD 0.735523213
F2126FAE-F22F-4B21-AB98-0C50EE233498 0.735523213
2A151BE0-5215-4EB7-A315-CD403E00A51D 0.735523213
D0BA9F75-8C4D-43F0-B929-1CA6F9F61A1A 0.735523213
5EC94F38-BF00-41DF-89E6-C175E4E886D5 0.735523213
0383E4E2-86C1-4FBD-9280-582132A401B9 0.735523213
```

and the second time:

```InvoiceID........................... Rnd
1E92C9B1-7C7D-40EB-A386-6535A3AA557C 0.187330049
ECC458E2-CA56-4FD7-B508-5708C367D578 0.187330049
AF414580-D271-4E46-96CA-AA5E2C4A05FD 0.187330049
F2126FAE-F22F-4B21-AB98-0C50EE233498 0.187330049
2A151BE0-5215-4EB7-A315-CD403E00A51D 0.187330049
D0BA9F75-8C4D-43F0-B929-1CA6F9F61A1A 0.187330049
5EC94F38-BF00-41DF-89E6-C175E4E886D5 0.187330049
0383E4E2-86C1-4FBD-9280-582132A401B9 0.187330049
```

Then, just for giggles, I modified the query again as follows.

```SELECT InvoiceID, RAND() AS Rnd, --Generate random # between 0 and 1RAND() AS Rnd2 --Generate a second random # between 0 and 1FROM Invoice
```

This query produced even stranger results.

```InvoiceID........................... Rnd..... Rnd2
1E92C9B1-7C7D-40EB-A386-6535A3AA557C 0.961508 0.358211
ECC458E2-CA56-4FD7-B508-5708C367D578 0.961508 0.358211
AF414580-D271-4E46-96CA-AA5E2C4A05FD 0.961508 0.358211
F2126FAE-F22F-4B21-AB98-0C50EE233498 0.961508 0.358211
2A151BE0-5215-4EB7-A315-CD403E00A51D 0.961508 0.358211
D0BA9F75-8C4D-43F0-B929-1CA6F9F61A1A 0.961508 0.358211
5EC94F38-BF00-41DF-89E6-C175E4E886D5 0.961508 0.358211
0383E4E2-86C1-4FBD-9280-582132A401B9 0.961508 0.358211
```

A few hours of reading and some deductive reasoning later I had worked out what was happening. The RAND() function, by default, uses the current date and time as its seed value. The problem is that the RAND() function re-seeds itself once for every row in the result set, and the query runs so quickly that the seed value for each row ends up the same.

The RAND() function does take an optional parameter, a seed value. But what seed value should I use. The first thing I tried to use was the Row_Number() function.

```SELECT InvoiceID,
RAND(Row_Number() OVER (Order by InvoiceID)) AS Rnd
FROM Invoice
```
```InvoiceID........................... Rnd
F2126FAE-F22F-4B21-AB98-0C50EE233498 0.713610626
D0BA9F75-8C4D-43F0-B929-1CA6F9F61A1A 0.713629259
52DC8A47-4B1F-4258-BD40-23A0C657DEE7 0.713647892
481130B9-6237-4B0D-AA2F-25D8141656B5 0.713666525
CCB83FA5-4D42-4848-8EED-388C6AFDA9CB 0.713685158
2C39B111-0AEA-410B-BC1D-3DC95B6A707D 0.713703791
91F0469F-8DE1-48DC-9953-3E5B0D4C0E26 0.713722424
204C8A91-36FD-4D1B-8C40-3FF9C3DE6263 0.713741057
EBDEF2EE-7208-4893-BF5B-40057F0C3285 0.713759690
```

This produced a different number for each row, but it suffered from two new problems. The first is that the random numbers generated are very similar. The second problem is that every time the query is run, it produced the exact same results.

This whole thing was turning into a real head-scratcher. The problem was that I needed a random number with which to seed the random number generator - a veritable catch 22. Then I noticed the primary key values - they certainly looked unique. Could I use a uniqueidentifier as the seed for the random number generator?

I first tried to simply pass the newid() function as a parameter to the rand() function, but this only led to an operand type clash. What I needed was an easy way to convert a uniqueidentifier to an integer. The solution was to use the built in checksum function, as shown in the next example.

```SELECT InvoiceID, RAND(Checksum(Newid())) AS Rnd --Generate random # between 0 and 1FROM Invoice
```

This did the trick.

```InvoiceID........................... Rnd
1E92C9B1-7C7D-40EB-A386-6535A3AA557C 0.518769865
ECC458E2-CA56-4FD7-B508-5708C367D578 0.246216954
AF414580-D271-4E46-96CA-AA5E2C4A05FD 0.207718428
F2126FAE-F22F-4B21-AB98-0C50EE233498 0.562418735
2A151BE0-5215-4EB7-A315-CD403E00A51D 0.095575008
D0BA9F75-8C4D-43F0-B929-1CA6F9F61A1A 0.390121252
5EC94F38-BF00-41DF-89E6-C175E4E886D5 0.118476845
0383E4E2-86C1-4FBD-9280-582132A401B9 0.027715165
```

The last step was to modify the original query with my new supercharged RAND() function.

```SELECT InvoiceID, Cast(5 * RAND(Checksum(Newid())) + 1 AS INT) AS Rnd FROM Invoice
```

This finally produced the desired results.

```InvoiceID........................... Rnd
1E92C9B1-7C7D-40EB-A386-6535A3AA557C 4
ECC458E2-CA56-4FD7-B508-5708C367D578 5
AF414580-D271-4E46-96CA-AA5E2C4A05FD 2
F2126FAE-F22F-4B21-AB98-0C50EE233498 5
2A151BE0-5215-4EB7-A315-CD403E00A51D 3
D0BA9F75-8C4D-43F0-B929-1CA6F9F61A1A 4
5EC94F38-BF00-41DF-89E6-C175E4E886D5 1
0383E4E2-86C1-4FBD-9280-582132A401B9 5
```

In my production environment, I modified the final query so the user could select the percentage of invoices that should be audited. To accomplish this, I changed the final query to produce a number in the range of 1 to 100 and saved it as a view. Then I created a table value function that took a parameter to specify the percentage of rows to be selected. These are shown below.

```CREATE VIEW dbo.vwInvoiceAudit ASSELECT InvoiceID, Cast(100 * RAND(Checksum(Newid())) + 1 AS INT) AS RndFROM Invoice
```
```CREATE FUNCTION dbo.fnInvoiceAudit(@pctToSelect INT)RETURNS TABLE AS RETURN ( SELECT * FROM vwInvoiceAudit WHERE Rnd <= @pctToSelect)
```

This approach is easy to implement, easy to understand and quite flexible because it is set based, rather than procedural. It can be applied to a number of other scenarios - for example randomly assigning cases to auditors, or even selecting door prize winners. The only other lesson to come from this case study is that the SQL Server's implementation of the RAND() function has some severe limitations.

Total article views: 5001 | Views in the last 30 days: 2

Related Articles
FORUM

Select query

FORUM

Select Query

FORUM

select query

FORUM

select query

select query to with condition

FORUM

functions

functions causing query to run excessively long

Tags
 checksum() newid() rand() random numbers seed values

Join the most active online SQL Server Community

SQL knowledge, delivered daily, free:

You make SSC a better place

As a member of SQLServerCentral, you get free access to loads of fresh content: thousands of articles and SQL scripts, a library of free eBooks, a weekly database news roundup, a great Q & A platformâ€¦ And itâ€™s our huge, buzzing community of SQL Server Professionals that makes it such a success.