# RANDom traps for the unwary

,

The RAND() function presents developers with a trap for the unwary. According to the latest edition of Books Online it "produces a pseudo-random number in the range 0 to 1, exclusive".

As a statement this is not a universal truth, it is only true from certain perspectives.

In this article I am going to use the Adventureworks2014 database to demonstrate the pitfalls of the RAND() statement.

## PROBLEM ONE:  Random numbers over a batch

If we use RAND() without a seed value then a single "random" value will be generated across the entire result set and not a random value per record.

````SELECT` RandomValue=RAND()
`FROM `person.StateProvince```

At least if we run the above query again we will get a different RandomValue, which is more than can be said for specifying a specific seed

````SELECT `RandomValue=RAND(32768)
`FROM `person.StateProvince```

The query above generates the value 0.324138575220812 for every record on all the machines I have available to me.  This tells me that for a given seed the random number algorithm is a deterministic function.

Finally, whatever the formula the RAND() function uses it is clearly symmetrical around zero.  The following query will produce two identical values.

``SELECT `RAND(32768),RAND(-32768)`

In short RAND() does not appear to have been designed with set based operations in mind.

## PROBLEM TWO:  RAND() for sequential seeds is definitely not random

The query below demonstrates just how poor the RAND() function is.

````SELECT `RandomValue=RAND(StateProvinceID),StateProvinceID
`FROM `person.StateProvince
`ORDER BY` 1 `-- Order by the allegedly random value````

If RAND() was truly random I would expect the StateProvinceID records to be randomly ordered however this is clearly not the case.

Plotting the values on a graph and adding a linear trend line emphasizes this fact.

For those of you not familiar with linear regression the R2 value measures the goodness of fit between the data sample and the calculated trend line.  A value of 1 represents a perfect fit.

I decided to explore this observation a bit further as Person.StateProvince is quite a small set of data.

I expanded the database and log file each to 1GB so that I could create my test data.

When I experiment I use SQLCMD mode as this allows me to parameterise what I am trying to do to a greater extent than T-SQL variables normally allow. SQLCMD mode can be switched on from the "Query" menu and appears near the bottom of the screen as cir led in `RED `below

The snippet of code below ensures that the recovery model for the database is set to SIMPLE and prints a confirmation message just to be sure.

```:setvar DB Adventureworks2014
`USE `\$(DB)
```GO
IF``` EXISTS(`SELECT `1 `FROM ``sys.databases` `WHERE `name = `'\$(DB)'` AND recovery_model_desc<>`'SIMPLE'`)
```BEGIN
ALTER DATABASE``` \$(DB) SET RECOVERY SIMPLE
`PRINT ``'DB RECOVERY MODEL CHANGED:  SIMPLE'`
```END
GO
DECLARE``` @RecoveryModelName  `NVARCHAR`(120)
`SELECT` @RecoveryModelName=recovery_model_desc
`FROM ``sys.databases`
`WHERE `name = `'\$(DB)'`
`RAISERROR`(`'DB RECOVERY MODEL FOR \$(DB) = %s'`,10,1,@RecoveryModelName) `WITH NOWAIT`
`GO````

The next step is to create a table to hold a large number of seed values and their associated RAND() value.

````IF `EXISTS(`SELECT `1 `FROM ``sys.objects` `WHERE `object_id=OBJECT_ID(`'dbo.RandomTest'`))
```BEGIN
DROP TABLE``` dbo.RandomTest
`PRINT ``'TABLE DROPPED: dbo.RandomTest'`
```END
GO
CREATE TABLE ```dbo.RandomTest (
SeedValue `INT `NOT NULL
`CONSTRAINT `PK_RandomTest `PRIMARY KEY CLUSTERED`,
RANDNumber `FLOAT `NULL
)
`GO````

### Populating the dbo.RandomTest table

We want to generate a respectable number of seed values and their associated RAND() values.

The use of a cross join on sys.columns is a useful way of generating a large number of records.   If there are 2,000 records in sys.columns then the cross join will produce 2,000 x 2,000 = 4,000,000 records.

The query below makes use of the cross join trick in the Common Table Expression (CTE) but caps the number of records used to 1 million in the query that consumes the CTE.

````--Populate our experiment table`
`WITH `cte(SequentialNumber)
`AS `(   `SELECT `ROW_NUMBER() `OVER `(`ORDER BY` C1.Column_Id)
`FROM ``sys.columns` `AS `C1,`sys.Columns` `AS `C2
)
`INSERT INTO` dbo.RandomTest (SeedValue,RANDNumber)
`SELECT TOP` 1000000 SequentialNumber,RAND(SequentialNumber)
`FROM `cte
```GO
CHECKPOINT
GO``````

### Evaluating the results

I wanted to gain some form of metric for the relationship between the seed value and its associated RAND() value.

I wanted to compare the random number generated for a seed with that of the next sequential seed.  If the current random value is less than the next random value then I wanted to count those occurrences.

````WITH `cte(Seed1, Seed2,  InOrder)
`AS `(
`SELECT `A.SeedValue, B.SeedValue,`CASE WHEN` A.RANDNumber<B.RANDNumber `THEN `1 `ELSE `0 `END`
`FROM `dbo.RandomTest `AS `A
LEFT JOIN dbo.RandomTest `AS `B
`ON `A.SeedValue = B.SeedValue-1
)
`SELECT`
OutOfOrder=COUNT(*)-COUNT (NULLIF(InOrder,1)),
InOrder=COUNT(*)-COUNT (NULLIF(InOrder,0))
`FROM `cte
`GO````
• NULLIF returns a NULL value if all arguments are equal
• COUNT will only count the non-null values.

My query returned the following result set.

 OutOfOrder InOrder 20 999,980

### Implications of ordered random numbers

Clearly RAND() produces a value that is highly correlated with its seed.

Back in the days of the Commodore VIC20 the trick to getting the BASIC RND() function to produce a decent sequence of random numbers was to seed it with a timer value.

The equivalent .NET function also defaults to using a timer based seed when no seed is specified.

The T-SQL equivalent clearly will not work on two fronts:-

• The correlation with the sequential seed as described above
• The conversion of CURRENT_TIMESTAMP (or GETDATE()) to a seed value is simply too coarse grained.

Consider the query below:

``SELECT `CAST(CURRENT_TIMESTAMP `AS FLOAT`)`
• The integer portion of the value represents the days elapsed since 1st January 1900.
• The decimal portion is a fraction of a day.

To keep the value within the bounds of a 32 bit integer we can multiply timestamp value by 10,000.  This allows us to make use of up to 4 decimal places.

0.001 of a day = 8.64 seconds.  Even if sequential seeds did produce a random distribution we would still not be able to use the timer unless we only generated a number every 8.6 seconds!

## Does the .NET framework offer a better solution?

The short answer is NO.  To test the .NET ability to generate random numbers from sequential seed values I wrote a short console application and noticed something peculiar.  It appeared that odd seeds were following a pattern and even seeds were following a similar but separate pattern.

````using `System;
`namespace `RandomExperiment
{
`class `Program
{
`static void` Main(`string`[] args)
{
`for `(`int `i = 0; i < 200; i+=2)
{
Console.Write(`" {0,10} "`, new Random(i).Next());
Console.Write(`" {0,10} "`, new Random(i+1).Next());
Console.WriteLine();
}
`string `s=Console.ReadLine();
}
}
}```

The graph below shows the output from the program.

Clearly there is a strong pattern

## How about repeated calls to RAND()

In no way conforming to the stereotype of an IT geek playing Dungeons and Dragons I am going to  test the randomness of the RAND function by simulating a 10 sided dice and by generating 100,000 roles of that dice.

To determine whether our dice are biased a little bit of knowledge of the binomial distribution should help.

If RAND() is unbiased then I would expect approximately 10,000 incidents of each dice face.  I would expect some faces to appear slightly more and others slightly less than 10,000 but not significantly.

V = np(1-p)

Variance = Number of Roles x Probability of getting a particular face x probability of not getting a face.

V= 100,000 x 0.9 * 0.1 = 9,000

Standard Deviation = Square Root of the Variance = 94.87.

The normal distribution tells us that a certain number of standard deviations either side of the mean describes the probability of the dice being biased.

 Confidence level Number of standard deviations Leeway either side Low Occurrence High Occurrence 90% 1.64 156 9,844 10,156 95% 1.96 186 9,814 10,186 99% 2.58 245 9,755 10,245

In other words if we wanted to be 99% confident that a dice was biased we would expect a particular face to crop up outside the range of 9,844 and 10,156 times.

So let us turn that into SQL code.

````IF `EXISTS(`SELECT `1 `FROM ``sys.objects` `WHERE `object_id=OBJECT_ID(`'dbo.RepeatedRand`') AND type=`'U'`)
```BEGIN
DROP TABLE``` dbo.RepeatedRand
`PRINT` `'TABLE DROPPED: dbo.RepeatedRand'`
```END
GO
CREATE TABLE``` dbo.RepeatedRand (
RepeatedRandID `INT `NOT NULL `IDENTITY`(1,1)
`CONSTRAINT `PK_RepeatedRand `PRIMARY KEY CLUSTERED`,
RandValue `TINYINT `NOT NULL
`CONSTRAINT `DF_RepeatedRand_RandValue `DEFAULT`(CAST(10*RAND() `AS INT`))
)
```GO
SET``` NOCOUNT ON
`-- Generate 100,000 records`
`INSERT INTO` dbo.RepeatedRand ```DEFAULT VALUES
GO ```100000```

To measure the number of occurrences of each value we run the simple query shown below.

````SELECT `RandValue,NumberOfOccurrences=COUNT(*)
`FROM `dbo.RepeatedRand
`GROUP BY` RandValue
`ORDER BY` RandValue```

Your results will vary on each run. The table below shows a result set I received.

RandValue

NumberOfOccurrences

0

9,979

1

9,932

2

9,909

3

10,127

4

9,971

5

9,915

6

10,075

7

9,960

8

10,001

9

10,131

As all values are within the 9,844 to 10,156 range I can be sure that there is no particular bias towards one number.

However, that is not the whole story.  The numbers picked may comply with a uniform distribution but what about when they are picked?

If the choosing of numbers is random then from our 100,000 sample we would expect the following:-

• Approximately 45,000 numbers to be lower than the preceding number
• Approximately 10,000 numbers to be the same as the preceding number
• Approximately 45,000 numbers to be higher than the preceding number

The SQL Query to measure this is as follows:-

```;`WITH `cte(Seed1, Seed2,  InOrder)
`AS `(
`SELECT `A.RepeatedRandId, B.RepeatedRandId,
`CASE WHEN `A.RANDValue<B.RANDValue `THEN `-1
`WHEN `A.RandValue>B.RandValue `THEN `1
`WHEN `A.RandValue=B.RandValue `THEN `0
`ELSE `2 `END`
`FROM `dbo.RepeatedRand `AS `A
LEFT JOIN dbo.RepeatedRand `AS `B
ON A.RepeatedRandId = B.RepeatedRandId-1
)
`SELECT`
LessThan=COUNT(*)-COUNT (NULLIF(InOrder,-1)),
Equal=COUNT(*)-COUNT (NULLIF(InOrder,0)) ,
WTF=COUNT(*)-COUNT (NULLIF(InOrder,2)) ,
MoreThan=COUNT(*)-COUNT (NULLIF(InOrder,1))
`FROM `cte
`GO````

An example of a result set I got is shown below:-

LessThan

Equal

WTF

MoreThan

44974

9962

1

45063

Note that the WTF captures the last LEFT record for which there is no RIGHT equivalent.

Again, this shows that the appearance of values is unbiased.

## The work around to make RAND() work with sets

To recap the problem with the RAND() function is that it does not appear to be designed with set based operations in mind.

This presents database developers with a paradox.  To produce random numbers over a set it needs to be seeded with random numbers in the first place.  This is somewhat of a chicken and egg scenario.

The general solution used involves CHECKSUM() and NEWID().  For example

````SELECT `RAND(CHECKSUM(NEWID()))
`FROM `person.StateProvince```

It works as follows:-

• The nearest we have to a random value is a unique identifier as generated by the NEWID() function.
• RAND needs an integer value for its seed. By using  the CHECKSUM() value of our NEWID() we achieve our goal.
• RAND produces a identical value for a given integer regardless of its sign (positive or negative)

## What are we trying to do with RAND() anyway?

### Choosing records at random

If we are trying to choose random records from a table then there are a couple of approaches covered by the MSDN article from 2008 by Marcelo De Barros, Kenton Gidewall.

For small to medium tables the following query is adequate

````SELECT TOP `10 *
`FROM `Person.StateProvince
`ORDER BY` NEWID()```

As illustrated in their article, once the number of records gets beyond a certain point then this method becomes progressively more expensive.

For large tables Marcelo De Barros and Kenton Gidewall's more creative approach offers an higher performance approximation

````SELECT` *
`FROM `Person.StateProvince
`WHERE `(ABS(CAST((BINARY_CHECKSUM(*) * RAND()) `AS `INT)) % 100) < 10```

The way that this works is as follows:-

• BINARY_CHECKSUM(*) creates a signed 32 bit integer value as a check sum across the record.
• RAND() produces a single random value across all records but when multiplied by the BINARY_CHECKSUM the result does produce the appearance of randomness
• ABS turns this into a signed value
• %100 returns the modulus (remainder) when the values above are divided by 100
• <10  Gives us the remainders that are 0 to 9 or 10 percent of the recordset.

The De Barros and Gidewall approach works best with tables with a large number of rows and also a large number of varying columns.

### Generating test data and/or running simulations

Personally I would sooner research and buy a tool to achieve data generation or for producing simulations.

Producing a decent tool for either situation is a skilled and time-consuming job.  It is easy for an IT person to cost £300/day or more so these tools are worthwhile on two fronts:-

• IT staff time/cost saved
• A good simulation can tease out bugs and realistic performance issues early.  Avoidance is cheaper than fixing.

I would sooner be able to say "Our simulation shows that we are likely to get a problem in scenario 'x', let's think about the possible ways to avoid the scenario" than be in the thick of a production incident trying to put in a sticking plaster solution.

For general data generation Red-Gate Data Generator offers a decent feature set and works well and handles data referential integrity well.

Simulations are a trickier area and are usually in the province of statistical software packages

## Concluding thoughts

Random number generation or utilisation is not something I have used often within a database.  There are practises that today we class as anti-patterns as better thought through approaches exist.  Yesterday the approaches had yet to emerge and our anti-pattern practises were all that we had.  There are always exceptions where the use of a  pattern may be appropriate but I suspect the use of RAND() is one that leans towards being an anti-pattern.

4.58 (19)

4.58 (19)