SQLServerCentral Article

Finding Unique Non-Repeating Random Numbers

,

Introduction

I'm writing this article because there isn't a lot of detail on the internet regarding the generation of unique, non-repeating random numbers. Today I'll outline three methods for generating this data.

With all the recent legislation to protect Personally Identifiable Information (PII), DBAs are left trying to figure out how to simulate Production databases in Development environments while masking private information from developers, testers, and end users who don't have valid business reasons to see the private information. One solution is to encrypt data, but that can cause performance issues that interfere with the ability to troubleshoot true code performance problems. Another, easier, solution is to simply replace the data with random characters when the database is being restored down to your Dev environment.

Take TaxIDs for example. While the Federal Employer Identification Number (or Federal Tax ID) for businesses isn't usually considered private information, self-employed people like contractors use their Social Security Numbers as their EIN and that is considered private information. So, assume you have the following tables:

Create Table Store (StoreID int NOT NULL, FedTaxID char(9) NOT NULL);
GO
Create Table Franchise (FranchiseID int NOT NULL, StoreID int NOT NULL);
GO
Insert into Store (StoreID, FedTaxID) (Select 1234, 115588779
UNION
Select 2564, 998822546 
UNION
Select 3258, 641122553 
UNION
Select 5582, 641122553 
UNION 
Select 8457, 992233457
UNION 
Select 7586, 992233457);

Insert into Franchise (FranchiseID, StoreID) 
(Select 1, 1234
UNION
Select 2, 2564
UNION
Select 3, 3258
UNION
Select 3, 5582
UNION 
Select 4, 8457
UNION 
Select 4, 7586);

A simple update to the TaxID should resolve the problem of PII. Just use the RAND() function to generate a random number, and convert the result so it looks like an EIN instead of a SSN.

Update Store 
Set TaxID = '11' + convert(varchar(7),(left(convert(bigint,RAND()*100000000000),7)))

However, this update causes every single TaxID to become the same random number. This isn't a problem unless you have business rules requiring each store to have a different TaxID. My situation is a little strange. Stores can have the same TaxID so long as they are within the same franchise. But stores from other franchises must have different TaxID and TaxIDs cannot repeat outside of those franchises.

So, franchises 1 & 2 must have TaxIDs different from each other and different from franchises 3 & 4. 3 & 4 must have different TaxIDs from each other and different from 1 & 2. But every store in franchise 3 must have the same TaxID as all the stores in franchise 3. An additional restriction is that I cannot use StoreID or FranchiseID in the generation of these random TaxIDs.

Method 1 - The Loop:

DECLARE @StoreCnt int, @CntUp int, @NewTaxID varchar(10);
SELECT @StoreCnt = Count(StoreID) FROM dbo.tblStore;
--Get the total count of stores so I know how many random numbers I'll need
SET @CntUp = 0;
--This is my loop variable
CREATE TABLE dbo.#RandomNumbers (TaxID varchar(10));
WHILE @CntUp < (@StoreCnt + 10000) 
--Add an additional 10000 to the count 
--because some generated numbers won't be unique, non-repeating numbers
 BEGIN
  SET @NewTaxID = '11-' + convert(varchar(7),(left(convert(bigint,RAND()*100000000000),7)));
  --To make sure I'm comparing same # in the IF conditional that will be inserted
  IF (SELECT TaxID FROM dbo.#RandomNumbers WHERE TaxID = @NewTaxID) IS NULL --Verify random # doesn't currently exist
   BEGIN
    INSERT INTO dbo.#RandomNumbers (TaxID)
    SELECT @NewTaxID; --Inserts random number when it doesn't exist
   END;
 SET @CntUp = @CntUp + 1; --increment counter
 END;

The above is a simple, easy solution that almost anyone can write. Performance, however, is a bear. Running this code against a table with 55,000 records took me 4 minutes (longer if you have more columns in your table). But the more stores that get entered in my table, the longer this query will take to run. Time is a precious commodity in the DBA world, so this solution is unacceptable to me.

Method 2 - The Tally Table:

Ahha! Let's use a Tally table to generate unique numbers, using Jeff Moden's favorite method of generating a Tally table.

Tally Table Code:

IF OBJECT_ID('dbo.Tally') IS NOT NULL
 DROP TABLE dbo.Tally;
--===== Create and populate the Tally table on the fly
SELECT TOP 15000000 IDENTITY(INT,1,1) AS N 
 INTO dbo.Tally 
 FROM Master.dbo.SysColumns sc1 
 CROSS JOIN Master.dbo.SysColumns sc2;
--===== Add a Primary Key to maximize performance 
ALTER TABLE dbo.Tally 
 ADD CONSTRAINT PK_Tally_N
 PRIMARY KEY CLUSTERED (N) WITH FILLFACTOR = 100;
--===== Let the public use it 
GRANT SELECT, REFERENCES ON dbo.Tally TO PUBLIC;

Now, we can match the RAND() function against the numbers generated by the Tally to get our TaxIDs.

CREATE TABLE dbo.#RandomTaxID (TaxID int, N int);
GO
INSERT INTO dbo.#RandomTaxID (TaxID, N)
SELECT TOP (SELECT Count(StoreID) + 10000 FROM dbo.Store) 
 convert(varchar(7),(left(convert(bigint,RAND()*(Convert(bigint,N)*1000000000)),7))), N
 FROM dbo.Tally;

Simple enough, but when we go back to verify the uniqueness of our numbers, we get a little surprise.

SELECT TaxID
 FROM dbo.#RandomTaxID
 GROUP BY TaxID
 HAVING Count(TaxID) > 1;
--Find all repeating TaxIDs

This code runs significantly faster than the Loop Method. It takes me 15 seconds to generate the needed numbers. I like this performance, so I continue testing.

When I do my "double-check" for repeating numbers, I get 5,499 duplicate numbers. No matter how many times I re-run my random code, I continually get that same exact number of duplicates, though the actual randomized numbers vary. Curious, I ran the following code to investigate why the number of duplicates was always the same. (Note: In my situation, I have 51,105 unique StoreIDs, plus the 10,000 I added for insurance.)

SELECT TaxID
 INTO dbo.#Tax
 FROM dbo.#RandomTaxID
 GROUP BY TaxID
 HAVING Count(TaxID) > 1;
SELECT rt.TaxID, rt.N
 FROM dbo.#RandomTaxID rt
  JOIN dbo.#Tax t
  ON rt.TaxID = t.TaxID
 ORDER BY rt.TaxID
--Find all repeating TaxIDs and corresponding Tally number

Here's a sample of my results:

TaxID   N
1000199 4520
1000199 452
1000199 45200
1000420 45210
1000420 4521
1000642 4522
1000642 45220

It turns out that every time part of my Tally number repeats itself, the randomization process turns out the same number. For the numbers beginning with "4520" or "452," I get the same "random" number. And when it switches to "4521," I get the repetition again. Now this repetition can be easily resolved with a DELETE statement, but that wastes a lot of numbers. Future additions to the table will cause more repeats and eventually, I see a situation where I'm deleting so many numbers that I won't have enough to entirely update my table. Again, this solution is unacceptable.

Method 3 - The NewID():

This solution comes courtesy of a suggestion made by Eugene Elutin to use NewID() in a simple UPDATE statement. That particular method didn't work for my circumstances, but I was able to adapt his suggestion to the Tally Table method above and came up with far fewer repeating numbers than the Tally method.

CREATE TABLE dbo.#RandomTaxID (TaxID int, N int);
GO
INSERT INTO dbo.#RandomTaxID (TaxID, N)
SELECT TOP (SELECT Count(StoreID) + 10000 FROM dbo.Store) 
 RIGHT(CAST(CAST(CAST(NEWID() AS VARBINARY) AS BIGINT) AS VARCHAR(30)),9),
--Notice above statement uses NewID() instead of N.
 N
FROM dbo.Tally;
--Still using Tally, just not as the random number generator
--IMPORTANT NOTE: If the MAX(StoreID) <> to COUNT(StoreID),
-- change the above COUNT() to be MAX()
-- or some records may not end up updated correctly!
SELECT TaxID
 FROM dbo.#RandomTaxID
 GROUP BY TaxID
 HAVING Count(TaxID) > 1; --Checking for duplicates

Performance is the same as Method 2. Repeating runs of the above code generates between 0 to X number of duplicate numbers. In over twenty runs, however, I have not seen more than 5 duplicated random numbers. 5 is a lot easier to delete than 5000+ and less likely to cause me to re-work this code in the future.

Choosing Method 3 definitely seems to be the right choice for me. Now I need to finish my updates for my particularly strange set of business rules. First, I delete the duplicated numbers.

DELETE FROM dbo.#RandomTaxID
 FROM dbo.#RandomTaxID rt1
   INNER JOIN (SELECT Min(N) as N, TaxID
                FROM dbo.#RandomTaxID
                GROUP BY TaxID
                HAVING Count(TaxID) > 1) rt2
 ON rt1.N = rt2.N
AND rt1.TaxID = rt2.TaxID;
--Using Min(N) allows me to delete only one repetition 
-- of a particular number instead of all repetitions
ALTER TABLE dbo.#RandomTaxID
 ADD MyID int Identity(1,1) NOT NULL;
GO
--Adds an identity field for matching storeID numbers

Then, it's a matter of actually updating the stores in each franchise to have the same exact TaxID without affect any of the other stores in other franchises.

--If your StoreID is an Identity field,
--use the below update.
UPDATE s
 SET TaxID = rt.TaxID
FROM dbo.Store s
 INNER JOIN dbo.#RandomTaxID rt
 ON s.StoreID = rt.MyID;
--If your StoreID is NOT an Identity field,
--add an identity field and match Identity to Identity
ALTER TABLE dbo.StoreID
 ADD MyID int Identity(1,1) NOT NULL;
GO
UPDATE s
 SET TaxID = rt.TaxID
 FROM dbo.Store s
   INNER JOIN dbo.#RandomTaxID rt
   ON s.MyID = rt.MyID;
ALTER TABLE dbo.StoreID
  DROP COLUMN MyID;
GO
--Now update stores within the same Franchise
UPDATE s1
  SET s1.TaxID = s3.TaxID
 FROM dbo.Store s1 --Table being updated
   JOIN dbo.Franchise f1 --Getting FranchiseIDs
     ON s1.StoreID = f1.FranchiseID
   JOIN (Select min(s.StoreID) as StoreID, f.FranchiseID
         From dbo.Store s
           Join dbo.Franchise f
             On s.StoreID = f.StoreID
         Group By FranchiseID) s2
  --Find minimum store for TaxID copy
  ON f1.FranchiseID = s2.FranchiseID
   --Join Franchise from outer to Franchise of min store
  JOIN dbo.Store s3
  ON s2.StoreID = s3.StoreID;
--Join Store again to get TaxID of min store
Select f.FranchiseID, s.StoreID, s.TaxID
 from dbo.Store s
   join dbo.Franchise f
     on s.StoreID = f.StoreID
 Order by f.FranchiseID, s.StoreID;
--Double-check that the Update worked correctly

A quick check of the tables shows that my code successfully copied over the TaxID from one store in a franchise to the rest of the stores in that franchise. And I got it done within the strictures of my business rules and without using a Loop, Cursor, or a CTE. Additionally, I can probably alter this code for use with zip codes, addresses, email addresses, or any other PII that government comes up with.

There are probably other methods that can create a unique, non-repeating random number. I'm sure some of them are more elegant than mine. If you have solutions, I would love to see them. If you have questions, I will try to answer them. If you have comments, I would love to read them. Hopefully this trial and error of mine will help someone else in the pursuit of the unique non-repeating numbers table.

Rate

3.48 (23)

You rated this post out of 5. Change rating

Share

Share

Rate

3.48 (23)

You rated this post out of 5. Change rating