I'm writing this article because there isn't a lot of detail on the internet regarding the generation of unique, non-repeating random numbers. Today I'll outline three methods for generating this data.
With all the recent legislation to protect Personally Identifiable Information (PII), DBAs are left trying to figure out how to simulate Production databases in Development environments while masking private information from developers, testers, and end users who don't have valid business reasons to see the private information. One solution is to encrypt data, but that can cause performance issues that interfere with the ability to troubleshoot true code performance problems. Another, easier, solution is to simply replace the data with random characters when the database is being restored down to your Dev environment.
Take TaxIDs for example. While the Federal Employer Identification Number (or Federal Tax ID) for businesses isn't usually considered private information, self-employed people like contractors use their Social Security Numbers as their EIN and that is considered private information. So, assume you have the following tables:
Create Table Store (StoreID int NOT NULL, FedTaxID char(9) NOT NULL); GO Create Table Franchise (FranchiseID int NOT NULL, StoreID int NOT NULL); GO Insert into Store (StoreID, FedTaxID) (Select 1234, 115588779 UNION Select 2564, 998822546 UNION Select 3258, 641122553 UNION Select 5582, 641122553 UNION Select 8457, 992233457 UNION Select 7586, 992233457); Insert into Franchise (FranchiseID, StoreID) (Select 1, 1234 UNION Select 2, 2564 UNION Select 3, 3258 UNION Select 3, 5582 UNION Select 4, 8457 UNION Select 4, 7586);
A simple update to the TaxID should resolve the problem of PII. Just use the RAND() function to generate a random number, and convert the result so it looks like an EIN instead of a SSN.
Update Store Set TaxID = '11' + convert(varchar(7),(left(convert(bigint,RAND()*100000000000),7)))
However, this update causes every single TaxID to become the same random number. This isn't a problem unless you have business rules requiring each store to have a different TaxID. My situation is a little strange. Stores can have the same TaxID so long as they are within the same franchise. But stores from other franchises must have different TaxID and TaxIDs cannot repeat outside of those franchises.
So, franchises 1 & 2 must have TaxIDs different from each other and different from franchises 3 & 4. 3 & 4 must have different TaxIDs from each other and different from 1 & 2. But every store in franchise 3 must have the same TaxID as all the stores in franchise 3. An additional restriction is that I cannot use StoreID or FranchiseID in the generation of these random TaxIDs.
Method 1 - The Loop:
DECLARE @StoreCnt int, @CntUp int, @NewTaxID varchar(10); SELECT @StoreCnt = Count(StoreID) FROM dbo.tblStore; --Get the total count of stores so I know how many random numbers I'll need SET @CntUp = 0; --This is my loop variable CREATE TABLE dbo.#RandomNumbers (TaxID varchar(10)); WHILE @CntUp < (@StoreCnt + 10000) --Add an additional 10000 to the count --because some generated numbers won't be unique, non-repeating numbers BEGIN SET @NewTaxID = '11-' + convert(varchar(7),(left(convert(bigint,RAND()*100000000000),7))); --To make sure I'm comparing same # in the IF conditional that will be inserted IF (SELECT TaxID FROM dbo.#RandomNumbers WHERE TaxID = @NewTaxID) IS NULL --Verify random # doesn't currently exist BEGIN INSERT INTO dbo.#RandomNumbers (TaxID) SELECT @NewTaxID; --Inserts random number when it doesn't exist END; SET @CntUp = @CntUp + 1; --increment counter END;
The above is a simple, easy solution that almost anyone can write. Performance, however, is a bear. Running this code against a table with 55,000 records took me 4 minutes (longer if you have more columns in your table). But the more stores that get entered in my table, the longer this query will take to run. Time is a precious commodity in the DBA world, so this solution is unacceptable to me.
Method 2 - The Tally Table:
Ahha! Let's use a Tally table to generate unique numbers, using Jeff Moden's favorite method of generating a Tally table.
Tally Table Code:
IF OBJECT_ID('dbo.Tally') IS NOT NULL DROP TABLE dbo.Tally; --===== Create and populate the Tally table on the fly SELECT TOP 15000000 IDENTITY(INT,1,1) AS N INTO dbo.Tally FROM Master.dbo.SysColumns sc1 CROSS JOIN Master.dbo.SysColumns sc2; --===== Add a Primary Key to maximize performance ALTER TABLE dbo.Tally ADD CONSTRAINT PK_Tally_N PRIMARY KEY CLUSTERED (N) WITH FILLFACTOR = 100; --===== Let the public use it GRANT SELECT, REFERENCES ON dbo.Tally TO PUBLIC;
Now, we can match the RAND() function against the numbers generated by the Tally to get our TaxIDs.
CREATE TABLE dbo.#RandomTaxID (TaxID int, N int); GO INSERT INTO dbo.#RandomTaxID (TaxID, N) SELECT TOP (SELECT Count(StoreID) + 10000 FROM dbo.Store) convert(varchar(7),(left(convert(bigint,RAND()*(Convert(bigint,N)*1000000000)),7))), N FROM dbo.Tally;
Simple enough, but when we go back to verify the uniqueness of our numbers, we get a little surprise.
SELECT TaxID FROM dbo.#RandomTaxID GROUP BY TaxID HAVING Count(TaxID) > 1; --Find all repeating TaxIDs
This code runs significantly faster than the Loop Method. It takes me 15 seconds to generate the needed numbers. I like this performance, so I continue testing.
When I do my "double-check" for repeating numbers, I get 5,499 duplicate numbers. No matter how many times I re-run my random code, I continually get that same exact number of duplicates, though the actual randomized numbers vary. Curious, I ran the following code to investigate why the number of duplicates was always the same. (Note: In my situation, I have 51,105 unique StoreIDs, plus the 10,000 I added for insurance.)
SELECT TaxID INTO dbo.#Tax FROM dbo.#RandomTaxID GROUP BY TaxID HAVING Count(TaxID) > 1; SELECT rt.TaxID, rt.N FROM dbo.#RandomTaxID rt JOIN dbo.#Tax t ON rt.TaxID = t.TaxID ORDER BY rt.TaxID --Find all repeating TaxIDs and corresponding Tally number
Here's a sample of my results:
TaxID N 1000199 4520 1000199 452 1000199 45200 1000420 45210 1000420 4521 1000642 4522 1000642 45220
It turns out that every time part of my Tally number repeats itself, the randomization process turns out the same number. For the numbers beginning with "4520" or "452," I get the same "random" number. And when it switches to "4521," I get the repetition again. Now this repetition can be easily resolved with a DELETE statement, but that wastes a lot of numbers. Future additions to the table will cause more repeats and eventually, I see a situation where I'm deleting so many numbers that I won't have enough to entirely update my table. Again, this solution is unacceptable.
Method 3 - The NewID():
This solution comes courtesy of a suggestion made by Eugene Elutin to use NewID() in a simple UPDATE statement. That particular method didn't work for my circumstances, but I was able to adapt his suggestion to the Tally Table method above and came up with far fewer repeating numbers than the Tally method.
CREATE TABLE dbo.#RandomTaxID (TaxID int, N int); GO INSERT INTO dbo.#RandomTaxID (TaxID, N) SELECT TOP (SELECT Count(StoreID) + 10000 FROM dbo.Store) RIGHT(CAST(CAST(CAST(NEWID() AS VARBINARY) AS BIGINT) AS VARCHAR(30)),9), --Notice above statement uses NewID() instead of N. N FROM dbo.Tally; --Still using Tally, just not as the random number generator --IMPORTANT NOTE: If the MAX(StoreID) <> to COUNT(StoreID), -- change the above COUNT() to be MAX() -- or some records may not end up updated correctly! SELECT TaxID FROM dbo.#RandomTaxID GROUP BY TaxID HAVING Count(TaxID) > 1; --Checking for duplicates
Performance is the same as Method 2. Repeating runs of the above code generates between 0 to X number of duplicate numbers. In over twenty runs, however, I have not seen more than 5 duplicated random numbers. 5 is a lot easier to delete than 5000+ and less likely to cause me to re-work this code in the future.
Choosing Method 3 definitely seems to be the right choice for me. Now I need to finish my updates for my particularly strange set of business rules. First, I delete the duplicated numbers.
DELETE FROM dbo.#RandomTaxID FROM dbo.#RandomTaxID rt1 INNER JOIN (SELECT Min(N) as N, TaxID FROM dbo.#RandomTaxID GROUP BY TaxID HAVING Count(TaxID) > 1) rt2 ON rt1.N = rt2.N AND rt1.TaxID = rt2.TaxID; --Using Min(N) allows me to delete only one repetition -- of a particular number instead of all repetitions ALTER TABLE dbo.#RandomTaxID ADD MyID int Identity(1,1) NOT NULL; GO --Adds an identity field for matching storeID numbers
Then, it's a matter of actually updating the stores in each franchise to have the same exact TaxID without affect any of the other stores in other franchises.
--If your StoreID is an Identity field, --use the below update. UPDATE s SET TaxID = rt.TaxID FROM dbo.Store s INNER JOIN dbo.#RandomTaxID rt ON s.StoreID = rt.MyID; --If your StoreID is NOT an Identity field, --add an identity field and match Identity to Identity ALTER TABLE dbo.StoreID ADD MyID int Identity(1,1) NOT NULL; GO UPDATE s SET TaxID = rt.TaxID FROM dbo.Store s INNER JOIN dbo.#RandomTaxID rt ON s.MyID = rt.MyID; ALTER TABLE dbo.StoreID DROP COLUMN MyID; GO --Now update stores within the same Franchise UPDATE s1 SET s1.TaxID = s3.TaxID FROM dbo.Store s1 --Table being updated JOIN dbo.Franchise f1 --Getting FranchiseIDs ON s1.StoreID = f1.FranchiseID JOIN (Select min(s.StoreID) as StoreID, f.FranchiseID From dbo.Store s Join dbo.Franchise f On s.StoreID = f.StoreID Group By FranchiseID) s2 --Find minimum store for TaxID copy ON f1.FranchiseID = s2.FranchiseID --Join Franchise from outer to Franchise of min store JOIN dbo.Store s3 ON s2.StoreID = s3.StoreID; --Join Store again to get TaxID of min store Select f.FranchiseID, s.StoreID, s.TaxID from dbo.Store s join dbo.Franchise f on s.StoreID = f.StoreID Order by f.FranchiseID, s.StoreID; --Double-check that the Update worked correctly
A quick check of the tables shows that my code successfully copied over the TaxID from one store in a franchise to the rest of the stores in that franchise. And I got it done within the strictures of my business rules and without using a Loop, Cursor, or a CTE. Additionally, I can probably alter this code for use with zip codes, addresses, email addresses, or any other PII that government comes up with.
There are probably other methods that can create a unique, non-repeating random number. I'm sure some of them are more elegant than mine. If you have solutions, I would love to see them. If you have questions, I will try to answer them. If you have comments, I would love to read them. Hopefully this trial and error of mine will help someone else in the pursuit of the unique non-repeating numbers table.