Finding Unique Non-Repeating Random Numbers

  • sql 7342 (8/12/2010)


    ...not proven to have good distributions...

    Ok... since you brought it up, let's see the code that demonstrates what you said. 😉

    ...and seems a bit slow.

    As compared to WHAT?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • shad-873858 (8/12/2010)


    You will get much higher degree of uniqueness if you look at the least significant digits of your RAND() instead of the most significant.

    In your Tally solution, simply changing LEFT() to RIGHT() will yield an almost completely unique set.

    With the small sample set provided, I ran several times and never got a single dupe.

    Thank you for posting this. I'll definitely test with that.

    Brandie Tarvin, MCITP Database AdministratorLiveJournal Blog: http://brandietarvin.livejournal.com/[/url]On LinkedIn!, Google+, and Twitter.Freelance Writer: ShadowrunLatchkeys: Nevermore, Latchkeys: The Bootleg War, and Latchkeys: Roscoes in the Night are now available on Nook and Kindle.

  • Jeff Moden (8/13/2010)


    wppatton (8/12/2010)


    Jim,

    It's not even close to being a homework problem. I wouldn't post homework for someone else to do. It's real world problem being brought to me by a business associate.

    Heh... your business associate must like Celko's seating challenge posted on one of the other forums. It's a contest and it comes with a prize. Tell your business associate that he's going to have to solve it himself or wait for the contest to end.

    Aha. I was wondering about that. @=)

    Brandie Tarvin, MCITP Database AdministratorLiveJournal Blog: http://brandietarvin.livejournal.com/[/url]On LinkedIn!, Google+, and Twitter.Freelance Writer: ShadowrunLatchkeys: Nevermore, Latchkeys: The Bootleg War, and Latchkeys: Roscoes in the Night are now available on Nook and Kindle.

  • Jeff Moden (8/13/2010)


    You can see that an average of about 520 numbers out of the million duplicate on each run. The run above (not including the dupe check) takes about 2 seconds to execute. The following run consitently gens a million UNIQUE random numbers in random order in about 26 seconds (again, not including the verification times).

    Oh. Now I get to play with more of your code. Thanks, Jeff. @=)

    Brandie Tarvin, MCITP Database AdministratorLiveJournal Blog: http://brandietarvin.livejournal.com/[/url]On LinkedIn!, Google+, and Twitter.Freelance Writer: ShadowrunLatchkeys: Nevermore, Latchkeys: The Bootleg War, and Latchkeys: Roscoes in the Night are now available on Nook and Kindle.

  • magarity kerns (8/12/2010)


    I understand random numbers for some cases but it seems strange here. If the purpose is to encrypt real EINs then one of the built in encryption routines would be a lot easier. If the purpose is to just a placeholder then why not just a simple integer counter; how is 1,2,3 less secure by obfuscation than random numbers?

    Exactly. 😀

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (8/13/2010)


    AndyC London (8/12/2010)


    Other functions to play with with regards to generating test data are the HashBytes function and CHECKSUM function which both have the useful property of being able to take a string and generate a unique value from it.

    I'll have to dig for the example, but CHECKSUM is not guaranteed to produce a unique value for different operands.

    I'm fairly certain I tried this variant at some point in the past, but it didn't work. But it was way before this particular project, so I'd have to dig too.

    Brandie Tarvin, MCITP Database AdministratorLiveJournal Blog: http://brandietarvin.livejournal.com/[/url]On LinkedIn!, Google+, and Twitter.Freelance Writer: ShadowrunLatchkeys: Nevermore, Latchkeys: The Bootleg War, and Latchkeys: Roscoes in the Night are now available on Nook and Kindle.

  • James Goodwin (8/12/2010)


    As AndyC mentions this is not a random set.

    However, for this case you don't need a random set, you need an arbitrary set of numbers. (More accurately, you need an arbitrary set that is easy to generate and impossible to reverse.) One way to achieve this would be to use RANK over the TaxID and then pad them out to the proper length. For nonnumeric data I would look at using CHECKSUM.

    --

    JimFive

    Since you brought it up, please explain the semantical differences between a random set and an arbitrary set of numbers.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • You can easely resolve the "unique" problem by adding a distinct to the select top().

    I tried this, and it works fine on SQL 2005. The distinct will not generate an ordered list ... ( you can use newid() or rand() )

    CREATE TABLE dbo.#RandomTaxID (TaxID int);

    GO

    INSERT INTO dbo.#RandomTaxID (TaxID)

    SELECT distinct TOP (70000)

    RIGHT(CAST(CAST(CAST(NEWID() AS VARBINARY) AS BIGINT) AS VARCHAR(30)),9)

    FROM dbo.Tally;

    ALTER TABLE dbo.#RandomTaxID ADD MyID int Identity(1,1) NOT NULL primary key clustered;

    go

    -- verify uniqueness :

    create unique index UQ_RandomTaxID on #RandomTaxID(TaxID);

    select top 10 * from #RandomTaxID order by MyID;

    [font="Courier New"]

    TaxIDMyID

    514443407 1

    959020860 2

    287024448 3

    269261913 4

    575137863 5

    423204402 6

    566434316 7

    66537729 8

    82584852 9

    752384001 10[/font]

  • Since you brought it up, please explain the semantical differences between a random set and an arbitrary set of numbers.

    Jeff,

    An arbitrary number is a number that you don't care how it is generated. As in the example in my previous post, since you just need an arbitrary number you can use Rank() to generate the numbers. They're not random, but they solve the deidentification issue presented.

    A random number, on the other hand, is a number generated through a random (or pseudo-random) process. This means that any number has an equal chance of coming up on any given test and therefore a set of random numbers is not guaranteed to contain unique numbers. The benefits of random numbers are distribution and unpredictability. If you don't need those, then asking for randomness just makes your life more difficult. Putting constraints (such as uniqueness) on a set of random numbers makes the set less random.

    To see why asking another person for a random number is bad, see http://scienceblogs.com/cognitivedaily/2007/02/is_17_the_most_random_number.php

    --

    JimFive

  • The benefits of random numbers are distribution and unpredictability. If you don't need those, then asking for randomness just makes your life more difficult.

    I think this is the same basic point I was asking about; if this article is about a theoretical exercise it is worded poorly. I can't figure out why anyone would jump through these hoops in the given situation except maybe a consultant paid by the hour.

    Stores in franchise #1, obfuscated EIN = 1. Stores in franchise 2, obfuscated EIN = 2. What am I missing??

    PS - I wish some people would have pity on my underpowered netbook and its tiny screen trying to render these pages and go easy on the animated gifs and 10 line sigs attached to one word comments :crying:

  • It's not a theoretical exercise. I actually had to do this for a project. And setting up encryption was not an option given me.

    EDIT: Since I had to jump through so many hoops and I didn't find a lot of articles on the subject, I thought I'd post for anyone else who is looking for something like this.

    Brandie Tarvin, MCITP Database AdministratorLiveJournal Blog: http://brandietarvin.livejournal.com/[/url]On LinkedIn!, Google+, and Twitter.Freelance Writer: ShadowrunLatchkeys: Nevermore, Latchkeys: The Bootleg War, and Latchkeys: Roscoes in the Night are now available on Nook and Kindle.

  • James Goodwin (8/13/2010)


    Since you brought it up, please explain the semantical differences between a random set and an arbitrary set of numbers.

    Jeff,

    An arbitrary number is a number that you don't care how it is generated. As in the example in my previous post, since you just need an arbitrary number you can use Rank() to generate the numbers. They're not random, but they solve the deidentification issue presented.

    A random number, on the other hand, is a number generated through a random (or pseudo-random) process. This means that any number has an equal chance of coming up on any given test and therefore a set of random numbers is not guaranteed to contain unique numbers. The benefits of random numbers are distribution and unpredictability. If you don't need those, then asking for randomness just makes your life more difficult. Putting constraints (such as uniqueness) on a set of random numbers makes the set less random.

    To see why asking another person for a random number is bad, see http://scienceblogs.com/cognitivedaily/2007/02/is_17_the_most_random_number.php

    --

    JimFive

    Thanks, James. As Magarity Kerns just posted, that ties it all together. I think a lot of people spend a lot of time making "random" numbers in an attempt to "obfuscate" data and all they really need are "arbitrary" place holders. It makes life even simpler when folks are made to understand that a list of test SSN's can actually be just a bunch of sequential 9 digit numbers.

    Anyway, thank you and Magarity for putting a very different and still very practical spin on all of this subject.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Bert De Haes (8/13/2010)


    You can easely resolve the "unique" problem by adding a distinct to the select top().

    I tried this, and it works fine on SQL 2005. The distinct will not generate an ordered list ... ( you can use newid() or rand() )

    CREATE TABLE dbo.#RandomTaxID (TaxID int);

    GO

    INSERT INTO dbo.#RandomTaxID (TaxID)

    SELECT distinct TOP (70000)

    RIGHT(CAST(CAST(CAST(NEWID() AS VARBINARY) AS BIGINT) AS VARCHAR(30)),9)

    FROM dbo.Tally;

    ALTER TABLE dbo.#RandomTaxID ADD MyID int Identity(1,1) NOT NULL primary key clustered;

    go

    -- verify uniqueness :

    create unique index UQ_RandomTaxID on #RandomTaxID(TaxID);

    select top 10 * from #RandomTaxID order by MyID;

    [font="Courier New"]

    TaxIDMyID

    514443407 1

    959020860 2

    287024448 3

    269261913 4

    575137863 5

    423204402 6

    566434316 7

    66537729 8

    82584852 9

    752384001 10[/font]

    Works just fine... until you need more than what your Tally table may have in it. Try it to gen a million numbers using a single cross join on an 11000 row Tally table. On second thought, don't try it... I stopped the code after 2.5 hours just trying to gen 70K rows.

    Using a Million row Tally, it generated 999541 rows after the distinct in about 27 seconds. Deduping appears to cost a lot on the fly both in your code and mine. 🙂

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Brandie Tarvin (8/13/2010)


    Jeff Moden (8/13/2010)


    You can see that an average of about 520 numbers out of the million duplicate on each run. The run above (not including the dupe check) takes about 2 seconds to execute. The following run consitently gens a million UNIQUE random numbers in random order in about 26 seconds (again, not including the verification times).

    Oh. Now I get to play with more of your code. Thanks, Jeff. @=)

    Having code to play with is always fun. Nice article, Brandie. It brought out several good discussions including the discussion that you may not need "random" numbers at all... just "arbitrary" numbers.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Support the animated giff comment - they get do very old very quickly.

Viewing 15 posts - 31 through 45 (of 51 total)

You must be logged in to reply to this topic. Login to reply