Finding Unique Non-Repeating Random Numbers

  • Sounds a bit like a Sudoku puzzle, there might be similar techniques you could use

    http://www.vsj.co.uk/articles/display.asp?id=540

  • wppatton (8/12/2010)


    Jim,

    It's not even close to being a homework problem. I wouldn't post homework for someone else to do. It's real world problem being brought to me by a business associate.

    Post the table structure and data samples in the T-SQL forums and then PM me the link. I'll take a look at it and I'm sure a lot of other people will too.

    Brandie Tarvin, MCITP Database AdministratorLiveJournal Blog: http://brandietarvin.livejournal.com/[/url]On LinkedIn!, Google+, and Twitter.Freelance Writer: ShadowrunLatchkeys: Nevermore, Latchkeys: The Bootleg War, and Latchkeys: Roscoes in the Night are now available on Nook and Kindle.

  • A little off topic maybe, but below is a possible technique to hash demographic or personally identifying data for a QA or Development environment. The distribution of the data remains basically the same as the original. Please note that this is something I hacked together in a few minutes, and I've never actually used this professionally. Also, the performance would suck unless customer name and birth_date are both indexed.

    declare @customer table

    (

    customer_id int not null,

    first_name varchar(40) not null,

    last_name varchar(40) not null,

    birth_date smalldatetime not null

    );

    insert into @customer (customer_id, first_name, last_name, birth_date)

    select 1, 'Beverly','Johnson','1970/04/01' union all

    select 2, 'Mark','Johnson','1972/03/10' union all

    select 3, 'Mark','Johnson','1972/10/03' union all

    select 4, 'Scott','Lemon','1982/01/04' union all

    select 5, 'Michelle','Snow','1958/10/24' union all

    select 6, 'Scott','Richards','1958/10/24';

    select

    customer_id,

    left(first_name,1)+cast( (select count(*) from @customer b where b.first_name > a.first_name) as varchar(9) ) first_name,

    left(last_name,1)+cast( (select count(*) from @customer b where b.last_name > a.last_name) as varchar(9) ) last_name,

    dateadd( day, (select count(*) from @customer b where b.birth_date > a.birth_date), birth_date ) birth_date

    from

    @customer a

    order by

    customer_id;

    customer_id first_name last_name birth_date

    ----------- ---------- ---------- -----------------------

    1 B5 J3 1970-04-04 00:00:00

    2 M3 J3 1972-03-12 00:00:00

    3 M3 J3 1972-10-04 00:00:00

    4 S0 L2 1982-01-04 00:00:00

    5 M2 S0 1958-10-28 00:00:00

    6 S0 R1 1958-10-28 00:00:00

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

  • I understand random numbers for some cases but it seems strange here. If the purpose is to encrypt real EINs then one of the built in encryption routines would be a lot easier. If the purpose is to just a placeholder then why not just a simple integer counter; how is 1,2,3 less secure by obfuscation than random numbers?

    PS - The guy with the table arrangement problem sounded at first like a contrived homework problem but then I thought he's probably trying to set up a dating websites and needs a process to prearrange the seating at a group meet&greet so don't go too harsh on him.

  • You will get much higher degree of uniqueness if you look at the least significant digits of your RAND() instead of the most significant.

    In your Tally solution, simply changing LEFT() to RIGHT() will yield an almost completely unique set.

    With the small sample set provided, I ran several times and never got a single dupe.

  • Foks,

    It is NOT a dating website. It's a seating chart request from a business partner. I haven't touched TSQL in awhile so I thought I would bang it out on my own, but I obviously need more experience. I've been building/testing running through the booksline that comes with SQL server and trying it that way.

    I certainly appreciate all the help and suggestions given. But don't think someone is just trying to get homework done or some other ruse. It's a fricking MATH problem dealing with matrices,iterations. Once the alogrhythm is idenified it needs to be translated into TSQL and run. I don't have the neurons for this, but thought this group would..

    Again thanks for the help.

  • wppatton,

    I've been thinking about your problem off and on all day and it is much harder than it originally appears. My initial thought is that you would be able to do something with mod to determine the next chair to go to, but I haven't been able to make it work. I then tried shifting people and haven't found one that works yet. I'm wondering if this might be a form of the "traveling salesman" or "bin packing" problem (which would mean that it isn't possible to solve in the general case).

    One way to look at the problem is to find a set of paths for each person at table 0 that goes to each of the other tables, but never "meets up" with anyone else from table 0 probably using prime numbers. Then you would have to generalize that for each of the other tables.

    Mathematically speaking it is probably better to number your people 00 - 69 where the first digit is the seat number and the last digit is the table number.

    Good Luck,

    JimFive

  • I have also been thinking about the homework problem regarding the dating website. 😉 I don't think a solution is possible for the problem as presented.

    Any given person needs to meet 69 other people. Best case, he can meet 6 new people in each round. That would take 12 rounds to accomplish, not 10.

    Even for other numbers of people and tables, I'm not sure there are solutions where the number of rounds equals the number of tables. Except in the case of one large table and one very long round, after which you'd probably be sick of all these prospective mates, plus your homework would be overdue anyway.

    If we remove the restriction on the number of rounds, it's still an interesting SQL challenge. It's beyond my meager math and coding skills, but I'll be interested in the solution if one is posted.

  • Thanks for the article Brandie

    Jason...AKA CirqueDeSQLeil
    _______________________________________________
    I have given a name to my pain...MCM SQL Server, MVP
    SQL RNNR
    Posting Performance Based Questions - Gail Shaw[/url]
    Learn Extended Events

  • wppatton (8/12/2010)


    Jim,

    It's not even close to being a homework problem. I wouldn't post homework for someone else to do. It's real world problem being brought to me by a business associate.

    Heh... your business associate must like Celko's seating challenge posted on one of the other forums. It's a contest and it comes with a prize. Tell your business associate that he's going to have to solve it himself or wait for the contest to end. 😉

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • wppatton (8/12/2010)


    Foks,

    It is NOT a dating website. It's a seating chart request from a business partner. I haven't touched TSQL in awhile so I thought I would bang it out on my own, but I obviously need more experience. I've been building/testing running through the booksline that comes with SQL server and trying it that way.

    I certainly appreciate all the help and suggestions given. But don't think someone is just trying to get homework done or some other ruse. It's a fricking MATH problem dealing with matrices,iterations. Once the alogrhythm is idenified it needs to be translated into TSQL and run. I don't have the neurons for this, but thought this group would..

    Again thanks for the help.

    Please take this out to the forum. This thread is about the article, not about this seating chart problem. Thanks.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • jcrawf02 (8/12/2010)


    Jeff likes to use ABS(CHECKSUM(NEWID())) to generate random numbers, I haven't done any testing on it like you did, so don't know if it has issues with repetition or not. I've been treating it like it doesn't for non-critical use, suppose I should check that.;-)

    See http://www.sqlservercentral.com/Forums/Topic666309-338-1.aspx for an example of his use.

    Thanks for the reference. Yes, it will generate non unique numbers. That's part of what makes it random. Each number in a given domain supposedly has the same chance of being picked as any other number no matter how many numbers have been picked. The larger the domain for a given set size, the smaller the number of dupes.

    A nine digit number such as a Social Security Number will give you a domain of 1 billion numbers if you include zero. Let's see how many dupes we'd get trying to generate a million nine digit numbers using my favorite method that Brandie mentioned in her article...

    --===== Conditionally drop the test table to make reruns easier.

    IF OBJECT_ID('TempDB..#RandomSSN','U') IS NOT NULL

    DROP TABLE #RandomSSN

    ;

    --===== Build the million row test table of 9 digit random SSN's (includes leading zeros)

    SELECT TOP (1000000)

    RIGHT('000000000'+CAST(ABS(CHECKSUM(NEWID()))%1000000000 AS VARCHAR(9)),9) AS RandomSSN

    INTO #RandomSSN

    FROM Master.sys.All_Columns ac1

    CROSS JOIN Master.sys.All_Columns ac2

    ;

    --===== See how many dupes there are.

    SELECT RandomSSN, COUNT(*)

    FROM #RandomSSN

    GROUP BY RandomSSN

    HAVING COUNT(*) > 1

    ;

    You can see that an average of about 520 numbers out of the million duplicate on each run. The run above (not including the dupe check) takes about 2 seconds to execute. The following run consitently gens a million UNIQUE random numbers in random order in about 26 seconds (again, not including the verification times).

    --===== Conditionally drop the test table to make reruns easier.

    IF OBJECT_ID('TempDB..#RandomSSN','U') IS NOT NULL

    DROP TABLE #RandomSSN

    ;

    WITH

    cteBuildRandomSSNs AS

    ( --=== Build the million row test table of 9 digit random SSN's with twice as many extra rows

    -- as what we expect the dupes to be.

    SELECT TOP (1001000)

    RIGHT('000000000'+CAST(ABS(CHECKSUM(NEWID()))%1000000000 AS VARCHAR(9)),9) AS RandomSSN

    FROM Master.sys.All_Columns ac1

    CROSS JOIN Master.sys.All_Columns ac2

    )

    ,

    cteDupeCheck AS

    ( --=== Number how many times each random ssn is used. Unfortunately, this kills the random order

    SELECT ROW_NUMBER() OVER (PARTITION BY RandomSSN ORDER BY (SELECT 1)) AS Occurance,

    RandomSSN

    FROM cteBuildRandomSSNs

    )

    SELECT TOP (1000000)

    IDENTITY(INT,1,1) AS SortOrder,

    RandomSSN

    INTO #RandomSSN

    FROM cteDupeCheck

    WHERE Occurance = 1

    ORDER BY NEWID() -- This re-establishes a random order, but at great cost

    ;

    --===== See how many dupes there are.

    SELECT RandomSSN, COUNT(*)

    FROM #RandomSSN

    GROUP BY RandomSSN

    HAVING COUNT(*) > 1

    ;

    --===== See how many RandomSSN's there are

    SELECT COUNT(*)

    FROM #RandomSSN

    ;

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • sql 7342 (8/12/2010)


    Great post. Good example of use of random numbers.

    Rand versus NewID has been discussed. Rand only runs once per query, NewID is like other functions, but requires conversion and is not proven to have good distributions and seems a bit slow.

    An alternative is to union a bunch of "select Rand()" statments to create the list of unique random values.

    Select top 10 from

    (

    select cast(rand()*100000 as int) as rnd union

    select cast(rand()*100000 as int) as rnd union

    ...

    select cast(rand()*100000 as int) as rnd union

    select cast(rand()*100000 as int) as rnd

    ) as t

    Union All if you want non-unique numbers

    it may be a long sql query, but its very fast

    BTW if the multiple can easily be a count(*) from some other table, instead of a constant.

    Heh... gen a million numbers that way. 😉 See above.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • AndyC London (8/12/2010)


    There is a terminolgy issue here. The numbers in this sequence are not actually random as we could predict the next number in the sequence by looking at the previous values, the accuracy of this prediction would get better as we went along.

    For example, if there were 9 numbers in the range and we specified that they would not repeat then we might get a sequence

    3,2,5,4,7,8,1,6

    The next number would have to be 9 as it's the only number left in the sequence.

    I'm not sure if what you described would clasifiy as pseudo random but I know that the Linear Feedback Shift Register technique definately is.

    However, with regards to generating test data both these techniques are great and it's a well written article so many thanks.

    Other functions to play with with regards to generating test data are the HashBytes function and CHECKSUM function which both have the useful property of being able to take a string and generate a unique value from it.

    So for example Checksum('John Doe') would always return the same value.

    See also:

    http://msdn.microsoft.com/en-us/library/aa175776(SQL.80).aspx

    I'll have to dig for the example, but CHECKSUM is not guaranteed to produce a unique value for different operands.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • CELKO (8/12/2010)


    This is called an additive congruency generator, as oppsd to a random number generator. Here is the formula for a 31-bit additive congruency generator.

    UPDATE generator

    SET keyval = keyval/2 + MOD(MOD(keyval, 2) + MOD(keyval/8, 2), 2) * 2^30;

    Oh, I'm going to have some fun with that. Thanks for posting it, Joe.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 15 posts - 16 through 30 (of 51 total)

You must be logged in to reply to this topic. Login to reply