Generate n random int subsets

  • Howdy SSC,

    I'm trying to generate some realistic test data to do some load testing. The app I'm testing takes a parameter of a comma delimited list of integers, representing different stocks. For simplicity's sake, lets say represented by the numbers 1 through 10,000. The app caches calls, so I can't just pass the same set of symbols each time, or it won't hit all the code I want. Order of the subset doesn't matter to me

    What I'd like to do is generate about a million random subsets of that list with each subset containing, say, between 1 and 100 identifiers.

    So a dumbed down version might be, given [1,2,3,4,5] and I want to generate n=5 subsets of between 1 and 3 symbols

    1) [1,3]

    2) [2,3,5]

    3) [4]

    4) [1,3]

    5) [2,3,4,5]

    While i recognize this is HORRIBLY RBAR, this is what I'm trying to get at. Any suggestions for scaling this up and getting rid of the loop would be most welcome

    declare

    @numSubsets int = 50,

    @subsetMaxSize int = 5,

    @i int = 1,

    @topCt int

    if object_id('tempdb.dbo.#symbols') is not null drop table #symbols

    create table #symbols

    (

    SymbolId int primary key clustered

    )

    if object_id('tempdb.dbo.#Subsets') is not null drop table #Subsets

    create table #Subsets

    (

    SubsetId int,

    SymbolId int

    )

    insert into #symbols (SymbolId)

    select top 5 row_number() over (order by (select null))

    from sys.all_objects

    while @i <= @numSubsets

    begin

    select @topCt = (abs(checksum(newid())) % @subsetMaxSize)

    insert into #Subsets

    (

    SubsetId,

    SymbolId

    )

    select top (@topCt)

    SubsetId = @i,

    SymbolId = SymbolId

    from #Symbols

    order by newid()

    select @i += 1

    end

    select top 1000 *

    from #Subsets

    Executive Junior Cowboy Developer, Esq.[/url]

  • Here two fairly simple set based solutions based on the same logic. The first is pretty fast generating smaller sets and doesn't use a temporary table, the second one uses a temporary table and is better suited for larger sets.

    😎

    USE tempdb;

    GO

    SET NOCOUNT ON;

    -- Test set configurations:

    -- @ENTRY_NUMBER, size of the population

    DECLARE @ENTRY_NUMBER INT = 10000;

    -- @MAX_SET_ENTRIES, maximum number of entries from the population

    -- in each subset.

    DECLARE @MAX_SET_ENTRIES INT = 100;

    -- @SAMPLE_SIZE, number of subsets

    DECLARE @SAMPLE_SIZE INT = 1000;

    -- Inline Tally table producing a number sequence from 1 to @SAMPLE_SIZE

    ;WITH T(N) AS (SELECT N FROM (VALUES (0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) AS X(N))

    , NUMS(N) AS (SELECT TOP(@SAMPLE_SIZE) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS N

    FROM T T1,T T2,T T3,T T4,T T5,T T6,T T7,T T8,T T9)

    -- BASE_SET returns @SAMPLE_SIZE number of rows with a pseudo random number from

    -- 1 to @MAX_SET_ENTRIES which is controls the subset size for the entry.

    ,BASE_SET AS

    (

    SELECT

    NM.N AS SAMPLE_ID

    ,((ABS(CHECKSUM(NEWID())) % @MAX_SET_ENTRIES) + 1) AS SAMPLE_SIZE

    FROM NUMS NM

    )

    -- BASE_WITH_SUBSETS adds one row to the set for each member of the subset

    -- and the numerical identifier for that member

    ,BASE_WITH_SUBSETS AS

    (

    SELECT DISTINCT

    BS.SAMPLE_ID

    ,BS.SAMPLE_SIZE

    ,CONVERT(VARCHAR(12),((ABS(CHECKSUM(NEWID())) % @ENTRY_NUMBER) + 1),0) AS SUB_VAL

    FROM BASE_SET BS

    CROSS APPLY

    (

    SELECT TOP(BS.SAMPLE_SIZE) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS N

    FROM T T1,T T2,T T3,T T4,T T5,T T6,T T7,T T8,T T9

    ) AS SNUM(N)

    )

    -- The final select concatenates the subsets returning one row for each

    -- set and it's subset as a comma delimited value string.

    SELECT

    BWS.SAMPLE_ID

    ,STUFF((

    SELECT

    CHAR(44) + SBWS.SUB_VAL

    FROM BASE_WITH_SUBSETS SBWS

    WHERE BWS.SAMPLE_ID = SBWS.SAMPLE_ID

    FOR XML PATH('')

    ),1,1,'') AS SUB_SET

    FROM BASE_WITH_SUBSETS BWS

    GROUP BY BWS.SAMPLE_ID;

    USE tempdb;

    GO

    SET NOCOUNT ON;

    -- Drop the temporary table if it exists

    IF OBJECT_ID('tempdb..#TMP_BASE_WITH_SUBSETS') IS NOT NULL DROP TABLE #TMP_BASE_WITH_SUBSETS;

    -- Test set configurations:

    -- @ENTRY_NUMBER, size of the population

    DECLARE @ENTRY_NUMBER INT = 10000;

    -- @MAX_SET_ENTRIES, maximum number of entries from the population

    -- in each subset.

    DECLARE @MAX_SET_ENTRIES INT = 100;

    -- @SAMPLE_SIZE, number of subsets

    DECLARE @SAMPLE_SIZE INT = 1000000;

    -- Inline Tally table producing a number sequence from 1 to @SAMPLE_SIZE

    ;WITH T(N) AS (SELECT N FROM (VALUES (0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) AS X(N))

    , NUMS(N) AS (SELECT TOP(@SAMPLE_SIZE) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS N

    FROM T T1,T T2,T T3,T T4,T T5,T T6,T T7,T T8,T T9)

    -- BASE_SET returns @SAMPLE_SIZE number of rows with a pseudo random number from

    -- 1 to @MAX_SET_ENTRIES which is controls the subset size for the entry.

    ,BASE_SET AS

    (

    SELECT

    NM.N AS SAMPLE_ID

    ,((ABS(CHECKSUM(NEWID())) % @MAX_SET_ENTRIES) + 1) AS SAMPLE_SIZE

    FROM NUMS NM

    )

    -- BASE_WITH_SUBSETS adds one row to the set for each member of the subset

    -- and the numerical identifier for that member

    ,BASE_WITH_SUBSETS AS

    (

    SELECT DISTINCT

    BS.SAMPLE_ID

    ,BS.SAMPLE_SIZE

    ,CONVERT(VARCHAR(12),((ABS(CHECKSUM(NEWID())) % @ENTRY_NUMBER) + 1),0) AS SUB_VAL

    FROM BASE_SET BS

    CROSS APPLY

    (

    SELECT TOP(BS.SAMPLE_SIZE) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS N

    FROM T T1,T T2,T T3,T T4,T T5,T T6,T T7,T T8,T T9

    ) AS SNUM(N)

    )

    SELECT

    BWS.SAMPLE_ID

    ,BWS.SUB_VAL

    INTO #TMP_BASE_WITH_SUBSETS

    FROM BASE_WITH_SUBSETS BWS;

    CREATE NONCLUSTERED INDEX NCLIDX#TMP_BASE_WITH_SUBSETS_SAMPLE_ID_INCL_SUB_VAL ON #TMP_BASE_WITH_SUBSETS (SAMPLE_ID ASC) INCLUDE (SUB_VAL);

    -- The final select concatenates the subsets returning one row for each

    -- set and it's subset as a comma delimited value string.

    SELECT

    BWS.SAMPLE_ID

    ,STUFF((

    SELECT

    CHAR(44) + SBWS.SUB_VAL

    FROM #TMP_BASE_WITH_SUBSETS SBWS

    WHERE BWS.SAMPLE_ID = SBWS.SAMPLE_ID

    FOR XML PATH('')

    ),1,1,'') AS SUB_SET

    FROM #TMP_BASE_WITH_SUBSETS BWS

    GROUP BY BWS.SAMPLE_ID;

    Sample output

    SAMPLE_ID SUB_SET

    ---------- ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

    1 1182,1185,1287,1303,1347,136,1702,2064,2088,2178,2328,238,2388,2410,2519,2553,2586,2628,2694,2757,2772,2888,3113,3277,3345,3406,3414,3542,3873,3908,3909,3995,4081,4225,4247,4348,4442,4464,4544,4955,4960,5121,5542,5544,5673,5882,5959,6120,6188,6838,6851,698

    2 1585,16,1875,2771,5113,5377,5531,6486,694,786,8477,8570

    3 1025,1100,1181,1212,1227,1296,1418,1419,1866,1874,188,2009,2049,2261,2325,244,2442,2473,2547,3061,3096,3235,3247,3343,3356,3563,3798,3944,4092,4110,4311,438,4426,4539,4601,4773,4809,4836,4929,4955,4972,4976,4996,5178,5196,5250,53,5400,5582,5605,5791,5807,6

    4 1198,1545,1738,1773,1817,1819,2109,2174,2313,2316,2456,2589,279,3419,3782,4076,4139,422,4347,4436,4457,4536,4565,4816,4971,5057,5288,5404,5450,5686,5710,6258,6623,6685,6691,67,6744,6874,696,71,7289,7440,7505,7552,7902,8199,8250,8390,850,8526,8688,893,9151,

    5 1151,119,1298,1380,1394,1478,1551,1552,1603,1702,1989,1996,2158,2211,2278,233,254,2563,2712,2748,2763,2882,2889,3028,3075,3276,3385,3410,3470,4006,4111,4151,4298,4579,4650,5084,5279,5296,5347,5488,5792,5883,5898,6018,6223,636,6374,6381,6399,6531,6674,6676,

    6 128,1367,1662,2189,2717,3537,3660,4307,4422,6662,7956,8131,8975

    7 1322,140,1534,1551,1681,2292,241,2431,2628,2649,2679,2815,284,2949,30,3159,322,3264,3852,413,4440,4523,4670,4936,4937,495,5005,5048,5056,5099,5964,6079,6331,6446,6693,694,7414,8126,8178,870,9089,9238,9241,9262,9369,9474,9610,9650,9815

    8 1119,1446,1596,1635,1700,1788,1809,2003,2186,2259,2483,2646,270,2721,2791,2851,2880,2892,2966,3024,3153,3160,3200,3212,3213,3299,3309,3467,3496,3554,3757,3767,4178,4188,4400,447,4485,450,5035,504,53,5466,5531,5612,5675,5765,5899,5902,6037,6043,6045,6100,63

    9 1093,12,1212,1326,1417,1474,1604,1652,1762,1829,1871,190,1925,2090,2155,2692,2790,2797,2925,3062,3077,315,3543,3845,42,4599,4997,5074,5149,5202,5222,5233,5521,5554,5663,5797,5952,6148,6173,6203,626,6275,635,6534,6591,6600,6711,6968,7013,711,7311,7319,7374,

    10 1719,1820,2325,2578,3426,3560,3666,3806,3817,3875,4149,4726,4776,5530,5641,6185,6403,7118,7340,740,7401,7520,7645,8293,8472,8483,8564,9054,907,9691,9747

    Edit: added the second solution using a temporary table.

  • Thanks for the response. Taking a look now...

    Executive Junior Cowboy Developer, Esq.[/url]

  • Hi Erikur,

    Thanks again for the reply. To preface this, your post was VERY helpful, and I doubt I would have got to the version I'm happy with now without your suggestions. I'll just say where I had to make some changes. Yours didn't do quite what I wanted right out of the box, but it gave me enough idea to get it working for what I needed. For what it's worth, when I ran your temp table version ran for about 50 minutes on my machine before I just killed it. Truthfully I didn't to a lot of profiling on it because there were a few other things I wanted to rewrite anyway. While using a random number mod @ENTRY_NUMBER is a really clever way of getting symbols, it can produce duplicates in a single set. And, unfortunately in my real set, the numbers are not uniformly distributed between 1 and 10000. Not a big deal, and given the information I gave you, perfectly valid assumptions.

    After working through your sample code, here's what I came up with. All in all, it took about a 5 minutes to generate 1 million samples - start to finish. I'ts pretty disk intensive on tempdb (part of that may be the setup I have at home), but memory and memory never got too bad.

    Note: I already had a persisted tally table (in my script, sysmon.dbo.numbers; I have one on ever server I ever use) so I don't use an in-line tally table

    I also tried batching various steps like pulling the symbols, or serializing them, but all took longer, and didn't seem to make a sizable impact on system resources.

    use tempdb

    go

    set nocount on

    go

    declare

    @universeSize int = 10000,

    @maxSubsetSize int = 100,

    @maxSubsets int = 1000000

    if object_id('tempdb.dbo.#Symbols') is not null drop table #Symbols

    create table #Symbols

    (

    SymbolId int primary key clustered

    )

    if object_id('tempdb.dbo.#Samples') is not null drop table #Samples

    create table #Samples

    (

    SampleId int primary key clustered,

    SampleSize int,

    Samples nvarchar(max)

    )

    if object_id('tempdb.dbo.#SampleSymbols') is not null drop table #SampleSymbols

    create table #SampleSymbols

    (

    SampleId int,

    SymbolId int

    --primary key clustered (SampleId, SymbolId) with (data_compression = page)

    )

    -- This will normally be derived from a real table, so you can't assume uniform distribution, or even that all the values will be lower than some number.

    -- Get a list of

    insert into #Symbols (SymbolID)

    select top 10000 SymbolID = num

    from sysmon.dbo.numbers

    order by newid() -- Just to give the list an air of randomness, so you don't try to use tricks involving sequential numbers. Not strictly necessary at all.

    -- Insert an integer identifer for each sample you want to generate, plus a random number between 1 and @maxSubsetSize saying how many symbols should be in the sample

    -- This serves as the framework for each group. SampleID is the identifier ofa given group

    -- SampleSize is how mny rows it should randomly take off the symbol list

    insert into #Samples

    (

    SampleID,

    SampleSize

    )

    select top (@maxSubsets)

    SampleID = num,

    SampleSize = (abs(checksum(newid())) % @maxSubsetSize) + 1

    from sysmon.dbo.numbers

    -- DON PUT A CLUSTERED INDEX ON THIS. ITS WAAAAAAAAY FASTER INTO A HEAP

    -- For each sample, cross apply a randomized subset of #Symbols with a top count = [SampleSize] (which we set in the previou statement)

    insert into #SampleSymbols

    (

    SampleID,

    SymbolId

    )

    select

    s.SampleId,

    x.SymbolId

    from #Samples s

    cross apply (select top (s.SampleSize) SymbolId

    from #Symbols

    order by newid()) x

    --1.5 min on on about 50m rows

    -- Cluster one up homie

    -- Note I tried this witout clustering the #SampleSymbols table and it did horribly

    create unique clustered index #PKC__#SampleSymbols on #SampleSymbols (SampleID, SymbolID)

    with (Data_compression = none, maxdop = 4, online = off)

    -- Serialize data

    update o

    set Samples = '[' + stuff((select ',{XID:' + cast(SymbolId as varchar(30)) + '}'

    from #SampleSymbols i

    where i.SampleId = o.sampleId

    for xml path('')), 1, 1, '') + ']'

    from #Samples o

    Executive Junior Cowboy Developer, Esq.[/url]

  • Hi Xedni,

    you are very welcome. I do see one problem with your code and that is that all subsets of the same size will be identical (duplicates), is this what you intended?

    😎

  • Eirikur Eiriksson (6/27/2016)


    ...all subsets of the same size will be identical (duplicates), is this what you intended?

    😎

    Ew... you're right. Nice catch.

    Executive Junior Cowboy Developer, Esq.[/url]

  • This would be much shorter version:

    SELECT n.N ID, STUFF(

    (SELECT ',' + CONVERT(VARCHAR(20), n3.RandomN)

    FROM

    --picking a random quantity of random numbers per ID

    (SELECT TOP 1 N topn FROM dbo.TallyGenerator(1,3,1) WHERE n.N > 0 ORDER BY NEWID() ) n2

    -- now picking the random numbers themselves, [topn] of different numbers per each ID

    CROSS APPLY (SELECT TOP (topn) N RandomN FROM dbo.TallyGenerator(1,1000,1) WHERE n.N > 0 ORDER BY NEWID()) n3

    FOR XML PATH('')

    ), 1,1,'')

    --generating ID's - may be coming from a table with actual set of ID's, does not have to be generated

    FROM dbo.TallyGenerator(1,10000,1) n --(SELECT id FROM #Numbers GROUP BY id) n

    ORDER BY id

    dbo.TallyGenerator here is an inline table function generating sequential numbers from @Param1 to @Param2 with step @Param3.

    May be replaced with the commonly used CTE.

    _____________
    Code for TallyGenerator

  • Could you provide the code for TallyGenerator? Since I'm not actually trying to just generate random numbers, it's difficult to see where I'd slot in my legitimate symbols, or how it would scale.

    Executive Junior Cowboy Developer, Esq.[/url]

  • Xedni (6/27/2016)


    Could you provide the code for TallyGenerator? Since I'm not actually trying to just generate random numbers, it's difficult to see where I'd slot in my legitimate symbols, or how it would scale.

    One of the versions is right there, in Erikur's post:

    -- Inline Tally table producing a number sequence from 1 to @SAMPLE_SIZE

    ;WITH T(N) AS (SELECT N FROM (VALUES (0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) AS X(N))

    , NUMS(N) AS (SELECT TOP(@SAMPLE_SIZE) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS N

    FROM T T1,T T2,T T3,T T4,T T5,T T6,T T7,T T8,T T9)

    It uses a single parameter (numbers to ..) .

    Parameters "numbers from" and "increment" are hardcoded as 1.

    Put this code into a function - and here you go.

    _____________
    Code for TallyGenerator

  • That's how you do it:

    CREATE FUNCTION dbo.TallyGenerator (

    @EndNumber bigint

    )

    RETURNS TABLE

    AS RETURN

    (-- Inline Tally table producing a number sequence from 1 to @EndNumber

    WITH T(N) AS (SELECT N FROM (VALUES (0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) AS X(N))

    SELECT TOP(@EndNumber) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS N

    FROM T T1,T T2,T T3,T T4,T T5,T T6,T T7,T T8,T T9

    )

    GO

    SELECT * FROM dbo.TallyGenerator(15) tg

    _____________
    Code for TallyGenerator

  • Gotcha. Unfortunately, that function doesn't have the same signature as the one in your original code sample. I modified the function to accept all three parameters (start, end and increment; num % @increment, while still being an inline tvf), but cancelled it after 5 minutes. It's an interesting (and informative) approach, however it would seem simplicity in this case doesn't equate to performance.

    Executive Junior Cowboy Developer, Esq.[/url]

  • Xedni (6/27/2016)


    Gotcha. Unfortunately, that function doesn't have the same signature as the one in your original code sample. I modified the function to accept all three parameters (start, end and increment; num % @increment, while still being an inline tvf), but cancelled it after 5 minutes. It's an interesting (and informative) approach, however it would seem simplicity in this case doesn't equate to performance.

    That's odd.

    How many sets are your trying to generate?

    It took 7 seconds on my laptop to generate 10k sets of up to 30 numbers from the selection of 1000.

    _____________
    Code for TallyGenerator

  • Here is the code I used:

    CREATE FUNCTION dbo.TallyN (

    @EndNumber bigint

    )

    RETURNS TABLE

    AS RETURN

    (-- Inline Tally table producing a number sequence from 1 to @SAMPLE_SIZE

    WITH T(N) AS (SELECT N FROM (VALUES (0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) AS X(N))

    SELECT TOP(@EndNumber) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS N

    FROM T T1,T T2,T T3,T T4,T T5,T T6,T T7,T T8,T T9

    )

    GO

    SELECT n.N ID, STUFF(

    (SELECT ',' + CONVERT(VARCHAR(20), n3.RandomN)

    FROM

    --picking a random quantity of random numbers per ID

    (SELECT TOP 1 N topn FROM dbo.TallyN(30) WHERE n.N > 0 ORDER BY NEWID() ) n2

    -- now picking the random numbers themselves, [topn] of different numbers per each ID

    CROSS APPLY (SELECT TOP (topn) N RandomN FROM dbo.TallyN(1000) WHERE n.N > 0 ORDER BY NEWID()) n3

    FOR XML PATH('')

    ), 1,1,'')

    --generating ID's - may be coming from a table with actual set of ID's, does not have to be generated

    FROM dbo.TallyN(10000) n

    ORDER BY id

    Execution time - 7 seconds.

    _____________
    Code for TallyGenerator

Viewing 13 posts - 1 through 12 (of 12 total)

You must be logged in to reply to this topic. Login to reply