• Hi Erikur,

    Thanks again for the reply. To preface this, your post was VERY helpful, and I doubt I would have got to the version I'm happy with now without your suggestions. I'll just say where I had to make some changes. Yours didn't do quite what I wanted right out of the box, but it gave me enough idea to get it working for what I needed. For what it's worth, when I ran your temp table version ran for about 50 minutes on my machine before I just killed it. Truthfully I didn't to a lot of profiling on it because there were a few other things I wanted to rewrite anyway. While using a random number mod @ENTRY_NUMBER is a really clever way of getting symbols, it can produce duplicates in a single set. And, unfortunately in my real set, the numbers are not uniformly distributed between 1 and 10000. Not a big deal, and given the information I gave you, perfectly valid assumptions.

    After working through your sample code, here's what I came up with. All in all, it took about a 5 minutes to generate 1 million samples - start to finish. I'ts pretty disk intensive on tempdb (part of that may be the setup I have at home), but memory and memory never got too bad.

    Note: I already had a persisted tally table (in my script, sysmon.dbo.numbers; I have one on ever server I ever use) so I don't use an in-line tally table

    I also tried batching various steps like pulling the symbols, or serializing them, but all took longer, and didn't seem to make a sizable impact on system resources.

    use tempdb

    go

    set nocount on

    go

    declare

    @universeSize int = 10000,

    @maxSubsetSize int = 100,

    @maxSubsets int = 1000000

    if object_id('tempdb.dbo.#Symbols') is not null drop table #Symbols

    create table #Symbols

    (

    SymbolId int primary key clustered

    )

    if object_id('tempdb.dbo.#Samples') is not null drop table #Samples

    create table #Samples

    (

    SampleId int primary key clustered,

    SampleSize int,

    Samples nvarchar(max)

    )

    if object_id('tempdb.dbo.#SampleSymbols') is not null drop table #SampleSymbols

    create table #SampleSymbols

    (

    SampleId int,

    SymbolId int

    --primary key clustered (SampleId, SymbolId) with (data_compression = page)

    )

    -- This will normally be derived from a real table, so you can't assume uniform distribution, or even that all the values will be lower than some number.

    -- Get a list of

    insert into #Symbols (SymbolID)

    select top 10000 SymbolID = num

    from sysmon.dbo.numbers

    order by newid() -- Just to give the list an air of randomness, so you don't try to use tricks involving sequential numbers. Not strictly necessary at all.

    -- Insert an integer identifer for each sample you want to generate, plus a random number between 1 and @maxSubsetSize saying how many symbols should be in the sample

    -- This serves as the framework for each group. SampleID is the identifier ofa given group

    -- SampleSize is how mny rows it should randomly take off the symbol list

    insert into #Samples

    (

    SampleID,

    SampleSize

    )

    select top (@maxSubsets)

    SampleID = num,

    SampleSize = (abs(checksum(newid())) % @maxSubsetSize) + 1

    from sysmon.dbo.numbers

    -- DON PUT A CLUSTERED INDEX ON THIS. ITS WAAAAAAAAY FASTER INTO A HEAP

    -- For each sample, cross apply a randomized subset of #Symbols with a top count = [SampleSize] (which we set in the previou statement)

    insert into #SampleSymbols

    (

    SampleID,

    SymbolId

    )

    select

    s.SampleId,

    x.SymbolId

    from #Samples s

    cross apply (select top (s.SampleSize) SymbolId

    from #Symbols

    order by newid()) x

    --1.5 min on on about 50m rows

    -- Cluster one up homie

    -- Note I tried this witout clustering the #SampleSymbols table and it did horribly

    create unique clustered index #PKC__#SampleSymbols on #SampleSymbols (SampleID, SymbolID)

    with (Data_compression = none, maxdop = 4, online = off)

    -- Serialize data

    update o

    set Samples = '[' + stuff((select ',{XID:' + cast(SymbolId as varchar(30)) + '}'

    from #SampleSymbols i

    where i.SampleId = o.sampleId

    for xml path('')), 1, 1, '') + ']'

    from #Samples o

    Executive Junior Cowboy Developer, Esq.[/url]