RE: Generate n random int subsets

SSCertifiable

Points: 6951

June 26, 2016 at 9:52 am

Hi Erikur,

Thanks again for the reply. To preface this, your post was VERY helpful, and I doubt I would have got to the version I'm happy with now without your suggestions. I'll just say where I had to make some changes. Yours didn't do quite what I wanted right out of the box, but it gave me enough idea to get it working for what I needed. For what it's worth, when I ran your temp table version ran for about 50 minutes on my machine before I just killed it. Truthfully I didn't to a lot of profiling on it because there were a few other things I wanted to rewrite anyway. While using a random number mod @ENTRY_NUMBER is a really clever way of getting symbols, it can produce duplicates in a single set. And, unfortunately in my real set, the numbers are not uniformly distributed between 1 and 10000. Not a big deal, and given the information I gave you, perfectly valid assumptions.

After working through your sample code, here's what I came up with. All in all, it took about a 5 minutes to generate 1 million samples - start to finish. I'ts pretty disk intensive on tempdb (part of that may be the setup I have at home), but memory and memory never got too bad.

Note: I already had a persisted tally table (in my script, sysmon.dbo.numbers; I have one on ever server I ever use) so I don't use an in-line tally table

I also tried batching various steps like pulling the symbols, or serializing them, but all took longer, and didn't seem to make a sizable impact on system resources.

use tempdb

go

set nocount on

go

declare

@universeSize int = 10000,

@maxSubsetSize int = 100,

@maxSubsets int = 1000000

if object_id('tempdb.dbo.#Symbols') is not null drop table #Symbols

create table #Symbols

(

SymbolId int primary key clustered

)

if object_id('tempdb.dbo.#Samples') is not null drop table #Samples

create table #Samples

(

SampleId int primary key clustered,

SampleSize int,

Samples nvarchar(max)

)

if object_id('tempdb.dbo.#SampleSymbols') is not null drop table #SampleSymbols

create table #SampleSymbols

(

SampleId int,

SymbolId int

--primary key clustered (SampleId, SymbolId) with (data_compression = page)

)

-- This will normally be derived from a real table, so you can't assume uniform distribution, or even that all the values will be lower than some number.

-- Get a list of

insert into #Symbols (SymbolID)

select top 10000 SymbolID = num

from sysmon.dbo.numbers

order by newid() -- Just to give the list an air of randomness, so you don't try to use tricks involving sequential numbers. Not strictly necessary at all.

-- Insert an integer identifer for each sample you want to generate, plus a random number between 1 and @maxSubsetSize saying how many symbols should be in the sample

-- This serves as the framework for each group. SampleID is the identifier ofa given group

-- SampleSize is how mny rows it should randomly take off the symbol list

insert into #Samples

(

SampleID,

SampleSize

)

select top (@maxSubsets)

SampleID = num,

SampleSize = (abs(checksum(newid())) % @maxSubsetSize) + 1

from sysmon.dbo.numbers

-- DON PUT A CLUSTERED INDEX ON THIS. ITS WAAAAAAAAY FASTER INTO A HEAP

-- For each sample, cross apply a randomized subset of #Symbols with a top count = [SampleSize] (which we set in the previou statement)

insert into #SampleSymbols

(

SampleID,

SymbolId

)

select

s.SampleId,

x.SymbolId

from #Samples s

cross apply (select top (s.SampleSize) SymbolId

from #Symbols

order by newid()) x

--1.5 min on on about 50m rows

-- Cluster one up homie

-- Note I tried this witout clustering the #SampleSymbols table and it did horribly

create unique clustered index #PKC__#SampleSymbols on #SampleSymbols (SampleID, SymbolID)

with (Data_compression = none, maxdop = 4, online = off)

-- Serialize data

update o

set Samples = '[' + stuff((select ',{XID:' + cast(SymbolId as varchar(30)) + '}'

from #SampleSymbols i

where i.SampleId = o.sampleId

for xml path('')), 1, 1, '') + ']'

from #Samples o

Executive Junior Cowboy Developer, Esq.[/url]