Hi Erikur,
Thanks again for the reply. To preface this, your post was VERY helpful, and I doubt I would have got to the version I'm happy with now without your suggestions. I'll just say where I had to make some changes. Yours didn't do quite what I wanted right out of the box, but it gave me enough idea to get it working for what I needed. For what it's worth, when I ran your temp table version ran for about 50 minutes on my machine before I just killed it. Truthfully I didn't to a lot of profiling on it because there were a few other things I wanted to rewrite anyway. While using a random number mod @ENTRY_NUMBER is a really clever way of getting symbols, it can produce duplicates in a single set. And, unfortunately in my real set, the numbers are not uniformly distributed between 1 and 10000. Not a big deal, and given the information I gave you, perfectly valid assumptions.
After working through your sample code, here's what I came up with. All in all, it took about a 5 minutes to generate 1 million samples - start to finish. I'ts pretty disk intensive on tempdb (part of that may be the setup I have at home), but memory and memory never got too bad.
Note: I already had a persisted tally table (in my script, sysmon.dbo.numbers; I have one on ever server I ever use) so I don't use an in-line tally table
I also tried batching various steps like pulling the symbols, or serializing them, but all took longer, and didn't seem to make a sizable impact on system resources.
use tempdb
go
set nocount on
go
declare
@universeSize int = 10000,
@maxSubsetSize int = 100,
@maxSubsets int = 1000000
if object_id('tempdb.dbo.#Symbols') is not null drop table #Symbols
create table #Symbols
(
SymbolId int primary key clustered
)
if object_id('tempdb.dbo.#Samples') is not null drop table #Samples
create table #Samples
(
SampleId int primary key clustered,
SampleSize int,
Samples nvarchar(max)
)
if object_id('tempdb.dbo.#SampleSymbols') is not null drop table #SampleSymbols
create table #SampleSymbols
(
SampleId int,
SymbolId int
--primary key clustered (SampleId, SymbolId) with (data_compression = page)
)
-- This will normally be derived from a real table, so you can't assume uniform distribution, or even that all the values will be lower than some number.
-- Get a list of
insert into #Symbols (SymbolID)
select top 10000 SymbolID = num
from sysmon.dbo.numbers
order by newid() -- Just to give the list an air of randomness, so you don't try to use tricks involving sequential numbers. Not strictly necessary at all.
-- Insert an integer identifer for each sample you want to generate, plus a random number between 1 and @maxSubsetSize saying how many symbols should be in the sample
-- This serves as the framework for each group. SampleID is the identifier ofa given group
-- SampleSize is how mny rows it should randomly take off the symbol list
insert into #Samples
(
SampleID,
SampleSize
)
select top (@maxSubsets)
SampleID = num,
SampleSize = (abs(checksum(newid())) % @maxSubsetSize) + 1
from sysmon.dbo.numbers
-- DON PUT A CLUSTERED INDEX ON THIS. ITS WAAAAAAAAY FASTER INTO A HEAP
-- For each sample, cross apply a randomized subset of #Symbols with a top count = [SampleSize] (which we set in the previou statement)
insert into #SampleSymbols
(
SampleID,
SymbolId
)
select
s.SampleId,
x.SymbolId
from #Samples s
cross apply (select top (s.SampleSize) SymbolId
from #Symbols
order by newid()) x
--1.5 min on on about 50m rows
-- Cluster one up homie
-- Note I tried this witout clustering the #SampleSymbols table and it did horribly
create unique clustered index #PKC__#SampleSymbols on #SampleSymbols (SampleID, SymbolID)
with (Data_compression = none, maxdop = 4, online = off)
-- Serialize data
update o
set Samples = '[' + stuff((select ',{XID:' + cast(SymbolId as varchar(30)) + '}'
from #SampleSymbols i
where i.SampleId = o.sampleId
for xml path('')), 1, 1, '') + ']'
from #Samples o