@alan.B: Thank you sir! Coming from you that's especially high praise 🙂
I absolutely agree; this is a deceptively difficult exercise, and I spent a couple hours last night trying to find something better as well.
I did come up with one other possibility, based on a little math trick to make sure the groups are distributed correctly.
In the first version I used CTEs to do all the required aggregations, but some of them are just painful to do that way.
I switched to using a series of temp tables instead, which makes it much faster, especially since the running product can be calculated using a quirky update style approach (the order in which it's calculated doesn't actually matter).
In my tests with >100 synonyms for some of the keywords, it generally outperforms the dynamic TSQL approach, but it's a lot more bizarre 🙂
Basically, the idea is to count the synonyms for each keyword (including the keyword itself). Then, you number each synonym within a cWord group.
Then, you calculate a running product of the count per keyword, which is used to give both the running product for each cWord and the total number of combinations.
Once you have all that, you can use a tally to generate a possible row for each synonym/group_id combination, and distribute synonyms into groups correctly by requiring that row_number-1=((group_id-1)/(total_combinations/running_product))%count
Essentially you're splitting the possibility space into buckets, So, let's say you have just 2 keywords with 2 possible values each (the keyword itself and one synonym).
For the first cWord (in arbitrary order), the running product is 2. For the second cWord, the running product is 4, which is also the total number of combinations.
The first cWord is then split into buckets of size 2 (total_combinations/running_product) over which its values cycle. So, group_ids 1 and 2 would have the same value, and group_ids 3 and 4 would have the other value.
The second cWord is split into buckets of size 1 (total_combinations/running_product) over which its values cycle. By essentially making sure that each successive cWord cycles through its values within a single cycle of the previous cWord (again, in arbitrary order), you make sure you get all combinations.
I still prefer the dynamic TSQL approach, honestly, but this one was more fun to write 🙂
IF OBJECT_ID('tempdb..#unified') IS NOT NULL DROP TABLE #unified;
CREATE TABLE #unified (rn BIGINT, cWord VARCHAR(255), cSyn VARCHAR(255));
INSERT INTO #unified
SELECT rn=ROW_NUMBER() OVER (PARTITION BY cWord ORDER BY cSyn)-1, cWord, cSyn
SELECT cWord, cSyn=cWord
SELECT cWord, cSyn
FROM @tSynonyms ts
IF OBJECT_ID('tempdb..#prods') IS NOT NULL DROP TABLE #prods;
SELECT cWord, cnt=COUNT(*),moving_prod=0 INTO #prods FROM #unified GROUP BY cWord;
DECLARE @agg_prod INT=1;
UPDATE #prods SET @agg_prod=moving_prod=cnt*@agg_prod OPTION (MAXDOP 1);
IF OBJECT_ID('tempdb..#some_aggs') IS NOT NULL DROP TABLE #some_aggs;
CREATE TABLE #some_aggs (cWord VARCHAR(255), cnt BIGINT, moving_prod BIGINT, marker BIGINT, N BIGINT);
INSERT INTO #some_aggs SELECT cWord, cnt, moving_prod, marker=((tn.N-1)/(@agg_prod/moving_prod))%cnt,N FROM #prods p CROSS JOIN dbo.TallyN(@agg_prod) tn;
SELECT group_id=s.N, cWord=u.cSyn
FROM #unified u
#some_aggs s ON u.cWord=s.cWord AND u.rn=s.marker
ORDER BY s.N ASC, u.cWord ASC