Help with query

Question

Post reply

Help with query

schleep

SSChampion

Points: 12605
More actions
July 12, 2017 at 8:45 am

#338212

I need all combinations (not permutations) of the keywords and their synonyms. I'm having trouble getting groups of words substituting with their synonyms.
A keyword's synonym could be matched in the synonym's table cWord column or cSyn column.
I'm guessing it's going to involve something like a recursive cte w/ cross join (shudders), but the keywords table will rarely be more than 5, and never more than 8, rows.
Can anyone help?
TIA,
Paul
DECLARE @tKeywords    TABLE     (nGroup_Id    INTEGER,     cWord        VARCHAR(255))
DECLARE @tSynonyms    TABLE     (cWord        VARCHAR(255),     cSyn        VARCHAR(255))
INSERT    @tKeywords SELECT    1, 'Employment' UNION ALL SELECT 1, 'Insurance' UNION ALL SELECT 1, 'Online' UNION ALL SELECT 1, 'Application'
INSERT    @tSynonyms SELECT    'Employment', 'Job' UNION ALL SELECT 'Employment', 'Profession' UNION ALL SELECT 'Employment', 'Vocation' UNION ALL SELECT 'Online', 'Electronic'
Expected results:
Group_Id     cWord
1 Employment
1 Insurance
1 Online
1 Application
2 Job
2 Insurance
2 Online
2 Application
3 Profession
3 Insurance
3 Online
3 Application
4 Vocation
4 Insurance
4 Online
4 Application
5 Employment
5 Insurance
5 Electronic
5 Application
6 Job
6 Insurance
6 Electronic
6 Application
7 Profession
7 Insurance
7 Electronic
7 Application
8 Vocation
8 Insurance
8 Electronic
8 Application

Viewing 12 posts - 1 through 12 (of 12 total)

You must be logged in to reply to this topic. Login to reply

TheSQLGuru SSC Guru Points: 134015 More actions · Answer 1

I'm, sorry, but I can't for the life of me figure out what you need here. I'll blame it on percocet. 🙂

1) You have a groupid field that is all 1's, but your output has an incrementing groupid

2) Insurance and Application have no matches at all, yet they are in the output

3) Just general confusion. 🙂

Can you please simplify with perhaps just two of each thingy and put some verbiage with each expected output row explaining the rule(s) for why each exists. Perhaps something about why other permutations do NOT make it in the output could be helpful too?

Best,
Kevin G. Boles
SQL Server Consultant
SQL MVP 2007-2012
TheSQLGuru on googles mail service

John Mitchell-245523 SSC Guru Points: 148800 More actions · Answer 2

I think he wants all combinations of one word from each group of synonyms. The groups are (Employment, Job, Profession, Vocation), (Insurance), (Online, Electronic) and (Application). I got close, but I've run out of time, so I'll save what I've done and have another go at it tomorrow if I get a chance.

Edit: what even is percocet? Is it like covfefe?!

John

TheSQLGuru SSC Guru Points: 134015 More actions · Answer 3

John Mitchell-245523 - Wednesday, July 12, 2017 10:23 AM
I think he wants all combinations of one word from each group of synonyms. The groups are (Employment, Job, Profession, Vocation), (Insurance), (Online, Electronic) and (Application). I got close, but I've run out of time, so I'll save what I've done and have another go at it tomorrow if I get a chance.
Edit: what even is percocet? Is it like covfefe?!
John

Can you post up what you have so far? Maybe I can carry on with it.

Percocet is a narcotic painkiller. I had a surgery go bad almost 2 months ago and I still have to take it regularly. :crying:

Best,
Kevin G. Boles
SQL Server Consultant
SQL MVP 2007-2012
TheSQLGuru on googles mail service

TheSQLGuru SSC Guru Points: 134015 More actions · Answer 4

I have annotated the expected results to try to understand better. Am I on the right track? One thing that still gets me is that it (almost) seems like order matters? Still fuzzy ..

Expected results:
Group_Id cWord
1   Employment Employment, 1st hit, NO replacement
1   Insurance
1   Online
1   Application

2   Job    Employment --> Job cSyn
2   Insurance
2   Online
2   Application

3   Profession Employment --> Profession cSyn
3   Insurance
3   Online
3   Application

4   Vocation Employment --> Vocation cSyn
4   Insurance
4   Online
4   Application

5   Employment Employment, 2nd hit, replace Online (3rd row) with Electronic cSyn
5   Insurance
5   Electronic
5   Application

6   Job    Employment, 2nd hit --> Job cSyn, replace Online (3rd row) with Electronic cSyn
6   Insurance
6   Electronic
6   Application

7   Profession Employment, 2nd hit --> Profession cSyn, replace Online (3rd row) with Electronic cSyn
7   Insurance
7   Electronic
7   Application

8   Vocation Employment, 2nd hit --> Vocation Syn, replace Online (3rd row) with Electronic cSyn
8   Insurance
8   Electronic
8   Application

Best,
Kevin G. Boles
SQL Server Consultant
SQL MVP 2007-2012
TheSQLGuru on googles mail service

schleep SSChampion Points: 12605 More actions · Answer 5

Sorry, was -- and am -- out to lunch :Whistling:

Precisely!

But thinking on it over food, I may have been making this harder than it need be.
I can start with the string 'Employment Insurance Online Application', do the substitutions, then run them through the splitter.
It's a WHILE loop, but, seems to work alright.

DECLARE @cCleanStringVar VARCHAR(512) = 'Employment Insurance Online Application'

DECLARE @tKeywords TABLE (cWord VARCHAR(255))

DECLARE @tSynonyms TABLE (cWord VARCHAR(255), cSyn VARCHAR(255))

INSERT @tSynonyms SELECT 'Employment', 'Job' UNION ALL SELECT 'Employment', 'Work' UNION ALL SELECT 'Employment', 'Profession' UNION ALL SELECT 'Employment', 'Vocation' UNION ALL SELECT 'Online', 'Electronic'

INSERT @tKeywords SELECT @cCleanStringVar

WHILE @@ROWCOUNT > 0

INSERT @tKeywords SELECT DISTINCT --ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) + 1, STUFF(k.cWord, CHARINDEX(s.cWord, k.cWord, nRow), DATALENGTH(s.cWord), s.cSyn)--, -- s.cWord FROM @tKeywords k CROSS APPLY Project.dbo.ufn_Numbers(DATALENGTH(k.cWord)) JOIN @tSynonyms s ON CHARINDEX(s.cWord, k.cWord, nRow) > 0 AND NOT EXISTS (SELECT * FROM @tKeywords k2 WHERE k2.cWord = STUFF(k.cWord, CHARINDEX(s.cWord, k.cWord, nRow), DATALENGTH(s.cWord), s.cSyn))

SELECT ROW_NUMBER() OVER (ORDER BY (SELECT NULL)), cWord FROM @tKeywords ORDER BY 1

TheSQLGuru SSC Guru Points: 134015 More actions · Answer 6

schleep - Wednesday, July 12, 2017 11:21 AM
Sorry, was -- and am -- out to lunch :Whistling:
Precisely!
But thinking on it over food, I may have been making this harder than it need be.
I can start with the string 'Employment Insurance Online Application', do the substitutions, then run them through the splitter.
It's a WHILE loop, but, seems to work alright.
DECLARE @cCleanStringVar VARCHAR(512) = 'Employment Insurance Online Application'
DECLARE @tKeywords    TABLE     (cWord        VARCHAR(255))
DECLARE @tSynonyms    TABLE     (cWord        VARCHAR(255),     cSyn        VARCHAR(255))
INSERT    @tSynonyms SELECT    'Employment', 'Job' UNION ALL SELECT 'Employment', 'Work' UNION ALL SELECT 'Employment', 'Profession' UNION ALL SELECT 'Employment', 'Vocation' UNION ALL SELECT 'Online', 'Electronic'
INSERT    @tKeywords SELECT    @cCleanStringVar
WHILE @@ROWCOUNT > 0
    INSERT    @tKeywords     SELECT    DISTINCT --ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) + 1,         STUFF(k.cWord, CHARINDEX(s.cWord, k.cWord, nRow), DATALENGTH(s.cWord), s.cSyn)--, --        s.cWord     FROM    @tKeywords    k     CROSS APPLY Project.dbo.ufn_Numbers(DATALENGTH(k.cWord))     JOIN    @tSynonyms    s    ON CHARINDEX(s.cWord, k.cWord, nRow) > 0         AND NOT EXISTS (SELECT    *             FROM    @tKeywords k2             WHERE    k2.cWord = STUFF(k.cWord, CHARINDEX(s.cWord, k.cWord, nRow), DATALENGTH(s.cWord), s.cSyn))
SELECT    ROW_NUMBER() OVER (ORDER BY (SELECT NULL)),     cWord FROM    @tKeywords ORDER BY 1

That code makes my brain hurt!! 😀 But if it works I'm moving on ... LOL

Best,
Kevin G. Boles
SQL Server Consultant
SQL MVP 2007-2012
TheSQLGuru on googles mail service

schleep SSChampion Points: 12605 More actions · Answer 7

Mine too. Here's the final version. I didn't mention before, we also have phrases and synonyms of phrases.
Managed to get rid of the cross apply, which makes me considerably happier.

IF OBJECT_ID('dbo.ufn_IPP_GetKWSynsGroups') IS NOT NULL DROP FUNCTION dbo.ufn_IPP_GetKWSynsGroups GO

CREATE FUNCTION dbo.ufn_IPP_GetKWSynsGroups (@cCleanStringArg VARCHAR(255), @cLangArg CHAR(1)) RETURNS @tRS TABLE ( nGroup_Id INTEGER, cWord VARCHAR(512)) AS BEGIN DECLARE @nRowCountVar INTEGER DECLARE @tKeywords TABLE (nGroup_Id INTEGER IDENTITY, cWord VARCHAR(255) PRIMARY KEY)

DECLARE @tSynonyms TABLE (cWord VARCHAR(255), cSyn VARCHAR(255), nPos INTEGER)

INSERT @tSynonyms SELECT cWord, cSyn, CHARINDEX(cWord, @cCleanStringArg) FROM dbo.IPP_Synonyms s WHERE CHARINDEX(cWord, @cCleanStringArg) > 0 AND s.cLang = @cLangArg

WHILE @@ROWCOUNT > 0

INSERT @tSynonyms SELECT s.cWord, s.cSyn, CHARINDEX(s.cWord, @cCleanStringArg, nPos) FROM dbo.IPP_Synonyms s CROSS JOIN @tSynonyms t WHERE CHARINDEX(s.cWord, @cCleanStringArg) > 0 AND s.cLang = @cLangArg AND NOT EXISTS (SELECT * FROM @tSynonyms t2 WHERE t2.cWord = s.cWord AND t2.cSyn = s.cSyn)

INSERT @tKeywords SELECT @cCleanStringArg

SELECT @nRowCountVar = @@ROWCOUNT

WHILE @nRowCountVar > 0 BEGIN INSERT @tKeywords SELECT DISTINCT STUFF(k.cWord, CHARINDEX(s.cWord, k.cWord), DATALENGTH(s.cWord), s.cSyn) FROM @tKeywords k JOIN @tSynonyms s ON CHARINDEX(s.cWord, k.cWord) > 0 AND NOT EXISTS (SELECT * FROM @tKeywords k2 WHERE k2.cWord = STUFF(k.cWord, CHARINDEX(s.cWord, k.cWord), DATALENGTH(s.cWord), s.cSyn)) SELECT @nRowCountVar = @@ROWCOUNT

INSERT @tKeywords SELECT DISTINCT STUFF(k.cWord, CHARINDEX(s.cSyn, k.cWord), DATALENGTH(s.cSyn), s.cWord) FROM @tKeywords k JOIN @tSynonyms s ON CHARINDEX(s.cSyn, k.cWord) > 0 AND NOT EXISTS (SELECT * FROM @tKeywords k2 WHERE k2.cWord = STUFF(k.cWord, CHARINDEX(s.cSyn, k.cWord), DATALENGTH(s.cSyn), s.cWord))

SELECT @nRowCountVar += @@ROWCOUNT END

INSERT @tRS SELECT nGroup_Id, cRow FROM (SELECT nGroup_Id, cWord FROM @tKeywords) k CROSS APPLY Project.dbo.ufn_DelimitedStringToTable (k.cWord, ' ') RETURN END GO

schleep SSChampion Points: 12605 More actions · Answer 8

TheSQLGuru - Wednesday, July 12, 2017 12:42 PM
That code makes my brain hurt!! 😀 But if it works I'm moving on ... LOL

Might wanna pop a couple more percs 😛

Thanks for taking the time.

Jacob Wilkins One Orange Chip Points: 27976 More actions · Answer 9

Here's another way (based on the original sample data with two table variables) that uses dynamic TSQL. Of course, all the usual caveats with such things apply (protecting against injection, etc.); this is just to illustrate another technique:

DECLARE @CASE VARCHAR(MAX)=' CASE ' ,@FROM VARCHAR(MAX)=' FROM (SELECT cWord FROM #temp WHERE cWord=cSyn) t ' ,@ORDER VARCHAR(MAX)='' ,@SQL VARCHAR(MAX) ;

IF OBJECT_ID('tempdb..#temp') IS NOT NULL DROP TABLE #temp;

SELECT cWord, cSyn=cWord INTO #temp FROM @tKeywords UNION ALL SELECT cWord, cSyn FROM @tSynonyms ts WHERE EXISTS (SELECT 0 FROM @tKeywords tk WHERE ts.cWord=tk.cWord) ;

SELECT @CASE=@CASE+'WHEN t.cWord='''+cWord+''' THEN '+QUOTENAME(cWord)+'.cSyn ', @FROM=@FROM+'OUTER APPLY (SELECT cSyn FROM #temp WHERE cWord='''+cWord+''') '+QUOTENAME(cWord)+' ', @ORDER=@ORDER+', '+QUOTENAME(cWord)+'.cSyn' FROM @tKeywords;

SELECT @CASE=@CASE+' END ', @ORDER=STUFF(@ORDER,1,1,'');

SELECT @SQL='SELECT group_id=DENSE_RANK() OVER (ORDER BY '+@ORDER+'), cWord='+@CASE+@FROM+' ORDER BY group_id ASC, t.cWord ASC';

EXEC(@SQL);

Cheers!

Alan Burstein SSC Guru Points: 61152 More actions · Answer 10

Jacob Wilkins - Wednesday, July 12, 2017 3:50 PM
Here's another way (based on the original sample data with two table variables) that uses dynamic TSQL. Of course, all the usual caveats with such things apply (protecting against injection, etc.); this is just to illustrate another technique:
DECLARE @CASE VARCHAR(MAX)=' CASE ' ,@FROM VARCHAR(MAX)=' FROM (SELECT cWord FROM #temp WHERE cWord=cSyn) t ' ,@ORDER VARCHAR(MAX)='' ,@SQL VARCHAR(MAX) ;
IF OBJECT_ID('tempdb..#temp') IS NOT NULL DROP TABLE #temp;
SELECT cWord, cSyn=cWord INTO #temp FROM @tKeywords UNION ALL SELECT cWord, cSyn FROM @tSynonyms ts WHERE EXISTS (SELECT 0 FROM @tKeywords tk WHERE ts.cWord=tk.cWord) ;
SELECT @CASE=@CASE+'WHEN t.cWord='''+cWord+''' THEN '+QUOTENAME(cWord)+'.cSyn ', @FROM=@FROM+'OUTER APPLY (SELECT cSyn FROM #temp WHERE cWord='''+cWord+''') '+QUOTENAME(cWord)+' ', @ORDER=@ORDER+', '+QUOTENAME(cWord)+'.cSyn' FROM @tKeywords;
SELECT @CASE=@CASE+' END ', @ORDER=STUFF(@ORDER,1,1,'');
SELECT @SQL='SELECT group_id=DENSE_RANK() OVER (ORDER BY '+@ORDER+'), cWord='+@CASE+@FROM+' ORDER BY group_id ASC, t.cWord ASC';
EXEC(@SQL);
Cheers!

Brilliant work! This is excellent!
I spent a couple hours trying to come up with a more set based way to do this - it looks really simple but I ended up pulling my hair out.

"I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

-- Itzik Ben-Gan 2001

Jacob Wilkins One Orange Chip Points: 27976 More actions · Answer 11

@Alan.B: Thank you sir! Coming from you that's especially high praise 🙂

I absolutely agree; this is a deceptively difficult exercise, and I spent a couple hours last night trying to find something better as well.

I did come up with one other possibility, based on a little math trick to make sure the groups are distributed correctly.

In the first version I used CTEs to do all the required aggregations, but some of them are just painful to do that way.

I switched to using a series of temp tables instead, which makes it much faster, especially since the running product can be calculated using a quirky update style approach (the order in which it's calculated doesn't actually matter).

In my tests with >100 synonyms for some of the keywords, it generally outperforms the dynamic TSQL approach, but it's a lot more bizarre 🙂

Basically, the idea is to count the synonyms for each keyword (including the keyword itself). Then, you number each synonym within a cWord group.

Then, you calculate a running product of the count per keyword, which is used to give both the running product for each cWord and the total number of combinations.

Once you have all that, you can use a tally to generate a possible row for each synonym/group_id combination, and distribute synonyms into groups correctly by requiring that row_number-1=((group_id-1)/(total_combinations/running_product))%count

Essentially you're splitting the possibility space into buckets, So, let's say you have just 2 keywords with 2 possible values each (the keyword itself and one synonym).

For the first cWord (in arbitrary order), the running product is 2. For the second cWord, the running product is 4, which is also the total number of combinations.

The first cWord is then split into buckets of size 2 (total_combinations/running_product) over which its values cycle. So, group_ids 1 and 2 would have the same value, and group_ids 3 and 4 would have the other value.

The second cWord is split into buckets of size 1 (total_combinations/running_product) over which its values cycle. By essentially making sure that each successive cWord cycles through its values within a single cycle of the previous cWord (again, in arbitrary order), you make sure you get all combinations.

I still prefer the dynamic TSQL approach, honestly, but this one was more fun to write 🙂

IF OBJECT_ID('tempdb..#unified') IS NOT NULL DROP TABLE #unified;

CREATE TABLE #unified (rn BIGINT, cWord VARCHAR(255), cSyn VARCHAR(255));

INSERT INTO #unified SELECT rn=ROW_NUMBER() OVER (PARTITION BY cWord ORDER BY cSyn)-1, cWord, cSyn FROM ( SELECT cWord, cSyn=cWord FROM @tKeywords UNION ALL SELECT cWord, cSyn FROM @tSynonyms ts )x ;

IF OBJECT_ID('tempdb..#prods') IS NOT NULL DROP TABLE #prods;

SELECT cWord, cnt=COUNT(*),moving_prod=0 INTO #prods FROM #unified GROUP BY cWord;

DECLARE @agg_prod INT=1;

UPDATE #prods SET @agg_prod=moving_prod=cnt*@agg_prod OPTION (MAXDOP 1);

IF OBJECT_ID('tempdb..#some_aggs') IS NOT NULL DROP TABLE #some_aggs;

CREATE TABLE #some_aggs (cWord VARCHAR(255), cnt BIGINT, moving_prod BIGINT, marker BIGINT, N BIGINT);

INSERT INTO #some_aggs SELECT cWord, cnt, moving_prod, marker=((tn.N-1)/(@agg_prod/moving_prod))%cnt,N FROM #prods p CROSS JOIN dbo.TallyN(@agg_prod) tn;

SELECT group_id=s.N, cWord=u.cSyn FROM #unified u INNER JOIN #some_aggs s ON u.cWord=s.cWord AND u.rn=s.marker ORDER BY s.N ASC, u.cWord ASC OPTION (RECOMPILE);