Help with query

  • I need all combinations (not permutations) of the keywords and their synonyms. I'm having trouble getting groups of words substituting with their synonyms.
    A keyword's synonym could be matched in the synonym's table cWord column or cSyn column.
    I'm guessing it's going to involve something like a recursive cte w/ cross join (shudders), but the keywords table will rarely be more than 5, and never more than 8, rows.
    Can anyone help?
    TIA,

    Paul

    DECLARE @tKeywords    TABLE
        (nGroup_Id    INTEGER,
        cWord        VARCHAR(255))

    DECLARE @tSynonyms    TABLE
        (cWord        VARCHAR(255),
        cSyn        VARCHAR(255))

    INSERT    @tKeywords
    SELECT    1, 'Employment'
    UNION ALL SELECT 1, 'Insurance'
    UNION ALL SELECT 1, 'Online'
    UNION ALL SELECT 1, 'Application'

    INSERT    @tSynonyms
    SELECT    'Employment', 'Job'
    UNION ALL SELECT 'Employment', 'Profession'
    UNION ALL SELECT 'Employment', 'Vocation'
    UNION ALL SELECT 'Online', 'Electronic'

    Expected results:
    Group_Id     cWord
    1 Employment
    1 Insurance
    1 Online
    1 Application
    2 Job
    2 Insurance
    2 Online
    2 Application
    3 Profession
    3 Insurance
    3 Online
    3 Application
    4 Vocation
    4 Insurance
    4 Online
    4 Application
    5 Employment
    5 Insurance
    5 Electronic
    5 Application
    6 Job
    6 Insurance
    6 Electronic
    6 Application
    7 Profession
    7 Insurance
    7 Electronic
    7 Application
    8 Vocation
    8 Insurance
    8 Electronic
    8 Application

  • I'm, sorry, but I can't for the life of me figure out what you need here. I'll blame it on percocet. 🙂

    1) You have a groupid field that is all 1's, but your output has an incrementing groupid

    2) Insurance and Application have no matches at all, yet they are in the output

    3) Just general confusion. 🙂

    Can you please simplify with perhaps just two of each thingy and put some verbiage with each expected output row explaining the rule(s) for why each exists. Perhaps something about why other permutations do NOT make it in the output could be helpful too?

    Best,
    Kevin G. Boles
    SQL Server Consultant
    SQL MVP 2007-2012
    TheSQLGuru on googles mail service

  • I think he wants all combinations of one word from each group of synonyms.  The groups are (Employment, Job, Profession, Vocation), (Insurance), (Online, Electronic) and (Application).  I got close, but I've run out of time, so I'll save what I've done and have another go at it tomorrow if I get a chance.

    Edit: what even is percocet?  Is it like covfefe?!

    John

  • John Mitchell-245523 - Wednesday, July 12, 2017 10:23 AM

    I think he wants all combinations of one word from each group of synonyms.  The groups are (Employment, Job, Profession, Vocation), (Insurance), (Online, Electronic) and (Application).  I got close, but I've run out of time, so I'll save what I've done and have another go at it tomorrow if I get a chance.

    Edit: what even is percocet?  Is it like covfefe?!

    John

    Can you post up what you have so far? Maybe I can carry on with it.

    Percocet is a narcotic painkiller. I had a surgery go bad almost 2 months ago and I still have to take it regularly. :crying:

    Best,
    Kevin G. Boles
    SQL Server Consultant
    SQL MVP 2007-2012
    TheSQLGuru on googles mail service

  • I have annotated the expected results to try to understand better. Am I on the right track? One thing that still gets me is that it (almost) seems like order matters? Still fuzzy ..

    Expected results:
    Group_Id cWord
    1   Employment Employment, 1st hit, NO replacement
    1   Insurance
    1   Online
    1   Application

    2   Job    Employment --> Job cSyn
    2   Insurance
    2   Online
    2   Application

    3   Profession Employment --> Profession cSyn
    3   Insurance
    3   Online
    3   Application

    4   Vocation  Employment --> Vocation cSyn
    4   Insurance
    4   Online
    4   Application

    5   Employment Employment, 2nd hit, replace Online (3rd row) with Electronic cSyn
    5   Insurance
    5   Electronic
    5   Application

    6   Job    Employment, 2nd hit --> Job cSyn, replace Online (3rd row) with Electronic cSyn
    6   Insurance
    6   Electronic
    6   Application

    7   Profession Employment, 2nd hit --> Profession cSyn, replace Online (3rd row) with Electronic cSyn
    7   Insurance
    7   Electronic
    7   Application

    8   Vocation  Employment, 2nd hit --> Vocation Syn, replace Online (3rd row) with Electronic cSyn
    8   Insurance
    8   Electronic
    8   Application

    Best,
    Kevin G. Boles
    SQL Server Consultant
    SQL MVP 2007-2012
    TheSQLGuru on googles mail service

  • Sorry, was -- and am  -- out to lunch :Whistling:

    Precisely!

    But thinking on it over food, I may have been making this harder than it need be.
    I can start with the string 'Employment Insurance Online Application', do the substitutions, then run them through the splitter.
    It's a WHILE loop, but, seems to work alright.

    DECLARE @cCleanStringVar VARCHAR(512) = 'Employment Insurance Online Application'

    DECLARE @tKeywords    TABLE
        (cWord        VARCHAR(255))

    DECLARE @tSynonyms    TABLE
        (cWord        VARCHAR(255),
        cSyn        VARCHAR(255))

    INSERT    @tSynonyms
    SELECT    'Employment', 'Job'
    UNION ALL SELECT 'Employment', 'Work'
    UNION ALL SELECT 'Employment', 'Profession'
    UNION ALL SELECT 'Employment', 'Vocation'
    UNION ALL SELECT 'Online', 'Electronic'

    INSERT    @tKeywords
    SELECT    @cCleanStringVar

    WHILE @@ROWCOUNT > 0

        INSERT    @tKeywords
        SELECT    DISTINCT --ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) + 1,
            STUFF(k.cWord, CHARINDEX(s.cWord, k.cWord, nRow), DATALENGTH(s.cWord), s.cSyn)--,
    --        s.cWord
        FROM    @tKeywords    k
        CROSS APPLY Project.dbo.ufn_Numbers(DATALENGTH(k.cWord))
        JOIN    @tSynonyms    s    ON CHARINDEX(s.cWord, k.cWord, nRow) > 0    
        AND NOT EXISTS (SELECT    *
                FROM    @tKeywords k2
                WHERE    k2.cWord = STUFF(k.cWord, CHARINDEX(s.cWord, k.cWord, nRow), DATALENGTH(s.cWord), s.cSyn))

    SELECT    ROW_NUMBER() OVER (ORDER BY (SELECT NULL)),
        cWord
    FROM    @tKeywords
    ORDER BY 1

  • schleep - Wednesday, July 12, 2017 11:21 AM

    Sorry, was -- and am  -- out to lunch :Whistling:

    Precisely!

    But thinking on it over food, I may have been making this harder than it need be.
    I can start with the string 'Employment Insurance Online Application', do the substitutions, then run them through the splitter.
    It's a WHILE loop, but, seems to work alright.

    DECLARE @cCleanStringVar VARCHAR(512) = 'Employment Insurance Online Application'

    DECLARE @tKeywords    TABLE
        (cWord        VARCHAR(255))

    DECLARE @tSynonyms    TABLE
        (cWord        VARCHAR(255),
        cSyn        VARCHAR(255))

    INSERT    @tSynonyms
    SELECT    'Employment', 'Job'
    UNION ALL SELECT 'Employment', 'Work'
    UNION ALL SELECT 'Employment', 'Profession'
    UNION ALL SELECT 'Employment', 'Vocation'
    UNION ALL SELECT 'Online', 'Electronic'

    INSERT    @tKeywords
    SELECT    @cCleanStringVar

    WHILE @@ROWCOUNT > 0

        INSERT    @tKeywords
        SELECT    DISTINCT --ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) + 1,
            STUFF(k.cWord, CHARINDEX(s.cWord, k.cWord, nRow), DATALENGTH(s.cWord), s.cSyn)--,
    --        s.cWord
        FROM    @tKeywords    k
        CROSS APPLY Project.dbo.ufn_Numbers(DATALENGTH(k.cWord))
        JOIN    @tSynonyms    s    ON CHARINDEX(s.cWord, k.cWord, nRow) > 0    
        AND NOT EXISTS (SELECT    *
                FROM    @tKeywords k2
                WHERE    k2.cWord = STUFF(k.cWord, CHARINDEX(s.cWord, k.cWord, nRow), DATALENGTH(s.cWord), s.cSyn))

    SELECT    ROW_NUMBER() OVER (ORDER BY (SELECT NULL)),
        cWord
    FROM    @tKeywords
    ORDER BY 1

    That code makes my brain hurt!! 😀 But if it works I'm moving on ... LOL

    Best,
    Kevin G. Boles
    SQL Server Consultant
    SQL MVP 2007-2012
    TheSQLGuru on googles mail service

  • Mine too. Here's the final version. I didn't mention before, we also have phrases and synonyms of phrases.
    Managed to get rid of the cross apply, which makes me considerably happier.

    IF OBJECT_ID('dbo.ufn_IPP_GetKWSynsGroups') IS NOT NULL
    DROP FUNCTION dbo.ufn_IPP_GetKWSynsGroups
    GO

    CREATE FUNCTION dbo.ufn_IPP_GetKWSynsGroups
        (@cCleanStringArg    VARCHAR(255),
        @cLangArg        CHAR(1))
        
        RETURNS @tRS TABLE (    nGroup_Id    INTEGER,
                    cWord        VARCHAR(512))
    AS
     BEGIN
        DECLARE @nRowCountVar    INTEGER
        
        DECLARE @tKeywords    TABLE
            (nGroup_Id    INTEGER    IDENTITY,
            cWord        VARCHAR(255) PRIMARY KEY)

        DECLARE @tSynonyms    TABLE
            (cWord        VARCHAR(255),
            cSyn        VARCHAR(255),
            nPos        INTEGER)

        INSERT    @tSynonyms
        SELECT    cWord, cSyn, CHARINDEX(cWord, @cCleanStringArg)
        FROM    dbo.IPP_Synonyms    s
        WHERE    CHARINDEX(cWord, @cCleanStringArg) > 0
        AND    s.cLang = @cLangArg

        WHILE @@ROWCOUNT > 0

            INSERT    @tSynonyms
            SELECT    s.cWord, s.cSyn, CHARINDEX(s.cWord, @cCleanStringArg, nPos)
            FROM    dbo.IPP_Synonyms    s
            CROSS JOIN @tSynonyms        t
            WHERE    CHARINDEX(s.cWord, @cCleanStringArg) > 0
            AND    s.cLang = @cLangArg
            AND NOT EXISTS (SELECT    *
                    FROM    @tSynonyms t2
                    WHERE    t2.cWord = s.cWord
                    AND    t2.cSyn = s.cSyn)

        INSERT    @tKeywords
        SELECT    @cCleanStringArg

        SELECT @nRowCountVar = @@ROWCOUNT

        WHILE @nRowCountVar > 0
         BEGIN
         
            INSERT    @tKeywords
            SELECT    DISTINCT STUFF(k.cWord, CHARINDEX(s.cWord, k.cWord), DATALENGTH(s.cWord), s.cSyn)
            FROM    @tKeywords    k
            JOIN    @tSynonyms    s ON CHARINDEX(s.cWord, k.cWord) > 0
            AND NOT EXISTS (SELECT    *
                    FROM    @tKeywords k2
                    WHERE    k2.cWord = STUFF(k.cWord, CHARINDEX(s.cWord, k.cWord), DATALENGTH(s.cWord), s.cSyn))
            SELECT @nRowCountVar = @@ROWCOUNT

            INSERT    @tKeywords
            SELECT    DISTINCT STUFF(k.cWord, CHARINDEX(s.cSyn, k.cWord), DATALENGTH(s.cSyn), s.cWord)
            FROM    @tKeywords    k
            JOIN    @tSynonyms    s    ON CHARINDEX(s.cSyn, k.cWord) > 0
            AND NOT EXISTS (SELECT    *
                    FROM    @tKeywords k2
                    WHERE    k2.cWord = STUFF(k.cWord, CHARINDEX(s.cSyn, k.cWord), DATALENGTH(s.cSyn), s.cWord))

            SELECT @nRowCountVar += @@ROWCOUNT
         END

        INSERT    @tRS
        SELECT    nGroup_Id,
            cRow
        FROM    (SELECT    nGroup_Id,
                cWord
            FROM    @tKeywords) k
        CROSS APPLY Project.dbo.ufn_DelimitedStringToTable (k.cWord, ' ')
        
        RETURN
     END
    GO

  • TheSQLGuru - Wednesday, July 12, 2017 12:42 PM

    That code makes my brain hurt!! 😀 But if it works I'm moving on ... LOL

    Might wanna pop a couple more percs 😛

    Thanks for taking the time.

  • Here's another way (based on the original sample data with two table variables) that uses dynamic TSQL. Of course, all the usual caveats with such things apply (protecting against injection, etc.); this is just to illustrate another technique:

    DECLARE
    @CASE VARCHAR(MAX)=' CASE '
    ,@FROM VARCHAR(MAX)=' FROM (SELECT cWord FROM #temp WHERE cWord=cSyn) t '
    ,@ORDER VARCHAR(MAX)=''
    ,@SQL VARCHAR(MAX)
    ;

    IF OBJECT_ID('tempdb..#temp') IS NOT NULL DROP TABLE #temp;

    SELECT cWord, cSyn=cWord
    INTO #temp
    FROM @tKeywords
    UNION ALL
    SELECT cWord, cSyn
    FROM @tSynonyms ts
    WHERE EXISTS (SELECT 0 FROM @tKeywords tk WHERE ts.cWord=tk.cWord)
    ;

    SELECT @CASE=@CASE+'WHEN t.cWord='''+cWord+''' THEN '+QUOTENAME(cWord)+'.cSyn ',
       @FROM=@FROM+'OUTER APPLY (SELECT cSyn FROM #temp WHERE cWord='''+cWord+''') '+QUOTENAME(cWord)+' ',
       @ORDER=@ORDER+', '+QUOTENAME(cWord)+'.cSyn'
    FROM @tKeywords;

    SELECT @CASE=@CASE+' END ',
       @ORDER=STUFF(@ORDER,1,1,'');

    SELECT @SQL='SELECT group_id=DENSE_RANK() OVER (ORDER BY '+@ORDER+'), cWord='+@CASE+@FROM+' ORDER BY group_id ASC, t.cWord ASC';

    EXEC(@SQL);

    Cheers!

  • Jacob Wilkins - Wednesday, July 12, 2017 3:50 PM

    Here's another way (based on the original sample data with two table variables) that uses dynamic TSQL. Of course, all the usual caveats with such things apply (protecting against injection, etc.); this is just to illustrate another technique:

    DECLARE
    @CASE VARCHAR(MAX)=' CASE '
    ,@FROM VARCHAR(MAX)=' FROM (SELECT cWord FROM #temp WHERE cWord=cSyn) t '
    ,@ORDER VARCHAR(MAX)=''
    ,@SQL VARCHAR(MAX)
    ;

    IF OBJECT_ID('tempdb..#temp') IS NOT NULL DROP TABLE #temp;

    SELECT cWord, cSyn=cWord
    INTO #temp
    FROM @tKeywords
    UNION ALL
    SELECT cWord, cSyn
    FROM @tSynonyms ts
    WHERE EXISTS (SELECT 0 FROM @tKeywords tk WHERE ts.cWord=tk.cWord)
    ;

    SELECT @CASE=@CASE+'WHEN t.cWord='''+cWord+''' THEN '+QUOTENAME(cWord)+'.cSyn ',
       @FROM=@FROM+'OUTER APPLY (SELECT cSyn FROM #temp WHERE cWord='''+cWord+''') '+QUOTENAME(cWord)+' ',
       @ORDER=@ORDER+', '+QUOTENAME(cWord)+'.cSyn'
    FROM @tKeywords;

    SELECT @CASE=@CASE+' END ',
       @ORDER=STUFF(@ORDER,1,1,'');

    SELECT @SQL='SELECT group_id=DENSE_RANK() OVER (ORDER BY '+@ORDER+'), cWord='+@CASE+@FROM+' ORDER BY group_id ASC, t.cWord ASC';

    EXEC(@SQL);

    Cheers!

    Brilliant work! This is excellent! 
    I spent a couple hours trying to come up with a more set based way to do this - it looks really simple but I ended up pulling my hair out.

    "I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

    -- Itzik Ben-Gan 2001

  • @alan.B: Thank you sir! Coming from you that's especially high praise 🙂

    I absolutely agree; this is a deceptively difficult exercise, and I spent a couple hours last night trying to find something better as well.

    I did come up with one other possibility, based on a little math trick to make sure the groups are distributed correctly.

    In the first version I used CTEs to do all the required aggregations, but some of them are just painful to do that way.

    I switched to using a series of temp tables instead, which makes it much faster, especially since the running product can be calculated using a quirky update style approach (the order in which it's calculated doesn't actually matter).

    In my tests with >100 synonyms for some of the keywords, it generally outperforms the dynamic TSQL approach, but it's a lot more bizarre 🙂

    Basically, the idea is to count the synonyms for each keyword (including the keyword itself). Then, you number each synonym within a cWord group.

    Then, you calculate a running product of the count per keyword, which is used to give both the running product for each cWord and the total number of combinations.

    Once you have all that, you can use a tally to generate a possible row for each synonym/group_id combination, and distribute synonyms into groups correctly by requiring that row_number-1=((group_id-1)/(total_combinations/running_product))%count

    Essentially you're splitting the possibility space into buckets, So, let's say you have just 2 keywords with 2 possible values each (the keyword itself and one synonym).

    For the first cWord (in arbitrary order), the running product is 2. For the second cWord, the running product is 4, which is also the total number of combinations.

    The first cWord is then split into buckets of size 2  (total_combinations/running_product) over which its values cycle. So, group_ids 1 and 2 would have the same value, and group_ids 3 and 4 would have the other value.

    The second cWord is split into buckets of size 1 (total_combinations/running_product) over which its values cycle. By essentially making sure that each successive cWord cycles through its values within a single cycle of the previous cWord (again, in arbitrary order), you make sure you get all combinations.

    I still prefer the dynamic TSQL approach, honestly, but this one was more fun to write 🙂

    IF OBJECT_ID('tempdb..#unified') IS NOT NULL DROP TABLE #unified;

    CREATE TABLE #unified (rn BIGINT, cWord VARCHAR(255), cSyn VARCHAR(255));

    INSERT INTO #unified
    SELECT rn=ROW_NUMBER() OVER (PARTITION BY cWord ORDER BY cSyn)-1, cWord, cSyn
    FROM
    (
    SELECT cWord, cSyn=cWord
    FROM @tKeywords
    UNION ALL
    SELECT cWord, cSyn
    FROM @tSynonyms ts
    )x
    ;

    IF OBJECT_ID('tempdb..#prods') IS NOT NULL DROP TABLE #prods;

    SELECT cWord, cnt=COUNT(*),moving_prod=0 INTO #prods FROM #unified GROUP BY cWord;

    DECLARE @agg_prod INT=1;

    UPDATE #prods SET @agg_prod=moving_prod=cnt*@agg_prod OPTION (MAXDOP 1);

    IF OBJECT_ID('tempdb..#some_aggs') IS NOT NULL DROP TABLE #some_aggs;

    CREATE TABLE #some_aggs (cWord VARCHAR(255), cnt BIGINT, moving_prod BIGINT, marker BIGINT, N BIGINT);

    INSERT INTO #some_aggs SELECT cWord, cnt, moving_prod, marker=((tn.N-1)/(@agg_prod/moving_prod))%cnt,N FROM #prods p CROSS JOIN dbo.TallyN(@agg_prod) tn;

    SELECT group_id=s.N, cWord=u.cSyn
    FROM #unified u
     INNER JOIN
     #some_aggs s ON u.cWord=s.cWord AND u.rn=s.marker
    ORDER BY s.N ASC, u.cWord ASC
    OPTION (RECOMPILE);

Viewing 12 posts - 1 through 11 (of 11 total)

You must be logged in to reply to this topic. Login to reply