Grouping the words which has maximum count

  • Hi All,

    I have to highest two word in the columns. It should have the maximum two words combination.  I have attached the excel for reference.

     

     

    Attachments:
    You must be logged in to view attached files.
  • Is it possible to have an sql-script posted as code to reproduce your question so we don't have to open a possible dangerous excelfile

  • This is my SQL Query to get the single words. I need two words grouping

     

    SELECT value AS keyword, COUNT(*) AS keyword_count

    FROM (

    SELECT value

    FROM [title_one_space]

    CROSS APPLY STRING_SPLIT([title], ' ')

    ) keywords

    GROUP BY value

    ORDER BY keyword_count DESC;

  • This is the column

  • Need the result like this

     

  • We cannot dump an image of your data in SSMS to test any code.

    So, I made up some dummy data for myself.

    The following code appears to do what you are looking for (NOTE the extra parameter in STRING_SPLIT)

    WITH cteTwoWords AS (
    SELECT TwoWords = ss.value + ' ' + LEAD(ss.value) OVER (PARTITION BY td.[title] ORDER BY ss.ordinal)
    FROM [title_one_space] AS td
    CROSS APPLY STRING_SPLIT(td.[title], ' ', 1) AS ss
    )
    SELECT TwoWord_group = cte.TwoWords
    , TwoWord_count = COUNT(*)
    FROM cteTwoWords AS cte
    WHERE TwoWords IS NOT NULL
    GROUP BY TwoWords
    ORDER BY COUNT(*) DESC;
  • This is the error message I had got

     

    Msg 8144, Level 16, State 3, Line 17

    Procedure or function STRING_SPLIT has too many arguments specified.

    Completion time: 2023-05-03T21:27:20.2644947+05:30

  • That error seems to indicate you're not using SQL Server 2022. (Which is probably why DesNorton wrote " (NOTE the extra parameter in STRING_SPLIT)"

    Are you on SQL Server 2022? You posted in a 2022 forum.

  • As ratbak indicated, you have posted in a SQL 2022 forum.  As such, my proposed solution uses functionality that is available in SQL 2022.

    If you are using SQL 2012 to 2019, then I suggest that you read Reaping the benefits of the Window functions in T-SQL by Eirikur Eiriksson.  Under the resources section, you will find a link to download the code for CREATE_FUNCTION_dbo_DelimitedSplit8K_LEAD, which you can install and use.

    My suggested solution is updated to use DelimitedSplit8K_LEAD

    WITH cteTwoWords AS (
    SELECT TwoWords = ss.Item + ' ' + LEAD(ss.Item) OVER (PARTITION BY td.[title] ORDER BY ss.ItemNumber)
    FROM [title_one_space] AS td
    CROSS APPLY dbo.DelimitedSplit8K_LEAD(td.[title], ' ') AS ss
    )
    SELECT TwoWord_group = cte.TwoWords
    , TwoWord_count = COUNT(*)
    FROM cteTwoWords AS cte
    WHERE TwoWords IS NOT NULL
    GROUP BY TwoWords
    ORDER BY COUNT(*) DESC;
  • If you are using an earlier version of SQL, then I suggest that you read Tally OH! An Improved SQL 8K “CSV Splitter” Function by Jeff Moden.  Under the resources section, you will find a link to download the code for the function, which you can install and use.  There is even a version that will work with nvarchar(4000) strings.

    My suggested solution is updated to use DelimitedSplit8K

    WITH cteTwoWords AS (
    SELECT TwoWords = ss.Item + ' ' + LEAD(ss.Item) OVER (PARTITION BY td.[title] ORDER BY ss.ItemNumber)
    FROM [title_one_space] AS td
    CROSS APPLY dbo.DelimitedSplit8K(td.[title], ' ') AS ss
    )
    SELECT TwoWord_group = cte.TwoWords
    , TwoWord_count = COUNT(*)
    FROM cteTwoWords AS cte
    WHERE TwoWords IS NOT NULL
    GROUP BY TwoWords
    ORDER BY COUNT(*) DESC;
  • This was the error message

    Msg 208, Level 16, State 1, Line 14

    Invalid object name 'dbo.DelimitedSplit8K'

     

  • sathishkm wrote:

    This was the error message

    Msg 208, Level 16, State 1, Line 14 Invalid object name 'dbo.DelimitedSplit8K'

    Did you download and install dbo.DelimitedSplit8K on your SQL instance?

  • An alternative

    drop table if exists #title_one_space;
    go
    select * into #title_one_space from (values
    ('one word')
    ,('one word two words')
    ,('one word two words three words')
    ,('one word two words three words four')) v(title);

    select x.two_word, count(*) two_word_count
    from #title_one_space tos
    cross apply dbo.fnTally(1, len(title)-len(replace(title, ' ', ''))) fn
    cross apply (select string_agg(splt.value, ' ')
    within group (order by splt.ordinal)
    from string_split(tos.title, ' ', 1) splt
    where splt.ordinal between fn.n and fn.n+1) x(two_word)
    group by x.two_word
    order by two_word_count desc;

    Aus dem Paradies, das Cantor uns geschaffen, soll uns niemand vertreiben können

  • sathishkm wrote:

    Hi All,

    I have to highest two word in the columns. It should have the maximum two words combination.  I have attached the excel for reference.

    So which word pairs do you want returned from...

    research and operations analyst

    To me, it seems that there are two 2 word titles there... Research Analyst and Operations Analyst your original data is fraught with such things.

    Or, is that not important?  And how do you want to treat the word "and"???

    OR... is what you're really asking for is to pair every combination of individual unique words in a job title and then do a count? (I still recommend removing the word "and" there).

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Msg 8144, Level 16, State 3, Line 37

    Procedure or function string_split has too many arguments specified.

    Completion time: 2023-05-08T15:45:39.2855074+05:30

Viewing 15 posts - 1 through 15 (of 18 total)

You must be logged in to reply to this topic. Login to reply