Count Words in Text Field and/or Column (frequency of use)

  • I searched the archives for this and could not find a solution.

    I have a comment table and would like to get a word count.

    So for example, if comment #1 was red dress and comment #2 was purple dress, the result would be:

    1 red

    1 purple

    2 dress

    Performance is not a concern. Thanks.

  • I haven't done this myself, but I would probably try to apply the concepts behind splitting strings using a tally table that Jeff Moden explains in this article: http://www.sqlservercentral.com/articles/T-SQL/62867/ You can do the same thing with loops (Jeff's article includes examples of this), but the tally table approach is much more efficient. I would probably try to use the method towards the end of the article for splitting strings on commas and tweak it to split on spaces and punctuation (in order to identify words), then do a SELECT word, COUNT(word)...GROUP BY word from the result set.

  • I searched the archives for this and could not find a solution.

    I have a comment table and would like to get a word count.

    Are you aware of SQL Server's Full Text Search capabilities. If not may I suggest starting with reading:

    http://msdn.microsoft.com/en-us/library/ms142571(v=SQL.100).aspx

    Or read this excellent article here on SSC

    http://www.sqlservercentral.com/articles/SS2K5+-+Full+Text+Searching/3054/

    If everything seems to be going well, you have obviously overlooked something.

    Ron

    Please help us, help you -before posting a question please read[/url]
    Before posting a performance problem please read[/url]

  • Here's a quick stab at doing this with a tally table. Note that to run this code, you need to first have created the tally table using the code in the article I linked above.

    DECLARE @Sentence VARCHAR(7998);

    SET @Sentence = 'In many of the more relaxed civilizations on the Outer Eastern Rim of the Galaxy, ' +

    'the Hitchhiker''s Guide has already supplanted the great Encyclopaedia Galactica ' +

    'as the standard repository of all knowledge and wisdom, for though it has many omissions ' +

    'and contains much that is apocryphal, or at least wildly inaccurate, it scores over the older, ' +

    'more pedestrian work in two important respects. First, it is slightly cheaper; ' +

    'and secondly it has the words DON''T PANIC inscribed in large friendly letters on its cover.';

    DECLARE @Words TABLE

    (

    Word VARCHAR(8000)

    )

    SET @Sentence = ' ' + @Sentence + ' ';

    INSERT INTO @Words

    SELECT SUBSTRING(@Sentence,N+1,CHARINDEX(' ',@Sentence,N+1)-N-1)

    FROM dbo.Tally

    WHERE N < LEN(@Sentence)

    AND SUBSTRING(@Sentence,N,1) = ' ';

    SELECT Word, COUNT(Word) AS WordCount

    FROM @Words

    GROUP BY Word

    Note that this isn't very robust, it has problems with punctuation, multiple adjacent spaces, etc. But that should be easy enough to fix.

  • texpic (12/3/2011)


    I searched the archives for this and could not find a solution.

    I have a comment table and would like to get a word count.

    So for example, if comment #1 was red dress and comment #2 was purple dress, the result would be:

    1 red

    1 purple

    2 dress

    Performance is not a concern. Thanks.

    What's the maximum number of bytes you expect in any given comment?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • bitbucket-25253 (12/4/2011)


    I searched the archives for this and could not find a solution.

    I have a comment table and would like to get a word count.

    Are you aware of SQL Server's Full Text Search capabilities. If not may I suggest starting with reading:

    http://msdn.microsoft.com/en-us/library/ms142571(v=SQL.100).aspx

    Or read this excellent article here on SSC

    http://www.sqlservercentral.com/articles/SS2K5+-+Full+Text+Searching/3054/

    It's been a while since I looked at Full Text Search in SQL Server but my memory is that it's, ummmm... not as robust as it could be. IIRC, I don't believe that it'll actually help with this type of problem (histogram of words). It doesn't even do an explicit proximity search.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • It is a varchar(5500). There are 300,000 or so comments (rows).

  • That tally table is neat. I'm going to try and figure out how to apply this to my table. It is about 300,000 rows.

  • I just realized you need to summarize word count across multiple rows, so what I posted would need some reworking to be effective for that (although it should be doable, I think). I'm interested to see what Jeff comes up with...

  • texpic (12/4/2011)


    That tally table is neat. I'm going to try and figure out how to apply this to my table. It is about 300,000 rows.

    Then you want the function found in the "Resources" attachements of the following article. It's an 8k splitter on steroids which runs faster than any other splitter except for the CLR splitter.

    The problem with splitting on "white" space is that punctuation and other nuances such as parenthesis can get into the way of splitting actual words. My recommendation is to split on the white space and remove the punctuation later after the aggregations have occurred. Then reaggregate.

    ...OR...

    Modify the splitter function to use LIKE '%[^-A-Z0-9]%' so it'll split on anything but letters, numbers, and dashes.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff, which areticle has resources? Thanks.

  • In case Jeff Moden does not see your request, I believe this is the article he was referring to:

    http://www.sqlservercentral.com/articles/Tally+Table/72993/

    The articles title is:

    Tally OH! An Improved SQL 8K “CSV Splitter” Function

    By Jeff Moden, 2011/05/02

    The resources links are at the very end of the article.

    If everything seems to be going well, you have obviously overlooked something.

    Ron

    Please help us, help you -before posting a question please read[/url]
    Before posting a performance problem please read[/url]

  • bitbucket-25253 (12/6/2011)


    In case Jeff Moden does not see your request, I believe this is the article he was referring to:

    http://www.sqlservercentral.com/articles/Tally+Table/72993/

    The articles title is:

    Tally OH! An Improved SQL 8K “CSV Splitter” Function

    By Jeff Moden, 2011/05/02

    The resources links are at the very end of the article.

    Oh wow, I hadn't seen that article before! I guess somehow I'd only seen the original article. That is really cool.

  • texpic (12/6/2011)


    Jeff, which areticle has resources? Thanks.

    My apologies. I forgot to post the link. Good thing there's good people to cover for me. Bit Bucket posted the correct link.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • bitbucket-25253 (12/6/2011)


    In case Jeff Moden does not see your request, I believe this is the article he was referring to:

    http://www.sqlservercentral.com/articles/Tally+Table/72993/

    The articles title is:

    Tally OH! An Improved SQL 8K “CSV Splitter” Function

    By Jeff Moden, 2011/05/02

    The resources links are at the very end of the article.

    Outstanding! Thanks for the cover, Ron! :blush:

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 15 posts - 1 through 15 (of 32 total)

You must be logged in to reply to this topic. Login to reply