t-sql 2012 parse out multiple values in one field that is varchar(1200)

  • In a t-sql 2012 database, I am suppose to count the number of times each 'canned' message is used for
    Elementary students in the last school year and the current school  year.

    Right now I have the following sql that kind of works:

    USE TEST
      SELECT GS.Comments,Count(*) AS [Counts]
      FROM dbo.Enrol Enrol
      JOIN dbo.Student Student
          ON Student.StudentID = Enrol.StudentID
      JOIN dbo.GS GS
          ON GS.StudentID = Student.Studentid
           AND (GS.Comments IS NOT NULL)
        AND (GS.Comments <> '')
      where Enrol.grade in ('KG','01','02','03','04','05','06')
      and Enrol.endYear between 2016 and 2017
      Group by GS.Comments
      order by  Counts desc,GS.Comments asc

    The problem is the GS.Comments field is defined as varchar(1200). There can be one message in the field and/or there can be lots of messages in this field. Each message ends with a period and there is a space between each message.

    An example of multiple messages in the one GS.Comments field would look like the following:

    The student is trying hard and needs to make their time more efficiently. This student is good at math. This student turns in their assignments on time. This student seems to enjoy school.
    An example of when one messages is in  the one GS.Comments field would look like the following:

    This student seems to enjoy school.

    Thus would showe me the t-sql 2012 logic that I can use when the GS.Comments field contains multiple messages and/or just one message so that I can count the number of times each unique message has been used?

  • What do you consider "unique messages"?  Do you mean something like "common sentences"?  I would probably use Jeff's splitter function and write the results to at least a temporary table (better a permanent indexed table), and then this is trivial.  Read Jeff Moden's article on DelimitedSplit8K.

    You should know this by now, because everyone says it, but some sample data would make everyone's life a lot easier.  Doesn't have to be real, just representative.  CREATE TABLE script(s), INSERT script(s)?

  • First for some sample data

    USE tempdb
    GO

    IF OBJECT_ID('tempdb.dbo.cannedPhrases') IS NOT NULL DROP TABLE dbo.CannedPhrases;
    IF OBJECT_ID('tempdb.dbo.GS') IS NOT NULL DROP TABLE dbo.GS;
    GO

    DECLARE
    @c1 varchar(100) = 'This student seems to enjoy school',
    @c2 varchar(100) = 'This student has so much potential',
    @c3 varchar(100) = 'He takes after his mother';

    SELECT Canned = x INTO dbo.CannedPhrases FROM (VALUES (@c1), (@c2), (@c3)) x(x);
    SELECT CommentId = ROW_NUMBER() OVER (ORDER BY (SELECT 1)), Comment = x
    INTO dbo.GS
    FROM
    (
    SELECT REPLICATE('Blah blah blah', 10)+@c1+REPLICATE('Blah blah blah', 5)  UNION ALL
    SELECT @c2+REPLICATE('Blah blah blah', 20)+@c2+REPLICATE('Blah blah ', 5)  UNION ALL
    SELECT REPLICATE('Blah blah blah', 12)+@c3+REPLICATE('Blah blah blah', 5)+@c1 UNION ALL
    SELECT REPLICATE('Blah blah blah', 100)             UNION ALL
    SELECT REPLICATE('Blah blah blah', 12)+@c1+'blah...'+@c2
    ) x(x);

    Note that I stuffed 7 canned phrases in there. Next you need to get yourself a copy of NGrams8K. Here's an example of how to use NGrams8K to identify the "canned phrases and where they live in the string".

    SELECT CommentId, Phrase, position
    FROM
    (
    SELECT canned, phraseLen = LEN(canned)
    FROM dbo.CannedPhrases
    ) AS Canned(Phrase, PhraseLen)
    CROSS JOIN dbo.GS
    CROSS APPLY dbo.ngrams8K(GS.Comment, PhraseLen)
    WHERE token = Phrase;

    This returns:

    CommentId Phrase           position
    ---------- -------------------------------------- --------------------
    1    This student seems to enjoy school  141
    3    This student seems to enjoy school  264
    5    This student seems to enjoy school  169
    2    This student has so much potential  1
    2    This student has so much potential  315
    5    This student has so much potential  210
    3    He takes after his mother            169

    And to get your counts:

    SELECT Comment, Counts = COUNT(*)
    FROM
    (
    SELECT canned, phraseLen = LEN(canned)
    FROM dbo.CannedPhrases
    ) AS Canned(Phrase, PhraseLen)
    CROSS JOIN dbo.GS
    CROSS APPLY dbo.ngrams8K(GS.Comment, PhraseLen)
    WHERE token = Phrase
    GROUP BY Comment;
    --GROUP BY ALL Comment;
    -- Note: use the depreciated GROUP BY ALL to get comments with a 0 count

    It's that easy.

    "I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

    -- Itzik Ben-Gan 2001

  • wendy elizabeth - Friday, May 19, 2017 3:44 PM

    In a t-sql 2012 database, I am suppose to count the number of times each 'canned' message is used for
    Elementary students in the last school year and the current school  year.

    Right now I have the following sql that kind of works:

    USE TEST
      SELECT GS.Comments,Count(*) AS [Counts]
      FROM dbo.Enrol Enrol
      JOIN dbo.Student Student
          ON Student.StudentID = Enrol.StudentID
      JOIN dbo.GS GS
          ON GS.StudentID = Student.Studentid
           AND (GS.Comments IS NOT NULL)
        AND (GS.Comments <> '')
      where Enrol.grade in ('KG','01','02','03','04','05','06')
      and Enrol.endYear between 2016 and 2017
      Group by GS.Comments
      order by  Counts desc,GS.Comments asc

    The problem is the GS.Comments field is defined as varchar(1200). There can be one message in the field and/or there can be lots of messages in this field. Each message ends with a period and there is a space between each message.

    An example of multiple messages in the one GS.Comments field would look like the following:

    The student is trying hard and needs to make their time more efficiently. This student is good at math. This student turns in their assignments on time. This student seems to enjoy school.
    An example of when one messages is in  the one GS.Comments field would look like the following:

    This student seems to enjoy school.

    Thus would showe me the t-sql 2012 logic that I can use when the GS.Comments field contains multiple messages and/or just one message so that I can count the number of times each unique message has been used?

    Do you have a table where the "canned" comments are stored?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Yes there is a table where the canned statements are stored at. The problem is the unique identifier stored in the canned table are not passed along to the table where the statements are actually entered. One of my goals is to only find the statements that exist in the canned table and the table where the canned messages re entered. When the match occurs, I am suppose to count the number of occurences of where the canned messages are actually used.

  • The following additional items:
    1. Can you explain --GROUP BY ALL Comment;
    -- Note: use the depreciated GROUP BY ALL to get comments with a 0 count
    to me? What am I trying to accomplish here/
    2. All the values are stored in a canned table. Can you show me how to modify the sql listed above where the value is obtained from a canned table and not stuffed into a variables and the union all statement is used?
  • Got Google?

    http://sqlmag.com/t-sql/group-all

  • wendy elizabeth - Saturday, May 20, 2017 8:22 PM

    Yes there is a table where the canned statements are stored at. The problem is the unique identifier stored in the canned table are not passed along to the table where the statements are actually entered. One of my goals is to only find the statements that exist in the canned table and the table where the canned messages re entered. When the match occurs, I am suppose to count the number of occurences of where the canned messages are actually used.

    Understood and expected.  Can you post the CREATE TABLE for that canned messages table?  I believe I can show you a simple way to do this.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Thanks for showing me the sql:

    Here is the table with the canned 'comments' listed below:
    CREATE TABLE [dbo].[GradingCommentTemplate](
     [commentID] [int] IDENTITY(1,1) NOT FOR REPLICATION NOT NULL,
     [schoolID] [int] NULL,
     [comment] [varchar](200) NULL,
      [varchar](6) NULL,
     CONSTRAINT [PK_GradingCommentTemplate] PRIMARY KEY NONCLUSTERED
    (
     [commentID] ASC
    )WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, FILLFACTOR = 97) ON [PRIMARY]
    ) ON [PRIMARY]

    GO

    SET ANSI_PADDING OFF
    GO
    ALTER TABLE [dbo].[GradingCommentTemplate] CHECK CONSTRAINT [FK_GradingCommentTemplate_School]

    ALTER TABLE [dbo].[GradingCommentTemplate]  WITH NOCHECK ADD  CONSTRAINT [FK_GradingCommentTemplate_School] FOREIGN KEY([schoolID])
    REFERENCES [dbo].[School] ([schoolID])
    ON DELETE CASCADE
    GO

    Notes:
    1. When I run the job, I will select the elementary schools by school id.
    2. It is a possibility that different elementary schools have the exact same message. For this request, I do not care if they have the same comments.
    3. The value for 'code' in the GradingCommentTemplate is not used anywhere in the database.
    4. The identity key of CommentID is only used in this table and does not refer to any other database tables. There are lots
    of tables in this database that have their commentID column with their own unique values.
    Here is a copy of the school table that is referenced above as a foreign key relationship:

    CREATE TABLE [dbo].[School](
     [schoolID] [int] IDENTITY(1,1) NOT FOR REPLICATION NOT NULL,
     [districtID] [int] NOT NULL,
     [ncesSchoolID] [varchar](9) NULL,
     [number] [varchar](7) NOT NULL,
     [name] [varchar](50) NOT NULL,
     [type] [varchar](5) NULL,
     [comments] [varchar](255) NULL,
     [address] [varchar](50) NULL,
     [city] [varchar](25) NULL,
     [state] [varchar](2) NULL,
     [zip] [varchar](10) NULL,
     [phone] [varchar](25) NULL
    ) ON [PRIMARY]

    GO

  • wendy elizabeth - Saturday, May 20, 2017 8:29 PM

    The following additional items:
    1. Can you explain --GROUP BY ALL Comment;
    -- Note: use the depreciated GROUP BY ALL to get comments with a 0 count
    to me? What am I trying to accomplish here/
    2. All the values are stored in a canned table. Can you show me how to modify the sql listed above where the value is obtained from a canned table and not stuffed into a variables and the union all statement is used?

    Sorry for the later reply - busy few days. I re-read your requirement and, if I understand it correctly, you are looking to count the times that phrase is used. My original solution counted how many time they appear in a comment. 

    Ignore my GROUP BY ALL comment; it was a quick little side-note that won't really help you solve this. Regarding the values being stored in a canned table. The code that I posted was intended to emulate what you described. Let me re-post a simplified version. 


    IF OBJECT_ID('dbo.cannedPhrases') IS NOT NULL DROP TABLE dbo.CannedPhrases;
    IF OBJECT_ID('dbo.comments') IS NOT NULL DROP TABLE dbo.comments;
    GO

    DECLARE
    @c1 varchar(100) = 'This student seems to enjoy school',
    @c2 varchar(100) = 'This student has so much potential',
    @c3 varchar(100) = 'He takes after his mother';

    SELECT Canned = x
    INTO dbo.CannedPhrases
    FROM (VALUES (@c1), (@c2), (@c3)) x(x);

    SELECT CommentId = ROW_NUMBER() OVER (ORDER BY (SELECT 1)), Comment = x
    INTO dbo.comments
    FROM
    (
    SELECT 'Blah blah blah. '+@c1+'. Blah blah blah.'   UNION ALL
    SELECT @c2+'. Blah blah blah. '+@c2+'. Blah blah blah.' UNION ALL
    SELECT 'Blah blah blah. '+@c3+'. Blah blah blah. '+@c1 UNION ALL
    SELECT 'Blah blah blah.'           UNION ALL
    SELECT 'Blah blah blah. '+@c1+'. blah. '+@c2
    ) x(x);

    Now we have these two tables.

    Notice that a canned phrase appears once in commentId 1, twice in commentid 2, 3 and 5.  The phrase "He takes after his mother" appears once, the other two appear three times. 

    Here's two solutions, one which uses NGrams8K and the other that uses DelimitedSplit8K_LEAD


    -- "Splitter" solution
    SELECT canned, occurances = COUNT(*)
    FROM dbo.comments c
    CROSS APPLY dbo.delimitedsplit8K_lead(REPLACE(Comment, '. ', char(0)), char(0)) s
    JOIN dbo.CannedPhrases cp ON cp.Canned = s.Item
    GROUP BY Canned;

    -- NGrams8K solution
    SELECT canned, occurances = COUNT(*)
    FROM
    (
    SELECT canned, phraseLen = LEN(canned)
    FROM dbo.CannedPhrases
    ) AS Canned(canned, PhraseLen)
    CROSS JOIN dbo.comments
    CROSS APPLY dbo.ngrams8K(Comment, PhraseLen)
    WHERE token = canned
    GROUP BY canned;

    Each solution produces:

    I think this is what you're looking for; or will at least get you closer.

    "I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

    -- Itzik Ben-Gan 2001

  • Where can I access NGrams8K or DelimitedSplit8K in the sql server 2012 database?
    I want to use either NGrams8K or DelimitedSplit8K as you suggested but I cannot find them in the databases I work with.
    I do not see these items in the 'Custom' databases  that I am working with.
    I do not see these objects in the master, model, msdb, or tempdb databases probably since I do not have access for everything.

    I see NGrams8K is an inline tabled valued function. Thus I was looking at table valued functions and I cannot find NGrams8K thnere.

    DelimitedSplit8K is some code in the link that you pointed me to. Thus is this code that I need to setup from the link you gave to me?

    Thus can you tell me how I can access NGrams8K or DelimitedSplit8K inthie sql server 2012 database that I have aqccess to?

  • wendy elizabeth - Monday, May 22, 2017 3:58 PM

    Where can I access NGrams8K or DelimitedSplit8K in the sql server 2012 database?
    I want to use either NGrams8K or DelimitedSplit8K as you suggested but I cannot find them in the databases I work with.
    I do not see these items in the 'Custom' databases  that I am working with.
    I do not see these objects in the master, model, msdb, or tempdb databases probably since I do not have access for everything.

    I see NGrams8K is an inline tabled valued function. Thus I was looking at table valued functions and I cannot find NGrams8K thnere.

    DelimitedSplit8K is some code in the link that you pointed me to. Thus is this code that I need to setup from the link you gave to me?

    Thus can you tell me how I can access NGrams8K or DelimitedSplit8K inthie sql server 2012 database that I have aqccess to?

    go to the articles that were underlined
    Here's two solutions, one which uses NGrams8K and the other that uses DelimitedSplit8K_LEAD

    for each article....I strongly suggest you read them...and then download the code from the "resources" at the bottom of the articles

    ________________________________________________________________
    you can lead a user to data....but you cannot make them think
    and remember....every day is a school day

  • Jeff Moden - Sunday, May 21, 2017 11:22 AM

    wendy elizabeth - Saturday, May 20, 2017 8:22 PM

    Yes there is a table where the canned statements are stored at. The problem is the unique identifier stored in the canned table are not passed along to the table where the statements are actually entered. One of my goals is to only find the statements that exist in the canned table and the table where the canned messages re entered. When the match occurs, I am suppose to count the number of occurences of where the canned messages are actually used.

    Understood and expected.  Can you post the CREATE TABLE for that canned messages table?  I believe I can show you a simple way to do this.

    Looks like a couple of folks beat me to it.  You posting that table helped them very well, though.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 13 posts - 1 through 13 (of 13 total)

You must be logged in to reply to this topic. Login to reply