'Spell check' a string-valued field

  • Hello all,

    We have a field in a table that's a string-valued field. In this field are (mainly) what I'll call 'tags'. These tags are pre-defined by our application, and there are quite a few. They can be arranged in any order in the field, they can use one or all or anywhere in between. [example: you have 3 tags called <remarks>,<boolresult>,<name>. this field can have '<boolresult><remarks>', '<name><boolresult>', '<remarks>', etc] The field can also include other characters such as colons, commas, etc.

    We've found that some of these tags are misspelled, thus breaking the application. The problem is that it's not feasible to have people sift through them all manually as there are thousands upon thousands of records. 

    Is there a way to 'spell check' this field against what is known to be correct? I personally can't think of any, except maybe using a LIKE in a where clause with just about every single possible incorrect spelling, order of tags, and combination (which isn't feasible, either). Am I crazy to think there may be a way to do this?

    Thanks!

  • scarr030 - Tuesday, March 14, 2017 10:49 AM

    Hello all,

    We have a field in a table that's a string-valued field. In this field are (mainly) what I'll call 'tags'. These tags are pre-defined by our application, and there are quite a few. They can be arranged in any order in the field, they can use one or all or anywhere in between. [example: you have 3 tags called <remarks>,<boolresult>,<name>. this field can have '<boolresult><remarks>', '<name><boolresult>', '<remarks>', etc] The field can also include other characters such as colons, commas, etc.

    We've found that some of these tags are misspelled, thus breaking the application. The problem is that it's not feasible to have people sift through them all manually as there are thousands upon thousands of records. 

    Is there a way to 'spell check' this field against what is known to be correct? I personally can't think of any, except maybe using a LIKE in a where clause with just about every single possible incorrect spelling, order of tags, and combination (which isn't feasible, either). Am I crazy to think there may be a way to do this?

    Thanks!

    This might be possible, but I would need DDL, sample data and expected results to give an answer. Check my signature on guidance on how to post this correctly.

    Luis C.
    General Disclaimer:
    Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

    How to post data/code on a forum to get the best help: Option 1 / Option 2
  • Heh... spell checking tags.  There's another good reason to avoid XML if I ever heard one. 😉

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • There is, however, a way to do this.  I forget the exact code but if you open up XML as an XML Document, you can easily derive all of the tag names from the XML as a table and compare it to a list of authorized tag names.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Luis Cazares - Tuesday, March 14, 2017 10:55 AM

    scarr030 - Tuesday, March 14, 2017 10:49 AM

    Hello all,

    We have a field in a table that's a string-valued field. In this field are (mainly) what I'll call 'tags'. These tags are pre-defined by our application, and there are quite a few. They can be arranged in any order in the field, they can use one or all or anywhere in between. [example: you have 3 tags called <remarks>,<boolresult>,<name>. this field can have '<boolresult><remarks>', '<name><boolresult>', '<remarks>', etc] The field can also include other characters such as colons, commas, etc.

    We've found that some of these tags are misspelled, thus breaking the application. The problem is that it's not feasible to have people sift through them all manually as there are thousands upon thousands of records. 

    Is there a way to 'spell check' this field against what is known to be correct? I personally can't think of any, except maybe using a LIKE in a where clause with just about every single possible incorrect spelling, order of tags, and combination (which isn't feasible, either). Am I crazy to think there may be a way to do this?

    Thanks!

    This might be possible, but I would need DDL, sample data and expected results to give an answer. Check my signature on guidance on how to post this correctly.

    I hope I've provided enough data/info to help. I also hope I'm formatting this properly to help you help me. New to this, so apologies if I've made a mistake. The below is just a small sample of some of the data, which can be much more varied than what I've provided. Note that there are spelling errors such as '<REMAKR>' or '<BOOOLRESULT>'. 

    IF OBJECT_ID('TempDB..#TESTS','U') IS NOT NULL
       DROP TABLE #TESTS

    CREATE TABLE #TESTS (
    SENTENCE VARCHAR(255) NULL
    )

    INSERT INTO #TESTS (SENTENCE)
    SELECT ' <BOOLRESULT>. <REMARK>' UNION ALL
    SELECT ' <BOOLRESULT> <REMARK>' UNION ALL
    SELECT ' <RESULT> <REMARK>' UNION ALL
    SELECT ' via <RESULT> route. <REMARK>' UNION ALL
    SELECT ' <BOOLRESULT> <REMARK>' UNION ALL
    SELECT ' <BOOLRESULT> <REMARK>' UNION ALL
    SELECT ' <BOOLRESULT> <REMARK>' UNION ALL
    SELECT ' <BOOLRESULT> <REMARK>' UNION ALL
    SELECT ' <BOOOLRESULT> <REMARK>' UNION ALL
    SELECT ' <BOOLRESULT> <REMARK>' UNION ALL
    SELECT ' <BOLRESULT> <REMARK>' UNION ALL
    SELECT ' <BOOLRESULT>. <REMARK>.' UNION ALL
    SELECT ' <GENDER> <RESULT>. <REMARK>' UNION ALL
    SELECT ' <GENDER> has had <RESULT> in the past. <REMARK>.' UNION ALL
    SELECT ' <RESULT> <REMAKR>' UNION ALL
    SELECT ' <RESULT> <REMARK>' UNION ALL
    SELECT ' <RESULT> <REMARK>' UNION ALL
    SELECT ' <RESULT> <REARK>' UNION ALL
    SELECT ' <RESULT> <REMARK>' UNION ALL
    SELECT ' <RESULT> <REMARK>'

  • Jeff Moden - Tuesday, March 14, 2017 11:27 AM

    There is, however, a way to do this.  I forget the exact code but if you open up XML as an XML Document, you can easily derive all of the tag names from the XML as a table and compare it to a list of authorized tag names.

    I'm going to show my ignorance, here....

    How is XML involved? Admittedly, I really don't like working with XML in SQL because I'm terrible at it! Really, I just need to read about it and practice it. I've had basically no exposure to it.

  • scarr030 - Tuesday, March 14, 2017 12:17 PM

    Jeff Moden - Tuesday, March 14, 2017 11:27 AM

    There is, however, a way to do this.  I forget the exact code but if you open up XML as an XML Document, you can easily derive all of the tag names from the XML as a table and compare it to a list of authorized tag names.

    I'm going to show my ignorance, here....

    How is XML involved? Admittedly, I really don't like working with XML in SQL because I'm terrible at it! Really, I just need to read about it and practice it. I've had basically no exposure to it.

    I suspect because you referred to them as "tags" and from the formatting of the information (<tag name> is a standard way of representing tags in an XML file)

  • Ah! That makes sense. We can call them labels, if that's less confusing. There is no XML involved, as far as I'm aware. The labels/tags are used to input a value into a form template.

  • First, apologies for thinking it was XML.  Jason is correct.  I got the word "tags" and the <somestring> format stuck in my head.  Shifting gears to your problem...

    Try this please...

    First, here's a repeat of your test data.  I just squared it up for personal use and I added an extra row to demonstrate what happens if more than one label per "sentence" is bad.  I also added a "RowNum" column to the table just to make that final example more obvious and to ensure each row had something unique on it.

    --===== If the test table already exists, drop it to make reruns in SSMS easier.
         IF OBJECT_ID('tempdb..#Tests','U') IS NOT NULL
       DROP TABLE #Tests
    ;
     CREATE TABLE #Tests
            (
             RowNum     INT          IDENTITY(1,1) --Added this column
            ,Sentence   VARCHAR(255) NULL
            )
    ;
     INSERT INTO #Tests
            (Sentence)
     SELECT ' <BOOLRESULT>. <REMARK>'                           UNION ALL
     SELECT ' <BOOLRESULT> <REMARK>'                            UNION ALL
     SELECT ' <RESULT> <REMARK>'                                UNION ALL
     SELECT ' via <RESULT> route. <REMARK>'                     UNION ALL
     SELECT ' <BOOLRESULT> <REMARK>'                            UNION ALL
     SELECT ' <BOOLRESULT> <REMARK>'                            UNION ALL
     SELECT ' <BOOLRESULT> <REMARK>'                            UNION ALL
     SELECT ' <BOOLRESULT> <REMARK>'                            UNION ALL
     SELECT ' <BOOOLRESULT> <REMARK>'                           UNION ALL
     SELECT ' <BOOLRESULT> <REMARK>'                            UNION ALL
     SELECT ' <BOLRESULT> <REMARK>'                             UNION ALL
     SELECT ' <BOOLRESULT>. <REMARK>.'                          UNION ALL
     SELECT ' <GENDER> <RESULT>. <REMARK>'                      UNION ALL
     SELECT ' <GENDER> has had <RESULT> in the past. <REMARK>.' UNION ALL
     SELECT ' <RESULT> <REMAKR>'                                UNION ALL
     SELECT ' <RESULT> <REMARK>'                                UNION ALL
     SELECT ' <RESULT> <REMARK>'                                UNION ALL
     SELECT ' <RESULT> <REARK>'                                 UNION ALL
     SELECT ' <RESULT> <REMARK>'                                UNION ALL
     SELECT ' <RESULT> <REMARK>'                                UNION ALL
     SELECT ' <Wup>! Der it is! <Not Allowed> but <Result> is.' --Added this
    ;
    --===== Let's see what the test table looks like.
     SELECT *
       FROM #Tests
    ;

    The next thing you need is the "Swiss Army Knife" for T-SQL in the form of an iTVF (Inline Table Valued Function) known as a "Tally Table".  You can find out more about how it replaces a loop at the following URL.  http://www.sqlservercentral.com/articles/T-SQL/62867/

    Here's the code for it (and I'm working on a new one that will be a little bit faster and more flexible).

     CREATE FUNCTION [dbo].[fnTally]
    /**********************************************************************************************************************
     Purpose:
     Return a column of BIGINTs from @ZeroOrOne up to and including @MaxN with a max value of 1 Trillion.

     As a performance note, it takes about 00:02:10 (hh:mm:ss) to generate 1 Billion numbers to a throw-away variable.

     Usage:
    --===== Syntax example (Returns BIGINT)
     SELECT t.N
       FROM dbo.fnTally(@ZeroOrOne,@MaxN) t
    ;

     Notes:
     1. Based on Itzik Ben-Gan's cascading CTE (cCTE) method for creating a "readless" Tally Table source of BIGINTs.
        Refer to the following URLs for how it works and introduction for how it replaces certain loops.
        http://www.sqlservercentral.com/articles/T-SQL/62867/
        http://sqlmag.com/sql-server/virtual-auxiliary-table-numbers
     2. To start a sequence at 0, @ZeroOrOne must be 0 or NULL. Any other value that's convertable to the BIT data-type
        will cause the sequence to start at 1.
     3. If @ZeroOrOne = 1 and @MaxN = 0, no rows will be returned.
     5. If @MaxN is negative or NULL, a "TOP" error will be returned.
     6. @MaxN must be a positive number from >= the value of @ZeroOrOne up to and including 1 Billion. If a larger
        number is used, the function will silently truncate after 1 Billion. If you actually need a sequence with
        that many values, you should consider using a different tool. ;-)
     7. There will be a substantial reduction in performance if "N" is sorted in descending order.  If a descending
        sort is required, use code similar to the following. Performance will decrease by about 27% but it's still
        very fast especially compared with just doing a simple descending sort on "N", which is about 20 times slower.
        If @ZeroOrOne is a 0, in this case, remove the "+1" from the code.

        DECLARE @MaxN BIGINT;
         SELECT @MaxN = 1000;
         SELECT DescendingN = @MaxN-N+1
           FROM dbo.fnTally(1,@MaxN);

     8. There is no performance penalty for sorting "N" in ascending order because the output is explicity sorted by
        ROW_NUMBER() OVER (ORDER BY (SELECT NULL))

     Revision History:
     Rev 00 - Unknown     - Jeff Moden
            - Initial creation with error handling for @MaxN.
     Rev 01 - 09 Feb 2013 - Jeff Moden
            - Modified to start at 0 or 1.
     Rev 02 - 16 May 2013 - Jeff Moden
            - Removed error handling for @MaxN because of exceptional cases.
     Rev 03 - 22 Apr 2015 - Jeff Moden
            - Modify to handle 1 Trillion rows for experimental purposes.
    **********************************************************************************************************************/
            (@ZeroOrOne BIT, @MaxN BIGINT)
    RETURNS TABLE WITH SCHEMABINDING AS
     RETURN WITH
      E1(N) AS (SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
                SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
                SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
                SELECT 1)                                  --10^1 or 10 rows
    , E4(N) AS (SELECT 1 FROM E1 a, E1 b, E1 c, E1 d)      --10^4 or 10 Thousand rows
    ,E12(N) AS (SELECT 1 FROM E4 a, E4 b, E4 c)            --10^12 or 1 Trillion rows                
                SELECT N = 0 WHERE ISNULL(@ZeroOrOne,0)= 0 --Conditionally start at 0.
                 UNION ALL
                SELECT TOP(@MaxN) N = ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E12 -- Values from 1 to @MaxN
    ;

    Next, we need a reference table of allowed labels...

     CREATE TABLE dbo.AllowedLabel
            (
         Label varchar(100) NOT NULL PRIMARY KEY CLUSTERED
            )
    ;
    GO
     INSERT INTO dbo.AllowedLabel
            (Label)
     SELECT '<BOOLRESULT>'  UNION ALL
     SELECT '<GENDER>'      UNION ALL
     SELECT '<REMARK>'      UNION ALL
     SELECT '<RESULT>'
    ;

    ... and another iTVF, which uses the fnTally iTFV, to take care of the dirty work...

     CREATE FUNCTION dbo.SpellCheckLabels
    /**************************************************************************************************
     Purpose:
     Given a string with "labels" encapsulated as <string>, split the labels out of the string and check
     them for existance in the dbo.AllowedLabel table. This inherently does a "spell check" on the
     labels found. Any labels not found in the table will be returned along with their ordinal position
     (LabelNumber) and their substring position and length.

     Revision History:
     Rev 00 - 14 Mar 2017 - Jeff Moden
    **************************************************************************************************/
            (@pString VARCHAR(8000))
    RETURNS TABLE WITH SCHEMABINDING AS
     RETURN WITH
    cteSplit AS
    (--===== This splits the labels out and returns the positional information
     SELECT  LabelNumber    = ROW_NUMBER() OVER (ORDER BY t.N)
            ,Label          = SUBSTRING(@pString,t.N,CHARINDEX('>',@pString,t.N)-t.N+1)
            ,LabelPosition  = t.N
            ,LabelLength    = CHARINDEX('>',@pString,t.N)-t.N+1
       FROM dbo.fnTally(1,LEN(@pString)) t
      WHERE SUBSTRING(@pString,t.N,1) = '<'
    )--==== This checks the split-out labels against the table containing allowed labels
         -- and only returns only dis-allowed labels.
     SELECT LabelNumber, DisallowedLabel = split.Label, LabelPosition, LabelLength
       FROM cteSplit split
      WHERE NOT EXISTS (SELECT 1 FROM dbo.AllowedLabel allowed WHERE allowed.Label = Split.Label)
    ;

    And then the rest is easy...

     SELECT t.*, chk.*
       FROM #Tests t
      CROSS APPLY dbo.SpellCheckLabels(t.Sentence) chk
    ;

    Here's the result set, which is a list of the misspelled/illegal labels...
    [
    RowNum Sentence                                          LabelNumber DisallowedLabel LabelPosition LabelLength
    ------ ------------------------------------------------- ----------- --------------- ------------- -----------
    9       <BOOOLRESULT> <REMARK>                           1           <BOOOLRESULT>   2             13
    11      <BOLRESULT> <REMARK>                             1           <BOLRESULT>     2             11
    15      <RESULT> <REMAKR>                                2           <REMAKR>        11            8
    18      <RESULT> <REARK>                                 2           <REARK>         11            7
    21      <Wup>! Der it is! <Not Allowed> but <Result> is. 1           <Wup>           2             5
    21      <Wup>! Der it is! <Not Allowed> but <Result> is. 2           <Not Allowed>   20            13

    (6 row(s) affected)

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Just curious... is that what you were looking for?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden - Friday, March 17, 2017 8:28 PM

    Just curious... is that what you were looking for?

    Hi Jeff,

    At first glance, yes. It looks like what we need. I'm reading through it all (including your tally table article - very interesting, btw) and will try to implement it soon.

    Apologies for the delay. I was thrown into the midst of some real client-initiated BS that's consumed my life fore a couple of weeks. 

    Thank you very much, kind sir!

  • You're welcome.  It might sound a bit strange but (I have no life 😉 ) thank you for the interesting problem.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • I don't know if what Jeff posted was helpful (another vanishing OP) but this is certainly doable. You need to do two things:
    (1) Extract the element names from the XML into a table
    (2) Write or obtain a CLR that does spellchecking 

    The first part is extremely simple. 

    USE tempdb
    GO

    -- A table to store the names
    IF OBJECT_ID('dbo.XML_Tag') IS NOT NULL DROP TABLE dbo.XML_Tag;
    CREATE TABLE dbo.XML_Tag(eName varchar(100)) -- adjust size as needed
    GO

    -- Some XML
    DECLARE @someXML XML = '
    <animal>
    <cat>fluffy</cat>
    <dog>
      <nickname>spot</nickname>
    </dog>
    <fish>Billy</fish>
    </animal>';

    -- how to get the element names
    INSERT dbo.XML_Tag
    SELECT Element_Name = xx1.value('local-name(.)', 'varchar(100)')
    FROM (VALUES (@someXML)) x(xx)
    CROSS APPLY xx.nodes('//*') x1(xx1);
    -- Note that, with multiple namespaces, this logic will need to be slightly tweeked

    What we now have in dbo.XML_tag:

    eName
    ----------------------------------------------------------------------------------------------------
    animal
    cat
    dog
    nickname
    fish

    The second part looks pretty easy but I'm not a CLR guy so who knows. I know you can leverage MS Offices's spell-checker functionality from a CLR. I did a quick search and found stuff like this. Either way - extracting the element names is simple. If you get stuck on the second part you can create a new thread in the CLR forum. There's some solid CLR ninjas on SSC.

    "I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

    -- Itzik Ben-Gan 2001

  • Alan.B - Thursday, March 30, 2017 8:49 PM

    I don't know if what Jeff posted was helpful (another vanishing OP) but this is certainly doable. You need to do two things:
    (1) Extract the element names from the XML into a table
    (2) Write or obtain a CLR that does spellchecking 

    You've made the same error that I did.  It's not, in any way, shape, or form, XML.  They're simply using <somestring> as labels within in non-XML string.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden - Friday, March 31, 2017 7:35 AM

    Alan.B - Thursday, March 30, 2017 8:49 PM

    I don't know if what Jeff posted was helpful (another vanishing OP) but this is certainly doable. You need to do two things:
    (1) Extract the element names from the XML into a table
    (2) Write or obtain a CLR that does spellchecking 

    You've made the same error that I did.  It's not, in any way, shape, or form, XML.  They're simply using <somestring> as labels within in non-XML string.

    Doh!

    "I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

    -- Itzik Ben-Gan 2001

Viewing 15 posts - 1 through 14 (of 14 total)

You must be logged in to reply to this topic. Login to reply