How can I count the number of occurances of a string in an ntext column for each record?

  • Dear SQL Gurus:

    I have a table [tblHTML] with an ntext column [strPage] that has HTML in it.

    Each record's [strPage] field text should contain an equal number of '<table>' and '</table>' strings.

    But some do not. I need to find those records so I can fix them.

    I think what I need to do is:

    1. figure out how many occurrences of '<table>' and '</table>' there are in each record's [strPage].

    2. If the number of '<table>' strings does not equal the number of '</table>' strings, then return the id of that record as part of the recordset.

    I'm stuck on #1. I've looked for 3 hours now, and I haven't found anyone else with my exact issue - to return a count for each record of the number of occurrences of a string in an ntext field.

    I found some solutions that involve using LEN and REPLACE ( like this one) but those do not seem to work because "Argument data type ntext is invalid for argument 1 of replace function."

    Then I found a post here at SQL Sever Central and I eagerly made a number table to use with it, but then I realized it returns a total count for the column, not the individual records that contain the string.

    Can you please help me or point me in the right direction? Thank you!

  • You should really avoid using ntext, text and image datatypes and use nvarchar(max), varchar(max), and varbinary(max) instead http://msdn.microsoft.com/en-us/library/ms187993.aspx. I guess you have old database migrated from previous version of SQL Server.

    For mow you can convert ntext to nvarchar(max) before using string functions. However, I would create and use CLR function which would give way better performance for what you're trying to do.

    _____________________________________________
    "The only true wisdom is in knowing you know nothing"
    "O skol'ko nam otkrytiy chudnyh prevnosit microsofta duh!":-D
    (So many miracle inventions provided by MS to us...)

    How to post your question to get the best and quick help[/url]

  • This will get you 1 and 2:

    DROP TABLE #tblHTML

    CREATE TABLE #tblHTML (RowID INT, strPage ntext)

    INSERT INTO #tblHTML (RowID, strPage)

    SELECT

    1, REPLICATE(CAST('I have a table [tblHTML] with an ntext column [strPage] that has HTML in it ' AS VARCHAR(MAX)),25)

    + '<table>' + 'I have a table [tblHTML] with an ntext column [strPage] that has HTML in it ' + '</table>'

    + REPLICATE('I have a table [tblHTML] with an ntext column [strPage] that has HTML in it ',25)

    + '<table>' + 'I have a table [tblHTML] with an ntext column [strPage] that has HTML in it ' + '</table>'

    + REPLICATE('I have a table [tblHTML] with an ntext column [strPage] that has HTML in it ',25)

    + '<table>' + 'I have a table [tblHTML] with an ntext column [strPage] that has HTML in it ' + '</table>'

    + REPLICATE('I have a table [tblHTML] with an ntext column [strPage] that has HTML in it ',25)

    + '<table>' + 'I have a table [tblHTML] with an ntext column [strPage] that has HTML in it ' + '</table>'

    + REPLICATE('I have a table [tblHTML] with an ntext column [strPage] that has HTML in it ',25)

    + '<table>' + 'I have a table [tblHTML] with an ntext column [strPage] that has HTML in it ' + '</table>'

    + REPLICATE('I have a table [tblHTML] with an ntext column [strPage] that has HTML in it ',25)

    + '<table>' + 'I have a table [tblHTML] with an ntext column [strPage] that has HTML in it ' + '</table>'

    UNION ALL

    SELECT

    2, REPLICATE(CAST('I have a table [tblHTML] with an ntext column [strPage] that has HTML in it ' AS VARCHAR(MAX)),25)

    + '<table>' + 'I have a table [tblHTML] with an ntext column [strPage] that has HTML in it ' + '</table>'

    + REPLICATE('I have a table [tblHTML] with an ntext column [strPage] that has HTML in it ',25)

    + '<table>' + 'I have a table [tblHTML] with an ntext column [strPage] that has HTML in it ' + '</table>'

    + REPLICATE('I have a table [tblHTML] with an ntext column [strPage] that has HTML in it ',25)

    + '<table>' + 'I have a table [tblHTML] with an ntext column [strPage] that has HTML in it ' + 'xxxxxxxx'

    + REPLICATE('I have a table [tblHTML] with an ntext column [strPage] that has HTML in it ',25)

    + '<table>' + 'I have a table [tblHTML] with an ntext column [strPage] that has HTML in it ' + '</table>'

    + REPLICATE('I have a table [tblHTML] with an ntext column [strPage] that has HTML in it ',25)

    + '<table>' + 'I have a table [tblHTML] with an ntext column [strPage] that has HTML in it ' + '</table>'

    + REPLICATE('I have a table [tblHTML] with an ntext column [strPage] that has HTML in it ',25)

    + '<table>' + 'I have a table [tblHTML] with an ntext column [strPage] that has HTML in it ' + '</table>'

    SELECT *, LEN(CAST(strPage AS VARCHAR(MAX))) FROM #tblHTML

    ;WITH rCTE AS (

    SELECT RowID, q.strPage,

    lastStartpos = NULLIF(CHARINDEX('<table>',q.strPage,1),0),

    lastEndpospos = NULLIF(CHARINDEX('</table>',q.strPage,1),0)

    FROM #tblHTML

    CROSS APPLY (SELECT strPage = CAST(strPage AS VARCHAR(MAX))) q

    UNION ALL

    SELECT RowID, strPage,

    lastStartpos = NULLIF(CHARINDEX('<table>',strPage,lastStartpos+1),0),

    lastEndpospos = NULLIF(CHARINDEX('</table>',strPage,lastEndpospos+1),0)

    FROM rCTE

    WHERE NOT (lastStartpos IS NULL OR lastEndpospos IS NULL)

    )

    SELECT RowID, TagStarts = COUNT(lastStartpos), TagEnds = COUNT(lastEndpospos)

    FROM rCTE

    GROUP BY RowID

    HAVING COUNT(lastStartpos) <> COUNT(lastEndpospos)

    ORDER BY RowID

    OPTION(MAXRECURSION 0)

    β€œWrite the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • This appears to work:

    I've elected to show rows where both values are zero in case these are wrong - You can remove this if you want.

    I've set MAXRECURSION to 10, but you can increase this if necessary...

    IF OBJECT_ID('dbo.tblHTML') IS NOT NULL

    DROP TABLE dbo.tblHTML;

    CREATE TABLE [dbo].[tblHTML](

    [RowId] [int] NOT NULL,

    [strPage] [ntext] NULL

    ) ON [PRIMARY]

    insert into dbo.tblHTML values ( 1, '<table>www</table><table>xxx</table><table>yyy</table>zzz</table>' );

    insert into dbo.tblHTML values ( 2, '<table>xxx</table><table>yyy</table><table>zzz</table>' );

    insert into dbo.tblHTML values ( 3, '>xxx</table><table>yyy</table><table>zzz</table>' );

    insert into dbo.tblHTML values ( 4, '>xxx</table>' );

    insert into dbo.tblHTML values ( 5, '>xxx</table' );

    insert into dbo.tblHTML values ( 6, '<table>xxx</table>' );

    DECLARE @SearchHTMLStart Varchar(30), @SearchHTMLEnd Varchar(30);

    SET @SearchHTMLStart = '<table>';

    SET @SearchHTMLEnd = '</table>';

    WITH CTE_Start AS

    (

    SELECT RowId,

    Occurrence = CASE WHEN CHARINDEX ( @SearchHTMLStart ,[strPage] , 0 ) > 0 THEN 1 ELSE 0 END,

    Location = CHARINDEX ( @SearchHTMLStart ,[strPage] , 0 )

    FROM dbo.tblHTML

    UNION ALL

    SELECT h.RowId,

    Occurrence = CASE WHEN CHARINDEX ( @SearchHTMLStart ,h.[strPage] , c.Location +1 ) > 0 THEN 1 ELSE 0 END,

    Location = CHARINDEX ( @SearchHTMLStart ,h.[strPage] , c.Location +1 )

    FROM dbo.tblHTML h

    INNER JOIN CTE_Start c ON h.RowId = c.RowId

    WHERE Occurrence > 0

    ),

    CTE_End AS

    (

    SELECT RowId,

    Occurrence = CASE WHEN CHARINDEX ( @SearchHTMLEnd ,[strPage] , 0 ) > 0 THEN 1 ELSE 0 END,

    Location = CHARINDEX ( @SearchHTMLEnd ,[strPage] , 0 )

    FROM dbo.tblHTML

    UNION ALL

    SELECT h.RowId,

    Occurrence = CASE WHEN CHARINDEX ( @SearchHTMLEnd ,h.[strPage] , c.Location +1 ) > 0 THEN 1 ELSE 0 END,

    Location = CHARINDEX ( @SearchHTMLEnd ,h.[strPage] , c.Location +1 )

    FROM dbo.tblHTML h

    INNER JOIN CTE_End c ON h.RowId = c.RowId

    WHERE Occurrence > 0

    ),

    CTE_Start_Sum AS

    (

    SELECT s.RowId, SUM(s.Occurrence) as NumberStart

    FROM CTE_Start s

    GROUP BY s.RowId

    ),

    CTE_End_Sum AS

    (

    SELECT e.RowId, SUM(e.Occurrence) as NumberEnd

    FROM CTE_End e

    GROUP BY e.RowId

    )

    SELECT ISNULL(s.RowId, e.RowId) as RowId, NumberStart, NumberEnd

    FROM CTE_Start_Sum s

    INNER JOIN CTE_End_Sum e ON s.RowId = e.RowId

    WHERE (NumberStart <> NumberEnd) OR (NumberStart = 0 AND NumberEnd = 0)

    OPTION (MAXRECURSION 10);

  • i like all the examples so rar, but make sure you search for <table without the ending > tag.

    if the table has a class, a css style or even older style html, you will get misleading counts;

    <table border="1">

    <table class="main_css">

    Lowell


    --help us help you! If you post a question, make sure you include a CREATE TABLE... statement and INSERT INTO... statement into that table to give the volunteers here representative data. with your description of the problem, we can provide a tested, verifiable solution to your question! asking the question the right way gets you a tested answer the fastest way possible!

  • laurie-789651 (10/1/2012)


    This appears to work:

    I've elected to show rows where both values are zero in case these are wrong - You can remove this if you want.

    I've set MAXRECURSION to 10, but you can increase this if necessary...

    IF OBJECT_ID('dbo.tblHTML') IS NOT NULL

    DROP TABLE dbo.tblHTML;

    CREATE TABLE [dbo].[tblHTML](

    [RowId] [int] NOT NULL,

    [strPage] [ntext] NULL

    ) ON [PRIMARY]

    insert into dbo.tblHTML values ( 1, '<table>www</table><table>xxx</table><table>yyy</table>zzz</table>' );

    insert into dbo.tblHTML values ( 2, '<table>xxx</table><table>yyy</table><table>zzz</table>' );

    insert into dbo.tblHTML values ( 3, '>xxx</table><table>yyy</table><table>zzz</table>' );

    insert into dbo.tblHTML values ( 4, '>xxx</table>' );

    insert into dbo.tblHTML values ( 5, '>xxx</table' );

    insert into dbo.tblHTML values ( 6, '<table>xxx</table>' );

    DECLARE @SearchHTMLStart Varchar(30), @SearchHTMLEnd Varchar(30);

    SET @SearchHTMLStart = '<table>';

    SET @SearchHTMLEnd = '</table>';

    WITH CTE_Start AS

    (

    SELECT RowId,

    Occurrence = CASE WHEN CHARINDEX ( @SearchHTMLStart ,[strPage] , 0 ) > 0 THEN 1 ELSE 0 END,

    Location = CHARINDEX ( @SearchHTMLStart ,[strPage] , 0 )

    FROM dbo.tblHTML

    UNION ALL

    SELECT h.RowId,

    Occurrence = CASE WHEN CHARINDEX ( @SearchHTMLStart ,h.[strPage] , c.Location +1 ) > 0 THEN 1 ELSE 0 END,

    Location = CHARINDEX ( @SearchHTMLStart ,h.[strPage] , c.Location +1 )

    FROM dbo.tblHTML h

    INNER JOIN CTE_Start c ON h.RowId = c.RowId

    WHERE Occurrence > 0

    ),

    CTE_End AS

    (

    SELECT RowId,

    Occurrence = CASE WHEN CHARINDEX ( @SearchHTMLEnd ,[strPage] , 0 ) > 0 THEN 1 ELSE 0 END,

    Location = CHARINDEX ( @SearchHTMLEnd ,[strPage] , 0 )

    FROM dbo.tblHTML

    UNION ALL

    SELECT h.RowId,

    Occurrence = CASE WHEN CHARINDEX ( @SearchHTMLEnd ,h.[strPage] , c.Location +1 ) > 0 THEN 1 ELSE 0 END,

    Location = CHARINDEX ( @SearchHTMLEnd ,h.[strPage] , c.Location +1 )

    FROM dbo.tblHTML h

    INNER JOIN CTE_End c ON h.RowId = c.RowId

    WHERE Occurrence > 0

    ),

    CTE_Start_Sum AS

    (

    SELECT s.RowId, SUM(s.Occurrence) as NumberStart

    FROM CTE_Start s

    GROUP BY s.RowId

    ),

    CTE_End_Sum AS

    (

    SELECT e.RowId, SUM(e.Occurrence) as NumberEnd

    FROM CTE_End e

    GROUP BY e.RowId

    )

    SELECT ISNULL(s.RowId, e.RowId) as RowId, NumberStart, NumberEnd

    FROM CTE_Start_Sum s

    INNER JOIN CTE_End_Sum e ON s.RowId = e.RowId

    WHERE (NumberStart <> NumberEnd) OR (NumberStart = 0 AND NumberEnd = 0)

    OPTION (MAXRECURSION 10);

    I'm informed it won't work properly:

    CHARINDEX won't work properly with ntext, you have to cast to VARCHAR(max), otherwise it stops working at 8000 characters

    Thanks ChrisM@Work πŸ™‚

  • laurie-789651 (10/1/2012)


    laurie-789651 (10/1/2012)


    This appears to work:

    I've elected to show rows where both values are zero in case these are wrong - You can remove this if you want.

    I've set MAXRECURSION to 10, but you can increase this if necessary...

    IF OBJECT_ID('dbo.tblHTML') IS NOT NULL

    DROP TABLE dbo.tblHTML;

    CREATE TABLE [dbo].[tblHTML](

    [RowId] [int] NOT NULL,

    [strPage] [ntext] NULL

    ) ON [PRIMARY]

    insert into dbo.tblHTML values ( 1, '<table>www</table><table>xxx</table><table>yyy</table>zzz</table>' );

    insert into dbo.tblHTML values ( 2, '<table>xxx</table><table>yyy</table><table>zzz</table>' );

    insert into dbo.tblHTML values ( 3, '>xxx</table><table>yyy</table><table>zzz</table>' );

    insert into dbo.tblHTML values ( 4, '>xxx</table>' );

    insert into dbo.tblHTML values ( 5, '>xxx</table' );

    insert into dbo.tblHTML values ( 6, '<table>xxx</table>' );

    DECLARE @SearchHTMLStart Varchar(30), @SearchHTMLEnd Varchar(30);

    SET @SearchHTMLStart = '<table>';

    SET @SearchHTMLEnd = '</table>';

    WITH CTE_Start AS

    (

    SELECT RowId,

    Occurrence = CASE WHEN CHARINDEX ( @SearchHTMLStart ,[strPage] , 0 ) > 0 THEN 1 ELSE 0 END,

    Location = CHARINDEX ( @SearchHTMLStart ,[strPage] , 0 )

    FROM dbo.tblHTML

    UNION ALL

    SELECT h.RowId,

    Occurrence = CASE WHEN CHARINDEX ( @SearchHTMLStart ,h.[strPage] , c.Location +1 ) > 0 THEN 1 ELSE 0 END,

    Location = CHARINDEX ( @SearchHTMLStart ,h.[strPage] , c.Location +1 )

    FROM dbo.tblHTML h

    INNER JOIN CTE_Start c ON h.RowId = c.RowId

    WHERE Occurrence > 0

    ),

    CTE_End AS

    (

    SELECT RowId,

    Occurrence = CASE WHEN CHARINDEX ( @SearchHTMLEnd ,[strPage] , 0 ) > 0 THEN 1 ELSE 0 END,

    Location = CHARINDEX ( @SearchHTMLEnd ,[strPage] , 0 )

    FROM dbo.tblHTML

    UNION ALL

    SELECT h.RowId,

    Occurrence = CASE WHEN CHARINDEX ( @SearchHTMLEnd ,h.[strPage] , c.Location +1 ) > 0 THEN 1 ELSE 0 END,

    Location = CHARINDEX ( @SearchHTMLEnd ,h.[strPage] , c.Location +1 )

    FROM dbo.tblHTML h

    INNER JOIN CTE_End c ON h.RowId = c.RowId

    WHERE Occurrence > 0

    ),

    CTE_Start_Sum AS

    (

    SELECT s.RowId, SUM(s.Occurrence) as NumberStart

    FROM CTE_Start s

    GROUP BY s.RowId

    ),

    CTE_End_Sum AS

    (

    SELECT e.RowId, SUM(e.Occurrence) as NumberEnd

    FROM CTE_End e

    GROUP BY e.RowId

    )

    SELECT ISNULL(s.RowId, e.RowId) as RowId, NumberStart, NumberEnd

    FROM CTE_Start_Sum s

    INNER JOIN CTE_End_Sum e ON s.RowId = e.RowId

    WHERE (NumberStart <> NumberEnd) OR (NumberStart = 0 AND NumberEnd = 0)

    OPTION (MAXRECURSION 10);

    I'm informed it won't work properly:

    CHARINDEX won't work properly with ntext, you have to cast to VARCHAR(max), otherwise it stops working at 8000 characters

    Thanks ChrisM@Work πŸ™‚

    You're welcome - your query works just fine if you introduce the CAST where necessary. Why am I writing this when you're sitting 3 feet away? How was your weekend? πŸ˜€

    β€œWrite the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • Solution incorporating the above suggestions (thank you):

    --====== TEST DATA =============

    IF OBJECT_ID('dbo.tblHTML') IS NOT NULL

    DROP TABLE dbo.tblHTML;

    CREATE TABLE [dbo].[tblHTML](

    [RowId] [int] NOT NULL,

    [strPage] [ntext] NULL

    ) ON [PRIMARY]

    insert into dbo.tblHTML values ( 1, '<table>www</table><table>xxx</table><table>yyy</table>zzz</table>' );

    insert into dbo.tblHTML values ( 2, '<table>xxx</table><table>yyy</table><table>zzz</table>' );

    insert into dbo.tblHTML values ( 3, '>xxx</table><table>yyy</table><table>zzz</table>' );

    insert into dbo.tblHTML values ( 4, '>xxx</table>' );

    insert into dbo.tblHTML values ( 5, '>xxx</table' );

    insert into dbo.tblHTML values ( 6, '<table>xxx</table>' );

    --========= SOLUTION Altered as recommended by ChrisM@Work ===============

    DECLARE @SearchHTMLStart Varchar(30), @SearchHTMLEnd Varchar(30);

    SET @SearchHTMLStart = '<table'; -- Leaving off trailing '>'

    SET @SearchHTMLEnd = '</table>';

    WITH CTE_Start AS

    (

    SELECT RowId,

    Occurrence = CASE WHEN CHARINDEX ( @SearchHTMLStart ,q.[strPage] , 0 ) > 0 THEN 1 ELSE 0 END,

    Location = CHARINDEX ( @SearchHTMLStart ,q.[strPage] , 0 )

    FROM dbo.tblHTML

    CROSS APPLY (SELECT strPage = CAST(strPage AS VARCHAR(MAX))) q

    UNION ALL

    SELECT h.RowId,

    Occurrence = CASE WHEN CHARINDEX ( @SearchHTMLStart ,q.[strPage] , c.Location +1 ) > 0 THEN 1 ELSE 0 END,

    Location = CHARINDEX ( @SearchHTMLStart ,q.[strPage] , c.Location +1 )

    FROM dbo.tblHTML h

    CROSS APPLY (SELECT strPage = CAST(strPage AS VARCHAR(MAX))) q

    INNER JOIN CTE_Start c ON h.RowId = c.RowId

    WHERE Occurrence > 0

    ),

    CTE_End AS

    (

    SELECT RowId,

    Occurrence = CASE WHEN CHARINDEX ( @SearchHTMLEnd ,q.[strPage] , 0 ) > 0 THEN 1 ELSE 0 END,

    Location = CHARINDEX ( @SearchHTMLEnd ,q.[strPage] , 0 )

    FROM dbo.tblHTML

    CROSS APPLY (SELECT strPage = CAST(strPage AS VARCHAR(MAX))) q

    UNION ALL

    SELECT h.RowId,

    Occurrence = CASE WHEN CHARINDEX ( @SearchHTMLEnd ,q.[strPage] , c.Location +1 ) > 0 THEN 1 ELSE 0 END,

    Location = CHARINDEX ( @SearchHTMLEnd ,q.[strPage] , c.Location +1 )

    FROM dbo.tblHTML h

    CROSS APPLY (SELECT strPage = CAST(strPage AS VARCHAR(MAX))) q

    INNER JOIN CTE_End c ON h.RowId = c.RowId

    WHERE Occurrence > 0

    ),

    CTE_Start_Sum AS

    (

    SELECT s.RowId, SUM(s.Occurrence) as NumberStart

    FROM CTE_Start s

    GROUP BY s.RowId

    ),

    CTE_End_Sum AS

    (

    SELECT e.RowId, SUM(e.Occurrence) as NumberEnd

    FROM CTE_End e

    GROUP BY e.RowId

    )

    SELECT ISNULL(s.RowId, e.RowId) as RowId, NumberStart, NumberEnd

    FROM CTE_Start_Sum s

    INNER JOIN CTE_End_Sum e ON s.RowId = e.RowId

    WHERE (NumberStart <> NumberEnd) OR (NumberStart = 0 AND NumberEnd = 0)

    OPTION (MAXRECURSION 10);

  • Wow! I had hoped to find a response but I am overwhelmed. Thank you all very much!

    Lowell, thank you for the valuable tip. I do need to check for <table in case there are some classes or other values in the opening tag.

    SSCommitted, You are quite right, this database has been around a while. Thank you for the suggestion I will read up on changing the data type.

    And ChrisM@Work & laurie-789651, thank you both for your code! Both your examples are way over my head, but I'm going to pick them apart and learn how they work.

    I've already run the query (with the <table modification) and found the 500 or so records out of 30,000 which need fixing. I'm very happy! :w00t:

    Thank you all again! πŸ˜€

  • Mixolydian (10/1/2012)


    Wow! I had hoped to find a response but I am overwhelmed. Thank you all very much!

    Lowell, thank you for the valuable tip. I do need to check for <table in case there are some classes or other values in the opening tag.

    SSCommitted, You are quite right, this database has been around a while. Thank you for the suggestion I will read up on changing the data type.

    And ChrisM@Work & laurie-789651, thank you both for your code! Both your examples are way over my head, but I'm going to pick them apart and learn how they work.

    I've already run the query (with the <table modification) and found the 500 or so records out of 30,000 which need fixing. I'm very happy! :w00t:

    Thank you all again! πŸ˜€

    You're welcome, thanks for the feedback. If you have trouble understanding the solutions posted, just ask πŸ˜‰

    β€œWrite the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

Viewing 10 posts - 1 through 9 (of 9 total)

You must be logged in to reply to this topic. Login to reply